# LLM Topic Modeling

We'll use text embeddings to find text similarity and use that to create topics automatically from text.

For example, let's categorize these words into these topics:

In [None]:
words = ['Apple', 'Orange', 'Banana', 'Jamaica', 'Sri Lanka', 'Facebook', 'Google']
topics = ['Fruit', 'Country', 'Company']

## Embeddings

LLMs can convert text into an array of numbers such that similar numbers have similar meanings. These are called embeddings.

You can think of the them as being points in a multi-dimension space and if two points are close to each other then they are similar in meaning and if they are far away then they the less similar.

We'll use [OpenAI embeddings](https://platform.openai.com/docs/api-reference/embeddings) to get the numbers associated with each of these words and topics.

In [None]:
import requests
import json
from google.colab import userdata

api_key = userdata.get('OPENAI_API_KEY')

url = "https://api.openai.com/v1/embeddings"
headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}
data = {
    "input": words,
    "model": "text-embedding-3-small",
    "encoding_format": "float"
}

response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json())

{'object': 'list', 'data': [{'object': 'embedding', 'index': 0, 'embedding': [0.009169695, -0.0351853, -0.025009409, 0.03981761, 0.0018953233, -0.027667291, -0.0030708036, 0.048626594, 0.003126176, 0.0013210309, 0.010745439, 0.021566818, -0.05012007, -0.003800139, 0.006897838, 0.002627823, -0.03971636, -0.017453428, -0.031641457, -0.00037811542, -0.02074414, 0.036476273, 0.0045057437, 0.019124098, 0.061359115, -0.0035565, -0.044475235, 0.014631011, 0.003515366, -0.039134156, -0.013036281, -0.044500545, 0.027287593, -0.0027085089, -0.026502885, 0.0028350747, 0.01885831, -0.043462705, 0.0006233367, 0.01466898, -0.010960601, 0.040247936, 0.025351137, 0.032274287, 0.056650866, -0.041133896, -0.078875825, -0.020630231, 0.017997662, 0.02418673, -0.019238006, 0.018364701, -0.004078584, 0.058220282, -0.015871355, 0.010485979, -0.029110141, 0.04371584, 0.007878723, 0.024971439, 0.0029110142, 0.008207794, -0.020617574, -0.027312906, -0.0032211004, -0.017542025, -0.0008171407, 0.044475235, 0.0364

Let's store the embeddings as a dictionary `embeddings` of `{word: embedding}`.

In [None]:
import numpy as np

embeddings = {}
for word, embedding in zip(words, response.json()['data']):
    embeddings[word] = np.array(embedding['embedding'])

embeddings

{'Apple': array([ 0.0091697 , -0.0351853 , -0.02500941, ...,  0.0021975 ,
        -0.00448992,  0.01428928]),
 'Orange': array([-0.01316147,  0.00517033, -0.02400391, ...,  0.00702487,
        -0.00882856, -0.00019707]),
 'Banana': array([ 0.02111392, -0.04319699, -0.03527755, ...,  0.01640655,
         0.00189679,  0.02478289]),
 'Jamaica': array([ 0.03438216, -0.031114  ,  0.00606471, ..., -0.02314286,
        -0.02473709,  0.01929015]),
 'Sri Lanka': array([ 0.0270712 , -0.01661132,  0.03729919, ...,  0.00990455,
        -0.03842207,  0.01406043]),
 'Facebook': array([ 0.0299663 , -0.0253963 , -0.04586077, ...,  0.01940572,
         0.02311801,  0.00629547]),
 'Google': array([ 0.00915158,  0.00340002, -0.017734  , ...,  0.02558249,
         0.00259869, -0.02914727])}

Note that each embedding is an array of 1,536 numbers.

In [None]:
list(embeddings.values())[0].shape

(1536,)

## Find similarity

The dot product via `np.dot` gives us a similarity between 2 embeddings.

So this gives us a similarity between "Apple" and "Orange":

In [None]:
np.dot(embeddings['Apple'], embeddings['Orange'])

0.44532714309624044

Let's now calculate the similarity between every pair of words:

In [None]:
import pandas as pd

words = list(embeddings.keys())

# Initialize an empty DataFrame
dot_product_df = pd.DataFrame(index=words, columns=words)

# Calculate the dot product for every pair of words
for word1 in words:
    for word2 in words:
        dot_product_df.at[word1, word2] = np.dot(embeddings[word1], embeddings[word2])

dot_product_df

Unnamed: 0,Apple,Orange,Banana,Jamaica,Sri Lanka,Facebook,Google
Apple,1.0,0.445327,0.386232,0.206973,0.195864,0.409309,0.437886
Orange,0.445327,1.0,0.357272,0.198355,0.137741,0.297724,0.257779
Banana,0.386232,0.357272,1.0,0.367946,0.238199,0.209801,0.217968
Jamaica,0.206973,0.198355,0.367946,1.0,0.395586,0.185161,0.124532
Sri Lanka,0.195864,0.137741,0.238199,0.395586,1.0,0.182928,0.182536
Facebook,0.409309,0.297724,0.209801,0.185161,0.182928,1.0,0.582094
Google,0.437886,0.257779,0.217968,0.124532,0.182536,0.582094,1.0


## Cluster into topics

We cluster the embeddings based on similarity using K-Means.

Here, we create 3 clusters.

In [None]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans

# Prepare the data
words = list(embeddings.keys())
embedding_vectors = np.array(list(embeddings.values()))

# Choose the number of clusters (k)
k = 3  # You can choose a different number of clusters

# Perform K-means clustering
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(embedding_vectors)

# Get the cluster labels for each word
labels = kmeans.labels_

# Create a DataFrame to show the clusters
clustered_words = pd.DataFrame({'Word': words, 'Cluster': labels})

print(clustered_words)


        Word  Cluster
0      Apple        2
1     Orange        2
2     Banana        2
3    Jamaica        1
4  Sri Lanka        1
5   Facebook        0
6     Google        0




## Classify words to topics

Let's get the embeddings of the words AND topics together.

In [None]:
import requests
import json
from google.colab import userdata

api_key = userdata.get('OPENAI_API_KEY')

url = "https://api.openai.com/v1/embeddings"
headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}
data = {
    "input": words + topics,
    "model": "text-embedding-3-small",
    "encoding_format": "float"
}

response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json())

{'object': 'list', 'data': [{'object': 'embedding', 'index': 0, 'embedding': [0.009183893, -0.03516589, -0.02500095, 0.039748345, 0.0018481715, -0.027621303, -0.0031140423, 0.048685394, 0.003131448, 0.0013299555, 0.010690279, 0.021545121, -0.050128486, -0.0037849538, 0.006898996, 0.0026646582, -0.03969771, -0.017431041, -0.03167209, -0.0003423785, -0.020696988, 0.036482397, 0.004497006, 0.019177943, 0.06136942, -0.0035222857, -0.044508018, 0.0146587845, 0.0035317796, -0.03911541, -0.012987835, -0.044508018, 0.027317492, -0.0027184577, -0.0264567, 0.0028371331, 0.01883616, -0.04344469, 0.00064084714, 0.014620808, -0.010962442, 0.04028001, 0.025304759, 0.032203756, 0.05666038, -0.041090168, -0.0787878, -0.020633696, 0.01803866, 0.02421611, -0.019241236, 0.018367786, -0.0040887627, 0.05823006, -0.015798068, 0.010481411, -0.02911503, 0.043773815, 0.007918023, 0.024950314, 0.002925744, 0.008221831, -0.020646354, -0.027317492, -0.0031899945, -0.017570287, -0.0008331013, 0.044432066, 0.036482

Now, we find the dot-product between the words and the topics.

In [None]:
# Parse the response to get embeddings
embeddings = response.json()['data']
embedding_vectors = np.array([e['embedding'] for e in embeddings])

# Separate word and topic embeddings
word_embeddings = embedding_vectors[:len(words)]
topic_embeddings = embedding_vectors[len(words):]

# Calculate dot products
dot_product_matrix = np.dot(word_embeddings, topic_embeddings.T)

# Create DataFrame
df = pd.DataFrame(dot_product_matrix, index=words, columns=topics)

df

Unnamed: 0,Fruit,Country,Company
Apple,0.414807,0.303259,0.368041
Orange,0.363135,0.308233,0.269728
Banana,0.539551,0.244805,0.212664
Jamaica,0.245623,0.351682,0.212892
Sri Lanka,0.188262,0.29896,0.21488
Facebook,0.196168,0.243862,0.376275
Google,0.211459,0.220816,0.281387


We see that by picking the top similarity in each row:

- Apple is a Fruit
- Orange is a Fruit
- Banana is a Fruit
- Jamaica is a Country
- Sri Lanka is a Country
- Facebook is a Company
- Google is a Company

# Local embedding models

We can run embedding models locally too. The [Massive Text Embedding Leaderboard (MTEB)](https://huggingface.co/spaces/mteb/leaderboard) lists the top embedding models. Let's pick [gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) and run it.

In [None]:
!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/171.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━[0m [32m143.4/171.5 kB[0m [31m4.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True)
embedding_vectors = model.encode(words + topics)
print(embedding_vectors)

[[-0.5090499  -0.15889336 -0.38412032 ...  1.0322104   0.02407256
  -1.1023357 ]
 [-0.67409146 -0.83528525 -0.04251496 ...  0.6114075  -0.20315407
  -1.4508063 ]
 [ 0.06841635 -0.01685135 -0.6980065  ...  0.68077064 -0.19276078
  -0.70105606]
 ...
 [-0.5333297  -0.12361603 -0.59674186 ...  0.7952124  -0.6889464
  -0.6601959 ]
 [-0.7522429  -1.2153517  -1.0396134  ... -0.0702344   0.42999077
  -0.42822212]
 [-0.86104506 -0.25109887 -1.4982017  ...  0.88898855 -0.05507576
  -0.39268136]]


`gte-large-en-v1.5` gives is 1,024 numbers for each word.

In [None]:
embedding_vectors.shape

(10, 1024)

For models other than OpenAI's text embeddings a better way to calculate similarity is using cosine similarity. Let's apply that.

In [None]:
from sentence_transformers.util import cos_sim

# Separate word and topic embeddings
word_embeddings = embedding_vectors[:len(words)]
topic_embeddings = embedding_vectors[len(words):]

# Calculate dot products
dot_product_matrix = cos_sim(word_embeddings, topic_embeddings).numpy()

# Create DataFrame
df = pd.DataFrame(dot_product_matrix, index=words, columns=topics)

df

Unnamed: 0,Fruit,Country,Company
Apple,0.709521,0.537598,0.793602
Orange,0.771039,0.559362,0.645636
Banana,0.847789,0.482939,0.588516
Jamaica,0.616004,0.555557,0.539662
Sri Lanka,0.631691,0.49555,0.589877
Facebook,0.60194,0.517291,0.733116
Google,0.570325,0.491316,0.740454


This now gives us another way of extracting a slightly different set of results:

- Apple is a company
- Orange is a fruit
- Banana is a fruit
- Jamaica is a fruit (!)
- Sri Lanka is a fruit (!)
- Facebook is a company
- Google is a company