# Tutorial: Using The Google Cloud Vertex AI for Clustering and Topic Modeling
### Author: Campbell Lund
### 10/12/2023
This notebook walks through how to get started using the Google Cloud Vertex AI to generate text embeddings that retain sentence context. We then use these embeddings to cluster similar sentences and preform topic modeling to determine their subject. Finally, we use a text generation model to create cluster labels.

### Table of contents:
- 1. [Initialization and background](#sec1)
- 2. [Generate embeddings](#sec2)
- 3. [Interpreting the embeddings](#sec3)
- 4. [Labeling the clusters ](#sec4)

## 1. Initialization and background<a name="sec1"></a>

Vertex AI has two pre-trained models that we'll be utilizing: `textembedding-gecko@001` and `text-bison@001`.
- The gecko model is used to create text embeddings. Simply put, text embeddings are numerical, vector representations of text. These embeddings capture semantic information about words, phrases, or documents in a way that preserves their contextual relationships. We use these vectors later for determining sentence similarity.
- The bison model is used for text generation. Similar to ChatGPT, text-bison takes a prompt as input and returns the AI-generated response. 

Import or `!pip install` the following libraries:

In [None]:
import os
from dotenv import load_dotenv
import json
import base64
import numpy as np
import matplotlib.pyplot as plt
import mplcursors
import pandas as pd
import pickle

In order to use the Vertex AI you'll need to create unique credentials, which will be stored in a `.json` file. `key_path` refers to the location of this file, and `PROJECT_ID` refers to the project ID created in your Google Cloud Account. For a tutorial on how to create your credentials click [here](https://learn.deeplearning.ai/google-cloud-vertex-ai/lesson/8/optional---google-cloud-setup) (note: you may need to create an account to access the tutorial).

In [None]:
from google.auth.transport.requests import Request
from google.oauth2.service_account import Credentials

In [None]:
key_path = # path to your key 
PROJECT_ID = # your project ID 
REGION = 'us-central1'

In [None]:
# create credentials object
credentials = Credentials.from_service_account_file(
    key_path,
    scopes=['https://www.googleapis.com/auth/cloud-platform'])

if credentials.expired:
    credentials.refresh(Request())

In [None]:
import vertexai
from vertexai.language_models import TextGenerationModel
from vertexai.language_models import TextEmbeddingModel
# initialize vertex
vertexai.init(project = PROJECT_ID, location = REGION, credentials = credentials)

### helper functions:

The following helper functions were loaded from a [DeepLearning.AI tutorial](https://learn.deeplearning.ai/google-cloud-vertex-ai/lesson/1/introduction). 
- `encode_texts_to_embeddings()` takes a single string as input and returns the corresponding embeddings.
- `encode_text_to_embedding_batched()` helps us prompt the text embedding model in batches for larger tasks. We must work in batches to avoid overloading the model and hitting rate limits. It takes a Python list of strings as input and returns a list of the corresponding embeddings. 
- `generate_batches()` creates batches of size 5 for the `encode_text_to_embedding_batched()` function. Five is the maximum batch size for the `textembedding-gecko@001` model.
- `clusters_2D()` is a function to help us visualize the high-dimensional data on a 2D plot.

It's not necessary to understand the inner workings of these functions, just how to utilize them.

In [None]:
from google.auth.transport.requests import Request
from google.oauth2.service_account import Credentials
import functools
import time
from concurrent.futures import ThreadPoolExecutor
from tqdm.auto import tqdm
import math

def generate_batches(sentences, batch_size = 5):
    for i in range(0, len(sentences), batch_size):
        yield sentences[i : i + batch_size]

def encode_texts_to_embeddings(sentences):
    model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")
    try:
        embeddings = model.get_embeddings(sentences)
        return [embedding.values for embedding in embeddings]
    except Exception:
        return [None for _ in range(len(sentences))]
        
def encode_text_to_embedding_batched(sentences, api_calls_per_second = 0.33, batch_size = 5):
    # Generates batches and calls embedding API
    
    embeddings_list = []

    # Prepare the batches using a generator
    batches = generate_batches(sentences, batch_size)

    seconds_per_job = 1 / api_calls_per_second

    with ThreadPoolExecutor() as executor:
        futures = []
        for batch in tqdm(
            batches, total = math.ceil(len(sentences) / batch_size), position=0
        ):
            futures.append(
                executor.submit(functools.partial(encode_texts_to_embeddings), batch)
            )
            time.sleep(seconds_per_job)

        for future in futures:
            embeddings_list.extend(future.result())

    is_successful = [
        embedding is not None for sentence, embedding in zip(sentences, embeddings_list)
    ]
    embeddings_list_successful = np.squeeze(
        np.stack([embedding for embedding in embeddings_list if embedding is not None])
    )
    return is_successful, embeddings_list_successful

def clusters_2D(x_values, y_values, labels, kmeans_labels):
    fig, ax = plt.subplots()
    scatter = ax.scatter(x_values, 
                         y_values, 
                         c = kmeans_labels, 
                         cmap='Set1', 
                         alpha=0.5, 
                         edgecolors='k', 
                         s = 40)  # Change the denominator as per n_clusters

    # Create a mplcursors object to manage the data point interaction
    cursor = mplcursors.cursor(scatter, hover=True)

    #axes
    ax.set_title('Embedding clusters visualization in 2D')  # Add a title
    ax.set_xlabel('X_1')  # Add x-axis label
    ax.set_ylabel('X_2')  # Add y-axis label

    # Define how each annotation should look
    @cursor.connect("add")
    def on_add(sel):
        sel.annotation.set_text(labels.category[sel.target.index])
        sel.annotation.get_bbox_patch().set(facecolor='white', alpha=0.95) # Set annotation's background color
        sel.annotation.set_fontsize(14) 

    plt.show()


### read the data:

In [None]:
df = pd.read_csv('data/allQueries.csv', header=None, names=["sentences"])

The following `df` contains the 628 sentences that we'll create text embeddings for in this tutorial:

In [None]:
df

## 2. Generate embeddings<a name="sec2"></a>

Only run the following cells if you're making embeddings for your own data as it takes some time to compile. If you're following along with the tutorial jump to section 3 to use the saved embeddings in `sentence_embeddings.pkl`.

In [None]:
# convert our df to a list
sentence_list = df.sentences.tolist()

In [None]:
# use the encode_text_to_embedding_batched() helper function to generate embeddings
is_successful, sentence_embeddings = encode_text_to_embedding_batched(
                            sentences=sentence_list,
                            api_calls_per_second = 20/60, 
                            batch_size = 5)

In [None]:
sentence_embeddings.shape

In [None]:
# filter for successfully embedded sentences
sentence_list = np.array(sentence_list)[is_successful]

In [None]:
# write embeddings to a pickle file
with open('data/sentence_embeddings.pkl', 'wb') as file:
    pickle.dump(sentence_embeddings, file)

In [None]:
# write the successfully embedded sentence list to a csv file
with open('data/filtered_sentences.pkl', 'wb') as file:
    pickle.dump(sentence_list, file)

## 3. Interpreting the embeddings<a name="sec3"></a>

### read the saved data:

In [None]:
with open('data/sentence_embeddings.pkl', 'rb') as file:
    sentence_embeddings = pickle.load(file)

In [None]:
with open('data/filtered_sentences.pkl', 'rb') as file:
    sentence_list = pickle.load(file)

In [None]:
sentence_embeddings

In [None]:
sentence_list

### clustering:

To group the sentences by similarity, we'll use KMeans clustering. This is a common machine learning algorithm that works to determine patterns and commonalities between data. Here, we fit the model to our sentence embeddings and ask it to divide the corresponding sentences into `k` distinct groups or “clusters”. 

Again, don't worry about fully understanding this section if you haven't taken a machine learning class before - what's important is the number of clusters.

Import or `!pip install` the following libraries:

In [None]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

Try playing around with changing the number of `n_clusters`. You'll notice that the cluster topics identified in the following section will consist of multiple different topics if `n_clusters` is too low, but might lose integrity if it's too high.

In [None]:
# this variable determines the number of clusters
n_clusters = 3

In [None]:
# run the KMeans algorithm
kmeans = KMeans(n_clusters=n_clusters, 
                random_state=0, 
                n_init = 'auto').fit(sentence_embeddings)

kmeans_labels = kmeans.labels_

In [None]:
# flatten the dimensionality of the data to help us visualize and interpret it better
PCA_model = PCA(n_components=2)
PCA_model.fit(sentence_embeddings)
new_values = PCA_model.transform(sentence_embeddings)

In [None]:
# use our helper function to display the clusters in 2D
clusters_2D(x_values = new_values[:,0], y_values = new_values[:,1], 
            labels = df, kmeans_labels = kmeans_labels)

In [None]:
clusters = [[] for cluster in range(n_clusters)]

In [None]:
# sort the sentences into lists based on their clusters
for i in range(len(sentence_list)):
    cluster_index = kmeans_labels[i]
    clusters[cluster_index].append(sentence_list[i])

### topic modeling:

Now that we've seperated the sentences into distinct groups, we can use topic modeling to determine the subject of each cluster. Topic modeling is a common Natural Language Processing technique - we'll be using the Latent Dirichlet Allocation (LDA) algorithm in our analysis.

Import or `!pip install` the following libraries:

In [None]:
from gensim import corpora, models
from gensim.models import CoherenceModel
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

In [None]:
# helper function for cleaning and tokenizing the sentences
def clean_text(text):
    stop_words = set(stopwords.words('english'))
    punctuation = set(string.punctuation)
    tokens = word_tokenize(text.lower())
    tokens = [token for token in tokens if token not in stop_words and token not in punctuation]
    return tokens

In [None]:
all_topics = []
all_vis_data = []

In [None]:
# train LDA model for each cluster
for i, cluster in enumerate(clusters):
    # clean the text
    cluster_doc = [' '.join(cluster)]
    processed_docs = [clean_text(doc) for doc in cluster_doc]

    # create a dictionary and corpus
    dictionary = corpora.Dictionary(processed_docs)
    corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

    # train model
    num_topics = 3  
    lda_model = models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=15)

    # print the results
    topics = ""
    print(f"Cluster {i + 1} Topics:")
    for topic_num, words in lda_model.print_topics():
        print(f"Topic {topic_num + 1}: {words}")
        topics += words
    print("\n")
    
    # save topic and visualization data for later
    all_topics.append(topics)
    vis_data = gensimvis.prepare(lda_model, corpus, dictionary)
    all_vis_data.append(vis_data)

In [None]:
# visualize the topics - change the index of all_vis_data to view a different cluster
pyLDAvis.display(all_vis_data[0])

### manually audit what's in each cluster:

To verify how well our clustering and topic modeling algorithms preformed we can print the contents of each cluster for cross-referencing.

In [None]:
for i, cluster in enumerate(clusters):
    print(f"Cluster {i + 1}: ")
    print(cluster)
    print("\n")

## 4. Labeling the clusters<a name="sec4"></a>

Now that we're satisfied with the number of clusters and their contents, let's create a label for each of them. We'll use a text generation model to achieve this, starting with Vertex AI's `text-bison@001`. Text-bison takes a prompt as a string and returns the response of the model.

In [None]:
generation_model = TextGenerationModel.from_pretrained("text-bison@001")

In [None]:
for i, topic in enumerate(all_topics):
    prompt = f'''Your job is to create labels for n={n_clusters} clusters. \
    Given the topics with their associated weights, output a single, master topic \
    that summarizes all the topics identified in the cluster.\
    Topics: {topic} .'''
    print(f"Cluster: {i+1} Topic: {generation_model.predict(prompt=prompt).text}")

As you can see this result isn't as succinct as we want. Feel free to edit the prompt to try and improve the output.

### switching models:

Since text-bison isn't doing a very good job preforming the task we've instructed it to do, lets try switching models to `gpt-3.5-turbo`. If you haven't worked with the OpenAI API before check out my tutorial [here](https://github.com/campbellslund/OpenAI-API-for-Categorization-and-Labeling/blob/main/OpenAI%20API%20Tutorial/Using%20The%20OpenAI%20API%20for%20Categorization%20and%20Labeling%20Tutorial.ipynb) for step-by-step instructions getting started.

In [None]:
import openai

In [None]:
# retrieving our API key from a secure file
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai.api_key  = os.environ['OPENAI_API_KEY']

In [None]:
# returns the model's response to a given message query
def get_completion_from_messages(messages, 
                                 model="gpt-3.5-turbo", 
                                 temperature=0, # degree of randomness
                                 max_tokens=150): #4000 is max for input and response combined
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature, 
        max_tokens=max_tokens,
    )
    return response.choices[0].message["content"]

In [None]:
delimiter = "####"
system_message = f"""your job is to create labels for clusters. \
    Given the topic modeling data of topics with their associated weights, \
    output a single, master topic that summarizes all the topics identified in each cluster.\
    Clusters will be seperated by {delimiter} characters."""

user_message = f"""{delimiter}"""
for i, topic in enumerate(all_topics):
    user_message += f"""{topic}{delimiter}"""
    
messages =  [  
{'role':'system', 
 'content': system_message},    
{'role':'user', 
 'content': user_message},  
]

In [None]:
response = get_completion_from_messages(messages)

In [None]:
print(response)

Much better! Now we have labels for our clusters. Again, make sure to verify these with the actual content of the clusters - it's good practice to always have a human in the loop auditing the results.