# Cross-lingual Topic Modeling Experiment using Qwen-2.5 and Multilingual Embeddings
* Notebook by Adam Lang
* Date: 12/17/2024

# Overview
* This is an experiment performing cross-lingual topic modeling using a multilingual LLM called Qwen-2.5 and multilingual embedding models.
* Emphasis on the "experiment" as some of the code may still need to be debugged.


# Background
* 1. Topic modeling is a very common technique used for extracting meaningful insights and modeling data in any industry or domain.
    * The topics are used to “tell a story” about a customer’s data through insights and visualizations.

 
* 2. Cross-lingual topic modeling is a common multi-step NLP problem when handling multilingual data.
    * Currently the process of working with multilingual data may include the following:
         * 1) Language detection → 
         * 2) Language Translation → 
         * 3) Topic modeling (e.g. BERTopic) → 
         * 4) Using an LLM to transform topics into “more interpretable” terminology → 
         * 5) Creating Data Visualizations to present the Topics to Customers

* However, I have developed a technique that leverages various transformer models and a multilingual open source LLM and multilingual embedding model to perform more robust Topic Modeling.

* While this process still involves language detection and translation as a pre-processing step, we can leverage the LLM as a checks and balances for the preprocessing language detection and translation tasks.

* In addition, there are some robust Data Visualization techniques that we can use to demonstrate the results.

# Multilingual Data in this dataset
* We know from the dataset we have that there are multiple languages represented (however this may not be correct as the data engineering ETL process to detect and classify the languages is not sensitive or specific, howeve this gives me a hint that language detection and translation is an important first preprocessing task before Topic modeling.
* Here are some of the known possible languages:
1. English (eng)
2. Spanish (esl - latin american Spanish)
3. Portuguese (por)
4. Brazilian (brz)
5. French (fra)
6. Deutsch (deu)
7. French-Canadian (cfr)


# Workflow
1. Topic Modeling
    * We will perform topic modeling using BERTopic and a multilingual LLM as well as extracting keywords using KeyBERT and also MMR (maximal marginal relevance).

2. Data Visualization
    * We will experiment with some robust data visualization techniques. 



# Introduction
* Usually, the standard workflow for BERTopic when Topic Modeling is:

1. Embedding your documents
2. Reducing the dimensionality of embeddings (e.g. UMAP)
3. Cluster reduced embeddings (e.g. HDBSCAN)
4. Tokenize documents per cluster
5. Extract best-representing words per cluster


# Reasoning for using LLMs
* While BERTopic and other related models are great for easy analysis and extraction of topics, they often are **"not interpretable" and require more human interaction to make them interpretable.**
* Quite often BERTopic will give us an output topic such as: "car-car-store"
    * We could assume this topic is about cars and going to the store but it is very OPEN to interpretation.
    * This is where an LLM like LLAMA or Qwen-2.5 can help us produce more interpretable topic names with keywords to better represent the underlying clusters of data.
* In particular, I will leverage an open-source LLM called Qwen-2.
    * One of Qwen-2's standout features is its **multilingual capabilities**.
    * **Thus we can use this model for cross-lingual topic modeling taking advantage of its robust abilities to handle multi-lingual and cross-lingual detection and translation.**
    * Qwen-2 has been trained on data spanning an **impressive 27 additional languages**.
    * This multilingual training regimen includes languages from diverse regions such as Western Europe, Eastern and Central Europe, the Middle East , Eastern Asia and Southern Asia.
    * The main reason we are going to use Qwen-2 is its performance on the MMLU which as a multi-lingual benchmark, it outperforms Llama-3 and Mistral among others in tasks such as the "Needle in a Haystack Problem."
 
# Multilingual Embeddings
* I am going to use the LLM for cross-lingual topic modeling, thus as part of the pipeline I need to use embeddings.
* You can use ANY embedding model that you want to, here we will use a `SentenceTransformer` model implementation.
* However, because we are dealing with multilingual text, I am going to use a multilingual embedding model that has performed well on text clustering and classification tasks which is what we are trying to do.
 
* Embedding Model I am using: `BAAI/bge-m3`
* Model card: https://huggingface.co/BAAI/bge-m3
* Reasons for using this model:
1) Handles long context windows up to 8192 tokens.
2) Multilingual support
3) Is able to handle 3 common retrieval functionalities of embedding models:
    * dense retrieval
    * multi-vector retrieval
    * sparse retrieval
 
* Future Models to test against this model:
  * 1) `jinaai/jina-embeddings-v3`
    2) `Snowflake/snowflake-arctic-embed-m`

# Dependencies
* These are the dependencies:

1. `bertopic`
    * for topic modeling

2. `accelerate`
    * huggingface library for speeding up and optimizing run-time for LLMs and transformers.

3. `bitsandbytes`
    * Used for quantization of embeddings and LLMs

4. `xformers`
    * aims to improve the efficiency and memory usage of Transformers, making it possible to train larger models and handle longer sequences of data.

5. `adjustText`
    * A library to help you adjust text positions on matplotlib plots to remove or minimize overlaps with each other and data points

# Code needed to work on Sagemaker

In [None]:
%%capture 
!pip install einops

In [None]:
%%capture 
!pip install --upgrade pandas fsspec # sagemaker dependency

In [None]:
%%capture  
## upgrade accelerate to use device_map 
!pip install --upgrade accelerate ## this is for compatability with `bitsandbytes` 

In [None]:
## check accelerate version after upgrade
import accelerate
print(f"Accelerate version: {accelerate.__version__}") 

## Main installations for this experiment 

In [None]:
%%capture
!pip install bertopic datasets bitsandbytes xformers adjustText # make sure to install ALL

In [None]:
%%capture 
## upgrade torchvision
!pip install --upgrade torchvision # if you need to upgrade torchvision run this line
!pip install --upgrade torch #upgrade torch version

Note: You may need to restart kernel after upgrading both torchvision and torch above.

In [None]:
## check versions of torch available
import torch
import torchvision 

# print versions
print(f"PyTorch version: {torch.__version__}") 
print(f"Torchvision version: {torchvision.__version__}")

# Check if GPU is available

In [None]:
# check if GPU is available 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 
print(f"Using device: {device}")

# set device for PyTorch operations
if device.type == "cuda":
    torch.cuda.set_device(0) # you can use a different device ID if you have multiple GPUs running

# Install other Dependencies

In [None]:
%%capture 
!pip install seaborn
!pip install s3fs #sagemaker dependency

In [None]:
## standard Data Science imports
import pandas as pd
import numpy as np

## plotting
import matplotlib.pyplot as plt
import matplotlib as mpl
from IPython.display import HTML
import seaborn as sns

## other tools
#import response_compare
import re
import functools
from copy import deepcopy
import chardet
import html #for non ascii detection

## transformers and huggingface
import transformers
#from sentence_transformers import SentenceTransformer, util
import torch
import torchvision ## to use Qwen
import huggingface_hub

## tqdm and pandas
from tqdm.auto import tqdm
tqdm.pandas(leave=False)
#from hashlib import sha256

# Load Data from S3 bucket on AWS

In [None]:
import boto3
import s3fs
from sagemaker import get_execution_role
conn = boto3.client('s3')

## load s3 bucket
bucket = 'adamnlpbucket1'
content = conn.list_objects(Bucket=bucket)['Contents']
data_key= '<your S3 bucket file path goes here>'
data_location= 's3://{}/{}'.format(bucket,data_key)

## load df
df = pd.read_csv(data_location, index_col= False, low_memory=False)
df.head()

# Exploratory Data Analysis

In [None]:
## df info
df.info()

In [None]:
## df columns
df.columns

In [None]:
## len of df
len(df)

Summary:
* There are 423,424 rows in the dataframe. We will cut this down to a much smaller sample size for our topic modeling experiment due to time and compute power concerns.

In [None]:
## check nulls
null_counts = df.isnull().sum().sort_values(ascending=False)
null_counts

In [None]:
## check if null counts in message column
df[['message']].isnull().sum().sort_values(ascending=False)

Summary:
* I already wrangled this dataset and know that the 56,516 "null messages" are not due to any unicode issues but rather due to the message types.
* So what I will do is:
1. Take a look at the multilingual data first.
2. Not remove null values, instead transform them to "no recognition nmessage".
3. Filter dataframe for a sample size to use in this experiment that includes multilingual data. 

## Unicode to ascii
* Using `unidecode` we will unescape HTML entities and then transliterate non-ASCII characters to their closest ASCII equivalents. Remember to install the unidecode library first with pip install unidecode.


In [None]:
!pip install unidecode

In [None]:
df[['message']].head()

In [None]:
df['message'].dtype

In [None]:
import html
from unidecode import unidecode

# function to process text
def process_text(text):
    if isinstance(text, str):
        # First, unescape any HTML entities
        unescaped = html.unescape(text)
        # Then, transliterate to ASCII
        return unidecode(unescaped)
    return text


In [None]:
# Apply the function to your DataFrame column
df['message'] = df['message'].apply(process_text)



In [None]:
## check nulls again
df[['message']].isna().sum()

## Multilingual Data
* First lets show the multilingual data.

In [None]:
## awards received
df['rec_language'].value_counts()

In [None]:
## sample of rows with these languages
df[['rec_language','message']].sample(5)

In [None]:
## awards nominated 
df['nom_language'].value_counts()

In [None]:
## sample of rows with these languages
df[['nom_language','message']].sample(3)

# HuggingFace Credentials
* This model is open source so no need to apply for credentials.
* However, we will input huggingface credentials for downloading models from the huggingface hub.

In [None]:
## hf hub login
from huggingface_hub import notebook_login
notebook_login()

# Language Detection 
* I am not convinced that the languages are correct so I am going to detect them first.
* We will use the `xlm-roberta-base-language-detection` transformer trained on 20 languages.
* We will get the model from the huggingface hub.
* We will use the huggingface pipeline for immediate inference rather than loading the model and the tokenizer we can use it out of the box this way: https://huggingface.co/docs/transformers/main_classes/pipelines

In [None]:
# Load your model
from transformers import pipeline

## load model from HF
model = pipeline(
    'text-classification',
    model="papluca/xlm-roberta-base-language-detection",
    truncation=True,
    max_length=512
)


## Detect Language
* We define a `safe_text function` that converts NaN values to empty strings and ensures all other values are strings.
* We apply this function to the 'message' column before converting it to a list.
* This ensures that all_text contains only strings, which the model can process.
* By handling the NaN values and ensuring all inputs are strings, you should avoid the ValueError you were encountering.

Note: For empty strings (which replace NaN values), the model will still make a prediction, but it might not be meaningful. You may want to handle these cases separately in your analysis, perhaps by assigning a special label like "unknown" or "not applicable" for entries that were originally NaN.

In [None]:
from tqdm import tqdm
tqdm.pandas() # load tqdm for pandas

# Function to safely process text with nan values
def safe_text(text):
    if pd.isna(text):
        return ""  # Return empty string for NaN values
    return str(text)  # Ensure all non-NaN values are strings

# Apply safe_text function to your column with progress bar
df['processed_message'] = df['message'].progress_apply(safe_text)

# Function to process batches
def process_batch(texts):
    results = model(texts.tolist(), batch_size=32)
    return [d['label'] for d in results]

# Process in batches with a progress bar
batch_size = 1000  # Adjust this based on memory constraints
num_batches = len(df) // batch_size + (1 if len(df) % batch_size != 0 else 0)

language_labels = []
for i in tqdm(range(num_batches), desc="Processing batches"):
    start_idx = i * batch_size
    end_idx = min((i + 1) * batch_size, len(df))
    batch_texts = df['processed_message'].iloc[start_idx:end_idx]
    language_labels.extend(process_batch(batch_texts))

# Assign labels to DataFrame
df['language_label'] = language_labels

# Print head
print(df.head())

In [None]:
df.head()

In [None]:
## lets see the language labels
df['language_label'].value_counts().reset_index()

In [None]:
%%capture 
!pip install plotly-express

In [None]:
import plotly.express as px 
## show languages detected
plot_title = "Languages Detected"
## create labels for x and y axis
labels = {
    "x": "Language",
    "y": "Count of text"
}
## plot histogram of languages for xlm-robeta model
fig = px.histogram(df, x="language_label", template="plotly_dark",
                   title=plot_title,
                   labels=labels)


fig.show()

Languages from `xlm-roberta`

```
arabic (ar), bulgarian (bg), german (de), modern greek (el), english (en), spanish (es), french (fr), hindi (hi), italian (it), japanese (ja), dutch (nl), polish (pl), portuguese (pt), russian (ru), swahili (sw), thai (th), turkish (tr), urdu (ur), vietnamese (vi), and chinese (zh)
```

## Non-english text -- check the outputs

In [None]:
## Lets see the Thai outputs since it was #2 on the list
thai_rows = df[df['language_label'] == 'th']

## print Thai text
print('Thai Text identified:')
for i, row in thai_rows.iterrows():
  print(f"- {row['message']}")
  if i == 10:
    break

Summary:
* I don't believe these are Thai, they are probaby just "not a number", but we do have to be weary of the issue of certain non-english characters not displaying on screen.
* To be sure I am going to utilize some language specific libraries to take a closer look.

In [None]:
%%capture
!pip install pythainlp ## Thai specific library in Python: https://pypi.org/project/pythainlp/

In [None]:
# Function to decode Thai text
def decode_thai(text):
    if isinstance(text, str):
        try:
            return text.encode('utf-8').decode('utf-8')
        except UnicodeEncodeError:
            return text.encode('utf-8-sig').decode('utf-8-sig')
    return text

# Create a boolean mask for Thai rows
thai_mask = df['language_label'] == 'th'

# Apply decoding to the original DataFrame using .loc
df.loc[thai_mask, 'message'] = df.loc[thai_mask, 'message'].apply(decode_thai)
df.loc[thai_mask, 'language_label'] = df.loc[thai_mask, 'language_label'].apply(decode_thai)

# Verify the changes
print(df.loc[thai_mask, ['message', 'language_label']].head())

Summary:
* It appears that the "Thai" messages are still showing up as NaN.
* Again, these are probably just "not a number". And the reason they showed up as Thai is because the word "Nan" has 2 meanings in the Thai language --> 1) Grandmother, 2) It is a city north of Bangkok.
* Let's try using langdetect instead to see if we are missing any Chinese variants.

### Lang Detect to Extract the Chinese Variants

In [None]:
%%capture 
!pip install langdetect #install langdetect library

In [None]:
from langdetect import detect

# function to detect chinese language variants
def detect_chinese_variant(text):
    try:
        lang = detect(text)
        if lang == 'zh-cn':
            return 'zh-cn'  # Simplified Chinese
        elif lang == 'zh-tw':
            return 'zh-tw'  # Traditional Chinese
        else:
            return 'zh'  # General Chinese
    except:
        return 'zh'  # If detection fails, return general Chinese


In [None]:
# Create a mask for Chinese messages
chinese_mask = df['language_label'] == 'zh'

## get chinese if its there
df.loc[chinese_mask, 'language_label'] = df.loc[chinese_mask, 'message'].apply(detect_chinese_variant)

In [None]:
df['language_label'].value_counts().reset_index()

Summary
* Note, after using langdetect, there was no change in the language detection results.

## Using Zhon Library to detect Chinese CJK Characters
* I am doing this due to validate what was found above. 

In [None]:
%%capture 
!pip install zhon 

In [None]:
#import pandas as pd
import re
from zhon import hanzi

## detect chinese characters using Zhon
def is_chinese(text):
    if isinstance(text, str):
        return bool(re.search(r'[\u4e00-\u9fff]+', text))
    return False

# Create chinese_rows DataFrame
chinese_rows = df[df['message'].apply(lambda x: is_chinese(x) if isinstance(x, str) else False)]

# Apply the is_chinese function to the 'message' column of chinese_rows
chinese_rows['is_chinese'] = chinese_rows['message'].apply(is_chinese)

print(chinese_rows['is_chinese'].value_counts())

In [None]:
encodings = ['utf-8', 'gb18030', 'gbk', 'gb2312', 'big5']

for encoding in encodings:
    try:
        decoded = chinese_rows['message'].str.encode('latin1').str.decode(encoding)
        print(f"Decoding with {encoding}:")
        print(decoded.head())
        print("---")
    except:
        print(f"Failed to decode with {encoding}")

### Check for Japanese and Korean
* Japanese detection: https://pypi.org/project/pykakasi/
* Korean romanization: https://github.com/osori/korean-romanizer

In [None]:
%%capture 
!pip install pykakasi hangul-romanize

In [None]:
#!pip install pykakasi hangul-romanize
#Now, let's create a function to test for Japanese and Korean:

import pykakasi
from hangul_romanize import Transliter
from hangul_romanize.rule import academic

def identify_cjk(text):
    if not isinstance(text, str):
        return "Not a string"
    
    # Check for Chinese characters
    if any('\u4e00' <= char <= '\u9fff' for char in text):
        # Try Japanese conversion
        kks = pykakasi.kakasi()
        result = kks.convert(text)
        if any(item['hira'] for item in result):
            return "Likely Japanese"
        else:
            return "Likely Chinese"
    
    # Check for Korean characters
    elif any('\uac00' <= char <= '\ud7a3' for char in text):
        transliter = Transliter(academic)
        romanized = transliter.romanize(text)
        if romanized != text:
            return "Likely Korean"
    
    return "Neither Chinese, Japanese, nor Korean"

# Apply the function to your DataFrame
df['identified_language'] = df['message'].apply(identify_cjk)

# Display results
print(df['identified_language'].value_counts())

# Show some examples
for lang in ['Likely Chinese', 'Likely Japanese', 'Likely Korean']:
    print(f"\n{lang} examples:")
    print(df[df['identified_language'] == lang]['message'].head())

## Additional Language Detection Checks
* These checks should give us a much clearer picture of what's in the data that was labeled as Chinese. Based on what we find, we can determine next steps. Some possibilities:

* If we're not seeing any Chinese characters at all, the 'zh' label might have been applied incorrectly.
* If we're seeing other non-ASCII characters, it might be text in another non-Latin script language.
* If we're only seeing ASCII characters, it might be transliterated Chinese or simply mislabeled data.
Once we have this information, we can decide how to correctly process and possibly relabel this data.




In [None]:
def check_cjk(text):
    if not isinstance(text, str):
        return "Not a string"
    if any('\u4e00' <= char <= '\u9fff' for char in text):
        return "Contains CJK characters"
    return "No CJK characters"

df['cjk_check'] = df['message'].apply(check_cjk)
print(df['cjk_check'].value_counts())

In [None]:
def has_non_ascii(text):
    if isinstance(text, str):
        return any(ord(char) > 127 for char in text)
    return False

df['has_non_ascii'] = df['message'].apply(has_non_ascii)
print(df['has_non_ascii'].value_counts())

In [None]:
def try_decoding(text):
    encodings = ['utf-8', 'iso-8859-1', 'windows-1252', 'ascii']
    for encoding in encodings:
        try:
            return text.encode('iso-8859-1').decode(encoding)
        except:
            continue
    return text

df['decoded_message'] = df['message'].apply(lambda x: try_decoding(x) if isinstance(x, str) else x)

In [None]:
df['decoded_message'].head()

In [None]:
df['cjk_check_after_decode'] = df['decoded_message'].apply(check_cjk)
print(df['cjk_check_after_decode'].value_counts())

### Distribution of Language Labels

In [None]:
print(df['language_label'].value_counts())

In [None]:
chinese_labeled = df[df['language_label'] == 'zh']
print(chinese_labeled['message'].head(20))

### Let's check the data types in these 'Chinese' labeled rows:

In [None]:
print(chinese_labeled['message'].apply(type).value_counts())

### For the string data labeled as Chinese, let's print out some examples along with their byte representation:


In [None]:
chinese_strings = chinese_labeled[chinese_labeled['message'].apply(lambda x: isinstance(x, str))]
for idx, row in chinese_strings.head(10).iterrows():
    print(f"Text: {row['message']}")
    print(f"Bytes: {row['message'].encode('utf-8')}")
    print("---")

### Let's also check for any non-ASCII characters in these 'Chinese' labeled strings:

In [None]:
#Let's also check for any non-ASCII characters in these 'Chinese' labeled strings:
def has_non_ascii(text):
    return any(ord(char) > 127 for char in text)

chinese_strings['has_non_ascii'] = chinese_strings['message'].apply(has_non_ascii)
print(chinese_strings['has_non_ascii'].value_counts())

# Print some examples with non-ASCII characters
print(chinese_strings[chinese_strings['has_non_ascii']]['message'].head())

### If we're not seeing any Chinese characters, let's check what kind of characters we are seeing in these 'Chinese' labeled rows:

In [None]:
import unicodedata

def char_details(text):
    return [(char, ord(char), unicodedata.name(char, 'Unknown')) for char in text]

for idx, row in chinese_strings.head(20).iterrows():
    print(f"Text: {row['message']}")
    print("Character details:")
    for char, code, name in char_details(row['message']):
        print(f"  '{char}': U+{code:04X} - {name}")
    print("---")

Summary:
* It appears that there is hungarian text mislabeled as Chinese.

## Correct mislabeled Text from Chinese to Hungarian
* We have validated this Chinese text is hungarian and will make sure its fixed.

In [None]:
import pandas as pd
from langdetect import detect, LangDetectException

## function to detect unknown
def safe_detect(text):
    if not isinstance(text, str):
        return 'unknown'
    try:
        return detect(text)
    except LangDetectException:
        return 'unknown'

# Apply language detection to the 'zh' labeled rows
chinese_labeled = df[df['language_label'] == 'zh']
chinese_labeled['detected_lang'] = chinese_labeled['message'].apply(safe_detect)

# Print the distribution of detected languages in 'zh' labeled texts
print("Distribution of detected languages in 'zh' labeled texts:")
print(chinese_labeled['detected_lang'].value_counts(normalize=True))

# Print some examples of texts labeled as 'zh' but detected as Hungarian
hungarian_examples = chinese_labeled[chinese_labeled['detected_lang'] == 'hu']
print("\nExamples of texts labeled as 'zh' but detected as Hungarian:")
print(hungarian_examples['message'].head(10))

# Create a new column for corrected language labels
df['corrected_lang'] = df['language_label']

# Update the 'zh' labeled rows with the new detection
df.loc[df['language_label'] == 'zh', 'corrected_lang'] = df.loc[df['language_label'] == 'zh', 'message'].apply(safe_detect)

# Print the distribution of corrected languages
print("\nDistribution of corrected languages:")
print(df['corrected_lang'].value_counts())

# Check how many 'zh' labels were changed
changed_labels = df[df['language_label'] != df['corrected_lang']]
print(f"\nNumber of changed labels: {len(changed_labels)}")
print(changed_labels['corrected_lang'].value_counts())

# Keep both original and corrected labels
df['original_lang'] = df['language_label']
df['language_label'] = df['corrected_lang']

# Print examples of texts that were originally labeled as 'zh' but are now detected as different languages
print("\nExamples of texts with changed language labels:")
for lang in df['corrected_lang'].unique():
    if lang != 'zh':
        print(f"\nOriginally labeled as 'zh' but detected as '{lang}':")
        examples = df[(df['original_lang'] == 'zh') & (df['corrected_lang'] == lang)]
        print(examples['message'].head(3))

# Print final statistics
print("\nFinal language distribution:")
print(df['language_label'].value_counts())

print("\nPercentage of 'zh' labels that were changed:")
percent_changed = (len(changed_labels) / len(chinese_labeled)) * 100
print(f"{percent_changed:.2f}%")

In [None]:
df['message'].sample(10)

# Create Sample Data for Topic Modeling
* I want to cut down the size of the dataset for this experiment but make sure that we have an adequate representation of the languages in the dataset.
* Here is what I did:

## Process

1. Remove the "th" labeled messages as these are really just "not a number".
2. Remove languages with fewer than 60 samples.
3. Keep original sample sizes for languages from "sw" down to "ar".
4. Reduce sample sizes of "en", "es", and "pt" to 2401 (the size of "sw" samples), or keeps their original size if it's less than 2401.
5. Sample from each language group according to these new sample sizes.
6. Combines all samples and shuffles them.

Overall, this approach should give a more balanced dataset where the top languages ("en", "es", "pt") have similar representation to "sw", while keeping the original sample sizes for less represented languages.

We are able to adjust the `min_samples` and `top_language_target` variables if needed. This method ensures that we get a sample that represents each language more equally, with the less common languages maintaining their original representation.

In [None]:
# df is the original DataFrame with 'message' and 'language_label' columns

# Set the minimum number of samples required for a language to be included
min_samples = 60

# Set the target size for the top languages (same as 'sw')
top_language_target = 2401

# Filter out languages with fewer than min_samples and exclude 'th'
language_counts = df['language_label'].value_counts()
valid_languages = language_counts[(language_counts >= min_samples) & (language_counts.index != 'th')].index

# Create a new DataFrame with only the valid languages
df_filtered = df[df['language_label'].isin(valid_languages)]

# Define the sample sizes for each language
language_samples = language_counts[valid_languages].copy()

# Set the sample size for top languages to the target
top_languages = ['en', 'es', 'pt']
for lang in top_languages:
    if lang in language_samples.index:
        language_samples[lang] = min(top_language_target, language_samples[lang])

# Function to sample from each language group
def sample_language(group, n):
    return group.sample(n=min(len(group), n), replace=False, random_state=42)

# Create the final sample
final_sample = df_filtered.groupby('language_label').apply(
    lambda x: sample_language(x, language_samples[x.name])
).reset_index(drop=True)

# Shuffle the final sample
final_sample = final_sample.sample(frac=1, random_state=42).reset_index(drop=True)

# Print the distribution in the final sample
print("Final language distribution:")
print(final_sample['language_label'].value_counts())

print(f"\nTotal samples: {len(final_sample)}")

# Save the final sample
final_sample.to_csv('balanced_language_sample.csv', index=False)
print("\nBalanced sample saved to 'balanced_language_sample.csv'")

In [None]:
## final output
final_sample.head()

In [None]:
## compute percentage
def perc_of(x,y):
    z = round(((len(x)/len(y))*100),3)
    return z 

In [None]:
## percentage
perc_of(final_sample,df)

# Qwen-2.5-14B-Instruct
* This is the model we will use: https://huggingface.co/Qwen/Qwen2.5-14B-Instruct

## Why use an Instruct model?
* The key difference between "Qwen2.5-14B" and "Qwen2.5-14B Instruct" is that the "Instruct" version is specifically fine-tuned to better follow instructions and generate text responses that are more aligned with user prompts, making it more suitable for direct interaction through conversational tasks, while the standard "Qwen2.5-14B" model serves as a more general-purpose foundation for further customization and fine-tuning by developers.

## Why are we using a 14B model and not the largest 72B model?
* According to the BERTopic inventor, Maarten says this is a "nice balance between inference and speed, accuracy and speed."
* A larger model would be more accurate but compute power would be an issue.
* A smaller model would be faster but less accurate.
* Thus we will take Maarten's advice and go with the "in-between" size.
* Previously I have used Llama-2 13B size following this same method and it worked quite well.

In [None]:
# import torch -- if not already done above
#import torch


## model_id
model_id = 'Qwen/Qwen2.5-14B-Instruct'

## device agnostic code
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# print device
print(device)

# LLM Optimization & Quantization
* Before we load a 14 BILLION PARAMETER LLM, optimization is a must!
* VRAM on any device is limited, so we need to condense the model to run it locally.
* There are numerous techniques and tricks to do this but the main principle is 4-bit quantization.
* This will reduce the 64-bit representation to only 4-bits which reduces the GPU memory needed to run the LLM model.
* More info here:
    * QLoRA paper: https://arxiv.org/pdf/2305.14314
    * HF blog: https://huggingface.co/blog/4bit-transformers-bitsandbytes


In [None]:
from torch import bfloat16
import transformers

## now we set the quantization to load LLM with less GPU memory
## this requires the `bitsandbytes` library

##bits and bytes config -- 32 bit to 4 bit
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True, #4-bit quantization
    bnb_4bit_quant_type='nf4', #normalized float 4
    bnb_4bit_use_double_quant=True, #second quantization after first
    bnb_4bit_compute_dtype=bfloat16 #computation type

)

## 4 Parameters that we use
1. **load_in_4bit**
    * Allows us to load the model in 4-bit precision compared to the original 32-bit precision. This gives us an incredible speed up and reduces memory!

2. **bnb_4bit_quant_type**
    * This is the type of 4-bit precision.
    * The original paper recommends normalized float 4-bit, so that is what we are going to use!

3. **bnb_4bit_use_double_quant**
    * This is a neat trick as it performs a second quantization after the first which further reduces the necessary bits

4. **bnb_4bit_compute_dtype**
    * The compute type used during computation, which further speeds up the model.

# Load LLM Model from Hugging Face

In [None]:
# Load dependencies from hf
from transformers import AutoTokenizer, AutoModelForCausalLM
import accelerate

## load tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

## Qwen2.5-14B Instruct -- init model
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    quantization_config=bnb_config, #quantization config from above
    device_map='auto' ## if you are distriuting across GPUs
)

## model params
model.eval()

# Pipeline Setup
* Here we are going to setup a huggingface transformers pipeline for the LLM model.

* The specific task we want is text-generation.

* Penalty discourages repetitive or redundant output
    * It is designed to address the tendency of language models to produce repeated phrases, sentences, or patterns.
    * In topic modeling we generally DO NOT want repetitive outputs.
    * We also want to set the Temperature lower --> which is more deterministic.

* More about pipeline params: https://medium.com/@developer.yasir.pk/understanding-the-controllable-parameters-to-run-inference-your-large-language-model-30643bb46434

In [None]:
## text generator pipeline from hugging face
generator = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task='text-generation',
    temperature=0.1,
    max_new_tokens=500,
    repetition_penalty=1.1 #penalty discourages redundant output
)

# Prompt Engineering
* We can take the model for a "test drive" just to test that we are able to prompt the model and get a result.
* Then we can go into the specific prompt template we need to instruct the model to perform cross-lingual topic modeling.
* Since we are performing cross lingual topic modeling, I am going to ask the model to detect the language of the text I give to it and translate it to english as well as 1 or other languages. 

In [None]:
## test prompt
prompt_test_msg = """Laci mar honapok ota vallan viszi a CRM adatmigracios projektet, az elkeszult sztorik leirasa mindig reszletes es atgondolt, a bonyolultabb fejleszteseket/javitasokat pedig a egyutt prezentaljuk a tesztelo kollegaknak, hogy ok is mindig up-to-date-ek legyenek az aktualis valtozasokkal. 
Lelkiismeretes munkaja peldaerteku mindenki szamara."""

prompt = f"Can you detect the language seen in this text: '{prompt_test_msg}', then translate it to english, french, and italian?"
res = generator(prompt)
print(res[0]['generated_text'])

# Prompt Templates for Qwen-2-14B-Instruct
* The prompt templates for this model can be found here:
    * Qwen home page: https://qwenlm.github.io/resources/
    * Qwen huggingface repo: https://huggingface.co/Qwen/Qwen2.5-14B-Instruct


In [None]:
"""
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
"""

## 1. System Prompt
* First prompt needed

In [None]:
# system prompt describes information given to all conversations
system_prompt = """
<human>
You are a helpful, respectful, and honest assistant for labeling topics. 
You are skilled at analyzing text and identifying key themes and concepts.
</human>
"""

## Prompt Template we will use for topic modeling
* This is the recommended format for TOPIC MODELING. 
* It has 2 components:
    * a. **example**
        * Most LLMs will do much better generating accurate responses if you give them examples to work from. Therefore we need to show an example of what we want our output to be.
        * **A note about this prompt. It is better to keep it "general" and not specific to your domain or data because I have tried that before and the result creates hallucination issues for the LLM as it thinks it has to focus on the details of your prompt rather than performing topic modeling. Thus, the "boilerplate" prompt below which comes from the BERTopic Author Maarten G. is a great example to use and it works well.**
      
    * b. **main prompt**


## 2. example_prompt
        * This is the 2nd prompt needed.

In [None]:
# Example prompt demonstrating the output we are looking for
example_prompt = """
I have a topic that contains the following documents:
- Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
- Meat, but especially beef, is the word food in terms of emissions.
- Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.

The topic is described by the following keywords: 'meat, beef, eat, eating, emissions, steak, food, health, processed, chicken'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.
[/INST] Environmental impacts of eating meat
"""

**Summary**:

* Give the example documents or text.
* Give example keywords.
* End by stating "please create a short label of this topic"


## 3. main_prompt
* This is the 3rd prompt needed.
* The main prompt contains documents and keywords.
* Note about the code below:
  * The `[Languages]` placeholder is for the column I created above which is `language_label`. The concept of adding this to the prompt is to feed the LLM the previously detected language as context.

In [None]:
## main prompt with documents and keywords
main_prompt = """
<human>
Instructions: I have a topic that contains the following documents:
[Documents]

The topic is described by the following keywords: '[Keywords]'.

Based on the information about the topic above, please create a short label for this topic. Make sure to only return the label and nothing more.
</human>

<response>
[Label]
</response>
"""

* The main_prompt has spaces for us to insert what `DOCUMENTS` and `KEYWORDS` we want to use which will change as we perform TOPIC MODELING.

* There are two BERTopic-specific tags that are of interest, namely `[DOCUMENTS]` and `[KEYWORDS]`:
    * 1. `[DOCUMENTS]` contain the top 5 most relevant documents to the topic.
    * 2. `[KEYWORDS]` contain the top 10 most relevant keywords to the topic as generated through c-TF-IDF algorithm from BERTopic.

* This template will be completed according to each topic.

And finally, we will combine this into our final prompt:

In [None]:
# final prompt format
full_prompt = system_prompt + example_prompt + main_prompt 

# Additional Data Preprocessing
* Before we create embeddings we want to store the `message` column in a separate variable.

Here's a suggested approach:

1. Create a new dataframe or a copy of the original dataframe to work with during the embedding creation and topic modeling process. This new dataframe should have the same index as the original dataframe.

2. In the new dataframe, create a new column or columns to store the embeddings, leaving the original message column untouched.

3. Perform the topic modeling process using the embeddings column(s) in the new dataframe.

4. After the topic modeling is complete, create new columns in the new dataframe to store the generated topics, keywords, or any other relevant information.

5. Since the new dataframe has the same index as the original dataframe, you can easily merge or join the new dataframe (containing the embeddings, topics, and keywords) back to the original dataframe using the index.



In [None]:
## lets review the sample df we created
final_sample.head()

In [None]:
## create copy of the sample data
df_work = final_sample.copy() 
df_work.head()

# Language Validation and Translation
* We can use the LLM to validate the `language_label` that we detected and also translate the `message` column to english prior to topic modeling.
* This would take away the need to have the LLM perform BOTH Topic Modeling and Translation all at the same time as it could prevent context window issues having it try and do too much at the same time.
* Also, if we can take advantage of the LLM as a seperate language validation and translation machine that would be most ideal. 

## Step 1: Language Validation Using LLM
* We will use the LLM to validate the languages detected above.

In [None]:
#Using the LLM to validate language detection done above
def validate_language(text, detected_lang):
    prompt = f"""
    <human>
    Please analyze the following text and determine its language. The previously detected language was {detected_lang}.

    Text: "{text}"

    Instructions:
    1. Identify the language of the text.
    2. Compare your identification with the previously detected language ({detected_lang}).
    3. Return your language identification and whether it matches the previous detection.

    Format your response as:
    Language: [Your identified language]
    Matches Previous Detection: [Yes/No]
    </human>

    <response>
    """
    
    response = generator(prompt)  # Your LLM generation function
    return response[0]['generated_text'].strip()

# Apply the validation to your dataframe
print('Validating language detection....")
tqdm.pandas()
df_work['language_validation'] = df_work.progress_apply(lambda row: validate_language(row['message'], row['language_label']), axis=1)

## Analyze results
# You can then analyze the results to see where the LLM's detection differs from your previous detection
mismatches = df_work[df_work['language_validation'].str.contains('Matches Previous Detection: No')]
print(f"Number of language detection mismatches: {len(mismatches)}”)




After analyzing and reviewing the results above, you can run the code below to update the language labels

In [None]:
# Update the language label based on LLM validation
df_work['updated_language_label'] = df_work.apply(
    lambda row: row['language_validation'].split('\n')[0].split(': ')[1] 
    if 'Matches Previous Detection: No' in row['language_validation'] 
    else row['language_label'],
    axis=1
)

## Step 2: LLM Translation to English
* We can now use the LLM Qwen-2.5-14B to translate the text to English.

In [None]:
def translate_to_english(text, lang):
    if lang.lower() == 'english':
        return text
    
    prompt = f"""
    <human>
    Please translate the following text from {lang} to English:

    Text: "{text}"

    Instructions:
    1. Provide an accurate English translation of the text.
    2. If the text is already in English, simply return "Already in English".

    Format your response as:
    Translation: [English translation or "Already in English"]
    </human>

    <response>
    """
    
    response = generator(prompt)  # Your LLM generation function
    translation = response[0]['generated_text'].strip()
    
    # Extract just the translation from the response
    if "Translation:" in translation:
        translation = translation.split("Translation:")[1].strip()
    
    return translation


After veryifying the code above, run this below to translate the Award Messages to English.

In [None]:
print("Translating messages to English...")
df_work['english_translation'] = df_work.progress_apply(
    lambda row: translate_to_english(row['message'], row['updated_language_label']),
    axis=1
)


## Summary 
* Now we can use `df_work['english_translation'] for topic modeling.

# BERTopic and Topic Modeling
* Now that the LLM is setup we can get into Topic Modeling.

## 1. Prepare Embeddings
    * Pre-calculation of embeddings for each document to speed-up additional exploration steps and use the embeddings to quickly iterate over BERTopic's hyperparameters as needed.

### Multilingual Embeddings
* I am going to use the LLM for cross-lingual topic modeling, thus as part of the pipeline I need to use embeddings.
* You can use ANY embedding model that you want to, here we will use a `SentenceTransformer implementation of an open source embedding model from Hugging Face. 
* However, because we are dealing with multilingual text, I am going to use a multilingual embedding model that has performed well on text clustering and classification tasks which is what we are trying to do.
* Embedding Model I am using: `BAAI/bge-m3`
    * Hugging Face Model card: https://huggingface.co/BAAI/bge-m3
    * This model is similar in its functionality to the Jina.ai model that I had originally wanted to use. 
    * 1) Handles long context windows up to 8192 tokens.
    * 2) Multilingual support
    * 3) Is able to handle 3 common retrieval functionalities of embedding models:
         * dense retrieval
         * multi-vector retrieval
         * sparse retrieval

In [None]:
%%capture
!pip install sentence-transformers

### Embedding Optimization
* To better optimize the embedding creation process and make better use of compute power I will do the following:
    * 1. Use a tqdm progress bar - The tqdm library is used to display a progress bar during the encoding process.
    * 2. Use the hugging face accelerate library -
        * This automatically handles cuda device placement, distributed training, and mixed precision.
        * The Accelerator class is used to prepare the model for acceleration and distribute the computation across multiple devices (if available).

    * 3. Use pandas cudflibrary from nvida (I was going to do this but not needed).
        * This library helps to optimize the efficiency of pandas dataframe processing.
        * It can be used in 2 ways, I will use this method: %load_ext cudf.pandas
        * Here is the repo for more information about this very popular library: GitHub - rapidsai/cudf: cuDF - GPU DataFrame Library

    * 4. Matryoshka representation (I was going to use this but this is a last resort). 


#### Embedding Optimization Workflow is as follows: 
* The workflow is as follows:
    * Initialize the Accelerator instance from the accelerate library.
    * Load specified sentence transformer model and prepares it for acceleration.
    * Iterate over the `message` column of the df_work DataFrame in batches of 64 using the tqdm progress bar.
    * Encode each batch of text using `embedding_model.encode` function.
    * Gather the results from all processes using `accelerator.gather`.
    * Finally, creates a new embeddings column in the `df_work` DataFrame with the computed embeddings.


* Overall, this workflow should significantly improve the efficiency of the embedding creation process by utilizing available hardware resources and distributing the computation across multiple devices (if available).


In [None]:
## check CUDA version before install pandas cudf 
#!nvcc --version

In [None]:
## install version CUDA 12.X - based on cudf github repo: https://github.com/rapidsai/cudf
#!pip install --extra-index-url=https://pypi.nvidia.com cudf-cu12

### Create Multilingual Embeddings
* The code below will clear the hugging face cache from Transformers library to better optimize memory.

In [None]:
# clear the Hugging Face cache in your current version of the Transformers library:
# Import the necessary modules:
from transformers import __file__ as transformers_path
from pathlib import Path
import shutil

## Find cache directory path:
# Get the path to the Transformers library
transformers_dir = Path(transformers_path).parent

# Construct the path to the cache directory
cache_dir = transformers_dir / "cached_path"

In [None]:
## print the cache_dir
print(cache_dir)

In [None]:
## now clear the cache
if cache_dir.exists():
    shutil.rmtree(cache_dir)

### Load Embedding Model from HuggingFace
* This is the model I am using: https://huggingface.co/BAAI/bge-m3

In [None]:
from sentence_transformers import SentenceTransformer
from accelerate import Accelerator

# Initialize the accelerator
accelerator = Accelerator()

# Set the model name/checkpoint
embedding_model_name = "BAAI/bge-m3"

# Load the model onto the appropriate device
embedding_model = SentenceTransformer(embedding_model_name, device=accelerator.device)
embedding_model = accelerator.prepare(embedding_model)

**Code below creates the actual embeddings**

In [None]:
# Create embeddings in batches of the message column in the df_work dataframe
embeddings = []
batch_size = 64
bar_format = "{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}]"

# Create embeddings in batches
batches = [df_work['english_translation'].values[i:i + batch_size].tolist() for i in range(0, len(df_work), batch_size)]

for batch in tqdm(batches, disable=not accelerator.is_main_process, bar_format=bar_format):
    batch_embeddings = embedding_model.encode(batch, show_progress_bar=False, convert_to_tensor=True)
    # Ensure the batch_embeddings is 2-dimensional
    if batch_embeddings.dim() == 1:
        batch_embeddings = batch_embeddings.unsqueeze(0)
    embeddings.append(batch_embeddings)

# Gather the results from all processes
embeddings = accelerator.gather(torch.cat(embeddings, dim=0))

# Convert the embeddings to a pandas Series
if accelerator.is_main_process:
    df_work['embeddings'] = pd.Series(embeddings.cpu().numpy().tolist())

print("Embeddings added to dataframe successfully.")

In [None]:
## view embeddings
df_work['embeddings'].head()

## 2. Sub-models
* The Next step is to define a few sub-models in BERTopic.
* Then we can do some small tweaks to the number of clusters to be created, setting random_states and more.
* NOTE: A technique I am NOT using here but is worth trying is hyperparameter tuning using bayesian optimization which would help us choose more refined parameters. I do advise using this technique for larger datasets as every dataset is different in terms of the hyperparameters!


The code below will do the following:

* Checks format of the embeddings in the DataFrame.
* If the embeddings are stored as lists, it converts them to a numpy array.
* If they're already numpy arrays, it stacks them into a 2D array.
* It then uses this array for the UMAP transformation.
* After running this, it will allow us to use reduced_embeddings for further processing or visualization.



In [None]:
from umap import UMAP
from hdbscan import HDBSCAN

# First, let's check the shape of the embeddings and convert them to a numpy array if needed
if isinstance(df_work['embeddings'].iloc[0], list):
    # If embeddings are stored as lists, convert to numpy array
    embeddings_array = np.array(df_work['embeddings'].tolist())
else:
    # If embeddings are already numpy arrays, just stack them
    embeddings_array = np.stack(df_work['embeddings'].values)

print(f"Shape of embeddings array: {embeddings_array.shape}")

# Now Create UMAP model
umap_model = UMAP(n_neighbors=15,
                  n_components=5,
                  min_dist=0.0,
                  metric='cosine',
                  random_state=42)

# Then Create HDBSCAN model
hdbscan_model = HDBSCAN(min_cluster_size=150,
                        metric='euclidean',
                        cluster_selection_method='eom',
                        prediction_data=True)


### Reduce the embeddings just created to 2 dimensions
* This is so we can use them for visualization purposes after creating the topics.


In [None]:
# Pre-reduce embeddings for visualization
umap_2d = UMAP(n_neighbors=15,
               n_components=2,
               min_dist=0.0,
               metric='cosine',
               random_state=42)

reduced_embeddings = umap_2d.fit_transform(embeddings_array)

## print shape
print(f"Shape of reduced embeddings: {reduced_embeddings.shape}")

## 3. Representation Models
* One of the ways we are going to represent the topics is with the LLM `Qwen-2.5-14B-Instruct`.
    * This should give us a nice set of quality labels for the topics.
* However, we might want to have additional representations to view a topic from MULTIPLE angles.
* Below we will use the c-TF-IDF from BERTopic as the main representation, and KeyBERT, MMR, and Qwen-2.5-14B-Instruct as the additional representations.


In [None]:
## import representation models from BERTopic
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, TextGeneration

## 1. KeyBERT
keybert = KeyBERTInspired()

## 2. MMR
mmr = MaximalMarginalRelevance(diversity=0.7) ## higher value --> more diversity keywords

## 3. Text Generation with Qwen-2.5-14B-Instruct LLM
qwen2 = TextGeneration(generator, prompt=prompt)

### ALL Representation models combined
representation_model = {
    "KeyBERT": keybert,
    "Qwen2": qwen2,
    "MMR": mmr,
}

# Topic Model Training
* Now that ALL models are prepared, we can start training the TOPIC MODEL.
* BERTopic is obviously what we are going to use.
* We will do the following:
    * 1. Give "sub-models" to BERTopic.
    * 2. Run `.fit_transform`
    * 3. Evaluate Topic outputs

In [None]:
#import numpy as np
from bertopic import BERTopic
from tqdm import tqdm

# First, let's convert the embeddings to a numpy array
if isinstance(df_work['embeddings'].iloc[0], list):
    # If embeddings are stored as lists, convert to numpy array
    embeddings_array = np.array(df_work['embeddings'].tolist())
else:
    # If embeddings are already numpy arrays, just stack them
    embeddings_array = np.stack(df_work['embeddings'].values)

print(f"Shape of embeddings array: {embeddings_array.shape}")

## setup topic model
topic_model = BERTopic(
    # sub-models - 4 models
    embedding_model=embedding_model, #1
    umap_model=umap_model, #2
    hdbscan_model=hdbscan_model, #3
    representation_model=representation_model, #4

    # hyperparameters
    top_n_words=10,
    verbose=True,
)

# Train topic model -- message column & embeddings column
topics, probs = tqdm(topic_model.fit_transform(df_work['english_translation'].tolist(), embeddings_array))

print("Topic modeling complete.")
print(f"Number of topics found: {len(set(topics)) - 1}")  # -1 to exclude the -1 topic (outliers)


# Topic Model Results Evaluation
* Now we can evaluate the results of the topic modmeling.

In [None]:
## get topic info -- pandas DF
topic_model.get_topic_info()

## Deep Dive into a topic
* We can now investigate 1 of the topics from the model results.

In [None]:
## topic from model --> Keywords extracted using KeyBERT model
topic_model.get_topic(0, full=True)['KeyBERT']

In [None]:
## topic from model --> MMR (maximal marginal relevance)
topic_model.get_topic(0, full=True)['MMR']

In [None]:
## topic from model --> Qwen-2 model
topic_model.get_topic(0, full=True)['Qwen2']

## Summary of Results
* add info here

In [None]:
## another example of LLM labeled topics
topic_model.get_topic(3, full=True)['Qwen2']

# Assign Topic Labels using Qwen-2.5-14B-Instruct Model
* Now we can use the open source LLM to assign more "interpretable" topic labels than what BERTopic would give us. 

In [None]:
## generate topic lables using Qwen-2.5-14B-Instruct LLM
#qwen2_labels = [label[0][0].split("\n")[0] for label in topic_model.get_topics(full=True)["Qwen2"].values()]

## Tag topic labels to topic_model dataframe
#topic_model.set_topic_labels(qwen2_labels)

# Function to label topics
def get_topic_label(topic_words, topic_docs):
    documents = "\n".join(topic_docs[:5])  # Use the first 5 documents as examples
    keywords = ", ".join(topic_words)
    
    prompt = full_prompt.replace("[Documents]", documents).replace("[Keywords]", keywords)
    
    response = generator(prompt)  # Your LLM generation function
    label = response[0]['generated_text'].strip()
    return label

# Apply to the topics
topic_labels = {}
for topic_id, topic_info in topic_model.get_topics().items():
    if topic_id != -1:  # Exclude the outlier topic
        topic_words = [word for word, _ in topic_info]
        topic_docs = topic_model.get_representative_docs(topic_id)
        label = get_topic_label(topic_words, topic_docs)
        topic_labels[topic_id] = label

topic_model.set_topic_labels(topic_labels)

# Visualize Topics

## 1. Interactive Plot

In [None]:
## topic visualization -- interactive
topic_model.visualize_documents(titles,
                                reduced_embeddings=reduced_embeddings,
                                hide_annotations=True,
                                hide_document_hover=False,
                                custom_labels=True)


## 2. Word Cloud