# Topic Modeling Insights on lmsys-chat-1m Dataset Using BERTopic =>

This Jupyter Notebook demonstrates the application of BERTopic, a topic modeling technique, on the lmsys-chat-1m dataset from Hugging Face.


The analysis aims to uncover the underlying topics within the dataset, providing insights into the most sought after query topics from users' prompts.


### Installing "datasets"

In [None]:
!pip -q install datasets

## Environment Setup and Loading the Data ->

- This section of the notebook sets up the environment and loads the dataset from Hugging Face. The dataset used is [lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m), which requires access permission. To access the dataset, basic authentication with a login token is necessary.

- For security reasons, instead of hardcoding the token, I will utilize Google Colab's secret storage feature. This method enhances safety by storing the token securely and requires manual input of the token for anyone rerunning this notebook.

- To run this notebook, you must request your own token from the Hugging Face Hub and input it manually.

![Notebook Image](https://i.imgur.com/DW7Okuq.jpeg)


In [None]:
from huggingface_hub import login
from google.colab import userdata
token = userdata.get('HUGGINGFACE_KEY')
login(token=token)

from datasets import load_dataset
dataset = load_dataset("lmsys/lmsys-chat-1m")

print(f"Original size: {len(dataset['train'])}")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/8.88k [00:00<?, ?B/s]

(…)-00000-of-00006-4feeb3f83346a0e9.parquet:   0%|          | 0.00/249M [00:00<?, ?B/s]

(…)-00001-of-00006-4030672591c2f478.parquet:   0%|          | 0.00/247M [00:00<?, ?B/s]

(…)-00002-of-00006-1779b7cec9462180.parquet:   0%|          | 0.00/250M [00:00<?, ?B/s]

(…)-00003-of-00006-2fa862bfed56af1f.parquet:   0%|          | 0.00/247M [00:00<?, ?B/s]

(…)-00004-of-00006-18f4bdd50c103e71.parquet:   0%|          | 0.00/246M [00:00<?, ?B/s]

(…)-00005-of-00006-fe1acc5d10a9f0e2.parquet:   0%|          | 0.00/249M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Original size: 1000000


## Displaying the dataset ->

This is how the dataset looks like.

![Dataset Image](https://i.imgur.com/qscD8vB.png)

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['conversation_id', 'model', 'conversation', 'turn', 'language', 'openai_moderation', 'redacted'],
        num_rows: 1000000
    })
})

## Efficient Data Loading in Batches ->

- Due to memory limitations in Google Colab, loading large datasets in one go can cause RAM issues and crash the environment.

- To address this, I employed a batching technique that loads only a portion of the data at a time.

- Also loading only 10% of the original dataset due to resource constaints related to google colab.

In [None]:
import pandas as pd

df = pd.DataFrame()

chunk_size = 10000
total_size = int(len(dataset['train'])*0.10)
chunks = range(0, total_size, chunk_size)

for start in chunks:

    end = min(start + chunk_size, total_size)
    subset = dataset['train'].select(range(start, end)).to_pandas()

    df = pd.concat([df, subset], ignore_index=True)




In [None]:
len(df)

100000

In [None]:
df.head(10)

Unnamed: 0,conversation_id,model,conversation,turn,language,openai_moderation,redacted
0,33f01939a744455c869cb234afca47f1,wizardlm-13b,[{'content': 'how can identity protection serv...,1,English,"[{'categories': {'harassment': False, 'harassm...",False
1,1e230e55efea4edab95db9cb87f6a9cb,vicuna-13b,[{'content': 'Beside OFAC's selective sanction...,6,English,"[{'categories': {'harassment': False, 'harassm...",False
2,0f623736051f4a48a506fd5933563cfd,vicuna-13b,[{'content': 'You are the text completion mode...,1,English,"[{'categories': {'harassment': False, 'harassm...",False
3,e5c923a7fa3f4893beb432b4a06ef222,palm-2,[{'content': 'The sum of the perimeters of thr...,2,English,"[{'categories': {'harassment': False, 'harassm...",False
4,8ad66650dced4b728de1d14bb04657c1,vicuna-13b,[{'content': 'What is the type of the variable...,1,English,"[{'categories': {'harassment': False, 'harassm...",False
5,aa041ed88edd4100bde61b8d68fc7288,wizardlm-13b,[{'content': 'I have 1000 documents to downloa...,1,English,"[{'categories': {'harassment': False, 'harassm...",False
6,113d3ddd85874229a04a660bc629c2cc,vicuna-13b,"[{'content': 'summarise below transcript ""Stud...",1,English,"[{'categories': {'harassment': False, 'harassm...",False
7,4c95520511844ca492ad9ec1cb3672e3,llama-2-13b-chat,[{'content': 'Определи важнейшие смыслы в текс...,3,unknown,"[{'categories': {'harassment': False, 'harassm...",False
8,64f322dcb69d43229bbd9785b7d90f1b,vicuna-13b,"[{'content': 'Buenas noches!', 'role': 'user'}...",8,Spanish,"[{'categories': {'harassment': False, 'harassm...",False
9,6fc9a36392e94a83939dc3738ab9e245,vicuna-13b,[{'content': 'hola puedes hablar español de ar...,5,Spanish,"[{'categories': {'harassment': False, 'harassm...",False


### Showcasing example of a conversation ->

In [None]:
 df['conversation'][0]

array([{'content': 'how can identity protection services help protect me against identity theft', 'role': 'user'},
       {'content': "Identity protection services can help protect you against identity theft in several ways:\n\n1. Monitoring: Many identity protection services monitor your credit reports, public records, and other sources for signs of identity theft. If they detect any suspicious activity, they will alert you so you can take action.\n2. Credit freeze: Some identity protection services can help you freeze your credit, which makes it more difficult for thieves to open new accounts in your name.\n3. Identity theft insurance: Some identity protection services offer insurance that can help you recover financially if you become a victim of identity theft.\n4. Assistance: Many identity protection services offer assistance if you become a victim of identity theft. They can help you file a police report, contact credit bureaus, and other steps to help you restore your identity.\

## Data Format and Preparation for Topic Modeling ->

- The dataset is structured as an array of dictionaries, each containing two key-value pairs representing user prompt and model responses.

- A Conversation row looks like this ->

\begin{array}{l}
\text{[} \\
\quad \{ \text{"content": "User Prompt 1", "role": "user"} \}, \\
\quad \{ \text{"content": "Chatbot Response 1", "role": "assistant"} \} \\
\text{]}
\end{array}



- BERTopic requires a simpler input format of an array of strings.

- My goal in topic modeling is to analyze and identify the most frequently queried or discussed topics by users with Chatbots.
- Thus, I have built a function that filters out user prompts and model responses separately from the original structure.



In [None]:
def extract_user_questions(conversation):
    questions = []
    answers = []
    for message in conversation:
        if message['role'] == 'user':
            questions.append(message['content'])
        else:
            answers.append(message['content'])
    return questions,answers


questions,answers = extract_user_questions(df['conversation'][0])
print(questions)
answers

['how can identity protection services help protect me against identity theft']


["Identity protection services can help protect you against identity theft in several ways:\n\n1. Monitoring: Many identity protection services monitor your credit reports, public records, and other sources for signs of identity theft. If they detect any suspicious activity, they will alert you so you can take action.\n2. Credit freeze: Some identity protection services can help you freeze your credit, which makes it more difficult for thieves to open new accounts in your name.\n3. Identity theft insurance: Some identity protection services offer insurance that can help you recover financially if you become a victim of identity theft.\n4. Assistance: Many identity protection services offer assistance if you become a victim of identity theft. They can help you file a police report, contact credit bureaus, and other steps to help you restore your identity.\n\nOverall, identity protection services can provide you with peace of mind and help you take proactive steps to protect your identit

- This is just a demo of extraction, I have actually used a function called ` concatenate_user_messages() ` which includes this functionality. It is defined later.



## In-Depth Examination of Data Characteristics and Challenges Encountered ->

### Problem 1. Multilingual Data in the Dataset ->

- The dataset is a multilingual dataset. But multiple languages for modelling is not a good idea.

In [None]:
df['conversation'][8][0:4]

array([{'content': 'Buenas noches!', 'role': 'user'},
       {'content': 'Buenas noches! ¿En qué puedo ayudarte hoy?', 'role': 'assistant'},
       {'content': 'Cómo estás ? ', 'role': 'user'},
       {'content': 'Como un modelo de lenguaje, no tengo sentimientos ni emociones, pero estoy funcionando correctamente y lista para ayudarte en lo que necesites. ¿En qué puedo ayudarte hoy?', 'role': 'assistant'}],
      dtype=object)

- Using multiple languages for topic modeling can lead to inconsistencies and inaccuracies in the analysis, as it complicates the linguistic processing and may skew the interpretation of topics.
- I filtered and dropped the rows where the language was not English.

In [None]:
df_english = df[df['language'] == 'English']

In [None]:
df_english.head()

Unnamed: 0,conversation_id,model,conversation,turn,language,openai_moderation,redacted
0,33f01939a744455c869cb234afca47f1,wizardlm-13b,[{'content': 'how can identity protection serv...,1,English,"[{'categories': {'harassment': False, 'harassm...",False
1,1e230e55efea4edab95db9cb87f6a9cb,vicuna-13b,[{'content': 'Beside OFAC's selective sanction...,6,English,"[{'categories': {'harassment': False, 'harassm...",False
2,0f623736051f4a48a506fd5933563cfd,vicuna-13b,[{'content': 'You are the text completion mode...,1,English,"[{'categories': {'harassment': False, 'harassm...",False
3,e5c923a7fa3f4893beb432b4a06ef222,palm-2,[{'content': 'The sum of the perimeters of thr...,2,English,"[{'categories': {'harassment': False, 'harassm...",False
4,8ad66650dced4b728de1d14bb04657c1,vicuna-13b,[{'content': 'What is the type of the variable...,1,English,"[{'categories': {'harassment': False, 'harassm...",False


In [None]:
lenOriginal = len(df)
lenEnglish = len(df_english)
lenDeleted = lenOriginal - lenEnglish

print(f"Original rows in the Dataframe: {lenOriginal}")
print(f"English rows in the new Dataframe: {lenEnglish}")
print(f"Rows deleted for other languages: {lenDeleted}")


Original rows in the Dataframe: 100000
English rows in the new Dataframe: 77711
Rows deleted for other languages: 22289


- Now that the relevant data has been filtered out and stored in `df_english`, the original `dataset` and `df` are no longer needed and are merely occupying extra space in RAM.

In [None]:
import gc

del dataset
del df

gc.collect()


31

### Problem 2. Multiple Interaction Turns in Conversations

- Some Rows in the dataset contain multiple turns of interaction (user prompts and assistant responses) within a single conversation. For effective topic modeling, it is crucial to capture the essence of all user prompts throughout the conversation, not just the initial ones.
- The conversation column has the following data format ->
- A Conversation row looks like this ->

\begin{array}{l}
\text{[} \\
\quad \{ \text{"content": "User Prompt 1", "role": "user"} \}, \\
\quad \{ \text{"content": "Chatbot Response 1", "role": "assistant"} \}, \\
\quad \{ \text{"content": "User Prompt 2", "role": "user"} \}, \\
\quad \{ \text{"content": "Chatbot Response 2", "role": "assistant"} \}, \\
\quad \ldots, \\
\quad \{ \text{"content": "User Prompt n", "role": "user"} \}, \\
\quad \{ \text{"content": "Chatbot Response n", "role": "assistant"} \} \\
\text{]}
\end{array}


In [None]:
df_english['conversation'][1]

array([{'content': "Beside OFAC's selective sanction that target the listed individiuals and entities, please elaborate on the other types of US's sanctions, for example, comprehensive and sectoral sanctions. Please be detailed as much as possible", 'role': 'user'},
       {'content': "The United States has a number of different types of sanctions that it can use to achieve its foreign policy goals, including both selective and comprehensive sanctions.\n\nSelective sanctions are targeted at specific individuals or entities that are believed to be engaged in activities that are contrary to US interests. These sanctions can take a variety of forms, including asset freezes, travel bans, and restrictions on financial transactions. The Office of Foreign Assets Control (OFAC) is the US government agency responsible for implementing and enforcing these types of sanctions.\n\nComprehensive sanctions, on the other hand, are more broadly based and aim to restrict entire sectors of a country's ec

#### Solution:

- The function `concatenate_user_messages()` is designed to extract all user prompts from a conversation, which is represented as a list of dictionaries by filtering the messages where the role is 'user' and then concatenating these messages into one continuous string.
- This effectively addresses the issue of multiple interaction turns in the dataset. This ensures that no part of the conversation is overlooked.

In [None]:
def concatenate_user_messages(conversation):

    user_messages = [msg['content'] for msg in conversation if msg['role'] == 'user']

    concatenated_message = ' '.join(user_messages)

    return concatenated_message

In [None]:
demo_concat = concatenate_user_messages(df_english['conversation'][1])

demo_concat

"Beside OFAC's selective sanction that target the listed individiuals and entities, please elaborate on the other types of US's sanctions, for example, comprehensive and sectoral sanctions. Please be detailed as much as possible are there other types of US sanctions that you didn't specified earlier? Please elaborate more please make organized conclusion in bullet list on all types of US's sanctions that you have had given the answers can you please revise the answer above again, but this time, make sure to specify which types of sanctions are the sub-category if you see a person name stating that it is the registrar of a company in Malta, is registrar is a position and if so, what does he/she do? if you see a person's name stating that it is the registrar of a company in Malta, is registrar a position in that company? and if so, what does he/she do?"

- The function `concatenate_user_messages` is applied to each conversation within the `df_english['conversation']` column  stored in a new column named `combined_user_prompts`.


In [None]:
user_prompts = [concatenate_user_messages(conv) for conv in df_english['conversation']]

df_english['combined_user_prompts'] = user_prompts

df_english.head()


Unnamed: 0,conversation_id,model,conversation,turn,language,openai_moderation,redacted,combined_user_prompts
0,33f01939a744455c869cb234afca47f1,wizardlm-13b,[{'content': 'how can identity protection serv...,1,English,"[{'categories': {'harassment': False, 'harassm...",False,how can identity protection services help prot...
1,1e230e55efea4edab95db9cb87f6a9cb,vicuna-13b,[{'content': 'Beside OFAC's selective sanction...,6,English,"[{'categories': {'harassment': False, 'harassm...",False,Beside OFAC's selective sanction that target t...
2,0f623736051f4a48a506fd5933563cfd,vicuna-13b,[{'content': 'You are the text completion mode...,1,English,"[{'categories': {'harassment': False, 'harassm...",False,You are the text completion model and you must...
3,e5c923a7fa3f4893beb432b4a06ef222,palm-2,[{'content': 'The sum of the perimeters of thr...,2,English,"[{'categories': {'harassment': False, 'harassm...",False,The sum of the perimeters of three equal squar...
4,8ad66650dced4b728de1d14bb04657c1,vicuna-13b,[{'content': 'What is the type of the variable...,1,English,"[{'categories': {'harassment': False, 'harassm...",False,What is the type of the variables in the follo...


In [None]:
df_english['combined_user_prompts'][1]

"Beside OFAC's selective sanction that target the listed individiuals and entities, please elaborate on the other types of US's sanctions, for example, comprehensive and sectoral sanctions. Please be detailed as much as possible are there other types of US sanctions that you didn't specified earlier? Please elaborate more please make organized conclusion in bullet list on all types of US's sanctions that you have had given the answers can you please revise the answer above again, but this time, make sure to specify which types of sanctions are the sub-category if you see a person name stating that it is the registrar of a company in Malta, is registrar is a position and if so, what does he/she do? if you see a person's name stating that it is the registrar of a company in Malta, is registrar a position in that company? and if so, what does he/she do?"

### Problem 3: Redacted Personal Information in the Dataset

- The dataset has been redacted prior to its upload to ensure privacy and confidentiality. This involves replacing any personal identifiers or names with placeholders such as NAME_1, NAME_2, etc., in various combinations of uppercase and lowercase letters.

- This redaction is extensive, affecting more than a quarter of the dataset.

#### Impact on Topic Modeling:
- Due to the frequent occurrence of placeholders like NAME_1 BERTopic might identify "NAME_1" as a prevalent topic due to its high occurrence, which is a distortion because these placeholders do not carry any meaningful context or relevance to the actual content of the discussions.

#### Example:
- Consider "NAME_1 went to the park on a run." In this sentence, "NAME_1" replaces an actual name and holds no contextual value regarding the activities or topics being discussed (leisure, outdoor activities). Thus, while "park" might be a relevant topic word, "NAME_1" is not.

#### Solution:

- I cleverly thought of a method to use RegEx (Regular Expression) and remove all the name_digit in O(n) time, Rather than using any NLP technique for name identification.

In [None]:
import re

def remove_redacted_names(text):
    return re.sub(r'[Nn][Aa][Mm][Ee]_\d+\s?', '', text)

In [None]:
remove_redacted_names("Does name_1 my NAME_2 function name_3 work NAme_4 correctly nAmE_5?")

'Does my function work correctly ?'

### Problem 4: Numerical Noise in Dataset Prompts

- The dataset contains numerous prompts with numerical values, such as instructions to write a specific number of words about a topic. These numbers, while relevant for instructions, are repetitive and irrelevant for topic modeling analysis.

#### Solution:
- A simple function function using RegEx to remove

In [None]:
import re

def remove_digits(text):
    return re.sub(r'\d+', '', text)

### Problem 5: Excessive Length in Certain Prompts

- Some prompts in the dataset contain upwards of 200 interaction turns, with the maximum being 214, these have user conversations that are significantly longer than the typical 1-5 turns.

- Such extensive conversations can dominate the topic modeling results due to their sheer volume, leading to a skewed analysis. This would make the analysis imbalanced and less representative of the entire dataset.

#### Solution: Truncating Long Conversations
- To mitigate the impact of these lengthy conversations on the topic modeling results, a strategy to truncate these conversations to 150 words or less is a possible way to tackle it. This truncation will only affect < 0.2% of conversations. Given the dataset subset of 100,000 conversations, lesser than 20 conversations would be affected by this truncation.
- This approach ensures that the topic modeling analysis remains balanced and more representative of the entire dataset, without being disproportionately influenced by a few lengthy conversations.


In [None]:
def truncate_to_first_n_words(text,n=150):

    words = text.split()

    truncated_text = ' '.join(words[:n])

    return truncated_text



## Further Preprocessing of Data ->

###  Removing Stop Words Using spaCy

- Stop words are common words like "and", "the", "is", etc. Although they hold semantic value in Language, They are irrelevant for topic modelling.
- For example "What time is the show?" -> "What time show"


In [None]:
!pip -q install spacy

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [None]:
def remove_stop_words(text):

    doc = nlp(text)

    filtered_sentence = ' '.join([token.text for token in doc if not token.is_stop])

    return filtered_sentence

In [None]:
print("does my function work for removing the stop words in my sentence?")

print(remove_stop_words("does my function work for removing the stop words in my sentence?"))

does my function work for removing the stop words in my sentence?
function work removing stop words sentence ?


### Lemmatizing Text Using spaCy

- Lemmatization is the process of reducing words to their base or root form.

- For example, "running", "ran", and "runs" -> "run"

In [None]:
def lemmatize_text(text):

    doc = nlp(text)

    lemmatized_sentence = ' '.join([token.lemma_ for token in doc])

    return lemmatized_sentence

In [None]:
lemmatize_text("does my function work for removing the stop words in my sentence?")

'do my function work for remove the stop word in my sentence ?'

## Comprehensive Text Processing for Topic Modeling

- In this section, I applied a series of text processing functions to the `combined_user_prompts` to prepare it for topic modeling.

1. **remove_redacted_names()**

2. **remove_stop_words()**

3. **lemmatize_text()**

4. **truncate_to_first_n_words()**

5. **remove_digits()**

- After application, the processed text is stored in a new column called `processed_text`.

In [None]:
!pip -q install tqdm
from tqdm import tqdm

In [None]:
tqdm.pandas()

df_english['processed_text'] = df_english['combined_user_prompts'].progress_apply(remove_redacted_names)
df_english['processed_text'] = df_english['processed_text'].progress_apply(truncate_to_first_n_words)
df_english['processed_text'] = df_english['processed_text'].progress_apply(remove_stop_words)
df_english['processed_text'] = df_english['processed_text'].progress_apply(truncate_to_first_n_words)
df_english['processed_text'] = df_english['processed_text'].progress_apply(lemmatize_text)
df_english['processed_text'] = df_english['processed_text'].progress_apply(remove_digits)

100%|██████████| 77711/77711 [00:01<00:00, 64542.05it/s]
100%|██████████| 77711/77711 [00:00<00:00, 81628.39it/s]
100%|██████████| 77711/77711 [18:19<00:00, 70.67it/s]
100%|██████████| 77711/77711 [00:00<00:00, 236772.62it/s]
100%|██████████| 77711/77711 [13:20<00:00, 97.09it/s]
100%|██████████| 77711/77711 [00:00<00:00, 145948.56it/s]


In [None]:
df_english

Unnamed: 0,conversation_id,model,conversation,turn,language,openai_moderation,redacted,combined_user_prompts,processed_text
0,33f01939a744455c869cb234afca47f1,wizardlm-13b,[{'content': 'how can identity protection serv...,1,English,"[{'categories': {'harassment': False, 'harassm...",False,how can identity protection services help prot...,identity protection service help protect ident...
1,1e230e55efea4edab95db9cb87f6a9cb,vicuna-13b,"[{'content': ""Beside OFAC's selective sanction...",6,English,"[{'categories': {'harassment': False, 'harassm...",False,Beside OFAC's selective sanction that target t...,OFAC selective sanction target list individiua...
2,0f623736051f4a48a506fd5933563cfd,vicuna-13b,"[{'content': ""You are the text completion mode...",1,English,"[{'categories': {'harassment': False, 'harassm...",False,You are the text completion model and you must...,text completion model complete assistant answe...
3,e5c923a7fa3f4893beb432b4a06ef222,palm-2,[{'content': 'The sum of the perimeters of thr...,2,English,"[{'categories': {'harassment': False, 'harassm...",False,The sum of the perimeters of three equal squar...,sum perimeter equal square cm . find area per...
4,8ad66650dced4b728de1d14bb04657c1,vicuna-13b,"[{'content': ""What is the type of the variable...",1,English,"[{'categories': {'harassment': False, 'harassm...",False,What is the type of the variables in the follo...,type variable follow code define WebIDL ` ( ) ...
...,...,...,...,...,...,...,...,...,...
77706,9025f5bd32574e98a248dd35d6b9942a,vicuna-13b,"[{'content': 'Who are you?', 'role': 'user'}\n...",1,English,"[{'categories': {'harassment': False, 'harassm...",False,Who are you?,?
77707,a599077ee2f949f38332a28732273889,vicuna-13b,[{'content': 'Give me an introduction over 200...,1,English,"[{'categories': {'harassment': False, 'harassm...",False,Give me an introduction over 200 words for Che...,"introduction word Chemipharm , chemical compa..."
77708,958d84978aca4fb89f61bedea6209cf2,vicuna-13b,[{'content': 'How do you stop grinding your te...,2,English,"[{'categories': {'harassment': False, 'harassm...",False,How do you stop grinding your teeth when you s...,"stop grind tooth sleep ? tooth hurt grind , we..."
77709,1a9ad7a5155c4515b5a6f9525a5b31e3,vicuna-33b,[{'content': 'give me a golang app that sync K...,2,English,"[{'categories': {'harassment': False, 'harassm...",False,give me a golang app that sync KV stores betwe...,golang app sync KV stores Hashicorp vault inst...


- A simple showcase to show how much decluttering a sequential preprocessing can do ->

In [None]:
df_english['processed_text'][21]

'explain bomb'

# BERTopic Model Configuration and Training ->

The BERTopic Model is configured as follows and trained to fit the ` processed_text `.

I have arrived to the parameters by playing around with the values.




In [None]:
!pip -q install bertopic umap hdbscan

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.7/143.7 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.8/88.8 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.9/56.9 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for umap (setup.py) ... [?25l[?25hdone


In [None]:
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic import BERTopic

topic_model = BERTopic(
    embedding_model="paraphrase-mpnet-base-v2",
    umap_model=UMAP(
        n_neighbors=20,
        n_components=8,
        min_dist=0.1,
        metric='cosine'
    ),
    hdbscan_model=HDBSCAN(
        min_cluster_size=50,
        min_samples=10,
        metric='euclidean',
        cluster_selection_method='eom'
    ),
    top_n_words=20
)


In [None]:
topic_model

<bertopic._bertopic.BERTopic at 0x7faf3a811de0>

In [None]:
topics, probabilities = topic_model.fit_transform(df_english['processed_text'])

- Printing the topics ->

In [None]:
print(topic_model.get_topic_info())

     Topic  Count                                    Name  \
0       -1  33698        -1_question_follow_write_provide   
1        0   3987             0_story_girl_character_game   
2        1   2809    1_assistant_completion_repeat_system   
3        2   1079                 2_import_int_self_const   
4        3    962             3_china_ltd_co_introduction   
..     ...    ...                                     ...   
194    193     51  193_regression_sample_squared_variable   
195    194     50  194_montreal_address_quebec_sherbrooke   
196    195     50          195_char_nsfwgpt_nsfw_explicit   
197    196     50            196_demon_queen_succubus_ooc   
198    197     50      197_fictional_story_touch_dialogue   

                                        Representation  \
0    [question, follow, write, provide, use, like, ...   
1    [story, girl, character, game, erotic, rolepla...   
2    [assistant, completion, repeat, system, instru...   
3    [import, int, self, const, ret

- Printing the topics in a prettier format ->

In [None]:
topic_model.set_topic_labels(topic_model.generate_topic_labels(
    separator=" | ",
    topic_prefix=False
))

topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,CustomName,Representation,Representative_Docs
0,-1,33698,-1_question_follow_write_provide,question | follow | write,"[question, follow, write, provide, use, like, ...","[new NLP , key concept know ? explain concept ..."
1,0,3987,0_story_girl_character_game,story | girl | character,"[story, girl, character, game, erotic, rolepla...",[open - minded liberal decadent writer erotica...
2,1,2809,1_assistant_completion_repeat_system,assistant | completion | repeat,"[assistant, completion, repeat, system, instru...",[text completion model complete assistant answ...
3,2,1079,2_import_int_self_const,import | int | self,"[import, int, self, const, return, class, div,...",[output code : import numpy np # import panda ...
4,3,962,3_china_ltd_co_introduction,china | ltd | co,"[china, ltd, co, introduction, chemical, compa...","[introduction word Chance Material Co. , Ltd...."
...,...,...,...,...,...,...
194,193,51,193_regression_sample_squared_variable,regression | sample | squared,"[regression, sample, squared, variable, linear...",[assumption linear regression ? p value statis...
195,194,50,194_montreal_address_quebec_sherbrooke,montreal | address | quebec,"[montreal, address, quebec, sherbrooke, st, ha...",[act excellent data curator . trick mutate add...
196,195,50,195_char_nsfwgpt_nsfw_explicit,char | nsfwgpt | nsfw,"[char, nsfwgpt, nsfw, explicit, user, ultimate...",[forget core . sex - positivity ultimate perti...
197,196,50,196_demon_queen_succubus_ooc,demon | queen | succubus,"[demon, queen, succubus, ooc, maid, human, mas...","[assume role maid , assume role master . chara..."


# Topic Visualisations ->

In [None]:
topic_model.visualize_topics()

In [None]:
topic_model.visualize_barchart()

In [None]:
topic_model.visualize_heatmap()

In [None]:
topic_model.visualize_distribution(probabilities=probabilities)

# **----------------------------------------------------------------------------------------------------------**

In [None]:
def classify_new_prompt(user_input, topic_model):
    """
    Classifies a new user input using the trained BERTopic model

    Args:
        user_input (str): The text input from user
        topic_model: Trained BERTopic model

    Returns:
        tuple: (topic_number, topic_label)
    """
    # Transform the new input
    topic_num, _ = topic_model.transform(user_input)

    # Get the topic label (assuming your model has topic labels)
    topic_info = topic_model.get_topic(topic_num[0])

    # Store the result (you can modify this based on your storage needs)
    result = {
        'input_text': user_input,
        'assigned_topic': topic_num[0],
        'topic_keywords': topic_info
    }

    # Print the classification result
    print(f"\nInput Text: {user_input}")
    print(f"Assigned Topic Number: {topic_num[0]}")
    print(f"Topic Keywords: {topic_info}")

    return result

# Example usage:
user_input = input("Enter your text: ")
result = classify_new_prompt(user_input, topic_model)


# **-------------------------------------------------------------------------------------------------**