<a href="https://colab.research.google.com/github/cohere-ai/sandbox-topically/blob/main/notebooks/Intro%20-%20Topically%20with%20BERTopic.ipynb" target="_parent\"><img src="https://colab.research.google.com/assets/colab-badge.svg\" alt="Open In Colab"/></a>

# Making Sense of Text Collections with Language Models
This notebook demonstrates how to make sense of a large text archive with large language models (LLMs). This is a topic called topic modeling. In this note book, we look at clustering 3,000 news and discussion headlines from Hacker News.

### Text Embedding

The headlines were embedded using Cohere's [Embed](https://docs.cohere.ai/embed-reference/) endpoint. More on this process and dataset here: [Combing For Insight in 10,000 Hacker News Posts With Text Clustering](https://txt.cohere.ai/combing-for-insight-in-10-000-hacker-news-posts-with-text-clustering/) and in this [video: Exploring News Headlines With Text Clustering](https://www.youtube.com/watch?v=23qfPq0m7XA).

### Clustering into topics
In this notebook, we use [BERTopic](https://github.com/MaartenGr/BERTopic) to cluster these embeddings into groups. Learn more about BERTopic in this [video: BERTopic for Topic Modeling](https://www.youtube.com/watch?v=uZxQz87lb84).

### Assigning topic names with Cohere's [Generate](https://docs.cohere.ai/generate-reference)

After grouping the texts into topics, we use [topically](https://github.com/cohere-ai/sandbox-topically/) to suggest names for these clusters (we use "topics" and "clusters" to mean the same thing).



In [None]:
!pip install bertopic topically

In [1]:
from topically import Topically
from bertopic import BERTopic
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


## Load embeddings and Hacker News dataset
We have already prepared the embeddings and can simply download them.

In [2]:
!wget https://storage.googleapis.com/cohere-assets/blog/text-clustering/data/askhn3k_embeds.npy
# or
# !curl https://storage.googleapis.com/cohere-assets/blog/text-clustering/data/askhn3k_embeds.npy -o askhn3k_embeds.npy

zsh:1: command not found: wget


In [2]:
#Load the embeddings matrix
embeds = np.load('askhn3k_embeds.npy')

# Load the dataframe containing the text and metadata of the posts
df = pd.read_csv('https://storage.googleapis.com/cohere-assets/blog/text-clustering/data/askhn3k_df.csv', index_col=0)

print(f'Loaded a DataFrame with {len(df)} rows and an embeddings matrix of dimensions {embeds.shape}')

Loaded a DataFrame with 3000 rows and an embeddings matrix of dimensions (3000, 1024)


## Cluster with BERTopic
We'll use the KMeans clustering algorithm here to group the 3,000 texts into 8 clusters (you can change this number).

In [3]:
# Load and initialize BERTopic to use KMeans clustering with 8 clusters only.
cluster_model = KMeans(n_clusters=8)
topic_model = BERTopic(hdbscan_model=cluster_model)

# df is a dataframe. df['title'] is the column of text we're modeling
df['topic'], probabilities = topic_model.fit_transform(df['title'], embeds)

## Naming topics with Topically

In [4]:
# Pass in Cohere API key, or it will ask for it
app = Topically()

Enter your Cohere API Key········


In [8]:
# Name each cluster. This will make one request to GENERATE for each cluster.
# Since we have 8 topics, this will call Cohere Generate 8 times.
df['topic_name'], topic_names = app.name_topics((df['title'], df['topic']))

In [10]:
# We can see the suggested names of these topics
topic_names

{0: 'Tech, productivity, and the future',
 1: 'Personal growth',
 2: 'User experience and product development',
 3: 'Data breaches and cybersecurity',
 4: 'Resources for learning programming',
 5: 'Jobs and job hunting',
 6: 'Who is hiring?',
 7: 'Employment issues'}

In [7]:
# Preview the generated names
df[['title', 'topic', 'topic_name']]

Unnamed: 0,title,topic,topic_name
0,"I'm a software engineer going blind, how shoul...",1,Other workplace-related topics
1,Am I the longest-serving programmer – 57 years...,1,Other workplace-related topics
2,Is S3 down?,3,The world of the web
3,What tech job would let me get away with the l...,1,Other workplace-related topics
4,What books changed the way you think about alm...,4,Self-improvement
...,...,...,...
2995,Do you like/use Reddit's redesign?,3,The world of the web
2996,What's your best startup idea that you're not ...,2,Success stories of startups
2997,Freelancer? Seeking freelancer? (June 2021),5,Hiring and recruiting issues
2998,How to learn about the history of computing?,4,Self-improvement


In [14]:
# Preview the headlines in one of the topics
df[df['topic'] == 1][['title', 'topic', 'topic_name']]

Unnamed: 0,title,topic,topic_name
0,"I'm a software engineer going blind, how shoul...",1,Personal growth
1,Am I the longest-serving programmer – 57 years...,1,Personal growth
3,What tech job would let me get away with the l...,1,Personal growth
11,What's the most valuable thing you can learn i...,1,Personal growth
14,I've been slacking off at Google for 6 years. ...,1,Personal growth
...,...,...,...
2966,"Senior developers, what would you tell a young...",1,Personal growth
2971,What to do when all you have is talent?,1,Personal growth
2978,anyone ever drop everything and leave software...,1,Personal growth
2981,What can I do to turn things around and make m...,1,Personal growth


We can run name_topics again to get new names. We can now visualize the texts and the topics like such:

In [20]:
topic_model.set_topic_labels(topic_names)
topic_model.visualize_documents(df['title'], 
                                embeddings=embeds,
                                topics = list(range(8)),
                                custom_labels=True,
                                width=900,
                                height=600)

There are several ways (more on the way) to improve the generated titles:

1. **Make a custom prompt** specific to the dataset. Use the prompts in prompts/prompts.py as a starting point.
1. Pass `num_generations=5` to `name_topics()`. This will tell Cohere Generate to return five suggested titles for each topic, and the one with the highest likelihood would be selected as the topic.
1. [coming soon] Include cTFIDF keywords in the prompt