<a target="_blank" href="https://colab.research.google.com/github/cohere-ai/notebooks/blob/main/notebooks/llmu/Topic_Modeling.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Topic Modeling: Analyzing Hacker News with Six Language Understanding Methods
Large language models give machines a vastly improved representation and understanding of language. These abilities give developers more options for content recommendation, analysis, and filtering.

In this notebook we take thousands of the most popular posts from Hacker News and demonstrated some of these functionalities:

1. Given an existing post title, retrieve the most similar posts (nearest neighbor search using embeddings)
1. Given a query that we write, retrieve the most similar posts
1. Plot the archive of articles by similarity (where similar posts are close together and different ones are far)
1. Cluster the posts to identify the major common themes
1. Extract major keywords from each cluster so we can identify what the clsuter is about
1. (Experimental) Naming clusters with a generative language model


## Dataset: Top 3,000 Ask HN posts
We will use the top 3,000 posts from the Ask HN section of Hacker News. We provide both a CSV containing the posts as well as their embeddings using Cohere's small embedding model.

## Setup
Let's start by installing the tools we'll need and then importing them.

In [None]:
# Install Cohere for embeddings, Umap to reduce embeddings to 2 dimensions, 
# Altair for visualization, Annoy for approximate nearest neighbor search
# tqdm for progress bars, bertopic for its ctfidf algorithm
# TODO: upgrade to "cohere>5"
!pip install "cohere<5" umap-learn altair annoy datasets tqdm bertopic

In [3]:
#@title Import libraries (Run this cell to execute required code) {display-mode: "form"}

import cohere
import numpy as np
import re
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset
import umap
import altair as alt
from sklearn.metrics.pairwise import cosine_similarity
from annoy import AnnoyIndex
import warnings
from sklearn.cluster import KMeans
from bertopic._ctfidf import ClassTFIDF
from sklearn.feature_extraction.text import CountVectorizer

warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', None)

Next, we can download the embeddings matrix. It has 3,000 rows (one for each post) and 1024 columns (meaning each post title is represented with a 1024-dimensional embedding).

In [None]:
!wget https://storage.googleapis.com/cohere-assets/blog/text-clustering/data/askhn3k_embeds.npy

In [5]:
#Load the embeddings matrix
embeds = np.load('askhn3k_embeds.npy')

# Load the dataframe containing the text and metadata of the posts
df = pd.read_csv('https://storage.googleapis.com/cohere-assets/blog/text-clustering/data/askhn3k_df.csv', index_col=0)

print(f'Loaded a DataFrame with {len(df)} rows and an embeddings matrix of dimensions {embeds.shape}')

Loaded a DataFrame with 3000 rows and an embeddings matrix of dimensions (3000, 1024)


In [7]:
# Let's glance at the contents of the dataframe with the text and metadata
df.head()

Unnamed: 0,title,url,text,dead,by,score,time,timestamp,type,id,parent,descendants,ranking,deleted
0,"I'm a software engineer going blind, how should I prepare?",,"I&#x27;m a 24 y&#x2F;o full stack engineer (I know some of you are rolling your eyes right now, just highlighting that I have experience on frontend apps as well as backend architecture). I&#x27;ve been working professionally for ~7 years building mostly javascript projects but also some PHP. Two years ago I was diagnosed with a condition called &quot;Usher&#x27;s Syndrome&quot; - characterized by hearing loss, balance issues, and progressive vision loss.<p>I know there are blind software engineers out there. My main questions are:<p>- Are there blind frontend engineers?<p>- What kinds of software engineering lend themselves to someone with limited vision? Backend only?<p>- Besides a screen reader, what are some of the best tools for building software with limited vision?<p>- Does your company employ blind engineers? How well does it work? What kind of engineer are they?<p>I&#x27;m really trying to get ahead of this thing and prepare myself as my vision is degrading rather quickly. I&#x27;m not sure what I can do if I can&#x27;t do SE as I don&#x27;t have any formal education in anything. I&#x27;ve worked really hard to get to where I am and don&#x27;t want it to go to waste.<p>Thank you for any input, and stay safe out there!<p>Edit:<p>Thank you all for your links, suggestions, and moral support, I really appreciate it. Since my diagnosis I&#x27;ve slowly developed a crippling anxiety centered around a feeling that I need to figure out the rest of my life before it&#x27;s too late. I know I shouldn&#x27;t think this way but it is hard not to. I&#x27;m very independent and I feel a pressure to &quot;show up.&quot; I will look into these opportunities mentioned and try to get in touch with some more members of the blind engineering community.",,zachrip,3270,1587332026,2020-04-19 21:33:46+00:00,story,22918980,,473.0,,
1,Am I the longest-serving programmer – 57 years and counting?,,"In May of 1963, I started my first full-time job as a computer programmer for Mitchell Engineering Company, a supplier of steel buildings. At Mitchell, I developed programs in Fortran II on an IBM 1620 mostly to improve the efficiency of order processing and fulfillment. Since then, all my jobs for the past 57 years have involved computer programming. I am now a data scientist developing cloud-based big data fraud detection algorithms using machine learning and other advanced analytical technologies. Along the way, I earned a Master’s in Operations Research and a Master’s in Management Science, studied artificial intelligence for 3 years in a Ph.D. program for engineering, and just two years ago I received Graduate Certificates in Big Data Analytics from the schools of business and computer science at a local university (FAU). In addition, I currently hold the designation of Certified Analytics Professional (CAP). At 74, I still have no plans to retire or to stop programming.",,genedangelo,2634,1590890024,2020-05-31 01:53:44+00:00,story,23366546,,531.0,,
2,Is S3 down?,,I&#x27;m getting<p>{\n &quot;errorCode&quot; : &quot;InternalError&quot;\n}<p>When I attempt to use the AWS Console to view s3,,iamdeedubs,2589,1488303958,2017-02-28 17:45:58+00:00,story,13755673,,1055.0,,
3,What tech job would let me get away with the least real work possible?,,"Hey HN,<p>I&#x27;ll probably get a lot of flak for this. Sorry.<p>I&#x27;m an average developer looking for ways to work as little as humanely possible.<p>The pandemic made me realize that I do not care about working anymore. The software I build is useless. Time flies real fast and I have to focus on my passions (which are not monetizable).<p>Unfortunately, I require shelter, calories and hobby materials. Thus the need for some kind of job.<p>Which leads me to ask my fellow tech workers, what kind of job (if any) do you think would fit the following requirements :<p>- No &#x2F; very little involvement in the product itself (I do not care.)<p>- Fully remote (You can&#x27;t do much when stuck in the office. Ideally being done in 2 hours in the morning then chilling would be perfect.)<p>- Low expectactions &#x2F; vague job description.<p>- Salary can be on the lower side.<p>- No career advancement possibilities required. Only tech, I do not want to manage people.<p>- Can be about helping other developers, setting up infrastructure&#x2F;deploy or pure data management since this is fun.<p>I think the only possible jobs would be some kind of backend-only dev or devops&#x2F;sysadmin work. But I&#x27;m not sure these exist anymore, it seems like you always end up having to think about the product itself. Web dev jobs always required some involvement in the frontend.<p>Thanks for any advice (or hate, which I can&#x27;t really blame you for).",,lmueongoqx,2022,1617784863,2021-04-07 08:41:03+00:00,story,26721951,,1091.0,,
4,What books changed the way you think about almost everything?,,I was reflecting today about how often I think about Freakonomics. I don&#x27;t study it religiously. I read it one time more than 10 years ago. I can only remember maybe a single specific anecdote from the book. And yet the simple idea that basically every action humans take can be traced back to an incentive has fundamentally changed the way I view the world. Can anyone recommend books that have had a similar impact on them?,,anderspitman,2009,1549387905,2019-02-05 17:31:45+00:00,story,19087418,,1165.0,,


## Building a semantic search index
For nearest-neighbor search, we can use the open-source Annoy library. Let's create a semantic search index and feed it all the embeddings.

In [8]:
# Create the search index, pass the size of embedding
search_index = AnnoyIndex(embeds.shape[1], 'angular')
# Add all the vectors to the search index
for i in range(len(embeds)):
    search_index.add_item(i, embeds[i])

search_index.build(10) # 10 trees
search_index.save('askhn.ann')

True

## 1- Given an existing post title, retrieve the most similar posts (nearest neighbor search using embeddings)
We can query neighbors of a specific post using `get_nns_by_item`.

In [9]:
# Choose an example (we'll retrieve others similar to it)
example_id = 50

# Retrieve nearest neighbors
similar_item_ids = search_index.get_nns_by_item(example_id,
                                                10, # Number of results to retrieve
                                                include_distances=True)
# Format and print the text and distances
results = pd.DataFrame(data={'post titles': df.iloc[similar_item_ids[0]]['title'], 
                             'distance': similar_item_ids[1]}).drop(example_id)

print(f"Query post:'{df.iloc[example_id]['title']}'\nNearest neighbors:")
results

Query post:'Pick startups for YC to fund'
Nearest neighbors:


Unnamed: 0,post titles,distance
731,What should we fund at YC Research?,0.782082
1859,Which successful startups were rejected by YC?,0.832642
1965,Did your YC (or other incubator) startup fail? What are you doing now?,0.93235
2123,Who is seeking a cofounder?,0.967259
2910,Who's looking for a cofounder?,0.976471
1603,"Non-VC backed founders, any tips on growth?",0.979501
1206,Who's looking for a co-founder?,0.99703
1880,How to raise a seed round for people with no elite connections?,1.007137
2537,Obtaining initial users for a startup,1.007389


## 2- Given a query that we write, retrieve the most similar posts
We're not limited to searching using existing items. If we get a query, we can embed it and find its nearest neighbors from the dataset.

Because we need to embed the query, you'll need your API key for this next cell. [Sign up to Cohere for free](https://os.cohere.ai/) and get one if you haven't yet. This gives you $75 of credits, which with the small embedding model, means embedding the entire Lord of the Rings trilogy (plus the hobbit) not once, but ten times. 

In [10]:
# Paste your API key here. Remember to not share publicly
api_key = ''

# Create and retrieve a Cohere API key from os.cohere.ai
co = cohere.Client(api_key)

In [15]:
query = "What is your most profound life insight?"

# Get the query's embedding
# We'll need to embed the query using the same model that embedded the archive
# so the query and archive are using the same embedding space.
query_embed = co.embed(texts=[query],
                  model="small-20220425", 
                  truncate="RIGHT").embeddings

# Retrieve the nearest neighbors
similar_item_ids = search_index.get_nns_by_vector(query_embed[0],10,
                                                include_distances=True)
# Format & print the results
results = pd.DataFrame(data={'texts': df.iloc[similar_item_ids[0]]['title'], 
                             'distance': similar_item_ids[1]})
print(f"Query:'{query}'\nNearest neighbors:")
results

Query:'What is your most profound life insight?'
Nearest neighbors:


Unnamed: 0,texts,distance
2867,Did you learn any life changing lessons this year?,0.812961
460,What inspires you to persevere through adversity?,0.84565
100,What has your work taught you that other people don't realize?,0.846071
1985,What's your most interesting life goal currently?,0.848561
1430,What influenced your personal growth the most?,0.851577
25,Name one idea that changed your life,0.883912
650,What things do you wish you discovered earlier?,0.891235
1614,What have the past 12 months taught you?,0.921628
1387,What has the past 12 months taught you?,0.92462
329,What was the best decision you made in your career?,0.927513


## 3- Plot the archive of articles by similarity
What if we want to browse the archive instead of only searching it. Let's plot out all the questions onto a 2D chart so you're able to visualize the posts in the archive and their similarities.

In [16]:

reducer = umap.UMAP(n_neighbors=100) 
umap_embeds = reducer.fit_transform(embeds)


In [18]:

df['x'] = umap_embeds[:,0]
df['y'] = umap_embeds[:,1]

# Plot
chart = alt.Chart(df).mark_circle(size=60).encode(
    x=#'x',
    alt.X('x',
        scale=alt.Scale(zero=False),
        axis=alt.Axis(labels=False, ticks=False, domain=False)
    ),
    y=
    alt.Y('y',
        scale=alt.Scale(zero=False),
        axis=alt.Axis(labels=False, ticks=False, domain=False)
    ),
    tooltip=['title']
).configure(background="#FDF7F0"
).properties(
    width=700,
    height=400,
    title='Ask HN: top 3,000 posts'
)

chart.interactive()

## 4- Cluster the posts to identify the major common themes
Let's proceed the cluster the embeddings using KMeans from scikit-learn.

In [19]:
# Pick the number of clusters
n_clusters=8

# Cluster the embeddings
kmeans_model = KMeans(n_clusters=n_clusters, random_state=0)
classes = kmeans_model.fit_predict(embeds)


## 5- Extract major keywords from each cluster so we can identify what the cluster is about

In [None]:

# Extract the keywords for each cluster
documents =  df['title']
documents = pd.DataFrame({"Document": documents,
                          "ID": range(len(documents)),
                          "Topic": None})
documents['Topic'] = classes
documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
count_vectorizer = CountVectorizer(stop_words="english").fit(documents_per_topic.Document)
count = count_vectorizer.transform(documents_per_topic.Document)
words = count_vectorizer.get_feature_names()
ctfidf = ClassTFIDF().fit_transform(count).toarray()
words_per_class = {label: [words[index] for index in ctfidf[label].argsort()[-10:]] for label in documents_per_topic.Topic}
df['cluster'] = classes
df['keywords'] = df['cluster'].map(lambda topic_num: ", ".join(np.array(words_per_class[topic_num])[:]))

## Plot with clusters and keywords information
We can now plot the documents with their clusters and keywords

In [20]:
selection = alt.selection_multi(fields=['keywords'], bind='legend')

chart = alt.Chart(df).transform_calculate(
    url='https://news.ycombinator.com/item?id=' + alt.datum.id
).mark_circle(size=60, stroke='#666', strokeWidth=1, opacity=0.3).encode(
    x=#'x',
    alt.X('x',
        scale=alt.Scale(zero=False),
        axis=alt.Axis(labels=False, ticks=False, domain=False)
    ),
    y=
    alt.Y('y',
        scale=alt.Scale(zero=False),
        axis=alt.Axis(labels=False, ticks=False, domain=False)
    ),
    href='url:N',
    color=alt.Color('keywords:N', 
                    legend=alt.Legend(columns=1, symbolLimit=0, labelFontSize=14)
                   ),
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2)),
    tooltip=['title', 'keywords', 'cluster', 'score', 'descendants']
).properties(
    width=800,
    height=500
).add_selection(
    selection
).configure_legend(labelLimit= 0).configure_view(
    strokeWidth=0
).configure(background="#FDF7F0").properties(
    title='Ask HN: Top 3,000 Posts'
)
chart.interactive()

## 6- (Experimental) Naming clusters with a generative language model
While the extracted keywords do add a lot of information to help us identify the clusters at a glance, we should be able to have a generative model look at these keywords an suggest a name. [I'm documenting my prompt engineering experiments in this forum thread](https://community.cohere.ai/t/naming-text-clusters-of-short-texts/226) and invite you to pitch in with ideas. So far I have reasonable results from a prompt that looks like this:

```
The common theme of the following words: books, book, read, the, you, are, what, best, in, your
is that they all relate to favorite books to read.
---
The common theme of the following words: startup, company, yc, failed
is that they all relate to startup companies and their failures.
---
The common theme of the following words: freelancer, wants, hired, be, who, seeking, to, 2014, 2020, april
is that they all relate to hiring for a freelancer to join the team of a startup.
---
The common theme of the following words: <insert keywords here>
is that they all relate to
```

There's a lot of room for improvement though. I'm really excited by this use case because it adds so much information. Imagine if the in the following tree of topics, you assigned each cluster an intelligeble name. Then imagine if you assigned each *branching* a name as well

![](https://raw.githubusercontent.com/cohere-ai/notebooks/main/notebooks/images/kmeans-centroid-dendrogram.png)

We can’t wait to see what you start building! Share your projects or find support on our [Discord server](https://discord.com/invite/co-mmunity).
