<a href="https://colab.research.google.com/github/ebamberg/research-projects-ml/blob/main/examples/visualization/embedding_clusters_with_umap.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install umap-learn plotly pandas numpy sentence-transformers hdbscan tqdm --quiet

!curl -L -o ./news-headlines-2024.zip https://www.kaggle.com/api/v1/datasets/download/dylanjcastillo/news-headlines-2024
!unzip -o ./news-headlines-2024.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  225k  100  225k    0     0   402k      0 --:--:-- --:--:-- --:--:--  402k
Archive:  ./news-headlines-2024.zip
  inflating: news_data_dedup.csv     


In [2]:
import pandas as pd

df = pd.read_csv("./news_data_dedup.csv")
docs = [
    f"{title}\n{description}"
    for title, description in zip(df.title, df.description)
]



In [3]:
from google.colab import userdata
from sentence_transformers import SentenceTransformer
from umap import UMAP
import pandas as pd
import numpy as np
import plotly.express as px
import hdbscan

model_id = "sentence-transformers/all-MiniLM-L6-v2"
hf_token = userdata.get('HF_TOKEN')

model = SentenceTransformer(model_id)

embeddings = model.encode(docs)

embeddings

array([[ 0.01300003, -0.02001945,  0.0378731 , ..., -0.0176036 ,
         0.08934786, -0.06768816],
       [-0.00416246,  0.03095501, -0.04282118, ..., -0.02408286,
        -0.04449847, -0.03713957],
       [-0.02172823, -0.00271915,  0.10941693, ...,  0.00217858,
         0.010975  , -0.09007305],
       ...,
       [ 0.08573322,  0.04818088,  0.0819603 , ...,  0.04652384,
        -0.06448119,  0.02147188],
       [ 0.08200873, -0.00633858,  0.06603416, ...,  0.00668821,
        -0.03189663, -0.00043885],
       [ 0.06723553,  0.08093947, -0.01455507, ..., -0.01280487,
        -0.00361775,  0.06653812]], dtype=float32)

In [4]:
hdb = hdbscan.HDBSCAN(min_samples=3, min_cluster_size=10 ).fit(embeddings)
df=df.assign(cluster=lambda df: hdb.labels_.astype(str))
df



Unnamed: 0,guid,title,description,venue,url,published_at,updated_at,cluster
0,e3dc5caa18f9a16d7edcc09f8d5c2bb4,Harvey Weinstein's 2020 rape conviction overtu...,Victims group describes the New York appeal co...,BBC,https://www.bbc.co.uk/news/world-us-canada-688...,2024-04-25 18:24:04+00,2024-04-26 20:03:00.628113+00,8
1,297b7152cd95e80dd200a8e1997e10d9,Police and activists clash on Atlanta campus a...,"Meanwhile, hundreds of students march in Washi...",BBC,https://www.bbc.co.uk/news/live/world-us-canad...,2024-04-25 13:40:25+00,2024-04-26 20:03:00.654819+00,11
2,170bd18d1635c44b9339bdbaf1e62123,Haiti PM resigns as transitional council sworn in,The council will try to restore order and form...,BBC,https://www.bbc.co.uk/news/world-latin-america...,2024-04-25 18:11:02+00,2024-04-26 20:03:00.663393+00,-1
3,e696224ac208878a5cec8bdc9f97c632,Europe risks dying and faces big decisions - M...,The French president delivers a stark warning ...,BBC,https://www.bbc.co.uk/news/world-europe-68898887,2024-04-25 14:51:15+00,2024-04-26 20:03:00.672106+00,-1
4,733f744b006fb13033d264efcaf8edad,Prosecutors ask for halt to case against Spain...,Pedro Sánchez is deciding whether to resign af...,BBC,https://www.bbc.co.uk/news/world-europe-68895727,2024-04-25 13:07:42+00,2024-04-26 20:03:00.682205+00,5
...,...,...,...,...,...,...,...,...
865,b26b379233ad059cd7b2098fa7bad86b,"A celebration ‘of the immigrant, the foreigner...",The grand gathering of contemporary art celebr...,El Pais,https://english.elpais.com/culture/2024-04-19/...,2024-04-19 21:48:57+00,2024-04-26 13:02:53.612335+00,-1
866,ad3948cb67434a9d6b7acf6beaab44fe,"Part protest, part rave: the Indigenous artist...",From Gold Lion winner Archie Moore to Brazilia...,The Guardian,https://www.theguardian.com/artanddesign/2024/...,2024-04-25 10:55:43+00,2024-04-26 13:02:54.537243+00,-1
867,efe3f14a916d7d187c42662dff1d8167,The photography studio with four wheels and a ...,‘The sun trickled in just as we were coming do...,The Guardian,https://www.theguardian.com/artanddesign/2024/...,2024-04-24 12:33:27+00,2024-04-26 13:02:54.756821+00,-1
868,89be6f3e8a2f8d75ce401e3c6c72648e,Lost New York: remembering the city’s forgotte...,A new exhibition at the New-York Historical So...,The Guardian,https://www.theguardian.com/artanddesign/2024/...,2024-04-25 14:30:06+00,2024-04-26 13:02:54.827138+00,-1


# Visualize the cluster result using UMAP

In [5]:
umap = UMAP(n_components=2,  n_neighbors=80, min_dist=0.2)

df_umap = (
    pd.DataFrame(umap.fit_transform(np.array(embeddings)), columns=['x', 'y'])
    .assign(cluster=lambda df: hdb.labels_.astype(str))
    .query('cluster != "-1"')
    .sort_values(by='cluster')
)

fig = px.scatter(df_umap, x='x', y='y', color='cluster')
fig.show()



# Generate Cluster Titles using LLM


In [6]:
#modelid="llama3.1:8b-instruct-q4_0"
#modelid="llama3.1:8b-text-q4_0"
modelid="llama3.1"

get_ipython().system_raw("curl -fsSL https://ollama.com/install.sh | sh")
get_ipython().system_raw("ollama serve &")
get_ipython().system_raw(f"ollama pull {modelid}")

curl -fsSL https://ollama.com/install.sh | sh

then we can pull and start the llama3.1 model

ollama serve & ollama pull llama3.1:8b-instruct-fp16

In [14]:
from langchain_community.llms import Ollama
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from tqdm import tqdm

# Initialize an instance of the Ollama model
llm = Ollama(model=modelid)
# Invoke the model to generate responses
#response = llm.invoke("What is the capital of Florida?")
# response = llm.invoke([{"role": "user", "content": "What is the capital of Florida?"}])
#print(response)


prompt_template = ChatPromptTemplate([
    ("system", "You're an expert journalist. You're helping me write short but compelling topic titles for groups of news articles."),
    ( "user", '''
Write a 4 to 5 word title that summarizes the following articles. Find a common sense in all the articles. Surround the title output with <title></title> tags.

ARTICLES:{articles}

TOPIC TITLE:
''')]
)

topicTitleChain = prompt_template | llm

df["cluster_name"] = "Uncategorized"

def generate_topic_titles():
    result = {}

    for c in tqdm(df.cluster.unique()):
        sample_articles = df.query(f"cluster == '{c}'").to_dict(orient="records")

        articles_str = "\n\n".join(
            [
                f"[{i}] {article['title']}\n{article['description'][:200]}{'...' if len(article['description']) > 200 else ''}"
                for i, article in enumerate(
                    sample_articles, start=1
                )
            ]
        )
        topic_title=topicTitleChain.invoke({"articles":articles_str},options={"temperature": 0.0})
        if topic_title.find("<title>")>-1:
          topic_title=topic_title.split("<title>")[1].split("</title>")[0].strip()
        else:
          topic_title="unknown"
          # print (f"\n\n{articles_str}\n\n")
        print (topic_title)
        df.loc[df.cluster == c, "cluster_name"] = topic_title
        result[c]=topic_title
    return result

topic_titles=generate_topic_titles()
topic_titles

#c = 6
#with pd.option_context("display.max_colwidth", None):
#    print(df.query(f"cluster == '{c}'").topic_title.values[0])
#    display(df.query(f"cluster == '{c}'").drop(columns=["topic_title"]).head())

  7%|▋         | 1/15 [00:02<00:36,  2.59s/it]

Weinstein Rape Conviction Overturned


 13%|█▎        | 2/15 [00:16<02:03,  9.48s/it]



[1] Police and activists clash on Atlanta campus amid Gaza protests
Meanwhile, hundreds of students march in Washington DC, and congresswoman Ilhan Omar joins protesters at a New York campus.

[2] Gaza protesters clash with police at US universities
Demonstrators in Texas were met with state troopers as protests lead to arrests on campuses across the country.

[3] Arrests and anger at US university pro-Palestinian demos
Dozens of student protesters have been arrested at Columbia, Yale and New York University in recent days.

[4] Police arrest Gaza protesters at Emory University
Video shows chaos erupting at one Atlanta college as police clear an encampment of pro-Palestinian demonstrators.

[5] More arrests as US campus protests continue to spread
USC also cancelled a graduation ceremony, citing "safety measures", as protests against the Gaza war continue.

[6] Where are the Gaza protests at US universities?
Protests have erupted across more than two dozen campuses, including New Yor

 20%|██        | 3/15 [00:23<01:39,  8.32s/it]



[1] Haiti PM resigns as transitional council sworn in
The council will try to restore order and form a new government in the nation gripped by gang violence.

[2] Europe risks dying and faces big decisions - Macron

[3] WATCH: Would you pay a tourist fee to enter Venice?
From Thursday visitors making a trip to the famous city at peak times will be charged a trial entrance fee.

[4] Paris's Moulin Rouge loses windmill sails overnight
The cause of the sails' collapse from the roof of the world famous cabaret club is not yet clear.

[5] Hundreds of villagers killed in Burkina Faso army revenge attack - report
Soldiers accused residents of aiding Islamist fighters before shooting at them, an investigation shows.

[6] Arizona indicts Trump allies over fake elector scheme
Former Trump advisers Rudy Giuliani and Mark Meadows are among those charged, court documents show.

[7] Inside Columbia's campus as students protest
Hundreds of students have been camping on the university lawn, sparking

 27%|██▋       | 4/15 [00:26<01:06,  6.05s/it]

Spanish PM Under Fire Amid Corruption Allegations


 33%|███▎      | 5/15 [00:38<01:21,  8.14s/it]



[1] Supreme Court divided on whether Trump has immunity
The justices discussed immunity, coups, pardons, Operation Mongoose - and the future of democracy.

[2] Playboy model's interview 'aggravated' Trump, court hears
A tabloid publisher testifies the then-president didn't understand why he let Karen McDougal speak to media.

[3] Gag orders and tabloid schemes at Trump trial
Day six of Donald Trump's hush-money trial in New York heard damaging testimony from a tabloid publisher.

[4] Tough questions as Supreme Court debates Trump's immunity claim
Justices push lawyers for both sides on whether a president can be criminally prosecuted for acts while in office.

[5] 'The boss' featured at Trump hush-money trial
Former National Enquirer Publisher David Pecker laid out his dealings with Michael Cohen for ''the boss''

[6] Trump case: Supreme Court reluctant to grant full immunity to former president
The Republican’s lawyer argues that a president can order a coup or the murder of a polit

 40%|████      | 6/15 [00:41<00:58,  6.50s/it]

Devastating Weather Disasters Unfold Globally


 47%|████▋     | 7/15 [00:57<01:16,  9.58s/it]



[1] Ukraine uses longer-range US missiles for first time
Missiles secretly delivered this month have been used to strike Russian targets in Crimea, US media say.

[2] Watch US House pass critical Ukraine aid package
The long-awaited vote comes as Kyiv faces a weapons shortage in its fight against Russia.

[3] Pentagon to 'rush' Patriot missiles to Ukraine
Ukrainian President Zelensky says munitions in $6bn package "can save lives right now".

[4] Ukraine and Israel aid back on track as House pushes toward weekend votes
President Joe Biden sent a swift endorsement of the Speaker Mike Johnson’s plan and, in a rare moment, Donald Trump, who opposes most overseas aid for Ukraine, has not derailed the speaker’s work

[5] 🔴 Live: US military begins construction of Gaza aid pier, says Pentagon
The United States military has begun construction of a pier to boost deliveries of desperately needed aid to Gaza, the Pentagon said on Thursday. UN World Food Programme deputy executive director Carl

 53%|█████▎    | 8/15 [01:01<00:53,  7.70s/it]

AI Spending Sparks Industry Turmoil


 60%|██████    | 9/15 [01:06<00:41,  6.90s/it]

US-China Tensions Escalate Over Ukraine


 67%|██████▋   | 10/15 [01:07<00:25,  5.13s/it]

TikTok Under Threat of US Ban


 73%|███████▎  | 11/15 [01:18<00:27,  6.98s/it]

Energy Transition Uncertainty Looms


 80%|████████  | 12/15 [01:21<00:16,  5.62s/it]

India Votes in Turbulent Election Phase


 87%|████████▋ | 13/15 [01:26<00:11,  5.59s/it]



[1] Brighton v Manchester City: Premier League – live
Premier League live updates from the 8pm BST kick-offLive scoreboard | Read Football Daily | And email Scott5 min: Joao Pedro steals the ball in midfield and races into the City half. He’s got options...

[2] Xavi’s Barça U-turn shows breaking up is hard to do despite tough times
Announcement was no shock but reflects environment shift at club and an appetite to find stability after a dismal weekBeing the Barcelona manager is “cruel and unpleasant” Xavi Hernández said, but in ...

[3] Have Everton dashed Liverpool’s title dreams? – Football Weekly Extra
Max Rushden is joined by Barry Glendenning, Jonathan Liew and Robyn Cowen as Liverpool lose the Merseyside derby … and maybe moreRate, review, share on Apple Podcasts, Soundcloud, Audioboom, Mixcloud,...

[4] Arsenal thrash Chelsea and a Football League update – Football Weekly
Max Rushden is joined by Barry Glendenning, Ben Fisher, Sanny Rudravajhala and George Elek as Arsenal bea

 93%|█████████▎| 14/15 [01:28<00:04,  4.31s/it]

Middle East Tensions Reach Boiling Point


100%|██████████| 15/15 [01:31<00:00,  6.11s/it]

Markets Worry About Slowing Growth





{'8': 'Weinstein Rape Conviction Overturned',
 '11': 'unknown',
 '-1': 'unknown',
 '5': 'Spanish PM Under Fire Amid Corruption Allegations',
 '9': 'unknown',
 '0': 'Devastating Weather Disasters Unfold Globally',
 '10': 'unknown',
 '3': 'AI Spending Sparks Industry Turmoil',
 '13': 'US-China Tensions Escalate Over Ukraine',
 '7': 'TikTok Under Threat of US Ban',
 '2': 'Energy Transition Uncertainty Looms',
 '4': 'India Votes in Turbulent Election Phase',
 '6': 'unknown',
 '12': 'Middle East Tensions Reach Boiling Point',
 '1': 'Markets Worry About Slowing Growth'}

In [8]:
umap = UMAP(n_components=2,  n_neighbors=80, min_dist=0.2)

df_umap = (
    pd.DataFrame(umap.fit_transform(np.array(embeddings)), columns=['x', 'y'])
    .assign(cluster=lambda df:  topic_titles[hdb.labels_.astype(str)])
    .query('cluster != "-1"')
    .sort_values(by='cluster')
)

fig = px.scatter(df_umap, x='x', y='y', color='cluster')
fig.show()


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



TypeError: unhashable type: 'numpy.ndarray'