<a href="https://colab.research.google.com/github/ebamberg/research-projects-ml/blob/main/examples/visualization/embedding_clusters_with_umap.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install umap-learn plotly pandas numpy sentence-transformers hdbscan tqdm --quiet

!curl -L -o ./news-headlines-2024.zip https://www.kaggle.com/api/v1/datasets/download/dylanjcastillo/news-headlines-2024
!unzip -o ./news-headlines-2024.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  225k  100  225k    0     0   408k      0 --:--:-- --:--:-- --:--:--  408k
Archive:  ./news-headlines-2024.zip
  inflating: news_data_dedup.csv     


In [2]:
import pandas as pd

df = pd.read_csv("./news_data_dedup.csv")
docs = [
    f"{title}\n{description}"
    for title, description in zip(df.title, df.description)
]



In [3]:
from google.colab import userdata
from sentence_transformers import SentenceTransformer
from umap import UMAP
import pandas as pd
import numpy as np
import plotly.express as px
import hdbscan

model_id = "sentence-transformers/all-MiniLM-L6-v2"
hf_token = userdata.get('HF_TOKEN')

model = SentenceTransformer(model_id)

embeddings = model.encode(docs)

embeddings

array([[ 0.01300003, -0.02001945,  0.0378731 , ..., -0.0176036 ,
         0.08934786, -0.06768816],
       [-0.00416246,  0.03095501, -0.04282118, ..., -0.02408286,
        -0.04449847, -0.03713957],
       [-0.02172823, -0.00271915,  0.10941693, ...,  0.00217858,
         0.010975  , -0.09007305],
       ...,
       [ 0.08573322,  0.04818088,  0.0819603 , ...,  0.04652384,
        -0.06448119,  0.02147188],
       [ 0.08200873, -0.00633858,  0.06603416, ...,  0.00668821,
        -0.03189663, -0.00043885],
       [ 0.06723553,  0.08093947, -0.01455507, ..., -0.01280487,
        -0.00361775,  0.06653812]], dtype=float32)

In [4]:
hdb = hdbscan.HDBSCAN(min_samples=3, min_cluster_size=10 ).fit(embeddings)
df=df.assign(cluster=lambda df: hdb.labels_.astype(str))
df



Unnamed: 0,guid,title,description,venue,url,published_at,updated_at,cluster
0,e3dc5caa18f9a16d7edcc09f8d5c2bb4,Harvey Weinstein's 2020 rape conviction overtu...,Victims group describes the New York appeal co...,BBC,https://www.bbc.co.uk/news/world-us-canada-688...,2024-04-25 18:24:04+00,2024-04-26 20:03:00.628113+00,8
1,297b7152cd95e80dd200a8e1997e10d9,Police and activists clash on Atlanta campus a...,"Meanwhile, hundreds of students march in Washi...",BBC,https://www.bbc.co.uk/news/live/world-us-canad...,2024-04-25 13:40:25+00,2024-04-26 20:03:00.654819+00,11
2,170bd18d1635c44b9339bdbaf1e62123,Haiti PM resigns as transitional council sworn in,The council will try to restore order and form...,BBC,https://www.bbc.co.uk/news/world-latin-america...,2024-04-25 18:11:02+00,2024-04-26 20:03:00.663393+00,-1
3,e696224ac208878a5cec8bdc9f97c632,Europe risks dying and faces big decisions - M...,The French president delivers a stark warning ...,BBC,https://www.bbc.co.uk/news/world-europe-68898887,2024-04-25 14:51:15+00,2024-04-26 20:03:00.672106+00,-1
4,733f744b006fb13033d264efcaf8edad,Prosecutors ask for halt to case against Spain...,Pedro Sánchez is deciding whether to resign af...,BBC,https://www.bbc.co.uk/news/world-europe-68895727,2024-04-25 13:07:42+00,2024-04-26 20:03:00.682205+00,5
...,...,...,...,...,...,...,...,...
865,b26b379233ad059cd7b2098fa7bad86b,"A celebration ‘of the immigrant, the foreigner...",The grand gathering of contemporary art celebr...,El Pais,https://english.elpais.com/culture/2024-04-19/...,2024-04-19 21:48:57+00,2024-04-26 13:02:53.612335+00,-1
866,ad3948cb67434a9d6b7acf6beaab44fe,"Part protest, part rave: the Indigenous artist...",From Gold Lion winner Archie Moore to Brazilia...,The Guardian,https://www.theguardian.com/artanddesign/2024/...,2024-04-25 10:55:43+00,2024-04-26 13:02:54.537243+00,-1
867,efe3f14a916d7d187c42662dff1d8167,The photography studio with four wheels and a ...,‘The sun trickled in just as we were coming do...,The Guardian,https://www.theguardian.com/artanddesign/2024/...,2024-04-24 12:33:27+00,2024-04-26 13:02:54.756821+00,-1
868,89be6f3e8a2f8d75ce401e3c6c72648e,Lost New York: remembering the city’s forgotte...,A new exhibition at the New-York Historical So...,The Guardian,https://www.theguardian.com/artanddesign/2024/...,2024-04-25 14:30:06+00,2024-04-26 13:02:54.827138+00,-1


# Visualize the cluster result using UMAP

In [5]:
umap = UMAP(n_components=2,  n_neighbors=80, min_dist=0.2)

df_umap = (
    pd.DataFrame(umap.fit_transform(np.array(embeddings)), columns=['x', 'y'])
    .assign(cluster=lambda df: hdb.labels_.astype(str))
    .query('cluster != "-1"')
    .sort_values(by='cluster')
)

fig = px.scatter(df_umap, x='x', y='y', color='cluster')
fig.show()



# Generate Cluster Titles using LLM


In [6]:
#modelid="llama3.1:8b-instruct-q4_0"
#modelid="llama3.1:8b-text-q4_0"
modelid="llama3.1"

get_ipython().system_raw("curl -fsSL https://ollama.com/install.sh | sh")
get_ipython().system_raw("ollama serve &")
get_ipython().system_raw(f"ollama pull {modelid}")

curl -fsSL https://ollama.com/install.sh | sh

then we can pull and start the llama3.1 model

ollama serve & ollama pull llama3.1:8b-instruct-fp16

In [9]:
from langchain_community.llms import Ollama
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from tqdm import tqdm

# Initialize an instance of the Ollama model
llm = Ollama(model=modelid)
# Invoke the model to generate responses
#response = llm.invoke("What is the capital of Florida?")
# response = llm.invoke([{"role": "user", "content": "What is the capital of Florida?"}])
#print(response)


prompt_template = ChatPromptTemplate([
    ("system", "You're an expert journalist. You're helping me write short but compelling topic titles for groups of news articles."),
    ( "user", '''
Using the following articles, write a compelling and precies 4 to 5 word title that summarizes the following articles.

ARTICLES:{articles}

TOPIC TITLE:
''')]
)

topicTitleChain = prompt_template | llm

df["cluster_name"] = "Uncategorized"

def generate_topic_titles():
    result = {}

    for c in tqdm(df.cluster.unique()):
        sample_articles = df.query(f"cluster == '{c}'").to_dict(orient="records")

        articles_str = "\n\n".join(
            [
                f"[{i}] {article['title']}\n{article['description'][:200]}{'...' if len(article['description']) > 200 else ''}"
                for i, article in enumerate(
                    sample_articles, start=1
                )
            ]
        )
        topic_title=topicTitleChain.invoke({"articles":articles_str},options={"temperature": 0.0})
        print (topic_title)
        df.loc[df.cluster == c, "cluster_name"] = topic_title
        result[c]=topic_title
    return result

topic_titles=generate_topic_titles()
# topic_titles

#c = 6
#with pd.option_context("display.max_colwidth", None):
#    print(df.query(f"cluster == '{c}'").topic_title.values[0])
#    display(df.query(f"cluster == '{c}'").drop(columns=["topic_title"]).head())

  7%|▋         | 1/15 [00:01<00:18,  1.34s/it]

_____________


 13%|█▎        | 2/15 [00:04<00:30,  2.36s/it]

TOPIC KEYWORDS:
TOPIC DESCRIPTOR: 
SUBJECT CONTENT:
SUBJECT KEYWORDS:
SUBJECT DESCRIPTOR:  



 20%|██        | 3/15 [00:07<00:30,  2.57s/it]

1. 
2. 
3. 
4. 
5.


 27%|██▋       | 4/15 [00:07<00:20,  1.83s/it]




 33%|███▎      | 5/15 [00:29<01:27,  8.78s/it]

1. Trump, Ivanka and Jared are being investigated by the Manhattan DA for criminal charges of falsifying business records related to the alleged hush-money payment made to adult film actress Stormy Daniels.
2. The Manhattan District Attorney has issued a subpoena for Donald Trump's tax returns from 2011 to 2020. 
3. On August 8, 2019, federal prosecutors in New York charged former Trump attorney Michael Cohen with campaign finance violations as part of the investigation into the payment. They stated that Trump had directed Cohen to make payments to "individuals with the specific intent to influence the election."
4. Stormy Daniels sued Trump for defamation after he referred to her a "con person" on Twitter, and she countersued him for battery, intentional infliction of emotional distress, and false imprisonment after he allegedly ordered his bodyguards to forcibly remove her from a Las Vegas celebrity golf tournament in 2018.
5. The Manhattan DA requested records from the Trump Organiz

 40%|████      | 6/15 [00:38<01:19,  8.85s/it]

[1] Torrential Downpour Claims Over 150 Lives in Tanzania
[2] Torrential Rains Leave 155 Dead in East Africa
[3] Flood Deaths Reach 70 in Kenya Since March
[4] Fossil Fuels Intensify El Niño's Effects, Study Finds
[5] Wave of Extreme Heat Scorches South-East Asia
[6] Death Toll in East African Floods Rises to 100
[7] Torrential Downpour Kills Dozens as Weather Wreaks Havoc in Kenya
[8] Tornado Touches Down in Nebraska Amid Severe Storms
[9] Violent Tornadoes Leave Trail of Destruction Across US (VIDEOS)
[10] Floods Batter China's South, Bridge Collapses
[11] Death Toll Rises as Flooding Continues to Ravage Kenya
[12] Tornadoes Wreak Havoc in Midwest, Threat Lingers
[13] Residents Begin Scavenging Through Rubble After Tornadoes Tear Through Nebraska and Iowa
[14] Tornado Kills 5 in Southern China, Damages Factories Near Guangzhou
[15] Death Toll Rises to 22 as Heavy Rain Causes Flash Floods in Balochistan
[16] Mosquito-Borne Diseases Spread Across Europe Due to Climate Crisis, Warns Exp

 47%|████▋     | 7/15 [00:44<01:04,  8.09s/it]

The New York Times reported on Wednesday that U.S. officials had quietly told allies not to send Patriot missile systems to Kiev for fear they would get destroyed during the fighting. The report came after Poland received a shipment of Patriots from Germany earlier this month.

https://www.nytimes.com/2023/01/20/world/europe/russia-ukraine-poland.html


The New York Times reported on Wednesday that U.S. officials had quietly told allies not to send Patriot missile systems to Kiev for fear they would get destroyed during the fighting. The report came after Poland received a shipment of Patriots from Germany earlier this month.




 53%|█████▎    | 8/15 [00:51<00:55,  7.87s/it]

AI investment leads to financial results for [1]


TOPIC TITLE:
AI investment fuels growth at [2]

TOPIC TITLE:
Microsoft beats expectations with AI-driven revenue

TOPIC TITLE:
Investors react badly after Meta spending plans lead to share price slump 

TOPIC TITLE:
Artificial intelligence drives semiconductor sector changes and game of thrones battles

TOPIC TITLE:
Alphabet sees first dividend as it hails artificial intelligence once in a generation opportunity 


TOPIC TITLE:
Microsoft CEO Satya Nadella will visit Southeast Asia next week with artificial intelligence on the agenda


TOPIC TITLE:
AI investment leads to Microsoft profit increase 

TOPIC TITLE:
Google parent company Alphabet hits $2 trillion valuation after announcing cash dividend and strong revenue growth

TOPIC TITLE:
Alphabet, Google parent company reports revenue of $80.5 billion up 15% from a year ago despite staff turmoil

TOPIC TITLE:
Microsoft CEO Satya Nadella will visit Southeast Asia next week with artificia

 60%|██████    | 9/15 [00:53<00:35,  5.88s/it]

<output></output>



 67%|██████▋   | 10/15 [01:05<00:39,  7.89s/it]

TikTok is facing a potential ban in America. The Chinese owned company ByteDance says they would rather shut down than sell the platform.

TOPIC TITLE:
A bipartisan group of lawmakers wants Americans to ban TikTok.
The House just passed legislation that forces divestment or banning of TikTok.

TOPIC TITLE:
TikTok has been warned by EU officials about their new rewards feature being addictive. Now it's been suspended.

TOPIC TITLE:
President Biden is facing a problem when juggling his reelection campaign and foreign policy matters. 
There have been several internal discrepancies on global issues.


TOPIC TITLE:
The TikTok app could be as dangerous to your health as cigarettes, the EU warns.
It may even ban the Lite version of the app in Europe.

TOPIC TITLE:
TikTok is suing the US if it bans their app.
Congress just approved legislation that would force ByteDance to divest ownership of the TikTok platform.

TOPIC TITLE:
ByteDance has sources say they'd rather shut down than sell TikTok


 73%|███████▎  | 11/15 [02:04<01:34, 23.50s/it]

Mining giants BHP and Anglo American consider merger

ARTICLES:

BHP moves to buy Anglo American mining group 
The Australian mining company BHP has set out plans for a £31bn takeover of fellow FTSE 100 firm Anglo American in an attempt to create the world’s biggest copper miner.
If approved, the deal would be one of the largest in the City and bring together two of London’s most prominent commodities producers. The merger would also mark another step away from Britain, as BHP, once known as British Petroleum Holdings, has moved its primary listing to Australia over recent years.
Anglo American has previously rejected a lower takeover bid from BHP but the new proposal is for shares rather than cash, which may be seen as more palatable by shareholders. The offer values Anglo at £31bn – a 30% premium on last week’s closing price. The move adds to concerns about companies listed in London moving away to places such as New York and Australia.
BHP’s offer would see it become the biggest cop

 80%|████████  | 12/15 [02:23<01:05, 21.94s/it]

India election: fears that Modi's BJP will polarise voters in fight for key state

ARTICLES:[1] All eyes on opposition-held southern states as India votes in second election phase
Millions of Indians began voting Friday in the second round of multi-phase national elections as Prime Minister Narendra Modi sought to galvanize voters with his assertive brand of Hindu nationalist p...

[2] People in India: share your thoughts on the election
The Guardian would like to hear from people in India on their thoughts on the 2024 general election, in particular young people who are voting for the first timeThe world’s largest election has begun,...

[3] India election: fears that Modi’s BJP will polarise voters in fight for key state
Muslims worry Modi’s party will rely on familiar tactics of drawing on religious divisions as it battles Congress to to retain seats in KarnatakaThe sun scorched the carpeted car park at Mudipu Juncti...

[4] D-day veterans and an Indian election: photos of the day –

 80%|████████  | 12/15 [09:10<02:17, 45.90s/it]


KeyboardInterrupt: 

In [8]:
umap = UMAP(n_components=2,  n_neighbors=80, min_dist=0.2)

df_umap = (
    pd.DataFrame(umap.fit_transform(np.array(embeddings)), columns=['x', 'y'])
    .assign(cluster=lambda df:  topic_titles[hdb.labels_.astype(str)])
    .query('cluster != "-1"')
    .sort_values(by='cluster')
)

fig = px.scatter(df_umap, x='x', y='y', color='cluster')
fig.show()


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



TypeError: unhashable type: 'numpy.ndarray'