<a href="https://colab.research.google.com/github/ebamberg/research-projects-ml/blob/main/examples/visualization/embedding_clusters_with_umap.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install umap-learn plotly pandas numpy sentence-transformers hdbscan tqdm --quiet

!curl -L -o ./news-headlines-2024.zip https://www.kaggle.com/api/v1/datasets/download/dylanjcastillo/news-headlines-2024
!unzip -o ./news-headlines-2024.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  225k  100  225k    0     0   439k      0 --:--:-- --:--:-- --:--:--  439k
Archive:  ./news-headlines-2024.zip
replace news_data_dedup.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: news_data_dedup.csv     


In [2]:
import pandas as pd

df = pd.read_csv("./news_data_dedup.csv")
docs = [
    f"{title}\n{description}"
    for title, description in zip(df.title, df.description)
]



In [3]:
from google.colab import userdata
from sentence_transformers import SentenceTransformer
from umap import UMAP
import pandas as pd
import numpy as np
import plotly.express as px
import hdbscan

model_id = "sentence-transformers/all-MiniLM-L6-v2"
hf_token = userdata.get('HF_TOKEN')

model = SentenceTransformer(model_id)

embeddings = model.encode(docs)

embeddings

array([[ 0.01300003, -0.02001945,  0.0378731 , ..., -0.0176036 ,
         0.08934786, -0.06768816],
       [-0.00416246,  0.03095501, -0.04282118, ..., -0.02408286,
        -0.04449847, -0.03713957],
       [-0.02172823, -0.00271915,  0.10941693, ...,  0.00217858,
         0.010975  , -0.09007305],
       ...,
       [ 0.08573322,  0.04818088,  0.0819603 , ...,  0.04652384,
        -0.06448119,  0.02147188],
       [ 0.08200873, -0.00633858,  0.06603416, ...,  0.00668821,
        -0.03189663, -0.00043885],
       [ 0.06723553,  0.08093947, -0.01455507, ..., -0.01280487,
        -0.00361775,  0.06653812]], dtype=float32)

In [4]:
hdb = hdbscan.HDBSCAN(min_samples=3, min_cluster_size=10 ).fit(embeddings)
df=df.assign(cluster=lambda df: hdb.labels_.astype(str))
df



Unnamed: 0,guid,title,description,venue,url,published_at,updated_at,cluster
0,e3dc5caa18f9a16d7edcc09f8d5c2bb4,Harvey Weinstein's 2020 rape conviction overtu...,Victims group describes the New York appeal co...,BBC,https://www.bbc.co.uk/news/world-us-canada-688...,2024-04-25 18:24:04+00,2024-04-26 20:03:00.628113+00,8
1,297b7152cd95e80dd200a8e1997e10d9,Police and activists clash on Atlanta campus a...,"Meanwhile, hundreds of students march in Washi...",BBC,https://www.bbc.co.uk/news/live/world-us-canad...,2024-04-25 13:40:25+00,2024-04-26 20:03:00.654819+00,11
2,170bd18d1635c44b9339bdbaf1e62123,Haiti PM resigns as transitional council sworn in,The council will try to restore order and form...,BBC,https://www.bbc.co.uk/news/world-latin-america...,2024-04-25 18:11:02+00,2024-04-26 20:03:00.663393+00,-1
3,e696224ac208878a5cec8bdc9f97c632,Europe risks dying and faces big decisions - M...,The French president delivers a stark warning ...,BBC,https://www.bbc.co.uk/news/world-europe-68898887,2024-04-25 14:51:15+00,2024-04-26 20:03:00.672106+00,-1
4,733f744b006fb13033d264efcaf8edad,Prosecutors ask for halt to case against Spain...,Pedro Sánchez is deciding whether to resign af...,BBC,https://www.bbc.co.uk/news/world-europe-68895727,2024-04-25 13:07:42+00,2024-04-26 20:03:00.682205+00,5
...,...,...,...,...,...,...,...,...
865,b26b379233ad059cd7b2098fa7bad86b,"A celebration ‘of the immigrant, the foreigner...",The grand gathering of contemporary art celebr...,El Pais,https://english.elpais.com/culture/2024-04-19/...,2024-04-19 21:48:57+00,2024-04-26 13:02:53.612335+00,-1
866,ad3948cb67434a9d6b7acf6beaab44fe,"Part protest, part rave: the Indigenous artist...",From Gold Lion winner Archie Moore to Brazilia...,The Guardian,https://www.theguardian.com/artanddesign/2024/...,2024-04-25 10:55:43+00,2024-04-26 13:02:54.537243+00,-1
867,efe3f14a916d7d187c42662dff1d8167,The photography studio with four wheels and a ...,‘The sun trickled in just as we were coming do...,The Guardian,https://www.theguardian.com/artanddesign/2024/...,2024-04-24 12:33:27+00,2024-04-26 13:02:54.756821+00,-1
868,89be6f3e8a2f8d75ce401e3c6c72648e,Lost New York: remembering the city’s forgotte...,A new exhibition at the New-York Historical So...,The Guardian,https://www.theguardian.com/artanddesign/2024/...,2024-04-25 14:30:06+00,2024-04-26 13:02:54.827138+00,-1


# Visualize the cluster result using UMAP

In [5]:
umap = UMAP(n_components=2,  n_neighbors=80, min_dist=0.2)

df_umap = (
    pd.DataFrame(umap.fit_transform(np.array(embeddings)), columns=['x', 'y'])
    .assign(cluster=lambda df: hdb.labels_.astype(str))
    .query('cluster != "-1"')
    .sort_values(by='cluster')
)

fig = px.scatter(df_umap, x='x', y='y', color='cluster')
fig.show()



# Generate Cluster Titles using LLM


In [10]:
#modelid="llama3.1:8b-instruct-q4_0"
modelid="llama3.1:8b-text-q4_0"

get_ipython().system_raw("curl -fsSL https://ollama.com/install.sh | sh")
get_ipython().system_raw("ollama serve &")
get_ipython().system_raw(f"ollama pull {modelid}")


KeyboardInterrupt



curl -fsSL https://ollama.com/install.sh | sh

then we can pull and start the llama3.1 model

ollama serve & ollama pull llama3.1:8b-instruct-fp16

In [8]:
from langchain_community.llms import Ollama
from langchain_core.prompts import PromptTemplate
from tqdm import tqdm

# Initialize an instance of the Ollama model
llm = Ollama(model=modelid)
# Invoke the model to generate responses
#response = llm.invoke("What is the capital of Florida?")
# response = llm.invoke([{"role": "user", "content": "What is the capital of Florida?"}])
#print(response)
article_delimiter = "####"
prompt_template = PromptTemplate.from_template('''
Write a compelling and informative 4 to 5 word long topic title that summarizes the following articles.
Ensure the title is engaging, concise, and accurately reflects the main point of the article. \
Output only to title and nothing else with the tag <TITLE></TITLE> around the title.
Ensure that you only output the best option.
\
ARTICLES:{article_delimiter}{articles}{article_delimiter} \
\
TOPIC TITLE:
''')

topicTitleChain = prompt_template | llm

df["cluster_name"] = "Uncategorized"

def generate_topic_titles():
    result = {}

    for c in tqdm(df.cluster.unique()):
        sample_articles = df.query(f"cluster == '{c}'").to_dict(orient="records")

        articles_str = "\n\n".join(
            [
                f"[{i}] {article['title']}\n{article['description'][:200]}{'...' if len(article['description']) > 200 else ''}"
                for i, article in enumerate(
                    sample_articles, start=1
                )
            ]
        )
        topic_title=topicTitleChain.invoke({"articles":articles_str,"article_delimiter":article_delimiter},options={"temperature": 0.0})
        print (topic_title)
        df.loc[df.cluster == c, "cluster_name"] = topic_title
        result[c]=topic_title
    return result

topic_titles=generate_topic_titles()
topic_titles

#c = 6
#with pd.option_context("display.max_colwidth", None):
#    print(df.query(f"cluster == '{c}'").topic_title.values[0])
#    display(df.query(f"cluster == '{c}'").drop(columns=["topic_title"]).head())


The class `Ollama` was deprecated in LangChain 0.3.1 and will be removed in 1.0.0. An updated version of the class exists in the :class:`~langchain-ollama package and should be used instead. To use it run `pip install -U :class:`~langchain-ollama` and import as `from :class:`~langchain_ollama import OllamaLLM``.

  7%|▋         | 1/15 [00:13<03:11, 13.68s/it]

Weinstein's Rape Conviction Overturned


 13%|█▎        | 2/15 [00:19<01:56,  8.98s/it]

The topic is the widespread protests on university campuses in the United States, Europe, and other parts of the world against Israel's war with Hamas, specifically:

*   Protests demanding that universities cut financial ties to Israel and divest from companies enabling its actions
*   Arrests of protesters at various US university campuses, including Emory University, Northeastern University, Boston, and Austin, Texas
*   Police clearing pro-Palestinian encampments at three US universities
*   Heckling and boos for a visiting House Speaker in Columbia, where the protests originated


 20%|██        | 3/15 [00:43<03:11, 15.94s/it]

I'll provide you with a summary of the articles under each topic title:

**1. News**

* A cybersecurity firm, Darktrace, has agreed to be acquired by US private equity firm Thoma Bravo for $5.3 billion.
* A music company, Concord, has acquired Hipgnosis for $1.51 billion.

**2. Technology and Security**

* Cybersecurity firm Darktrace accepted a $5 billion takeover by US private equity firm Thoma Bravo.
* Air conditioning and AI are demanding more power, but renewable energy can't keep up with the demand.
* A cybersecurity firm has agreed to be acquired by a US private equity firm for a significant amount.

**3. World News**

* Confessions of an 82-year-old hitchhiker: The founder of Bradt travel guides has been thumbing lifts all over the world since her teens.
* Searching for missing loved ones in Gaza's mass graves: It is a huge challenge to find out how people died whose bodies were exhumed at Nasser hospital.

**4. Arts and Culture**

* Part protest, part rave: Indigenous artists 

 27%|██▋       | 4/15 [00:44<01:49,  9.97s/it]

Spanish PM Faces Resignation Pressure


 33%|███▎      | 5/15 [00:57<01:51, 11.11s/it]

It appears that you want me to summarize a large list of news articles related to Donald Trump's hush money trial and his presidential immunity case. I'll do my best to provide a concise summary of the main points.

**Key Points:**

1. **Trump's Immunity Case**: The Supreme Court is hearing arguments on whether Trump has presidential immunity from prosecution for actions taken while in office.
2. **Hush Money Trial**: Trump is facing trial in Manhattan over allegations that he paid hush money to adult film star Stormy Daniels to keep her silent about an affair.
3. **Trump's Lawyer**: Trump's lawyer, Alan Dershowitz, has been arguing that Trump has presidential immunity from prosecution for actions taken while in office.
4. **Man Sets Himself on Fire**: A man set himself on fire outside the Manhattan courthouse where Trump is being tried, and was later identified as a self-described "investigative researcher."
5. **Supreme Court Decision**: The Supreme Court is expected to make a decisi

 40%|████      | 6/15 [00:58<01:09,  7.78s/it]

< TITLE > Extreme Weather Devastates East Africa < /TITLE >


 47%|████▋     | 7/15 [01:03<00:53,  6.71s/it]

It appears you've provided a list of news articles and headlines related to the ongoing conflict between Russia and Ukraine. The topics range from military aid packages, missile strikes, and diplomatic tensions between various countries.

If you'd like, I can try to summarize or provide an overview of some of the key points and trends in these articles. Alternatively, if you have a specific question or topic in mind, feel free to ask!


 53%|█████▎    | 8/15 [01:04<00:35,  5.03s/it]

Big Tech's Heavy Bet on AI


 60%|██████    | 9/15 [01:06<00:23,  3.93s/it]

US Warns China on Russia Ties


 67%|██████▋   | 10/15 [01:07<00:15,  3.07s/it]

<TITLE>TikTok Ban: Global Regulatory Showdown</TITLE>


 73%|███████▎  | 11/15 [01:09<00:11,  2.78s/it]

<BIDEN ADMINISTRATION'S CLEAN ENERGY PUSH</BIDEN ADMINISTRATION'S CLEAN ENERGY PUSH>


 80%|████████  | 12/15 [01:10<00:06,  2.18s/it]

"India Votes Amidst Rising Tensions"


 87%|████████▋ | 13/15 [01:14<00:05,  2.88s/it]

It looks like you've copied a large block of text from an article about football (soccer). There is no specific question or topic title within this text. If you'd like to discuss something in particular, I can try to help with a response. However, if you'd like me to summarize the articles' main points or identify any interesting information within it, I can do that as well.


 93%|█████████▎| 14/15 [01:16<00:02,  2.40s/it]

< TITLE>Israeli-Palestinian Crisis Escalates</TITLE>


100%|██████████| 15/15 [01:17<00:00,  5.16s/it]

Rate Cuts Dashed by Slow Growth





{'8': "Weinstein's Rape Conviction Overturned",
 '11': "The topic is the widespread protests on university campuses in the United States, Europe, and other parts of the world against Israel's war with Hamas, specifically:\n\n*   Protests demanding that universities cut financial ties to Israel and divest from companies enabling its actions\n*   Arrests of protesters at various US university campuses, including Emory University, Northeastern University, Boston, and Austin, Texas\n*   Police clearing pro-Palestinian encampments at three US universities\n*   Heckling and boos for a visiting House Speaker in Columbia, where the protests originated",
 '5': 'Spanish PM Faces Resignation Pressure',
 '9': 'It appears that you want me to summarize a large list of news articles related to Donald Trump\'s hush money trial and his presidential immunity case. I\'ll do my best to provide a concise summary of the main points.\n\n**Key Points:**\n\n1. **Trump\'s Immunity Case**: The Supreme Court is h

In [9]:
umap = UMAP(n_components=2,  n_neighbors=80, min_dist=0.2)

df_umap = (
    pd.DataFrame(umap.fit_transform(np.array(embeddings)), columns=['x', 'y'])
    .assign(cluster=lambda df:  topic_titles[hdb.labels_.astype(str)])
    .query('cluster != "-1"')
    .sort_values(by='cluster')
)

fig = px.scatter(df_umap, x='x', y='y', color='cluster')
fig.show()


'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.



TypeError: unhashable type: 'numpy.ndarray'