## clustering attempt

In [1]:
import re
import json
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sentence_transformers import SentenceTransformer
from sklearn.cluster import DBSCAN
from transformers import pipeline


In [2]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jared\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jared\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\jared\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\jared\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
pd.set_option('display.max_colwidth', None)

Process:

1) want list of each article's title + description
2) clean data (lowercase, remove html, remove special chars)
3) preprocess (tokenize and remove stopwords, lemmatize)
4) use kmeans for clustering (q: fixed number of clusters?)
5) apply PCA
6) select only largest clusters?
7) web scrape articles belonging to that cluster
8) summarize with OpenAI

In [4]:
# Step 1: read in data
def read_data(file_path):
    with open(file_path) as f:
        data = json.load(f)
    return data

search_data = read_data('../newsapi/outputs/test_search_trump.json')

In [5]:
# Step 1 cont: convert to DF
articles_df = pd.json_normalize(search_data)


In [6]:
articles_df.head(5)

Unnamed: 0,author,title,description,url,urlToImage,publishedAt,content,user_pref,source.id,source.name
0,"Lauren Goode, Paresh Dave, Will Knight",What Donald Trump's Win Will Mean for Big Tech,Donald Trump's approach to Big Tech has oscillated between calls for stricter regulations for some players and a hands-off approach for others. Here's how he might steer tech policy in a second term.,https://www.wired.com/story/trump-tech-policy/,"https://media.wired.com/photos/672bcd1062c380013856bb0f/191:100/w_1280,c_limit/Business_bigtech_GettyImages-697900116.jpg",2024-11-07T20:00:17Z,"The most raucous cheers of the night were prompted by Trumps promise to fire Gary Gensler, chairman of the Securities and Exchange Commission, a regulatory agency that has brought a volley of lawsuit… [+2787 chars]",False,wired,Wired
1,Caroline Haskins,ICE Started Ramping Up Its Surveillance Arsenal Immediately After Donald Trump Won,US Immigration and Customs Enforcement put out a fresh call for contracts for surveillance technologies before an anticipated surge in the number of people it monitors ahead of deportation hearings.,https://www.wired.com/story/ice-surveillance-contracts-isap/,"https://media.wired.com/photos/6733a1802d34679fa5571df1/191:100/w_1280,c_limit/Security_ICE_GettyImages.jpg",2024-11-13T12:00:00Z,"Compared to its recent November notice, ICE was more detailed about its 2025 plans in a different notice released last year. This earlier notice shows that ICE was preparing for a more intense immigr… [+2023 chars]",False,wired,Wired
2,Kelly McEvers,The Great American Microchip Mobilization,"Under Donald Trump and Joe Biden alike, the US has been determined to “reshore” chipmaking. Now money and colossal infrastructure are flowing to a vast Intel site in Ohio—just as the company may be falling apart.",https://www.wired.com/story/intel-great-american-microchip-mobilization/,"https://media.wired.com/photos/67329e6ef0b163d7e81ffb79/191:100/w_1280,c_limit/Intel_Ohio_DSC09351.jpg",2024-11-14T11:00:00Z,"It took nearly two years of planning to get here. This is one of about two dozen superloadspieces of highway cargo that weigh more than 120,000 poundsthat will be lumbering across Ohio for Intel. Tod… [+2622 chars]",False,wired,Wired
3,Eric Geller,"More Spyware, Fewer Rules: What Trump’s Return Means for US Cybersecurity","Experts expect Donald Trump’s next administration to relax cybersecurity rules on businesses, abandon concerns around human rights, and take an aggressive stance against the cyber armies of US adversaries.",https://www.wired.com/story/trump-administration-cybersecurity-policy-reversals/,"https://media.wired.com/photos/6735e10f310e865b05b94244/191:100/w_1280,c_limit/Security_Animation_GettyImages.jpg",2024-11-14T10:30:00Z,"Trump is also unlikely to continue the Biden administrations campaign to limit the proliferation of commercial spyware technologies, which authoritarian governments have used to harass journalists, c… [+2985 chars]",False,wired,Wired
4,Leah Feiger,We Break Down the Internet’s Future Under Trump 2.0,"WIRED global editorial director Katie Drummond joins this week to discuss what the fragmented internet did for the Trump campaign, and what the incoming Trump administration means for the internet.",https://www.wired.com/story/the-internets-future-under-donald-trump/,"https://media.wired.com/photos/67328cfe8b3c4eb2afd88938/191:100/w_1280,c_limit/Politics-Lab-Umbrella-of-Election-Politics-2182522891.jpg",2024-11-14T18:58:19Z,"Leah Feiger: Enough trillions that don't actually currently exist in that budget.\r\nKatie Drummond: Exactly. So thinking about that very messy, very human business and Elon Musk and Donald Trump in a … [+5020 chars]",False,wired,Wired


In [7]:
# Step 2: clean and tokenize data
def clean_text(text):
    if not isinstance(text, str):
        return ""
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    text = re.sub(r'\W+', ' ', text)
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

articles_df['combined_text'] = articles_df['title'].fillna('') + ' ' + articles_df['description'].fillna('')
articles_df['processed_text'] = articles_df['combined_text'].apply(clean_text)

In [8]:
# Step 3: embed and cluster using DBSCAN (can also try kmeans)
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(articles_df['processed_text'].fillna(''))

dbscan = DBSCAN(eps=0.5, min_samples=3, metric='cosine')
clusters = dbscan.fit_predict(embeddings)

articles_df['cluster'] = clusters

In [16]:
articles_df['cluster'].value_counts()

cluster
-1    87
 0    13
 2     9
 5     6
 1     4
 4     4
 6     3
 3     3
Name: count, dtype: int64

In [17]:
print(articles_df[articles_df['cluster'] == 0]['title'])
print(articles_df[articles_df['cluster'] == 0]['description'])

0                                          What Donald Trump's Win Will Mean for Big Tech
3               More Spyware, Fewer Rules: What Trump’s Return Means for US Cybersecurity
46                           How will a second Trump presidency affect the tech industry?
51               Did you need another reminder that Donald Trump watches TV? Here you go.
59     Elon Musk went all-in on Pennsylvania — and his expensive effort paid off big time
68           Silicon Valley is betting a Musk-inspired Trump could unleash a startup boom
102                                       What Donald Trump’s Win Could Mean for Vaccines
103                             How long will Elon Musk and Donald Trump’s lovefest last?
105                                           What Donald Trump’s Win Means For Inflation
108                                    What Trump's win means for the consulting industry
109                                                  What Trump’s Win Means for Education
119       

In [18]:
print(articles_df[articles_df['cluster'] == 2]['title'])
print(articles_df[articles_df['cluster'] == 2]['description'])



22                                                             Watch: Putin congratulates 'courageous' Trump on election win
35                                                    Who is Pete Hegseth - the Fox News host who will be defence secretary?
42                                   'Crimea is gone': Trump adviser says Ukraine focus must be peace not retaking territory
64                                                              Inside Trump and Putin's first phone call since the election
67                                      With Trump's White House win, the clock is ticking on over $6 billion in Ukraine aid
70     Trump's Pentagon pick criticized US involvement in Ukraine, said Putin probably wouldn't go 'much further' if he wins
89                                                      Trump to nominate Fox News host Pete Hegseth to be defense secretary
115                                          Trump takes calls from growing list of world leaders following election victory


In [19]:
print(articles_df[articles_df['cluster'] == 5]['title'])
print(articles_df[articles_df['cluster'] == 5]['description'])

31                 Divided Arizona contends with Trump's sweeping border plan
79                                                  Donald Trump wins Arizona
110    Trump won a comprehensive victory. Can other Republicans replicate it?
117      More voters saw Trump as the candidate of change: Exit poll analysis
118                The Strategist Who Predicted Trump’s Multiracial Coalition
127                                  Why Democrats Are Losing the Culture War
Name: title, dtype: object
31                                                                                   The state is likely to be one places seeing a shift under the president-elect's immigration policies.
79     Arizona was that last state to be called by the Associated Press in the presidential race. It brings Trump's total electoral vote count to 312, with 226 for Vice President Harris.
110                                         Donald Trump scored a comprehensive victory Tuesday, improving his numbers across sever

In [20]:
print(articles_df[articles_df['cluster'] == 1]['title'])
print(articles_df[articles_df['cluster'] == 1]['description'])


8     Jeff Bezos says he’s a climate guy — why is he kissing the ring?
27                    Climate talks to open in shadow of Trump victory
29                     Trump victory 'major setback' to climate action
36    Climate fight 'bigger than one election', says Biden’s top envoy
Name: title, dtype: object
8     Jeff Bezos joined other tech moguls in congratulating Trump on his election victory. That doesn’t seem to jive with his “passion” for fighting climate change.
27                                                                   Leaders heading to the annual UN climate talks have a lot of other distractions to contend with
29                                   Trump's election will hit immediate efforts to tackle climate change, experts say - but the longer-term effect is less certain.
36                                                          COP29 climate summit opens amid fears the US election will derail efforts to stop the planet heating up.
Name: description, dtype: obj

In [21]:
print(articles_df[articles_df['cluster'] == 4]['title'])
print(articles_df[articles_df['cluster'] == 4]['description'])


63                                                 Matt Gaetz's most controversial moments
83                                     Trump announces Matt Gaetz as attorney general pick
87     Senate Republicans concerned with Gaetz nomination ask to access House ethics probe
126                                        Trump nominates Matt Gaetz for Attorney General
Name: title, dtype: object
63                                                                                 Matt Gaetz, Donald Trump's polarizing pick to be the next attorney general, has long been at the center of controversy.
83             President-elect Donald Trump announced his pick for one of the biggest jobs in his new administration: attorney general. And for that job, Trump has chosen Florida Congressman Matt Gaetz.
87     Senators are calling for access to a House Ethics Committee probe into former Rep. Matt Gaetz, R-Fla., following his nomination to be the next attorney general under President-elect Donald Trump.


In [24]:
# step 4: summarize main clusters (may use OpenAI for this in the future)
summaries = []
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
for cluster in articles_df['cluster'].unique():
    if cluster == -1:
        continue
    cluster_articles = articles_df[articles_df['cluster'] == cluster]['combined_text']
    combined_text = ' '.join(cluster_articles[:15])
    if combined_text.strip():
        summary = summarizer(combined_text, max_length=50, min_length=30, do_sample=False)
        summaries.append({'cluster': cluster, 'summary_text': summary[0]['summary_text']})
        print(f"Cluster {cluster} Summary: {summary[0]['summary_text']}")

Cluster 0 Summary:  Trump's approach to Big Tech has oscillated between calls for stricter regulations for some players and a hands-off approach for others . Silicon Valley is betting a Musk-inspired Trump could unleash a startup boom . The president-elect returned to
Cluster 1 Summary:  Jeff Bezos says he’s a climate guy — why is he kissing the ring? Jeff Bezos joined other tech moguls in congratulating Trump on his election victory . COP29 climate summit opens amid fears the US election will derail efforts
Cluster 2 Summary:  Pete Hegseth, a Fox News host, has advocated for a conservative cultural shift in the US military . On a podcast last week, he said he didn't want US intervention forcing Putin's hand . With Trump's White House
Cluster 5 Summary:  Arizona was that last state to be called by the Associated Press in the presidential race . It brings Trump's total electoral vote count to 312, with 226 for Vice President Harris .
Cluster 6 Summary:  Trump fans and Harris supporters 

In [None]:
# PSA I tried it with OpenAI and summaries are much better. So we'll probably use that.