# Approach 2

# K-Means Cluster

In [105]:
from sklearn import metrics
import pandas as pd
df = pd.read_json('realDonaldTrump.jsonl', lines=True)
df = df['full_text']
df.head()

0    As everybody is aware, the past Administration...
1    “This isn’t some game. You are screwing with t...
2    I have been briefed on the U.S. C-130 “Hercule...
3    Congratulations @SecPompeo! https://t.co/ECrMG...
4    A Rigged System - They don’t want to turn over...
Name: full_text, dtype: object

In [121]:
from sklearn.feature_extraction import text 
stop_words = text.ENGLISH_STOP_WORDS.union(['https'])

We could use either TfidfVectorizer or HashingVectorizer, but since HashingVectorizer aims on low memory usage and it cannot compute the inverse transform(requires a pipeline) we choose TfidfVectorizer instead.

In [123]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words=stop_words,token_pattern = r'\b[a-zA-Z]{3,}\b',lowercase = True)
X = vectorizer.fit_transform(df)

Since our data is small and we are not trying to reduce the computation time we will be using k-means instead of miniBatch

In [124]:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=10,init='k-means++', max_iter=100, n_init=1 )
km.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=10, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [125]:
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, km.labels_))

Silhouette Coefficient: 0.003


The silhouette score computes the compactness of a cluster, where higher is better, this will depend on the number of clusters we are targeting. But overall it seems that is doing a decent job trying to cluster the terms. Cluster 2 is an obvious one, top term is "new" and it contains words such as "fake, great, media". This makes sense becase he is always talking about "fake news" , "make america great again" and the media.

In [126]:
print("Top terms per cluster:")

order_centroids = km.cluster_centers_.argsort()[:, ::-1]

terms = vectorizer.get_feature_names()
for i in range(10):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='',)
        print()

Top terms per cluster:
Cluster 0: trump
 minister
 prime
 president
 abeshinzo
 congratulations
 donald
 shows
 happens
 nikkihaley
Cluster 1: whitehouse
 join
 america
 great
 making
 flotus
 maga
 bush
 president
 michigan
Cluster 2: news
 fake
 great
 media
 foxandfriends
 doing
 believe
 spending
 cnn
 just
Cluster 3: korea
 north
 just
 meeting
 kim
 time
 nuclear
 south
 jong
 good
Cluster 4: border
 wall
 laws
 democrats
 country
 people
 daca
 mexico
 congress
 southern
Cluster 5: russia
 collusion
 tremendous
 long
 office
 lawyers
 amp
 investigation
 election
 campaign
Cluster 6: great
 honor
 maga
 thank
 today
 people
 whitehouse
 welcome
 american
 work
Cluster 7: hunt
 witch
 crime
 justice
 collusion
 total
 hard
 happened
 department
 obstruction
Cluster 8: jobs
 comey
 hillary
 james
 mccabe
 history
 fbi
 lied
 clinton
 crooked
Cluster 9: trade
 states
 united
 happy
 years
 china
 countries
 war
 end
 treated
