## News Headlines Clustering using Word2Vec and Count Vectorizer

- create a word2vec embedding of all headlines
- apply kmeans clustering algorithm on the word embedding matrix and create clusters
- create a count vectorizer of all the headlines with a max_features parameter
- identify the cluster id to which each headline in the count vectorizer belongs
- find the most frequent words in the cluster represented by the cluster id based on the count of the words

In [4]:
from gensim.models.keyedvectors import KeyedVectors
glove_model = KeyedVectors.load_word2vec_format("glove.6B.50d_orig.txt", binary=False)

In [5]:
import pandas as pd
import numpy as np

In [6]:
headlines = pd.read_csv("news_headlines.csv")

In [7]:
headlines.shape

(253032, 4)

In [8]:
headlines.head()

Unnamed: 0,publish_date,headline_category,headline_text,yr
0,20150101,business.india-business,Core sector sees fastest growth in 5 months at...,2015
1,20150101,business.india-business,Fiscal deficit hits 99% of FY15 target at Nov-end,2015
2,20150101,business.india-business,Govt gives SpiceJet breather from AAI dues til...,2015
3,20150101,business.india-business,Govt eyes younger PSB chiefs; to tap private s...,2015
4,20150101,business.india-business,Trai sets 3G reserve price at 2;720crper MHz,2015


### Data pre-processing

### Vectorize Title

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

In [10]:
# initialize vectorizer
vect = CountVectorizer(ngram_range=(1,2),stop_words='english', 
                       max_features=500)

In [11]:
vect.fit(headlines['headline_text'])
headline_matrix = vect.transform(headlines['headline_text'])

In [14]:
headline_matrix.shape

(253032, 500)

In [29]:
# Find vocabulary
features = vect.get_feature_names()
# features
print(len(features))
print(features[20:30])

500
['act', 'action', 'activists', 'actor', 'ahead', 'ahmedabad', 'air', 'airport', 'anti', 'ap']


In [67]:
### Use word2vec features

In [30]:
from tqdm import tqdm
headline_vec = np.zeros((headlines.shape[0],50))
for i in tqdm(range(0,headlines.shape[0])):
    words = headlines["headline_text"].iloc[i].split(" ")
    words = [x.strip() for x in words]
    ind_word_vecs = [glove_model.word_vec(x) for x in words if x in glove_model.vocab]
    headline_vec[i] = np.array(ind_word_vecs).mean(axis=0)

  import sys
  ret = ret.dtype.type(ret / rcount)
100%|███████████████████████████████████████████████████████████████████████| 253032/253032 [00:12<00:00, 20761.27it/s]


In [35]:
headline_vec = np.nan_to_num(headline_vec)

In [36]:
print(headline_vec.shape)

(253032, 50)


In [37]:
### Clustering the title

In [38]:
from sklearn.cluster import KMeans

In [39]:
kmeans = KMeans(n_clusters=8, random_state=0)
kmeans.fit(headline_vec)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=8, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

In [45]:
print(len(kmeans.labels_))
print(kmeans.labels_)

253032
[6 6 5 ... 6 2 6]


In [40]:
# convert count vectorizer matrix to a dataframe with feature names as cols and count vector for each headline as rows

In [41]:
headline_matrix_df = pd.DataFrame(headline_matrix.toarray())
headline_matrix_df.columns = vect.get_feature_names()

In [42]:
headline_matrix_df.head()

Unnamed: 0,000,10,100,11,12,13,14,15,16,17,...,world,world cup,worth,year,year old,years,yoga,youth,yr,yr old
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [43]:
#head_line text column, contains the headlines
headlines["headline_text"].values

array(['Core sector sees fastest growth in 5 months at 6.7%',
       'Fiscal deficit hits 99% of FY15 target at Nov-end',
       'Govt gives SpiceJet breather from AAI dues till Jan 10', ...,
       "Barack Obama was made to look 'blacker' in Republican Party campaign ads",
       'Bill Cosby charged with felony sexual assault in Pennsylvania',
       'At issue in UN Syria ceasefire plan: Who is a terrorist?'],
      dtype=object)

In [57]:
# add headline_text column and clusterid column to headline_matrix df

In [46]:
headline_matrix_df["headline_text"] = headlines["headline_text"].values
headline_matrix_df["cluster"] = kmeans.labels_

In [56]:
print(headline_matrix_df.shape)
headline_matrix_df[['headline_text', 'cluster']].sample(10)

(253032, 502)


Unnamed: 0,headline_text,cluster
242784,Fake railway secretary sent to 10-day remand,7
244133,Nyishi community decides to protest against Ra...,7
64527,BJD unleashes star power,0
90659,Quick police action sought,7
168763,Titanic's last luncheon menu expected to fetch...,6
13900,Brokerages turn bearish on HUL due to dull res...,1
156659,K'bawdi flyover's final arm to open tomorrow,6
180966,Caribbean lounge celebrates three years of suc...,6
217099,FDA seizes edible oil stored in old tins,5
209025,Sania-Martina brigade rolls on,5


In [None]:
# count of no of headlines present in each cluster

In [55]:
headline_matrix_df["cluster"].value_counts()

6    49438
1    39825
4    37700
5    35089
7    30137
0    27429
2    24191
3     9223
Name: cluster, dtype: int64

### Analyze the most fequent words and sample headlines from each cluster
### Detect any broad themes for each cluster

In [59]:
cluster_id = 0
temp = headline_matrix_df.loc[headline_matrix_df["cluster"] == cluster_id]
print(len(temp))

27429


In [63]:
temp.sample(5)

Unnamed: 0,000,10,100,11,12,13,14,15,16,17,...,worth,year,year old,years,yoga,youth,yr,yr old,headline_text,cluster
251435,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,Matthew McConaughey adopts wife's tradition to...,0
246616,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Beckham eyes Sweden's Zlatan for MLS club,0
57657,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Asia's brightest study Singapore's Lee Kuan Ye...,0
79967,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Now; locker thief says he didn't steal papers ...,0
206736,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Amy Winehouse's mother slams posthumous album ...,0


In [71]:
top_words = temp.drop(columns=["cluster", "headline_text"]).mean().sort_values(ascending=False)

In [77]:
#print(top_words.iloc[0:50])
print(top_words.index[0:50])

Index(['film', 'new', 'gets', 'india', 'life', 'day', 'khan', 'world', 'music',
       'girl', 'delhi', 'tv', 'kapoor', 'love', 'indian', 'city', 'mumbai',
       'wins', 'old', 'man', 'award', 'birthday', 'star', 'wedding', 'year',
       'turns', 'bollywood', 'kids', 'best', 'woman', 'women', 'singh',
       'fashion', 'says', 'play', 'photos', 'fest', 'festival', 'school',
       'makes', 'title', 'son', 'book', 'salman', '2015', 'big', 'students',
       'stars', 'films', 'modi'],
      dtype='object')


In [78]:
temp['headline_text'].iloc[0:50]

20                          Saptak festival begins today
42        Bengaluru doctors turn saviours for Afghan boy
54                     Florists make hay on New Year eve
75                   Salman faces rap for Lanka campaign
79     HC: Lawyer who protects illegality is like a m...
80        AIADMK organisational polls see little contest
84     3-month-old hippo released for public view in ...
85     Sakshi looked cute partying at Small World pub...
94                   Shivering Delhi makes its gods snug
158    Illegal constructions mar city's heritage look...
161              Udaipur river model for Narmada revival
164            Calligraphers get fillip at Sufi festival
174         City bosses promise the moon and stars again
204                       Keeper of Bengal's rice legacy
241    Mumbai midnight mass expresses gratitude for 2...
249                        A Mysore Silk shawl for Obama
259                        Sweet success with strawberry
306                     Two hel

In [80]:
# Cluster id 0 represents headlines having film reviews news and also political news
# It seems this cluser is not strongly defined by the word frequencies in it.

In [81]:
cluster_id = 3
temp = headline_matrix_df.loc[headline_matrix_df["cluster"] == cluster_id]

In [82]:
top_words = temp.drop(columns=["cluster", "headline_text"]).mean().sort_values(ascending=False)

In [83]:
top_words.index[0:50]

Index(['review', 'day', 'vs', 'singh', 'india', 'celebs', 'year', 'old',
       'govt', 'pics', '2015', 'music', 'year old', 'kumar', 'rs', 'khan',
       'bjp', 'world', 'photos', 'indian', 'delhi', 'sharma', 'tv', 'new',
       'cops', 'kapoor', 'modi', 'cm', 'best', 'congress', 'slams', 'party',
       'hc', 'city', 'cup', 'awards', 'pm', 'miss', 'salman', 'eye',
       'bollywood', 'ex', 'man', 'special', 'women', 'south', 'woman', 'make',
       'mumbai', 'live'],
      dtype='object')

In [84]:
temp['headline_text'].iloc[0:50]

87                          Village installs biogas plant
130                                ABMSU slams Gogoi govt
144                 The Milad Effect: Blurring Ideologies
205                 Better lives beckon mangrove dwellers
340               Jennifer Garner slams 'abusive' hacking
352                    Britney Spears gifted beau horses?
402                         Ansiba against cybercriminals
414                                 Lokmanya Ek Yugpurush
417                       MUSIC REVIEW: Avatarachi Goshta
439              Music Review: Malli Malli Idhi Rani Roju
507                                  Recipe: Mocha Coffee
508            Restaurant Review: Beeryani (North Indian)
512                          Recipe: Kerala Chicken curry
536            Federation Cup: Kaith frustrates Salgaocar
566                         Mamata Banerjee joins Twitter
587                          Now meet Kapil Sharma's nani
589                           Our telly wishlist for 2015
590           

In [85]:
# Cluster id 3 represents headlines having film reviews new, sports news and also political news
# even this cluser is not strongly defined by the word frequencies in it.

In [86]:
cluster_id = 2
temp = headline_matrix_df.loc[headline_matrix_df["cluster"] == cluster_id]

In [87]:
top_words = temp.drop(columns=["cluster", "headline_text"]).mean().sort_values(ascending=False)

In [88]:
top_words.index[0:50]

Index(['man', 'arrested', 'police', 'killed', 'woman', 'held', 'old', 'case',
       'death', 'murder', 'cops', 'dead', 'dies', 'year', 'girl', 'year old',
       'rape', 'accused', 'kills', 'injured', 'attack', 'suicide', 'accident',
       'gang', 'delhi', 'wife', 'cop', 'near', 'booked', 'car', 'youth',
       'minor', 'son', 'yr', 'jail', 'driver', 'student', 'yr old', 'killing',
       'bus', 'flu', 'rs', 'family', 'road', 'boy', 'swine', 'swine flu',
       'gets', 'missing', 'mumbai'],
      dtype='object')

In [103]:
temp['headline_text'].iloc[0:50]

23                   SP MLA's nephew missing from Mumbai
24      13 girls escape from juvenile home in Pratapgarh
39     NRI among six dead in different incidents in dist
40                      Deserted by wife; man kills self
64     Release inmates who finished their jail terms:...
65     Girl accuses constable of raping her for a yea...
72          Mohali resident held in possession of heroin
76     CB-CID to probe teen's sex torture in police c...
86     Tamil Nadu adds insult to trauma of Bengaluru ...
98             Two women held for robbing senior citizen
104     17 AC buses of DTC gutted in fire; probe ordered
114    Seven-day custody for accused in cheating case...
115               13-year-old boy drowns in St Cruz; Goa
116    Designer's postmortem done; cause of death res...
117    2 unidentified bodies found at Kundaim; Shirga...
129             Rhino; forest guard killed at Orang park
146     Peon held for sending lewd SMS to affluent women
160       Protest against morph

In [89]:
# Cluster id 2 represents headlines having crime news
# this cluser is strongly defined by the word frequencies in it.

## Training our own W2VEC model

- The format in which gensim expects the dataset to be is list of lists.[[w1,w2,...], [w1,w2,...],....]

In [94]:
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
import re

In [95]:
#clean the text
headline_text_clean = [re.sub("[^a-zA-Z ]","",x).lower() for x in headlines["headline_text"]]

In [96]:
#split or tokenize each headline text and store in a list.
#Output will be a list of list required for gensim

train_data = [x.split(" ") for x in headline_text_clean]

In [98]:
#list of words in first headline
train_data[0]

['core', 'sector', 'sees', 'fastest', 'growth', 'in', '', 'months', 'at', '']

In [99]:
train_data

[['core', 'sector', 'sees', 'fastest', 'growth', 'in', '', 'months', 'at', ''],
 ['fiscal', 'deficit', 'hits', '', 'of', 'fy', 'target', 'at', 'novend'],
 ['govt',
  'gives',
  'spicejet',
  'breather',
  'from',
  'aai',
  'dues',
  'till',
  'jan',
  ''],
 ['govt',
  'eyes',
  'younger',
  'psb',
  'chiefs',
  'to',
  'tap',
  'private',
  'sector'],
 ['trai', 'sets', 'g', 'reserve', 'price', 'at', 'crper', 'mhz'],
 ['over', '', 'cos', 'in', 'bse', '', 'grew', 'fold', 'in', '', 'years'],
 ['sensex', 'nifty', 'up', '', 'in', ''],
 ['ma',
  'of',
  'psu',
  'banks',
  'to',
  'be',
  'a',
  'key',
  'theme',
  'at',
  'pms',
  'gyan',
  'sangam'],
 ['no',
  'fuel',
  'price',
  'cut',
  'as',
  'oil',
  'companies',
  'get',
  'crude',
  'shock'],
 ['jaitley',
  'vows',
  'reforms',
  'push',
  'in',
  'new',
  'year',
  'to',
  'boost',
  'growth'],
 ['rupee', 'down', '', 'paise', 'against', 'dollar'],
 ['sensex',
  'begins',
  'new',
  'year',
  'on',
  'a',
  'weak',
  'note',
  'fa

In [100]:
path = get_tmpfile("word2vec_headlines.model")
model = Word2Vec(train_data, size=100, window=5, min_count=3)
model.save("word2vec_headlines.model")

In [101]:
# loading the model
headlines_w2v = Word2Vec.load("word2vec_headlines.model")

In [103]:
# find most similiar words
headlines_w2v.wv.most_similar("bjp")

[('congress', 0.9507282972335815),
 ('aap', 0.9332325458526611),
 ('cong', 0.9227748513221741),
 ('sena', 0.8621093034744263),
 ('ncp', 0.8266646265983582),
 ('nda', 0.8215234279632568),
 ('bjps', 0.8161717653274536),
 ('sad', 0.8082656860351562),
 ('opposition', 0.802887499332428),
 ('tmc', 0.8025071620941162)]

In [104]:
# find most similiar words
headlines_w2v.wv.most_similar("aap")

[('bjp', 0.9332325458526611),
 ('congress', 0.9233661890029907),
 ('cong', 0.9201474189758301),
 ('sena', 0.8885682821273804),
 ('kejriwal', 0.8732914924621582),
 ('sad', 0.8505297899246216),
 ('bjps', 0.8454707264900208),
 ('opposition', 0.8411419987678528),
 ('nda', 0.8401951789855957),
 ('nitish', 0.8323043584823608)]

In [105]:
# find most similiar words
headlines_w2v.wv.most_similar("film")

[('movie', 0.8307361602783203),
 ('song', 0.801473081111908),
 ('music', 0.790142297744751),
 ('cinema', 0.7575728893280029),
 ('debut', 0.7564358711242676),
 ('marathi', 0.7550472021102905),
 ('vijay', 0.7381199598312378),
 ('comedy', 0.7277856469154358),
 ('theatre', 0.7231904864311218),
 ('films', 0.7186452150344849)]

In [106]:
from tqdm import tqdm
headline_vec = np.zeros((headlines.shape[0],100))
for i in tqdm(range(0,headlines.shape[0])):
    words = headline_text_clean[i].split(" ")
    words = [x.strip() for x in words]
    ind_word_vecs = [headlines_w2v.wv[x] for x in words if x in headlines_w2v.wv.vocab]
    headline_vec[i] = np.array(ind_word_vecs).mean(axis=0)

  import sys
  ret = ret.dtype.type(ret / rcount)
100%|███████████████████████████████████████████████████████████████████████| 253032/253032 [00:11<00:00, 22818.64it/s]


In [108]:
headline_vec = np.nan_to_num(headline_vec)

In [109]:
headline_vec.shape

(253032, 100)

### re-run the clustering algo with the updated vectors and analyze the theme of each cluster

In [110]:
### Clustering the title

In [112]:
from sklearn.cluster import KMeans

In [115]:
kmeans = KMeans(n_clusters=8, random_state=0)
kmeans.fit(headline_vec)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=8, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

In [116]:
kmeans.labels_

array([0, 0, 0, ..., 3, 5, 4])

In [117]:
headline_matrix_df = pd.DataFrame(headline_matrix.toarray())
headline_matrix_df.columns = vect.get_feature_names()

In [118]:
headline_matrix_df.head()

Unnamed: 0,000,10,100,11,12,13,14,15,16,17,...,world,world cup,worth,year,year old,years,yoga,youth,yr,yr old
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [119]:
headline_matrix_df["headline_text"] = headlines["headline_text"].values
headline_matrix_df["cluster"] = kmeans.labels_

In [120]:
headline_matrix_df["cluster"].value_counts()

4    55917
2    46341
5    28315
3    27944
0    27048
1    23867
6    23059
7    20541
Name: cluster, dtype: int64

In [121]:
cluster_id = 0
temp = headline_matrix_df.loc[headline_matrix_df["cluster"] == cluster_id]

In [122]:
len(temp)

27048

In [123]:
top_words = temp.drop(columns=["cluster", "headline_text"]).mean().sort_values(ascending=False)

In [124]:
top_words.index[0:50]

Index(['rs', 'crore', 'lakh', '10', '2015', '000', 'years', 'india', 'year',
       'days', '15', 'delhi', 'cases', 'day', 'state', 'new', 'city', '20',
       'mumbai', 'rain', '50', 'worth', 'water', '12', '30', 'months', 'flu',
       '100', '25', 'swine', 'swine flu', 'students', 'cr', 'govt', 'rise',
       'women', 'trade', 'hit', 'prices', '11', '16', 'power', '14', 'gold',
       '2016', 'early', '40', '13', 'toll', 'hours'],
      dtype='object')

In [125]:
#top_words.iloc[0:50]
temp['headline_text'].iloc[0:50]

0      Core sector sees fastest growth in 5 months at...
1      Fiscal deficit hits 99% of FY15 target at Nov-end
2      Govt gives SpiceJet breather from AAI dues til...
5          Over 20 cos in BSE 500 grew 3-fold in 4 years
6                           Sensex; nifty up 30% in 2014
10                    Rupee down 21 paise against dollar
11     Sensex begins New Year on a weak note; falls b...
12      Gold prices slip in futures trade on global cues
13       Maruti's December sales jump 20.8 pc; shares up
14           India Inc promises 10 lakh new jobs in 2015
17         Sensex up by 8 pts in tepid start to new year
18           General Motors' sales decline 36.56% in Dec
19               12 flights; 5 trains delayed due to fog
21                     AMTS proposes 450 buses in budget
24      13 girls escape from juvenile home in Pratapgarh
32                 Eid-E-Milad celebrations on January 4
34     Maize sale in dist results in losses running i...
45     Police zero in on 2 susp

In [126]:
# Cluser 0 is well defined and has all Business news headlines

In [127]:
cluster_id = 6
temp = headline_matrix_df.loc[headline_matrix_df["cluster"] == cluster_id]

In [128]:
len(temp)

23059

In [129]:
top_words = temp.drop(columns=["cluster", "headline_text"]).mean().sort_values(ascending=False)

In [130]:
top_words.index[0:50]

Index(['held', 'man', 'woman', 'arrested', 'old', 'killed', 'police', 'cops',
       'year', 'murder', 'year old', 'girl', 'death', 'case', 'dies', 'dead',
       'booked', 'accused', 'injured', 'kills', 'rape', 'gang', 'cop',
       'suicide', 'youth', 'yr', 'accident', 'attack', 'near', 'yr old', 'rs',
       'wife', 'minor', 'car', 'driver', 'life', 'student', 'boy', 'delhi',
       'gets', 'road', 'son', 'bus', 'self', 'missing', 'family', 'mumbai',
       'killing', 'body', 'kin'],
      dtype='object')

In [131]:
#top_words.iloc[0:50]
temp['headline_text'].iloc[0:50]

23                   SP MLA's nephew missing from Mumbai
38               Suspects lift bag; abandon it near hosp
39     NRI among six dead in different incidents in dist
40                      Deserted by wife; man kills self
63     9 booked for hurting religious sentiments in Moga
65     Girl accuses constable of raping her for a yea...
72          Mohali resident held in possession of heroin
76     CB-CID to probe teen's sex torture in police c...
86     Tamil Nadu adds insult to trauma of Bengaluru ...
88                Nurse; tout held for child trafficking
98             Two women held for robbing senior citizen
113                       Russian held in Goa with drugs
114    Seven-day custody for accused in cheating case...
115               13-year-old boy drowns in St Cruz; Goa
129             Rhino; forest guard killed at Orang park
147    Woman run over by college bus; residents vanda...
150    Khandelwal murder: Lawyers beat up fourth accused
171                   Two super

In [132]:
# Cluser 6 is well defined and has all Crime related news headlines

In [133]:
cluster_id = 7
temp = headline_matrix_df.loc[headline_matrix_df["cluster"] == cluster_id]

In [134]:
len(temp)

20541

In [135]:
top_words = temp.drop(columns=["cluster", "headline_text"]).mean().sort_values(ascending=False)

In [136]:
top_words.index[0:50]

Index(['make', 'india', 'know', 'women', 'things', 'ways', 'health', 'don',
       'tips', 'diet', 'says', 'good', 'sex', 'want', 'kids', '10', 'best',
       'men', 'people', 'need', 'time', 'home', 'life', 'help', 'say', 'right',
       'like', 'way', 'love', 'indian', 'day', 'world', 'food', 'new',
       'better', 'just', 'look', 'work', '2015', 'modi', 'yoga', 'year',
       'summer', 'cancer', 'delhi', 'city', 'big', 'experts', 'heart',
       'change'],
      dtype='object')

In [137]:
#top_words.iloc[0:50]
temp['headline_text'].iloc[0:50]

178                            So what if it's Thursday!
240    'We voted for the promise of development; not ...
333                    Lakhera to be released this month
341    I have never taken any money from anyone; says...
345    Gwyneth Paltrow says women want to be 'sexual ...
347    Women should be compassionate towards each oth...
360    Actress Samvedna Suwalka talks about her New Y...
366        A film that asks the nation to 'Take It Easy'
379                          Twinkle twinkle can be star
389             Mithoon: I will never be short of energy
394    Will it be third time lucky for Ganesh and Amulya
397                  Take a break and share a few laughs
401                   Vineeth Sreenivasan not in My God!
404                          Dulquer to have a busy 2014
453         Physical inactivity can damage blood vessels
454                         'Science should help humans'
464    Black money: 42% say Modi government has kept ...
498                         Cur

In [138]:
# Cluser 7 is well defined and has all Political news headlines

In [152]:
cluster_id = 3
temp = headline_matrix_df.loc[headline_matrix_df["cluster"] == cluster_id]

In [153]:
len(temp)

27944

In [154]:
top_words = temp.drop(columns=["cluster", "headline_text"]).mean().sort_values(ascending=False)

In [155]:
top_words.index[0:50]

Index(['khan', 'kapoor', 'film', 'salman', 'singh', 'tv', 'says', 'new',
       'salman khan', 'love', 'bollywood', 'day', 'life', 'play', 'shah',
       'celebs', 'kumar', 'star', 'sharma', 'photos', 'music', 'revealed',
       'birthday', 'world', 'india', 'gets', 'wedding', 'pics', 'party',
       'look', 'actor', 'like', 'boss', 'turns', 'modi', 'best', 'indian',
       'big', 'family', 'son', 'stars', 'films', 'girl', 'year', 'rahul',
       'daughter', 'baby', 'man', 'time', 'wife'],
      dtype='object')

In [156]:
temp['headline_text'].iloc[0:50]

79     HC: Lawyer who protects illegality is like a m...
109         When Goa gave the country a defence minister
124       Amish to Gone Girl; the 2014 affair with books
144                The Milad Effect: Blurring Ideologies
174         City bosses promise the moon and stars again
202    After a life of struggle; healing touch for th...
208    Honey Singh rules dance moves in Kolkata's Chr...
259                        Sweet success with strawberry
287    Five generations of Bhosale family serve the n...
329                                 Dev turns terrorist!
330        Mahabharat stalled; director plans love story
332           Honey Singh rules dance moves on Christmas
335    Eva Mendes and Ryan Gosling think split rumour...
336      Eva Longoria: I was an annoying person in class
339    Gwyneth Paltrow wanted to stay together with C...
340              Jennifer Garner slams 'abusive' hacking
343    LiLo's father welcomes baby boy with wife Kate...
346     Snooki confirms her Ins

In [None]:
# Cluser 3 is well defined and has all film news headlines