# K-means Clustering in sklearn

This example uses a dataset downloaded from https://www.opensubtitles.org/en/search/vip and the raw data at opus.lingfil.uu.se/OpenSubtitles2016/raw/en. Metadata such as title actor and director was scraped from IMDB and is not guaranteed to be complete. This example uses the last 5000 most recent movies.

The code does the following:
1. counts words 
2. builds a TFIDF weighted vocabulary
3. Applies the TFIDF weights to the word counts to create a sparce matrix
4. Runs K-means clustering on the sparce matrix
5. Prints top words for each cluster using the largest features in the cluster centroid



In [96]:
import sys
print sys.version

2.7.13 (default, Jul 30 2017, 14:48:40) 
[GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)]


## Unarchive

In [97]:
import tempfile
import zipfile
import os.path

zipFile = "./openSubtitles-5000.json.zip"

print "Unarchiving ..."
temp_dir = tempfile.mkdtemp()
zip_ref = zipfile.ZipFile(zipFile, 'r')
zip_ref.extractall(temp_dir)
zip_ref.close()

openSubtitlesFile = os.path.join(temp_dir, "openSubtitles-5000.json")
print "file unarchived to:" + openSubtitlesFile


Unarchiving ...
file unarchived to:/var/folders/9l/w4_vhqyn5rz64fh1x9zzcsvr0000gn/T/tmpcOuCH8/openSubtitles-5000.json


## Tokenizing and Filtering a Vocabulary

In [102]:

import json
from sklearn.feature_extraction.text import CountVectorizer
#from log_progress import log_progress

maxDocsToload = 2000

titles = []
def make_corpus(file):
    with open(file) as f:
        for i, line in enumerate(f):
            doc = json.loads(line)
            titles.append(doc.get('Title',''))
            if i % 100 == 0:
                print "%d " % i, 
            yield doc.get('Text','')
            if i == maxDocsToload:
                break
                
print "Starting load ..."                
textGenerator = make_corpus(openSubtitlesFile)              
count_vectorizer = CountVectorizer(min_df=2, max_df=0.5, stop_words='english', analyzer="word", token_pattern="[a-zA-Z]{3,}")
term_freq_matrix = count_vectorizer.fit_transform(textGenerator)
print "Done."
print "term_freq_matrix = \n%s" % term_freq_matrix


 Starting load ...
0  100  200  300  400  500  600  700  800  900  1000  1100  1200  1300  1400  1500  1600  1700  1800  1900  2000  Done.
term_freq_matrix = 
  (0, 1802)	1
  (0, 6086)	1
  (0, 27353)	1
  (0, 26072)	1
  (0, 34030)	1
  (0, 18759)	1
  (0, 33974)	1
  (0, 33756)	1
  (0, 28386)	1
  (0, 17081)	1
  (0, 33623)	1
  (0, 13828)	1
  (0, 4332)	1
  (0, 11248)	1
  (0, 3335)	1
  (0, 4561)	1
  (0, 40462)	1
  (0, 7422)	1
  (0, 41881)	1
  (0, 33561)	2
  (0, 29493)	1
  (0, 16605)	1
  (0, 17746)	1
  (0, 4581)	1
  (0, 22126)	1
  :	:
  (2000, 1822)	1
  (2000, 3290)	1
  (2000, 13298)	1
  (2000, 1523)	2
  (2000, 20442)	1
  (2000, 13719)	1
  (2000, 20474)	1
  (2000, 930)	2
  (2000, 24734)	1
  (2000, 11685)	2
  (2000, 4877)	1
  (2000, 17576)	5
  (2000, 13730)	1
  (2000, 14408)	1
  (2000, 15428)	1
  (2000, 29621)	1
  (2000, 4776)	5
  (2000, 20125)	1
  (2000, 7233)	3
  (2000, 35200)	1
  (2000, 41957)	1
  (2000, 22842)	1
  (2000, 32580)	2
  (2000, 4305)	1
  (2000, 6398)	7


## Feature Vocabulary

In [103]:
print "Vocabulary length = ", len(count_vectorizer.vocabulary_)
word = "data";
rainingIndex = count_vectorizer.vocabulary_[word];
print "token index for \"%s\" = %d" % (word,rainingIndex)
feature_names = count_vectorizer.get_feature_names()
print "feature_names[%d] = %s" % (rainingIndex, feature_names[rainingIndex])


Vocabulary length =  42141
token index for "data" = 9324
feature_names[9324] = data


In [104]:
for i in range(0,1000):
    print "feature_names[%d] = %s" % (i, feature_names[i])

feature_names[0] = aaa
feature_names[1] = aaaaaaaaaaaaaaah
feature_names[2] = aaaaaaaaaah
feature_names[3] = aaaaaaagh
feature_names[4] = aaaaaaah
feature_names[5] = aaaaaah
feature_names[6] = aaaaah
feature_names[7] = aaaah
feature_names[8] = aaaargh
feature_names[9] = aaagh
feature_names[10] = aaah
feature_names[11] = aaahh
feature_names[12] = aaargh
feature_names[13] = aadhar
feature_names[14] = aafrin
feature_names[15] = aagh
feature_names[16] = aah
feature_names[17] = aahh
feature_names[18] = aahhh
feature_names[19] = aak
feature_names[20] = aamir
feature_names[21] = aargh
feature_names[22] = aaron
feature_names[23] = aarp
feature_names[24] = aback
feature_names[25] = abacus
feature_names[26] = abaddon
feature_names[27] = abalone
feature_names[28] = abandon
feature_names[29] = abandoned
feature_names[30] = abandoning
feature_names[31] = abandonment
feature_names[32] = abandons
feature_names[33] = abate
feature_names[34] = abattoir
feature_names[35] = abba
feature_names[36] = abbad

feature_names[348] = actress
feature_names[349] = actresses
feature_names[350] = acts
feature_names[351] = actual
feature_names[352] = actualize
feature_names[353] = actuator
feature_names[354] = acu
feature_names[355] = acumen
feature_names[356] = acupuncture
feature_names[357] = acute
feature_names[358] = acutely
feature_names[359] = ada
feature_names[360] = adage
feature_names[361] = adagio
feature_names[362] = adalind
feature_names[363] = adam
feature_names[364] = adamant
feature_names[365] = adamantine
feature_names[366] = adams
feature_names[367] = adamson
feature_names[368] = adapt
feature_names[369] = adapted
feature_names[370] = adapter
feature_names[371] = adapting
feature_names[372] = add
feature_names[373] = added
feature_names[374] = addendum
feature_names[375] = adder
feature_names[376] = adderall
feature_names[377] = addict
feature_names[378] = addicted
feature_names[379] = addicting
feature_names[380] = addiction
feature_names[381] = addictions
feature_names[382] = addi

## TDIF Weighting
This applys the TFIDF weight to the matrix

tfidf value = word count / number of documents word is in

The document vectors are also normalized so they have a euclidian magnitude of 1.0.

In [105]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(norm="l2")
tfidf.fit(term_freq_matrix)

tf_idf_matrix = tfidf.transform(term_freq_matrix)
print tf_idf_matrix

  (0, 1413)	0.104081639293
  (0, 40730)	0.00446522116367
  (0, 40432)	0.00981866060544
  (0, 31882)	0.00591001483191
  (0, 1092)	0.00710680029374
  (0, 32032)	0.436620103368
  (0, 11990)	0.0765188333169
  (0, 41552)	0.00352597811216
  (0, 3216)	0.0213834585029
  (0, 6398)	0.00912543656725
  (0, 38345)	0.0100068071902
  (0, 23818)	0.0108731394864
  (0, 27990)	0.0131045702889
  (0, 4305)	0.0111252531832
  (0, 19767)	0.521567697972
  (0, 32580)	0.0330213392989
  (0, 36665)	0.00969018050357
  (0, 22842)	0.00927490334951
  (0, 41957)	0.00557309620917
  (0, 4916)	0.00428518454776
  (0, 13355)	0.00995832642766
  (0, 8417)	0.00340097124456
  (0, 38155)	0.00683485970033
  (0, 36068)	0.0101113130265
  (0, 4923)	0.00433710749129
  :	:
  (2000, 19946)	0.342110171287
  (2000, 32321)	0.0718038509358
  (2000, 6953)	0.0743473912331
  (2000, 32609)	0.0822483490482
  (2000, 15305)	0.068813246568
  (2000, 38758)	0.0379347359336
  (2000, 20119)	0.0379347359336
  (2000, 20344)	0.0855275428217
  (2000, 7050

## K-Means

In [106]:
from sklearn.cluster import KMeans
import numpy

num_clusters = 10
km = KMeans(n_clusters=num_clusters, verbose=True, init='k-means++', n_init=3, n_jobs=-1)
km.fit(tf_idf_matrix)

clusters = km.labels_.tolist()
print "cluster id for each document = %s" % clusters

print("Top terms per cluster:")
print()
# sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

        


Initialization complete
Initialization complete
Initialization complete
Iteration  0, inertia 3640.656
Iteration  0, inertia 3674.154
Iteration  0, inertia 3628.640
Iteration  1, inertia 1911.365
Iteration  1, inertia 1901.711
Iteration  1, inertia 1902.336
Iteration  2, inertia 1902.417
Iteration  2, inertia 1895.356
Iteration  2, inertia 1895.171
Iteration  3, inertia 1895.849
Iteration  3, inertia 1891.083
Iteration  3, inertia 1892.528
Iteration  4, inertia 1891.614
Iteration  4, inertia 1887.909
Iteration  4, inertia 1890.968
Iteration  5, inertia 1889.815
Iteration  5, inertia 1886.135
Iteration  5, inertia 1890.282
Iteration  6, inertia 1888.901
Iteration  6, inertia 1885.152
Iteration  6, inertia 1889.830
Iteration  7, inertia 1888.477
Iteration  7, inertia 1884.558
Iteration  7, inertia 1889.333
Iteration  8, inertia 1888.329
Iteration  8, inertia 1884.165
Iteration  8, inertia 1888.881
Iteration  9, inertia 1888.244
Iteration  9, inertia 1883.933
Iteration  9, inertia 1888.70

In [107]:
topNWords = 50
#clusterWords = numpy.zeros((topNWords,num_clusters))


import pandas as pd   
#from IPython.display import HTML, display

df = pd.DataFrame()

for i in range(num_clusters):
    clusterWords = []
    for topWordIndex,ind in enumerate(order_centroids[i, :topNWords]):   
        clusterWords.append( feature_names[ind] )
    df['Cluster %d' % i] = pd.Series(clusterWords)
        #dtype='object', data= [''] * topNWords)
        #print(topWordIndex)        
        #print(ind)
        #print(feature_names[ind])

df.style.set_properties(**{'text-align': 'right'})
df

Unnamed: 0,Cluster 0,Cluster 1,Cluster 2,Cluster 3,Cluster 4,Cluster 5,Cluster 6,Cluster 7,Cluster 8,Cluster 9
0,police,fuck,sighs,cole,jimmy,harry,tommy,josh,king,oliver
1,alright,fucking,chuckles,jane,ben,selfridge,lucious,school,majesty,malcolm
2,president,shit,emma,alan,tom,joss,jamie,ooh,lord,elena
3,mrs,gotta,grunts,drill,sighs,calista,ghost,cool,gods,thea
4,brother,wanna,barry,clarke,rayna,waits,carter,laughs,queen,savannah
5,captain,danny,laughs,michael,chuckles,bosch,jamal,wanna,oracle,aleister
6,blood,jesus,detective,rafael,christy,karen,tuppence,andy,galavant,ives
7,mum,fucked,indistinct,amy,lemon,alec,lobos,fun,ragnar,sara
8,shit,sighs,police,dahlia,lavon,lucy,hakeem,dude,norrell,arrow
9,daughter,jackie,noah,klaus,jess,luca,ain,april,prince,vanessa


In [108]:

titlesFrame = pd.DataFrame()
titlesFrame['Labels']=km.labels_
titlesFrame['Titles']=titles

sort = titlesFrame.sort_values(by=['Labels','Titles'])
for i in range(num_clusters):
    display( sort.query('Labels == %d' % i) )

Unnamed: 0,Labels,Titles
1902,0,"""A.D. The Bible Continues"" The Body Is Gone (T..."
1901,0,"""A.D. The Bible Continues"" The Tomb Is Open (T..."
1789,0,"""Allegiance"" Chasing Ghosts (TV Episode 2015)"
1785,0,"""Allegiance"" Pilot (TV Episode 2015)"
1046,0,"""Another Period"" Pilot (TV Episode 2015)"
1035,0,"""Backstrom"" Ancient, Chinese, Secret (TV Episo..."
937,0,"""Backstrom"" Bella (TV Episode 2015)"
1939,0,"""Backstrom"" Corkscrewed (TV Episode 2015)"
1994,0,"""Backstrom"" I Like to Watch (TV Episode 2015)"
1160,0,"""Backstrom"" Love Is a Rose and You Better Not ..."


Unnamed: 0,Labels,Titles
1633,1,"""19-2"" Borders (TV Episode 2015)"
870,1,"""19-2"" Disorder (TV Episode 2015)"
1815,1,"""19-2"" Property Line (TV Episode 2015)"
1676,1,"""19-2"" Rock Garden (TV Episode 2015)"
950,1,"""19-2"" School (TV Episode 2015)"
1814,1,"""19-2"" Tables (TV Episode 2015)"
1381,1,"""19-2"" Tribes (TV Episode 2015)"
413,1,"""Ballers"" Pilot (TV Episode 2015)"
1904,1,"""Banana"" Episode #1.6 (TV Episode 2015)"
732,1,"""Banshee"" A Fixer of Sorts (TV Episode 2015)"


Unnamed: 0,Labels,Titles
1479,2,"""Agent Carter"" A Sin to Err (TV Episode 2015)"
1475,2,"""Agent Carter"" Bridge and Tunnel (TV Episode 2..."
1060,2,"""Agent Carter"" Now Is Not the End (TV Episode ..."
1480,2,"""Agent Carter"" SNAFU (TV Episode 2015)"
1477,2,"""Agent Carter"" The Blitzkrieg Button (TV Episo..."
1478,2,"""Agent Carter"" The Iron Ceiling (TV Episode 2015)"
1476,2,"""Agent Carter"" Time and Tide (TV Episode 2015)"
1481,2,"""Agent Carter"" Valediction (TV Episode 2015)"
1456,2,"""Agents of S.H.I.E.L.D."" Afterlife (TV Episode..."
1451,2,"""Agents of S.H.I.E.L.D."" Aftershocks (TV Episo..."


Unnamed: 0,Labels,Titles
1808,3,"""12 Monkeys"" Arms of Mine (TV Episode 2015)"
1540,3,"""12 Monkeys"" Atari (TV Episode 2015)"
1539,3,"""12 Monkeys"" Cassandra Complex (TV Episode 2015)"
1624,3,"""12 Monkeys"" Divine Move (TV Episode 2015)"
1537,3,"""12 Monkeys"" Mentally Divergent (TV Episode 2015)"
1807,3,"""12 Monkeys"" Paradox (TV Episode 2015)"
1538,3,"""12 Monkeys"" Pilot (TV Episode 2015)"
1806,3,"""12 Monkeys"" Shonin (TV Episode 2015)"
1543,3,"""12 Monkeys"" The Keys (TV Episode 2015)"
1541,3,"""12 Monkeys"" The Night Room (TV Episode 2015)"


Unnamed: 0,Labels,Titles
620,4,"""American Horror Story"" Magical Thinking (TV E..."
621,4,"""American Horror Story"" Show Stoppers (TV Epis..."
826,4,"""Better Call Saul"" Alpine Shepherd Boy (TV Epi..."
854,4,"""Better Call Saul"" Bingo (TV Episode 2015)"
799,4,"""Better Call Saul"" Hero (TV Episode 2015)"
892,4,"""Better Call Saul"" Marco (TV Episode 2015)"
770,4,"""Better Call Saul"" Nacho (TV Episode 2015)"
891,4,"""Better Call Saul"" Pimento (TV Episode 2015)"
890,4,"""Better Call Saul"" RICO (TV Episode 2015)"
168,4,"""Better Call Saul"" Uno (TV Episode 2015)"


Unnamed: 0,Labels,Titles
1740,5,"""Bosch"" Chapter Eight: High Low (TV Episode 2015)"
1300,5,"""Bosch"" Chapter Five: Mama's Boy (TV Episode 2..."
1230,5,"""Bosch"" Chapter Four: Fugazi (TV Episode 2015)"
1713,5,"""Bosch"" Chapter Nine: The Magic Castle (TV Epi..."
1650,5,"""Bosch"" Chapter Seven: Lost Boys (TV Episode 2..."
1296,5,"""Bosch"" Chapter Six: Donkey's Years (TV Episod..."
1289,5,"""Bosch"" Chapter Three: Blue Religion (TV Episo..."
974,5,"""Bosch"" Chapter Two: Lost Light (TV Episode 2015)"
514,5,"""Critical"" Episode #1.6 (TV Episode 2015)"
1301,5,"""Mistresses"" Gone Girl (TV Episode 2015)"


Unnamed: 0,Labels,Titles
1422,6,"""American Crime"" Episode Five (TV Episode 2015)"
1045,6,"""American Crime"" Episode Four (TV Episode 2015)"
1336,6,"""American Crime"" Episode Nine (TV Episode 2015)"
824,6,"""American Crime"" Episode One (TV Episode 2015)"
1328,6,"""American Crime"" Episode Seven (TV Episode 2015)"
1180,6,"""American Crime"" Episode Six (TV Episode 2015)"
941,6,"""American Crime"" Episode Three (TV Episode 2015)"
236,6,"""Banished"" Episode #1.1 (TV Episode 2015)"
235,6,"""Banished"" Episode #1.2 (TV Episode 2015)"
237,6,"""Banished"" Episode #1.3 (TV Episode 2015)"


Unnamed: 0,Labels,Titles
1569,7,"""2 Broke Girls"" And the Fun Factory (TV Episod..."
1908,7,"""A to Z"" J Is for Jan Vaughan (TV Episode 2015)"
1583,7,"""A to Z"" K Is for Keep Out (TV Episode 2015)"
1708,7,"""A to Z"" L Is for Likability (TV Episode 2015)"
1709,7,"""A to Z"" M Is for Meant to Be (TV Episode 2015)"
1570,7,"""About a Boy"" About a Boy Becoming a Man (TV E..."
1711,7,"""About a Boy"" About a Hook (TV Episode 2015)"
1495,7,"""About a Boy"" About a Manniversary (TV Episode..."
1164,7,"""Aquarius"" Cease to Resist (TV Episode 2015)"
557,7,"""Austin & Ally"" Buzzcuts & Beginnings (TV Epis..."


Unnamed: 0,Labels,Titles
1903,8,"""A.D. The Bible Continues"" The Spirit Arrives ..."
1759,8,"""Forever"" The King of Columbus Circle (TV Epis..."
1824,8,"""Galavant"" Comedy Gold (TV Episode 2015)"
1826,8,"""Galavant"" Completely Mad... Alena (TV Episode..."
1827,8,"""Galavant"" Dungeons and Dragon Lady (TV Episod..."
1829,8,"""Galavant"" It's All in the Executions (TV Epis..."
1823,8,"""Galavant"" Joust Friends (TV Episode 2015)"
1828,8,"""Galavant"" My Cousin Izzy (TV Episode 2015)"
715,8,"""Galavant"" Pilot (TV Episode 2015)"
1825,8,"""Galavant"" Two Balls (TV Episode 2015)"


Unnamed: 0,Labels,Titles
606,9,
1138,9,"""Arrow"" Al Sah-Him (TV Episode 2015)"
1140,9,"""Arrow"" Broken Arrow (TV Episode 2015)"
1130,9,"""Arrow"" Canaries (TV Episode 2015)"
1132,9,"""Arrow"" Left Behind (TV Episode 2015)"
1129,9,"""Arrow"" Midnight City (TV Episode 2015)"
1141,9,"""Arrow"" My Name Is Oliver Queen (TV Episode 2015)"
1134,9,"""Arrow"" Nanda Parbat (TV Episode 2015)"
1137,9,"""Arrow"" Public Enemy (TV Episode 2015)"
1142,9,"""Arrow"" Suicidal Tendencies (TV Episode 2015)"


In [None]:
mar