# K-means Clustering in sci-kit learn

This example uses a dataset downloaded from https://www.opensubtitles.org/en/search/vip and the raw data at opus.lingfil.uu.se/OpenSubtitles2016/raw/en. Metadata such as title actor and director was scraped from IMDB and is not guaranteed to be complete. This example uses the last 5000 most recent movies. The full archive (1.1 Gig) is [here](https://www.dropbox.com/s/db9d6765zbjru5x/openSubtitles.json.zip?dl=0).

The code does the following:
1. counts words 
2. builds a TFIDF weighted vocabulary
3. Applies the TFIDF weights to the word counts to create a sparse matrix
4. Runs K-means clustering on the sparce matrix
5. Prints top words for each cluster using the largest features in the cluster centroid

Be sure to install the following:
1. `pip3 install sklearn`
2. `pip3 install pandas`
2. `pip3 install scipy`


In [25]:
import pandas as pd 

import sys
sys.version 

'3.6.2 (default, Jul 30 2017, 14:53:19) \n[GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)]'

## Unarchive

In [26]:
import tempfile
import zipfile
import os.path

zipFile = "./openSubtitles-5000.json.zip"

print( "Unarchiving ...")
temp_dir = tempfile.mkdtemp()
zip_ref = zipfile.ZipFile(zipFile, 'r')
zip_ref.extractall(temp_dir)
zip_ref.close()

openSubtitlesFile = os.path.join(temp_dir, "openSubtitles-5000.json")
print ("file unarchived to:" + openSubtitlesFile)


Unarchiving ...
file unarchived to:/var/folders/9l/w4_vhqyn5rz64fh1x9zzcsvr0000gn/T/tmpyzzio205/openSubtitles-5000.json


## Tokenizing and Filtering a Vocabulary

In [27]:

import json
from sklearn.feature_extraction.text import CountVectorizer
#from log_progress import log_progress

maxDocsToload = 5000

titles = []
def make_corpus(file):
    with open(file) as f:
        for i, line in enumerate(f):
            doc = json.loads(line)
            titles.append(doc.get('Title',''))
            #if 'Sci-Fi' not in doc.get('Genre',''):
            #    continue
            if i % 100 == 0:
                print ("%d " % i, end='') 
            yield doc.get('Text','')
            if i == maxDocsToload:
                break
                
print ("Starting load ...")                
textGenerator = make_corpus(openSubtitlesFile)              
count_vectorizer = CountVectorizer(min_df=2, max_df=0.75, ngram_range=(1,2), max_features=50000,
                                   stop_words='english', analyzer="word", token_pattern="[a-zA-Z]{3,}")
term_freq_matrix = count_vectorizer.fit_transform(textGenerator)
print ("Done.")
print ( "term_freq_matrix shape = %s" % (term_freq_matrix.shape,) )
print ("term_freq_matrix = \n%s" % term_freq_matrix)


Starting load ...
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 4200 4300 4400 4500 4600 4700 4800 4900 Done.
term_freq_matrix shape = (5000, 50000)
term_freq_matrix = 
  (0, 43801)	1
  (0, 14746)	1
  (0, 44094)	1
  (0, 21796)	1
  (0, 4112)	1
  (0, 10559)	1
  (0, 17280)	1
  (0, 34971)	1
  (0, 38789)	1
  (0, 9338)	1
  (0, 29011)	1
  (0, 31198)	1
  (0, 49419)	1
  (0, 3751)	1
  (0, 9427)	1
  (0, 46392)	1
  (0, 24453)	1
  (0, 27305)	1
  (0, 24240)	1
  (0, 21301)	1
  (0, 25182)	1
  (0, 48467)	1
  (0, 26134)	1
  (0, 36028)	1
  (0, 41716)	1
  :	:
  (4999, 6237)	1
  (4999, 47667)	2
  (4999, 12628)	1
  (4999, 6734)	1
  (4999, 22751)	1
  (4999, 5372)	3
  (4999, 19080)	1
  (4999, 12840)	1
  (4999, 3713)	1
  (4999, 34455)	1
  (4999, 33739)	1
  (4999, 33125)	3
  (4999, 4065)	1
  (4999, 7763)	2
  (4999, 33163)	1
  (4999, 19771)	1
  (4999, 36837)	2
  

## Feature Vocabulary

In [28]:
print( "Vocabulary length = ", len(count_vectorizer.vocabulary_))
word = "data";
rainingIndex = count_vectorizer.vocabulary_[word];
print( "token index for \"%s\" = %d" % (word,rainingIndex))
feature_names = count_vectorizer.get_feature_names()
print( "feature_names[%d] = %s" % (rainingIndex, feature_names[rainingIndex]))


Vocabulary length =  50000
token index for "data" = 8419
feature_names[8419] = data


In [29]:
for i in range(0,1000):
    print( "feature_names[%d] = %s" % (i, feature_names[i]))

feature_names[0] = aaaaah
feature_names[1] = aaaah
feature_names[2] = aaah
feature_names[3] = aaargh
feature_names[4] = aafrin
feature_names[5] = aagh
feature_names[6] = aah
feature_names[7] = aah aah
feature_names[8] = aah did
feature_names[9] = aah don
feature_names[10] = aah god
feature_names[11] = aah grunting
feature_names[12] = aah grunts
feature_names[13] = aah hey
feature_names[14] = aah okay
feature_names[15] = aargh
feature_names[16] = aaron
feature_names[17] = abaddon
feature_names[18] = abalone
feature_names[19] = abandon
feature_names[20] = abandoned
feature_names[21] = abandoning
feature_names[22] = abandonment
feature_names[23] = abba
feature_names[24] = abbas
feature_names[25] = abbey
feature_names[26] = abbi
feature_names[27] = abbie
feature_names[28] = abbott
feature_names[29] = abbs
feature_names[30] = abby
feature_names[31] = abby abby
feature_names[32] = abdomen
feature_names[33] = abdominal
feature_names[34] = abduct
feature_names[35] = abducted
feature_names[36] 

feature_names[957] = analysts
feature_names[958] = analytical
feature_names[959] = analyze
feature_names[960] = analyzed
feature_names[961] = analyzing
feature_names[962] = anandi
feature_names[963] = anaphylaxis
feature_names[964] = anarchy
feature_names[965] = anatomy
feature_names[966] = ancestor
feature_names[967] = ancestors
feature_names[968] = ancestral
feature_names[969] = anchor
feature_names[970] = ancient
feature_names[971] = ancient astronaut
feature_names[972] = ancient history
feature_names[973] = ancient world
feature_names[974] = ancients
feature_names[975] = anders
feature_names[976] = andersen
feature_names[977] = anderson
feature_names[978] = anderssen
feature_names[979] = andes
feature_names[980] = andi
feature_names[981] = andie
feature_names[982] = andit
feature_names[983] = andr
feature_names[984] = andre
feature_names[985] = andrea
feature_names[986] = andreas
feature_names[987] = andrew
feature_names[988] = andrews
feature_names[989] = android
feature_names[990

## TFIDF Weighting
This applys the TFIDF weight to the matrix

tfidf value = word count / number of documents word is in

The document vectors are also normalized so they have a euclidian magnitude of 1.0.

In [30]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(norm="l2")
tfidf.fit(term_freq_matrix)

tf_idf_matrix = tfidf.transform(term_freq_matrix)
print( tf_idf_matrix)

  (0, 1052)	0.0959520615105
  (0, 47252)	0.0045643573485
  (0, 46375)	0.00987768752281
  (0, 47798)	0.0159646781073
  (0, 36780)	0.00584869223708
  (0, 888)	0.00733040914588
  (0, 28853)	0.0133945486906
  (0, 36925)	0.460728096203
  (0, 11666)	0.0747220017172
  (0, 48495)	0.00350520687676
  (0, 2586)	0.0213489141115
  (0, 5647)	0.00946811293935
  (0, 44936)	0.00973571035319
  (0, 28826)	0.0112220717293
  (0, 33208)	0.0129756461679
  (0, 48757)	0.0127876247595
  (0, 1992)	0.0101274111868
  (0, 3614)	0.0111653214653
  (0, 12677)	0.0154170967102
  (0, 21158)	0.500197844055
  (0, 19051)	0.012990005648
  (0, 37797)	0.033066465323
  (0, 41887)	0.0100586723564
  (0, 27604)	0.00969729004908
  (0, 49888)	0.00572522605098
  :	:
  (4999, 5359)	0.0121866541623
  (4999, 40315)	0.0109134798469
  (4999, 1533)	0.0121131749278
  (4999, 42664)	0.0104026648556
  (4999, 49858)	0.0378074287042
  (4999, 5875)	0.137630935462
  (4999, 18612)	0.0107636391144
  (4999, 1070)	0.0243733083245
  (4999, 31763)	0.026

## K-Means

In [31]:
%%time
from sklearn.cluster import KMeans
import numpy

num_clusters = 5
km = KMeans(n_clusters=num_clusters, verbose=True, init='k-means++', n_init=3, n_jobs=-1)
km.fit(tf_idf_matrix)

clusters = km.labels_.tolist()
print ("cluster id for each document = %s" % clusters)

print()
# sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

        


Initialization complete
Initialization complete
Initialization complete
Iteration  0, inertia 9385.937
Iteration  0, inertia 9338.260
Iteration  0, inertia 9380.384
Iteration  1, inertia 4794.372
Iteration  1, inertia 4794.096
Iteration  1, inertia 4787.500
Iteration  2, inertia 4779.824
Iteration  2, inertia 4783.607
Iteration  2, inertia 4777.659
Iteration  3, inertia 4775.038
Iteration  3, inertia 4777.820
Iteration  3, inertia 4773.711
Iteration  4, inertia 4771.939
Iteration  4, inertia 4773.842
Iteration  4, inertia 4772.229
Iteration  5, inertia 4770.008
Iteration  5, inertia 4771.600
Iteration  5, inertia 4771.371
Iteration  6, inertia 4769.064
Iteration  6, inertia 4771.283
Iteration  6, inertia 4769.633
Iteration  7, inertia 4768.715
Iteration  7, inertia 4771.142
Iteration  7, inertia 4768.549
Iteration  8, inertia 4768.511
Iteration  8, inertia 4771.083
Iteration  8, inertia 4768.131
Iteration  9, inertia 4768.399
Iteration  9, inertia 4771.034
Iteration  9, inertia 4767.94

In [32]:
labels = pd.DataFrame(clusters, columns=['Cluster Labels'])
counts = pd.DataFrame(labels['Cluster Labels'].value_counts().sort_index())
counts.columns=['Document Count']
display(counts)

Unnamed: 0,Document Count
0,722
1,1646
2,1699
3,520
4,413


In [33]:
topNWords = 50

df = pd.DataFrame()

for i in range(num_clusters):
    clusterWords = []
    for topWordIndex,ind in enumerate(order_centroids[i, :topNWords]):   
        clusterWords.append( feature_names[ind] )
    df['Cluster %d' % i] = pd.Series(clusterWords)
        #dtype='object', data= [''] * topNWords)
        #print(topWordIndex)        
        #print(ind)
        #print(feature_names[ind])

df.style.set_properties(**{'text-align': 'right'})
df

Unnamed: 0,Cluster 0,Cluster 1,Cluster 2,Cluster 3,Cluster 4
0,hulk,mom,sighs,sir,fuck
1,grunting,guys,chuckles,king,fucking
2,ship,dad,door,father,shit
3,grunts,guy,phone,lord,gotta
4,sir,baby,police,men,wanna
5,growling,ooh,guy,mary,guy
6,eggman,cause,car,majesty,money
7,narrator,school,killed,mother,fucked
8,president,wow,hell,queen,jesus
9,world,laughs,dad,brother,guys


In [34]:

titlesFrame = pd.DataFrame()
titlesFrame['Labels']=km.labels_
titlesFrame['Titles']=titles

sort = titlesFrame.sort_values(by=['Labels','Titles'])
for i in range(num_clusters):
    display( sort.query('Labels == %d' % i) )

Unnamed: 0,Labels,Titles
4375,0,
4376,0,
4377,0,
4378,0,
4379,0,
4380,0,
4381,0,
4382,0,
4383,0,
1808,0,"""12 Monkeys"" Arms of Mine (TV Episode 2015)"


Unnamed: 0,Labels,Titles
2849,1,"""2 Broke Girls"" And the Crime Ring (TV Episode..."
3363,1,"""2 Broke Girls"" And the Cupcake Captives (TV E..."
3909,1,"""2 Broke Girls"" And the Disappointing Unit (TV..."
2663,1,"""2 Broke Girls"" And the Fat Cat (TV Episode 2015)"
1569,1,"""2 Broke Girls"" And the Fun Factory (TV Episod..."
3908,1,"""2 Broke Girls"" And the Grate Expectations (TV..."
3064,1,"""2 Broke Girls"" And the Great Unwashed (TV Epi..."
2761,1,"""2 Broke Girls"" And the High Hook-Up (TV Episo..."
2951,1,"""2 Broke Girls"" And the Knock-Off Knockout (TV..."
3391,1,"""2 Broke Girls"" And the Look of the Irish (TV ..."


Unnamed: 0,Labels,Titles
606,2,
1479,2,"""Agent Carter"" A Sin to Err (TV Episode 2015)"
1475,2,"""Agent Carter"" Bridge and Tunnel (TV Episode 2..."
1060,2,"""Agent Carter"" Now Is Not the End (TV Episode ..."
1480,2,"""Agent Carter"" SNAFU (TV Episode 2015)"
1477,2,"""Agent Carter"" The Blitzkrieg Button (TV Episo..."
1476,2,"""Agent Carter"" Time and Tide (TV Episode 2015)"
1481,2,"""Agent Carter"" Valediction (TV Episode 2015)"
3587,2,"""Agent X"" The Enemy of My Enemy (TV Episode 2015)"
4216,2,"""Agents of S.H.I.E.L.D."" A Wanted (Inhu)Man (T..."


Unnamed: 0,Labels,Titles
4282,3,"""A Place to Call Home"" In the Heat of the Nigh..."
4278,3,"""A Place to Call Home"" L'chaim, to Life (TV Ep..."
4281,3,"""A Place to Call Home"" Living in the Shadow (T..."
4283,3,"""A Place to Call Home"" Sins of the Father (TV ..."
4279,3,"""A Place to Call Home"" Somewhere Beyond the Se..."
4280,3,"""A Place to Call Home"" Too Old to Dream (TV Ep..."
2815,3,"""A.D. The Bible Continues"" Brothers in Arms (T..."
2816,3,"""A.D. The Bible Continues"" Rise Up (TV Episode..."
2814,3,"""A.D. The Bible Continues"" Saul's Return (TV E..."
2817,3,"""A.D. The Bible Continues"" The Abomination (TV..."


Unnamed: 0,Labels,Titles
2762,4,"""19-2"" Babylon (TV Episode 2015)"
1633,4,"""19-2"" Borders (TV Episode 2015)"
2553,4,"""19-2"" Bridges (TV Episode 2015)"
870,4,"""19-2"" Disorder (TV Episode 2015)"
2755,4,"""19-2"" Orphans (TV Episode 2015)"
1815,4,"""19-2"" Property Line (TV Episode 2015)"
1676,4,"""19-2"" Rock Garden (TV Episode 2015)"
950,4,"""19-2"" School (TV Episode 2015)"
1814,4,"""19-2"" Tables (TV Episode 2015)"
1381,4,"""19-2"" Tribes (TV Episode 2015)"


# The End ...