# K-means Clustering in sci-kit learn

This example uses a dataset downloaded from https://www.opensubtitles.org/en/search/vip and the raw data at opus.lingfil.uu.se/OpenSubtitles2016/raw/en. Metadata such as title actor and director was scraped from IMDB and is not guaranteed to be complete. This example uses the last 5000 most recent movies. The full archive (1.1 Gig) is [here](https://www.dropbox.com/s/db9d6765zbjru5x/openSubtitles.json.zip?dl=0).

The code does the following:
1. counts words 
2. builds a TFIDF weighted vocabulary
3. Applies the TFIDF weights to the word counts to create a sparse matrix
4. Runs K-means clustering on the sparce matrix
5. Prints top words for each cluster using the largest features in the cluster centroid

Be sure to install the following:
1. `pip3 install sklearn`
2. `pip3 install pandas`
2. `pip3 install scipy`


In [1]:
import pandas as pd 

import sys
sys.version 

'3.6.3 |Anaconda, Inc.| (default, Oct  6 2017, 12:04:38) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'

## Unarchive

In [2]:
import tempfile
import zipfile
import os.path

zipFile = "./openSubtitles-5000.json.zip"

print( "Unarchiving ...")
temp_dir = tempfile.mkdtemp()
zip_ref = zipfile.ZipFile(zipFile, 'r')
zip_ref.extractall(temp_dir)
zip_ref.close()

openSubtitlesFile = os.path.join(temp_dir, "openSubtitles-5000.json")
print ("file unarchived to:" + openSubtitlesFile)


Unarchiving ...
file unarchived to:/var/folders/k1/ywpsl_ld2fj1bn5vp9bbgsr40000gn/T/tmp155tiu8f/openSubtitles-5000.json


## Tokenizing and Filtering a Vocabulary

In [31]:

import json
from sklearn.feature_extraction.text import CountVectorizer
#from log_progress import log_progress

maxDocsToload = 50000

titles = []
def make_corpus(file):
    with open(file) as f:
        for i, line in enumerate(f):
            doc = json.loads(line)
            titles.append(doc.get('Title',''))
            #if 'Sci-Fi' not in doc.get('Genre',''):
            #    continue
            if i % 100 == 0:
                print ("%d " % i, end='') 
            yield doc.get('Text','')
            if i == maxDocsToload:
                break
                
print ("Starting load ...")                
textGenerator = make_corpus(openSubtitlesFile)              
count_vectorizer = CountVectorizer(min_df=2, max_df=0.75, ngram_range=(1,2), max_features=50000,
                                   stop_words='english', analyzer="word", token_pattern="[a-zA-Z]{3,}")
term_freq_matrix = count_vectorizer.fit_transform(textGenerator)
print ("Done.")
print ( "term_freq_matrix shape = %s" % (term_freq_matrix.shape,) )
print ("term_freq_matrix = \n%s" % term_freq_matrix)


Starting load ...
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 4200 4300 4400 4500 4600 4700 4800 4900 Done.
term_freq_matrix shape = (5000, 50000)
term_freq_matrix = 
  (0, 43801)	1
  (0, 14746)	1
  (0, 44094)	1
  (0, 21796)	1
  (0, 4112)	1
  (0, 10559)	1
  (0, 17280)	1
  (0, 34971)	1
  (0, 38789)	1
  (0, 9338)	1
  (0, 29011)	1
  (0, 31198)	1
  (0, 49419)	1
  (0, 3751)	1
  (0, 9427)	1
  (0, 46392)	1
  (0, 24453)	1
  (0, 27305)	1
  (0, 24240)	1
  (0, 21301)	1
  (0, 25182)	1
  (0, 48467)	1
  (0, 26134)	1
  (0, 36028)	1
  (0, 41716)	1
  :	:
  (4999, 6237)	1
  (4999, 47667)	2
  (4999, 12628)	1
  (4999, 6734)	1
  (4999, 22751)	1
  (4999, 5372)	3
  (4999, 19080)	1
  (4999, 12840)	1
  (4999, 3713)	1
  (4999, 34455)	1
  (4999, 33739)	1
  (4999, 33125)	3
  (4999, 4065)	1
  (4999, 7763)	2
  (4999, 33163)	1
  (4999, 19771)	1
  (4999, 36837)	2
  

## Feature Vocabulary

In [32]:
print( "Vocabulary length = ", len(count_vectorizer.vocabulary_))
word = "data";
rainingIndex = count_vectorizer.vocabulary_[word];
print( "token index for \"%s\" = %d" % (word,rainingIndex))
feature_names = count_vectorizer.get_feature_names()
print( "feature_names[%d] = %s" % (rainingIndex, feature_names[rainingIndex]))


Vocabulary length =  50000
token index for "data" = 8419
feature_names[8419] = data


In [33]:
for i in range(0,1000):
    print( "feature_names[%d] = %s" % (i, feature_names[i]))

feature_names[0] = aaaaah
feature_names[1] = aaaah
feature_names[2] = aaah
feature_names[3] = aaargh
feature_names[4] = aafrin
feature_names[5] = aagh
feature_names[6] = aah
feature_names[7] = aah aah
feature_names[8] = aah did
feature_names[9] = aah don
feature_names[10] = aah god
feature_names[11] = aah grunting
feature_names[12] = aah grunts
feature_names[13] = aah hey
feature_names[14] = aah okay
feature_names[15] = aargh
feature_names[16] = aaron
feature_names[17] = abaddon
feature_names[18] = abalone
feature_names[19] = abandon
feature_names[20] = abandoned
feature_names[21] = abandoning
feature_names[22] = abandonment
feature_names[23] = abba
feature_names[24] = abbas
feature_names[25] = abbey
feature_names[26] = abbi
feature_names[27] = abbie
feature_names[28] = abbott
feature_names[29] = abbs
feature_names[30] = abby
feature_names[31] = abby abby
feature_names[32] = abdomen
feature_names[33] = abdominal
feature_names[34] = abduct
feature_names[35] = abducted
feature_names[36] 

feature_names[535] = ago started
feature_names[536] = ago think
feature_names[537] = ago thought
feature_names[538] = ago time
feature_names[539] = ago told
feature_names[540] = ago wanted
feature_names[541] = ago way
feature_names[542] = ago went
feature_names[543] = ago yeah
feature_names[544] = ago years
feature_names[545] = ago yes
feature_names[546] = agonizing
feature_names[547] = agony
feature_names[548] = agos
feature_names[549] = agota
feature_names[550] = agree
feature_names[551] = agree agree
feature_names[552] = agree disagree
feature_names[553] = agree don
feature_names[554] = agree just
feature_names[555] = agree know
feature_names[556] = agree think
feature_names[557] = agreed
feature_names[558] = agreed come
feature_names[559] = agreed let
feature_names[560] = agreed meet
feature_names[561] = agreeing
feature_names[562] = agreement
feature_names[563] = agreements
feature_names[564] = agrees
feature_names[565] = agricultural
feature_names[566] = agriculture
feature_names

## TFIDF Weighting
This applys the TFIDF weight to the matrix

tfidf value = word count / number of documents word is in

The document vectors are also normalized so they have a euclidian magnitude of 1.0.

In [34]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(norm="l2")
tfidf.fit(term_freq_matrix)

tf_idf_matrix = tfidf.transform(term_freq_matrix)
print( tf_idf_matrix)

  (0, 1052)	0.0959520615105
  (0, 47252)	0.0045643573485
  (0, 46375)	0.00987768752281
  (0, 47798)	0.0159646781073
  (0, 36780)	0.00584869223708
  (0, 888)	0.00733040914588
  (0, 28853)	0.0133945486906
  (0, 36925)	0.460728096203
  (0, 11666)	0.0747220017172
  (0, 48495)	0.00350520687676
  (0, 2586)	0.0213489141115
  (0, 5647)	0.00946811293935
  (0, 44936)	0.00973571035319
  (0, 28826)	0.0112220717293
  (0, 33208)	0.0129756461679
  (0, 48757)	0.0127876247595
  (0, 1992)	0.0101274111868
  (0, 3614)	0.0111653214653
  (0, 12677)	0.0154170967102
  (0, 21158)	0.500197844055
  (0, 19051)	0.012990005648
  (0, 37797)	0.033066465323
  (0, 41887)	0.0100586723564
  (0, 27604)	0.00969729004908
  (0, 49888)	0.00572522605098
  :	:
  (4999, 5359)	0.0121866541623
  (4999, 40315)	0.0109134798469
  (4999, 1533)	0.0121131749278
  (4999, 42664)	0.0104026648556
  (4999, 49858)	0.0378074287042
  (4999, 5875)	0.137630935462
  (4999, 18612)	0.0107636391144
  (4999, 1070)	0.0243733083245
  (4999, 31763)	0.026

## K-Means

In [58]:
%%time
from sklearn.cluster import KMeans,MiniBatchKMeans
import numpy

num_clusters = 5
#km = KMeans(n_clusters=num_clusters, verbose=True, init='k-means++', n_init=3, n_jobs=-1)
km = MiniBatchKMeans(n_clusters=num_clusters, verbose=True, init='k-means++', n_init=25, batch_size=2000)

km.fit(tf_idf_matrix)

clusters = km.labels_.tolist()
print ("cluster id for each document = %s" % clusters)

print()
# sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

        


Init 1/25 with method: k-means++
Inertia for init 1/25: 4784.462815
Init 2/25 with method: k-means++
Inertia for init 2/25: 4796.250436
Init 3/25 with method: k-means++
Inertia for init 3/25: 4784.292116
Init 4/25 with method: k-means++
Inertia for init 4/25: 4786.645619
Init 5/25 with method: k-means++
Inertia for init 5/25: 4798.048409
Init 6/25 with method: k-means++
Inertia for init 6/25: 4777.020875
Init 7/25 with method: k-means++
Inertia for init 7/25: 4791.242440
Init 8/25 with method: k-means++
Inertia for init 8/25: 4798.643628
Init 9/25 with method: k-means++
Inertia for init 9/25: 4794.832302
Init 10/25 with method: k-means++
Inertia for init 10/25: 4789.196281
Init 11/25 with method: k-means++
Inertia for init 11/25: 4783.763361
Init 12/25 with method: k-means++
Inertia for init 12/25: 4793.041913
Init 13/25 with method: k-means++
Inertia for init 13/25: 4794.125226
Init 14/25 with method: k-means++
Inertia for init 14/25: 4792.201854
Init 15/25 with method: k-means++
Iner

In [60]:
labels = pd.DataFrame(clusters, columns=['Cluster Labels'])
counts = pd.DataFrame(labels['Cluster Labels'].value_counts().sort_index())
counts.columns=['Document Count']
display(counts)

Unnamed: 0,Document Count
0,1756
1,415
2,1209
3,1057
4,563


In [61]:
topNWords = 50

df = pd.DataFrame()

for i in range(num_clusters):
    clusterWords = []
    for topWordIndex,ind in enumerate(order_centroids[i, :topNWords]):   
        clusterWords.append( feature_names[ind] )
    df['Cluster %d' % i] = pd.Series(clusterWords)
        #dtype='object', data= [''] * topNWords)
        #print(topWordIndex)        
        #print(ind)
        #print(feature_names[ind])

df.style.set_properties(**{'text-align': 'right'})
df

Unnamed: 0,Cluster 0,Cluster 1,Cluster 2,Cluster 3,Cluster 4
0,guys,fuck,sighs,mom,sir
1,music,fucking,chuckles,dad,king
2,laughs,shit,police,guys,father
3,world,guy,phone,baby,men
4,guy,gotta,door,guy,lord
5,whoa,guys,guy,girl,mary
6,shit,money,killed,school,brother
7,huh,wanna,car,cause,majesty
8,sighs,jesus,detective,party,queen
9,hell,fucked,murder,house,mother


In [38]:

titlesFrame = pd.DataFrame()
titlesFrame['Labels']=km.labels_
titlesFrame['Titles']=titles

sort = titlesFrame.sort_values(by=['Labels','Titles'])
for i in range(num_clusters):
    display( sort.query('Labels == %d' % i) )

Unnamed: 0,Labels,Titles
2584,0,"""X Company"" Sixes & Sevens (TV Episode 2015)"


Unnamed: 0,Labels,Titles
2762,1,"""19-2"" Babylon (TV Episode 2015)"
1633,1,"""19-2"" Borders (TV Episode 2015)"
2553,1,"""19-2"" Bridges (TV Episode 2015)"
870,1,"""19-2"" Disorder (TV Episode 2015)"
2755,1,"""19-2"" Orphans (TV Episode 2015)"
1815,1,"""19-2"" Property Line (TV Episode 2015)"
1676,1,"""19-2"" Rock Garden (TV Episode 2015)"
950,1,"""19-2"" School (TV Episode 2015)"
1814,1,"""19-2"" Tables (TV Episode 2015)"
1381,1,"""19-2"" Tribes (TV Episode 2015)"


Unnamed: 0,Labels,Titles
4683,2,"""Rosewood"" Fireflies and Fidelity (TV Episode ..."


Unnamed: 0,Labels,Titles
3553,3,"""Elementary"" A Stitch in Time (TV Episode 2015)"
1967,3,"""Justified"" Alive Day (TV Episode 2015)"
2057,3,"""Justified"" Burned (TV Episode 2015)"
1654,3,"""Justified"" Cash Game (TV Episode 2015)"
2061,3,"""Justified"" Collateral (TV Episode 2015)"
2058,3,"""Justified"" Dark as a Dungeon (TV Episode 2015)"
388,3,"""Justified"" Fate's Right Hand (TV Episode 2015)"
2060,3,"""Justified"" Fugitive Number One (TV Episode 2015)"
1655,3,"""Justified"" Noblesse Oblige (TV Episode 2015)"
1656,3,"""Justified"" Sounding (TV Episode 2015)"


Unnamed: 0,Labels,Titles
1106,4,"""Glee"" Child Star (TV Episode 2015)"


Unnamed: 0,Labels,Titles
2483,5,"""Bad Judge"" Naked and Afraid (TV Episode 2015)"


Unnamed: 0,Labels,Titles
4050,6,"""American Dad!"" Seizures Suit Stanny (TV Episo..."


Unnamed: 0,Labels,Titles
2839,7,"""Salem"" Blood Kiss (TV Episode 2015)"
3421,7,"""Salem"" Book of Shadows (TV Episode 2015)"
3995,7,"""Salem"" Dead Birds (TV Episode 2015)"
3559,7,"""Salem"" Ill Met by Moonlight (TV Episode 2015)"
4119,7,"""Salem"" Midnight Never Come (TV Episode 2015)"
4040,7,"""Salem"" On Earth as in Hell (TV Episode 2015)"
3699,7,"""Salem"" The Beckoning Fair One (TV Episode 2015)"
3422,7,"""Salem"" The Wine Dark Sea (TV Episode 2015)"
4293,7,"""Salem"" The Witching Hour (TV Episode 2015)"
3996,7,"""Salem"" Til Death Do Us Part (TV Episode 2015)"


Unnamed: 0,Labels,Titles
606,8,
4375,8,
4376,8,
4377,8,
4378,8,
4379,8,
4380,8,
4381,8,
4382,8,
4383,8,


Unnamed: 0,Labels,Titles
3646,9,"""American Odyssey"" Oscar Mike (TV Episode 2015)"


Unnamed: 0,Labels,Titles
3069,10,"""Mr. Robinson"" School's Out for Summer (TV Epi..."


Unnamed: 0,Labels,Titles
4201,11,"""Nashville"" 'Til the Pain Outwears the Shame (..."
719,11,"""Nashville"" Before You Go Make Sure You Know (..."
4199,11,"""Nashville"" Can't Get Used to Losing You (TV E..."
4063,11,"""Nashville"" Can't Let Go (TV Episode 2015)"
4197,11,"""Nashville"" How Can I Help You Say Goodbye (TV..."
912,11,"""Nashville"" I Can't Keep Away from You (TV Epi..."
911,11,"""Nashville"" I'm Lost Between Right or Wrong (T..."
908,11,"""Nashville"" I'm Not That Good at Goodbye (TV E..."
909,11,"""Nashville"" I've Got Reasons to Hate You (TV E..."
918,11,"""Nashville"" Is the Better Part Over (TV Episod..."


Unnamed: 0,Labels,Titles
3795,12,"""Bluestone 42"" Episode #3.6 (TV Episode 2015)"


Unnamed: 0,Labels,Titles
3875,13,"""Saving Hope"" Start Me Up (TV Episode 2015)"


Unnamed: 0,Labels,Titles
2121,14,"""The Strain"" Dead End (TV Episode 2015)"


Unnamed: 0,Labels,Titles
626,15,"""Parks and Recreation"" Ron & Jammy (TV Episode..."


Unnamed: 0,Labels,Titles
4424,16,"""Rick and Morty"" Big Trouble in Little Sanchez..."
4427,16,"""Rick and Morty"" The Wedding Squanchers (TV Ep..."
4421,16,"""Rick and Morty"" Total Rickall (TV Episode 2015)"


Unnamed: 0,Labels,Titles
2066,17,"""Star Wars Rebels"" Fire Across the Galaxy (TV ..."
2065,17,"""Star Wars Rebels"" Rebel Resolve (TV Episode 2..."
2069,17,"""Star Wars Rebels"" The Lost Commanders (TV Epi..."


Unnamed: 0,Labels,Titles
160,18,Lila & Eve (2015)


Unnamed: 0,Labels,Titles
1539,19,"""12 Monkeys"" Cassandra Complex (TV Episode 2015)"
1537,19,"""12 Monkeys"" Mentally Divergent (TV Episode 2015)"
1538,19,"""12 Monkeys"" Pilot (TV Episode 2015)"
1542,19,"""12 Monkeys"" The Red Forest (TV Episode 2015)"
3391,19,"""2 Broke Girls"" And the Look of the Irish (TV ..."
1908,19,"""A to Z"" J Is for Jan Vaughan (TV Episode 2015)"
1583,19,"""A to Z"" K Is for Keep Out (TV Episode 2015)"
1708,19,"""A to Z"" L Is for Likability (TV Episode 2015)"
1709,19,"""A to Z"" M Is for Meant to Be (TV Episode 2015)"
1570,19,"""About a Boy"" About a Boy Becoming a Man (TV E..."


# The End ...