# K-means Clustering in sklearn

This example uses a dataset downloaded from https://www.opensubtitles.org/en/search/vip and the raw data at opus.lingfil.uu.se/OpenSubtitles2016/raw/en. Metadata such as title actor and director was scraped from IMDB and is not guaranteed to be complete. This example uses the last 5000 most recent movies.

The code does the following:
1. counts words 
2. builds a TFIDF weighted vocabulary
3. Applies the TFIDF weights to the word counts to create a sparce matrix
4. Runs K-means clustering on the sparce matrix
5. Prints top words for each cluster using the largest features in the cluster centroid



## Unarchive

In [9]:
import sys
print sys.version

2.7.13 (default, Jul 30 2017, 14:48:40) 
[GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)]


In [17]:
import tempfile
import zipfile
import os.path

zipFile = "./openSubtitles-5000.json.zip"

print "Unarchiving ..."
temp_dir = tempfile.mkdtemp()
zip_ref = zipfile.ZipFile(zipFile, 'r')
zip_ref.extractall(temp_dir)
zip_ref.close()

openSubtitlesFile = os.path.join(temp_dir, "openSubtitles-5000.json")
print "file unarchived to:" + openSubtitlesFile


Unarchiving ...
file unarchived to:/var/folders/9l/w4_vhqyn5rz64fh1x9zzcsvr0000gn/T/tmpvmN7wd/openSubtitles-5000.json


In [29]:

import json
from sklearn.feature_extraction.text import CountVectorizer
from log_progress import log_progress

def make_corpus(file):
    with open(file) as f:
        for i, line in enumerate(f):
            doc = json.loads(line)
            if i % 100 == 0:
                print "%d " % i, 
            if 'Text' in doc:
              yield doc['Text']
            if i == 50:
                break
                
print "Starting load ..."                
textGenerator = make_corpus(openSubtitlesFile)              
count_vectorizer = CountVectorizer(min_df=2, max_df=0.5, stop_words='english')
term_freq_matrix = count_vectorizer.fit_transform(textGenerator)
print "Done."
print "term_freq_matrix[0] = \n%s" % term_freq_matrix[0]

print "done"

Starting load ...
0  Done.
term_freq_matrix[0] = 
  (0, 912)	1
  (0, 3776)	1
  (0, 3632)	1
  (0, 4745)	1
  (0, 2717)	1
  (0, 3917)	1
  (0, 4676)	1
  (0, 2086)	1
  (0, 659)	1
  (0, 1664)	1
  (0, 543)	1
  (0, 698)	1
  (0, 5677)	1
  (0, 1133)	1
  (0, 5892)	1
  (0, 4098)	1
  (0, 2446)	1
  (0, 2609)	1
  (0, 703)	1
  (0, 3167)	1
  (0, 2253)	1
  (0, 5229)	1
  (0, 78)	1
  (0, 4097)	1
  (0, 5341)	1
  :	:
  (0, 1145)	2
  (0, 2601)	2
  (0, 1048)	1
  (0, 2040)	1
  (0, 748)	1
  (0, 1673)	7
  (0, 750)	1
  (0, 5381)	1
  (0, 1271)	1
  (0, 746)	1
  (0, 3237)	3
  (0, 4515)	12
  (0, 2833)	71
  (0, 654)	3
  (0, 3860)	3
  (0, 3346)	2
  (0, 5427)	2
  (0, 974)	3
  (0, 519)	5
  (0, 5837)	1
  (0, 4456)	58
  (0, 220)	1
  (0, 4424)	1
  (0, 5710)	1
  (0, 255)	14
done


In [36]:
print "Vocabulary length = ", len(count_vectorizer.vocabulary_)
print "Total token count for \"raining\" = %s" % count_vectorizer.vocabulary_["raining"]
feature_names = count_vectorizer.get_feature_names()
print "feature_names[10] = %s" % feature_names[10]


Vocabulary length =  5914
Total token count for "raining" = 4136
feature_names[10] = 15


In [4]:
from IPython.display import HTML, display

table = [["Sun",696000,1989100000],
         ["Earth",6371,5973.6],
         ["Moon",1737,73.5],
         ["Mars",3390,641.85]]
display(HTML(
    '<table><tr>{}</tr></table>'.format(
        '</tr><tr>'.join(
            '<td>{}</td>'.format('</td><td>'.join(str(_) for _ in row)) for row in table)
        )
 ))


0,1,2
Sun,696000,1989100000.0
Earth,6371,5973.6
Moon,1737,73.5
Mars,3390,641.85
