## Table of Contents
<ul>
<li>[Part 1: Load Data](#Part-1:-Load-Data)
<li>[Part 2: Tokenizing and Stemming](#Part-2:-Tokenizing-and-Stemming)  
<li>[Part 3: TF-IDF](#Part-3:-TF-IDF)
<li>[Part 4: K-means Clustering](#Part-4:-K-means-Clustering)  
</ul>

## Part 1: Load Data

In [2]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import nltk
import re
import os

from sklearn import decomposition
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
import lda

Read data from files. In summary, we have 100 titles and 100 synoposes (combined from imdb and wiki).

In [3]:
# import three lists: titles and wikipedia synopses
titles = open('./data/title_list.txt').read().split('\n')
titles = titles[:100] # ensures that only the first 100 are read in

# the wiki synopses and imdb synopses of each movie is seperated by the keywords "BREAKS HERE".
# each synoposes may consist of multiple paragraphs.
synopses_wiki = open('./data/synopses_list_wiki.txt').read().split('\n BREAKS HERE')
synopses_wiki = synopses_wiki[:100]

synopses_imdb = open('./data/synopses_list_imdb.txt').read().split('\n BREAKS HERE')
synopses_imdb = synopses_imdb[:100]

# combine imdb and wiki to get full synoposes for the top 100 movies.
synopses = []
for i in range(len(synopses_wiki)):
    item = synopses_wiki[i] + synopses_imdb[i]
    synopses.append(item)

# because these synopses have already been ordered in popularity order,
# we just need to generate a list of ordered numbers for future usage.
ranks = range(len(titles))

## Part 2: Tokenizing and Stemming 

Load stopwords and stemmer function from NLTK library. Stop words are liek "a", "the", or "in" which don't convey significant meaning.
<br/>
<b>Stemming</b> is the process of breaking a word down into its root.

In [5]:
# Use nltk's English stopwords.
stopwords = nltk.corpus.stopwords.words("english")

print("We use " + str(len(stopwords)) + " stop-words from nltk library.")
print(stopwords[:10])

We use 179 stop-words from nltk library.
[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u"you're"]


In [8]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

def tokenization_and_stemming(text):
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent) if word not in stopwords]
    """
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if word not in stopword:
                tokens.append(word)
    """
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(token) for token in filtered_tokens]
    return stems

# tokenization without stemming
def tokenization(text):
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent) if word not in stopwords]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [12]:
# for example
print(tokenization_and_stemming("she looked at her father's arm."))
print(tokenization("she looked at her father's arm."))

[u'look', u'father', "'s", u'arm']
['looked', 'father', "'s", 'arm']


Use our defined functions to analyze (i.e. tokenize, stem) our synoposes. 

In [13]:
docs_stemmed = []
docs_tokenized = []
for i in synopses:
    tokenized_and_stemmed_results = tokenization_and_stemming(i)
    docs_stemmed.extend(tokenized_and_stemmed_results)
    
    tokenized_results = tokenization(i)
    docs_tokenized.extend(tokenized_results)

Create a mapping from stemmed words to original tokenized words for result interpretation. 

In [14]:
vocab_frame_dict = {docs_stemmed[i] : docs_tokenized[i] for i in range(len(docs_stemmed))}
print(vocab_frame_dict['angel'])

angeles


## Part 3: TF-IDF 

In [15]:
# define vectorizer parameters
tfidf_model = TfidfVectorizer(max_df=0.8, max_features=200000, min_df=0.2, stop_words='english', use_idf=True,
                             tokenizer=tokenization_and_stemming, ngram_range=(1,1))
tfidf_matrix = tfidf_model.fit_transform(synopses) # fit the vectorizer to synopses

print("In total, there are " + str(tfidf_matrix.shape[0]) + \
     " synopses and " + str(tfidf_matrix.shape[1]) + " terms.")

In total, there are 100 synopses and 538 terms.


In [16]:
tfidf_model.get_params()

{'analyzer': u'word',
 'binary': False,
 'decode_error': u'strict',
 'dtype': numpy.int64,
 'encoding': u'utf-8',
 'input': u'content',
 'lowercase': True,
 'max_df': 0.8,
 'max_features': 200000,
 'min_df': 0.2,
 'ngram_range': (1, 1),
 'norm': u'l2',
 'preprocessor': None,
 'smooth_idf': True,
 'stop_words': 'english',
 'strip_accents': None,
 'sublinear_tf': False,
 'token_pattern': u'(?u)\\b\\w\\w+\\b',
 'tokenizer': <function __main__.tokenization_and_stemming>,
 'use_idf': True,
 'vocabulary': None}

Save the terms identified by TF-IDF. 

In [18]:
tf_selected_words = tfidf_model.get_feature_names()

### Calculate Document Similarity 

In [19]:
from sklearn.metrics.pairwise import cosine_similarity
cos_matrix = cosine_similarity(tfidf_matrix)
print(cos_matrix)

[[ 1.          0.1996283   0.23342595 ...,  0.3581728   0.3277855
   0.17714484]
 [ 0.1996283   1.          0.26626879 ...,  0.26910321  0.22856581
   0.10870124]
 [ 0.23342595  0.26626879  1.         ...,  0.27738654  0.24669308
   0.12805055]
 ..., 
 [ 0.3581728   0.26910321  0.27738654 ...,  1.          0.55040973
   0.08151334]
 [ 0.3277855   0.22856581  0.24669308 ...,  0.55040973  1.          0.13005234]
 [ 0.17714484  0.10870124  0.12805055 ...,  0.08151334  0.13005234  1.        ]]


## Part 4: K-means Clustering 