# Using Sklearn NLTK to find closest matching movie title

In this demonstration, I will use the natural language processing to find a movie title closes to the input text.  
I'll using sklearn's Natural Language Tool Kit functions to process text and convert sentences into numeric vectors for analysis.

## The Movie Lens Data

The data is from [movielens](http://files.grouplens.org/datasets/movielens/ml-20m.zip).  The full set of movies.csv is over 27k movie titles and genres.  In this exercise I've taken about 30% of the original data to speed up computing  time.

In [1]:
from __future__ import print_function
import csv as csv
import numpy as np
import pandas as pd

from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer

import string

In [2]:
# Text processing functions

import re
def filter_non_ascii(text):
    return re.sub(r'[^\x00-\x7F]+',' ', text)

def filter_carriage_return(text):
    return re.sub(r'\n', ' ', text)

def filter_double_quote(text):
    return re.sub(r'\"', '', text)

# Add your domain-specific stopwords
def filter_custom_stopwords(text):
    custom_stopwords = ['episode', 'season', 'awesomenesstv', 'facebook', 'twitter', 
                        'instagram', 'tumblr', 'snapchat', 'google',
                        'network', 'subscribe', 'download', 'the app', 'follow',
                        'ios', 'android', 'http', 'https', 'swag', 'shop', 'youtube support',
                        'bitly', 'credits']
    for i in custom_stopwords:
        text = text.replace(i, ' ')
    return text

def filter_double_quote(text):
    return re.sub(r'\"', '', text)

def filter_punctuations(text):
    return text.translate(string.maketrans("", ""), string.punctuation)

# Bulk of the transformation before running tf-idf
def parseOutText(text):
    if len(text) > 1:
        ### remove punctuation
        text_no_punctuation = text.translate(string.maketrans("", ""), string.punctuation)
        
        ### remove custom stopwords - domain specific
        text_string = filter_custom_stopwords(text_no_punctuation.lower())
        
        ### split the text string into individual words, stem each word,
        ### and append the stemmed word to words (make sure there's a single
        ### space between each stemmed word)
        individual_words = word_tokenize(text_string)
        
        snowball_stemmer = SnowballStemmer("english")
        stemmed_text = ""
        for txt in individual_words:
                stemmed_text += snowball_stemmer.stem(txt) + " " 

        words = stemmed_text    

        return words

In [3]:
movielens = pd.read_csv('movies.csv')

In [4]:
movielens.head()

Unnamed: 0,movieId,title,genres
0,94494,96 Minutes (2011),Drama|Thriller
1,94496,Columbus Circle (2012),Crime|Mystery|Thriller
2,94503,"Decoy Bride, The (2011)",Comedy|Romance
3,94531,"Headhunter's Sister, The (1997)",Drama
4,94537,Angels Crest (2011),Drama


In [5]:
len(movielens)

8278

In [6]:
movie_word_data = movielens['title']

## Bi-Gram Document Similarity

Here, I didn't remove stop words as the stop words are being used for proper nouns in movie titles such as "The Shining" or "The Martian".

In [7]:
# Text processing: remove punctuations, stemming and tokenize
all_bi_word_data = []

# Group 1    
# Process description
for i, line in enumerate(movie_word_data):
    line1 = filter_non_ascii(line)
    line2 = parseOutText(line1)
    all_bi_word_data.append(line2)

In [8]:
len(all_bi_word_data)

8278

In [9]:
all_bi_word_data[0:5]

[u'96 minut 2011 ',
 u'columbus circl 2012 ',
 u'decoy bride the 2011 ',
 u'headhunt sister the 1997 ',
 u'angel crest 2011 ']

## Create a Bi-Gram Vector

In [10]:
# In a Bi-Gram algorithm, you can optionally remove stop words, here I didn't.
# I'm using bag-of-words and bi-gram by setting ngram_range=(1, 2)
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=1)
analyze = bigram_vectorizer.build_analyzer()

bigram_vec = bigram_vectorizer.fit_transform(all_bi_word_data).toarray()

In [11]:
# From occurrences to frequencies with TF-IDF weighting
bi_transformer = TfidfTransformer()

tfidf_bi_transform = bi_transformer.fit_transform(bigram_vec)

In [12]:
# Transform a sparse matrix to a dense one so it can be vertically stacked in a later step
tfidf_vec = tfidf_bi_transform.todense()

In [13]:
# See sample vector features
word_vector = bigram_vectorizer.get_feature_names()
word_vector[10000:10010]

[u'ferguson a',
 u'ferguson doe',
 u'ferguson im',
 u'ferm',
 u'ferm 2013',
 u'fernandez',
 u'fernandez 2012',
 u'ferngulli',
 u'ferngulli 2',
 u'fernsehen']

## Query 1: Days of Future Past

Let's test to see if the algorithm can find the movie title closes to the "Days of Future Past".

In [14]:
new_query = "Days of Future Past"

In [15]:
new_doc_processed = parseOutText(new_query)
new_doc_array = [new_doc_processed]

### Process the new query in the pre-trained Bi-Gram Vectorizer

In [16]:
# Vectorize new doc
new_doc_bigram_vec = bigram_vectorizer.transform(new_doc_array)

# From occurrences to frequencies with TF-IDF weighting
new_doc_tfidf_bi_transform = bi_transformer.transform(new_doc_bigram_vec)
new_doc_matrix = new_doc_tfidf_bi_transform.todense()

# Stack new doc as 1st row with the existing Rightsline matrix
combo_matrix = np.vstack([new_doc_matrix, tfidf_vec])

# Calculate Similarity Score including new_doc
sim_score = (combo_matrix * combo_matrix.T).A

### Find the document with the maximum similarity score

In [17]:
# Find Max Similarity Score and it's index
# Note that I don't compare the score with the 0th doc which is itself
max_score = np.amax(sim_score[0,1:])
max_index = np.argmax(sim_score[0, 1:])
    
print ("max_score:", max_score)
print ("max_index:", max_index)

max_score: 0.76637582957
max_index: 4381


In [18]:
# The data from the index of the max similarity score
movielens[max_index:max_index+1]

Unnamed: 0,movieId,title,genres
4381,111362,X-Men: Days of Future Past (2014),Action|Adventure|Sci-Fi


## Query 2: A Cinderella Story

In [19]:
new_query = "A Cinderella Story"
new_doc_processed = parseOutText(new_query)
new_doc_array = [new_doc_processed]

### Process the new query in the pre-trained Bi-Gram Vectorizer

In [20]:
# Vectorize new doc
new_doc_bigram_vec = bigram_vectorizer.transform(new_doc_array)

# From occurrences to frequencies with TF-IDF weighting
new_doc_tfidf_bi_transform = bi_transformer.transform(new_doc_bigram_vec)
new_doc_matrix = new_doc_tfidf_bi_transform.todense()

# Stack new doc as 1st row with the existing Rightsline matrix
combo_matrix = np.vstack([new_doc_matrix, tfidf_vec])

# Calculate Similarity Score including new_doc
sim_score = (combo_matrix * combo_matrix.T).A

### Find the document with the maximum similarity score

In [21]:
# Find Max Similarity Score and it's index
# Note that I don't compare the score with the 0th doc which is itself
max_score = np.amax(sim_score[0,1:])
max_index = np.argmax(sim_score[0, 1:])
    
print ("max_score:", max_score)
print ("max_index:", max_index)

max_score: 0.586858382681
max_index: 6102


In [22]:
# The data from the index of the max similarity score
movielens[max_index:max_index+1]

Unnamed: 0,movieId,title,genres
6102,118496,A Cinderella Story: Once Upon a Song (2011),Children|Comedy|Romance


# Conclusion

This demo shows that using sklearn's NLTK I can find closest movie titles to input text.