We will look at how plot similarity works using cosine similarities.

In [1]:
import sklearn.metrics.pairwise as pw
import numpy as np

X = np.matrix([1,-13])
Y = np.matrix([-34,13])
pw.cosine_similarity(X, Y)

array([[-0.42772402]])

Read the data that was previously created using Data Prepare notebook.

In [2]:
import io, json
from collections import OrderedDict

def read_data(path):
    with io.open(path, 'r', encoding = 'latin-1') as f:
        movies = json.load(f)
        return OrderedDict({(movie['title'],int(movie['year'])):{'plot':movie['plot'],'cast':set(movie['cast']), \
                                                                 'genres':set(movie['genres'])} \
                            for movie in movies}.items())
    
movies = read_data('data.json')
print(len(movies))

96585


We use a convenience method to find all Fast and Furious movies. In our examples, we will look at similar movies to Tokyo Drift.

In [3]:
def find_movie(title):
    return [movie for movie in movies.keys() if title in movie[0]]

q = find_movie('Fast and')
print(q)

[("The Making of 'The Fast and the Furious'", 2002), ('High, Fast and Wonderful', 2003), ('The Fast and the Furious: Tokyo Drift', 2006), ('The Fast and the Furious', 2001), ('Tasmanian Devil: The Fast and Furious Life of Errol Flynn', 2007)]


Initially, we take a Bag of Words approach without any lemmatization, the way it was used in the paper.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None, #LemmaTokenizer(),    \
                             preprocessor = None, \
                             stop_words = 'english')
plots = [movie['plot'] for movie in movies.values()]
feat_vec = vectorizer.fit_transform(plots)
vocab = vectorizer.get_feature_names()
print('Number of words in vocabulary:', len(vocab))

Number of words in vocabulary: 175129


We calculate cosine similarities between each plot vector in our movies, and Tokyo Drift's plot vector.

In [6]:
import math

def similar_vectors(X, v, n = 6, batch_size = 10000):
    num_vecs = X.shape[0]
    num_batches = math.ceil(num_vecs/batch_size)
    min_indices = np.zeros(num_batches*n, dtype=np.int)
    min_dists = np.ones(num_batches*n, dtype=np.float)
    for batch in range(num_batches):
        dists = pw.pairwise_distances(X[batch*batch_size:min((batch+1)*batch_size,num_vecs)], \
                                      v, metric='cosine')
        ind_min = np.argpartition(dists[:,0], n)[:n]
        min_indices[batch*n:(batch+1)*n] = ind_min + batch*batch_size
        min_dists[batch*n:(batch+1)*n] = dists[ind_min][:,0]
    ind_min = np.argpartition(min_dists, n)[:n]   
    best_ind = min_indices[ind_min]
    best_dist = min_dists[ind_min]
    return best_ind, best_dist
    
def similar_movies(title, feat_vec, vectorizer, n = 6, vtype = 'plot'):
    v = vectorizer.transform([movies[title][vtype]])
    best_ind, best_dist = similar_vectors(feat_vec, v, n)
    key_list = list(movies.keys())
    for i in range(n):
        sim_title = key_list[best_ind[i]]
        print(sim_title)
        print(best_dist[i])
        print(movies[sim_title])
    
similar_movies(q[2],feat_vec,vectorizer)

('The Fast and the Furious: Tokyo Drift', 2006)
-2.22044604925e-15
{'plot': 'An American teenager named Sean Boswell is a loner in school, however he challenges his rival for an illegal street racing, and he totals his car in the end of the race. To avoid time in prison he is sent to Tokyo to live with his father who is in the military. As soon as he arrives he discovers a new, fun but dangerous way of street racing in the underworld of the streets of Tokyo, Japan. Sean Boswell, an Alabama teenager with a record for street racing, moves to his father\'s resident city of Tokyo, Japan to avoid a prison sentence in America. Boswell quickly falls in love with the world of drift racing in Tokyo\'s underground and a Japanese girl named Neela. However, Boswell\'s presence and growing talent for drifting unsettles the Japanese Mafia, which makes thousands of dollars from the sport. Confrontations arise, and Sean is faced with a simple decision: drift or die. After totaling his car in an illega

We can see that we have a lot of racing related movies.

Lets try using tf-idf weights instead of simple counts.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = 'english') 
tfidf_vec = tfidf_vectorizer.fit_transform(plots)

similar_movies(q[2],tfidf_vec,tfidf_vectorizer)

('Hot Wheels Highway 35 World Race', 2003)
0.79182873117
{'plot': 'The drivers enter the Neon Pipeline Realm where they have to master not only drafting, but also driving on the inside and outside of an elaborate network of pipes. The drivers discover the presence of the mysterious Silencerz team. In the Junk Realm, Kurt\'s brother, Wylde, gets captured by Gelorum and the evil Racing Drones. Dr. Peter Tezla uncovers the mysterious technology of the ancient ACCELERONS. He discovers that the Wheel of Power is the gateway to amazing new racing environments: the RACING REALMS, over a hundred different tracks, each more breathtaking than the original world of Highway 35. But the Racing Realms become a high-speed battleground as two teams of human racers compete with deadly robotic racers and secret undercover drivers to see if they have the skill and courage to win the ultimate race -- and the ultimate power that goes with it. DRIVE TO SURVIVE! The drivers enter the Water Realm where former

Again we see the racing movies.

We now look at how LDA acts when given our plot data.

In [8]:
from sklearn.decomposition import LatentDirichletAllocation

def similar_movies2(title, feat_vec, vectorizer, vectorizer2, n = 6, vtype = 'plot'):
    v = vectorizer2.transform(vectorizer.transform([movies[title][vtype]]))
    best_ind, best_dist = similar_vectors(feat_vec, v, n)
    key_list = list(movies.keys())
    for i in range(n):
        sim_title = key_list[best_ind[i]]
        print(sim_title)
        print(best_dist[i])
        print(movies[sim_title])

In [9]:
n_topics = 20
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=2,
                                learning_method='online', learning_offset=10.,
                                random_state=0)
lda_vec = lda.fit_transform(feat_vec)
similar_movies2(q[2],lda_vec,vectorizer,lda)

('The Fast and the Furious: Tokyo Drift', 2006)
1.11022302463e-16
{'plot': 'An American teenager named Sean Boswell is a loner in school, however he challenges his rival for an illegal street racing, and he totals his car in the end of the race. To avoid time in prison he is sent to Tokyo to live with his father who is in the military. As soon as he arrives he discovers a new, fun but dangerous way of street racing in the underworld of the streets of Tokyo, Japan. Sean Boswell, an Alabama teenager with a record for street racing, moves to his father\'s resident city of Tokyo, Japan to avoid a prison sentence in America. Boswell quickly falls in love with the world of drift racing in Tokyo\'s underground and a Japanese girl named Neela. However, Boswell\'s presence and growing talent for drifting unsettles the Japanese Mafia, which makes thousands of dollars from the sport. Confrontations arise, and Sean is faced with a simple decision: drift or die. After totaling his car in an illegal

In [10]:
n_topics = 20
def list_topics(lda, vocab, n_topics, n_words = 10):
    for n in range(n_topics):
        top_inds = np.argpartition(lda.components_[n], -n_words)[-n_words:]
        topics = [vocab[i] for i in top_inds]
        print('Topic ',str(n+1)+':',topics)
        
list_topics(lda, vocab, n_topics)

Topic  1: ['battle', 'alien', 'simon', 'los', 'jake', 'fight', 'clark', 'powers', 'angeles', 'alex']
Topic  2: ['president', 'agent', 'new', 'war', 'john', 'american', 'government', 'men', 'years', 'group']
Topic  3: ['girls', 'george', 'friends', 'student', 'students', 'college', 'adam', 'high', 'rachel', 'school']
Topic  4: ['francis', 'wendy', 'jesus', 'canadian', 'farmer', 'harrison', 'brooke', 'al', 'elizabeth', 'michael']
Topic  5: ['julia', 'week', 'food', 'bob', 'joe', 'restaurant', 'tim', 'ryan', 'gay', 'party']
Topic  6: ['family', 'mysterious', 'max', 'father', 'years', 'home', 'house', 'young', 'town', 'evil']
Topic  7: ['crime', 'team', 'killed', 'killer', 'dead', 'death', 'man', 'police', 'case', 'murder']
Topic  8: ['wish', 'raj', 'christian', 'church', 'god', 'harry', 'heard', 'smith', 'simple', 'archer']
Topic  9: ['betty', 'greg', 'gary', 'tracy', 'lee', 'nick', 'tommy', 'danny', 'grace', 'alan']
Topic  10: ['teams', '000', 'race', 'world', 'team', 'challenge', 'time'

While the topic structure does seem to capture some semantic topics, there are also many unnecessary groups (ex: topic 16).