## Assignment 1 Solutions: Recommending Similar Movies (5 points)

- In this assignment, you will quantify the similarities between movies using plot summaries from Wikipedia and IMDb. For this, we will calculate Cosine similarity for all pairs of movies using movies data set. The data set includes 100 movies with their titles, genre, plot summaries from Wikipedia and IMDb.

- Instructions on what to do are provided in every code block with numbers.





- Do not clear the outputs, you must print out your outputs.



- Write your (legal) full name here: Amaris Efthimiou

## 1. Import Data Set

In [11]:
import numpy as np
import pandas as pd
import nltk

#(0.5 points)

# (1) The data set comes in as a csv file. Import the data file (it is uploaded on Black Board folder).
movies_df = pd.read_csv('movies.csv')
print("Number of movies: ", (len(movies_df)))


# (2) check the number of rows and columns, column names, and print out a few rows to see how the data looks like. 
print (movies_df.columns)
print (movies_df.head())
print (movies_df.shape)

Number of movies:  100
Index(['rank', 'title', 'genre', 'wiki_plot', 'imdb_plot'], dtype='object')
   rank                     title                                    genre  \
0     0             The Godfather                   [u' Crime', u' Drama']   
1     1  The Shawshank Redemption                   [u' Crime', u' Drama']   
2     2          Schindler's List  [u' Biography', u' Drama', u' History']   
3     3               Raging Bull    [u' Biography', u' Drama', u' Sport']   
4     4                Casablanca        [u' Drama', u' Romance', u' War']   

                                           wiki_plot  \
0  On the day of his only daughter's wedding, Vit...   
1  In 1947, banker Andy Dufresne is convicted of ...   
2  In 1939, the Germans move Polish Jews into the...   
3  In a brief scene in 1964, an aging, overweight...   
4  It is early December 1941. American expatriate...   

                                           imdb_plot  
0  In late summer 1945, guests are gathe

## 2. Combine (concatenate) Wikipedia and IMDb plot summaries

In [12]:
# (3) Combine wiki_plot and imdb_plot into a single column. (0.5 points)
movies_df["plot"] = movies_df["wiki_plot"].astype(str) + movies_df["imdb_plot"].astype(str)
print (movies_df.head())
# (4) make sure concatenation worked properly. Among many ways to check this, one thing you can do simply is to
# check the length of wiki_plot, imdb_plot and compare the length of the combined plot.
# The length of the combined plot should be sum of the length of wiki_plot and imdb_plot.
# (0.5 points)
print (len(movies_df["wiki_plot"][0]))
print (len(movies_df["imdb_plot"][0]))
print (len(movies_df["plot"][0]))

   rank                     title                                    genre  \
0     0             The Godfather                   [u' Crime', u' Drama']   
1     1  The Shawshank Redemption                   [u' Crime', u' Drama']   
2     2          Schindler's List  [u' Biography', u' Drama', u' History']   
3     3               Raging Bull    [u' Biography', u' Drama', u' Sport']   
4     4                Casablanca        [u' Drama', u' Romance', u' War']   

                                           wiki_plot  \
0  On the day of his only daughter's wedding, Vit...   
1  In 1947, banker Andy Dufresne is convicted of ...   
2  In 1939, the Germans move Polish Jews into the...   
3  In a brief scene in 1964, an aging, overweight...   
4  It is early December 1941. American expatriate...   

                                           imdb_plot  \
0  In late summer 1945, guests are gathered for t...   
1  In 1947, Andy Dufresne (Tim Robbins), a banker...   
2  The relocation of Polis

## 3. Tokenization

In [13]:
# (5) Tokenize the summed plot column. During or before/after the tokenization step, remove punctuations, non-words, stop-words.
# You can also work on a few pre-processing steps during the vectorization process.
# keep the elements which contain letters only. (0.5 points)
from nltk.tokenize import word_tokenize

#See Below: While it's possible to perform tokenization and then stemming,
#I put those together so that I pass through my data here.

## 4. Stemming

In [14]:
#(6) stem the tokenized summed plot column using a Snowball Stemmer.
# It is totally up to you whether you use a simple lemmatization or stemmer at this step. (0.5 points)

# Import the SnowballStemmer.
from nltk.stem.snowball import SnowballStemmer
# Create an English language SnowballStemmer object
stemmer = SnowballStemmer("english")

import re
def tokenize_and_stem_together(text):
    
    # Tokenization.    
    tokens = [word for word in nltk.word_tokenize(text)]
    
    # use Regular Expressions to keep words only.
    filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]
    
    # Stem the filtered_tokens.
    stems = [stemmer.stem(t) for t in filtered_tokens]
    
    return stems


movies_df["plot_token_stem"]=movies_df["plot"].apply(tokenize_and_stem_together)
print (movies_df.columns)
print (movies_df.head())
print (movies_df.shape)

Index(['rank', 'title', 'genre', 'wiki_plot', 'imdb_plot', 'plot',
       'plot_token_stem'],
      dtype='object')
   rank                     title                                    genre  \
0     0             The Godfather                   [u' Crime', u' Drama']   
1     1  The Shawshank Redemption                   [u' Crime', u' Drama']   
2     2          Schindler's List  [u' Biography', u' Drama', u' History']   
3     3               Raging Bull    [u' Biography', u' Drama', u' Sport']   
4     4                Casablanca        [u' Drama', u' Romance', u' War']   

                                           wiki_plot  \
0  On the day of his only daughter's wedding, Vit...   
1  In 1947, banker Andy Dufresne is convicted of ...   
2  In 1939, the Germans move Polish Jews into the...   
3  In a brief scene in 1964, an aging, overweight...   
4  It is early December 1941. American expatriate...   

                                           imdb_plot  \
0  In late summer 1945

## 5. Generate TF-IDF Vectorizer

In [15]:
# 1 point

#(7)-1 use a tf-idf vectorizer to create TF-IDF vectors. 
#(7)-2 You can adjust a few parameters within tf-idf Vectorizer object such as removing stopwords, including bigrams 
# for efficient processing of text.
#(7)-3 To filter out some terms that are too frequent and infrequent, you may add parameters such as max_df & min_df 
# and set the values as 0.8 & 0.2 respectively.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=0.2, stop_words='english',
                                 use_idf=True, ngram_range=(1,2))

## 6. Fit_transform TF-IDF Vectorizer

In [18]:
# (8) Fit and transform the tfidf_vectorizer with the summed plot column for each movie
# to create a vector representation of the plot summaries (0.5 points)

#Since plot_token_stem column is already in a list format, TfidfVectorizer doesn't run on that column.
#To avoid the error, I used a join function to join back all the list elements (tokens) in a string format.
movies_df["plot_token_stem_string"] = movies_df["plot_token_stem"].apply(lambda x: " ".join(x))
print (movies_df.columns)
print (movies_df.head())
print (movies_df.shape)

#Then applied vectorization on the STRING FORMATTED column.
tfidf_matrix = tfidf_vectorizer.fit_transform([x for x in movies_df["plot_token_stem_string"]])

print(tfidf_matrix.shape)

Index(['rank', 'title', 'genre', 'wiki_plot', 'imdb_plot', 'plot',
       'plot_token_stem', 'plot_token_stem_string'],
      dtype='object')
   rank                     title                                    genre  \
0     0             The Godfather                   [u' Crime', u' Drama']   
1     1  The Shawshank Redemption                   [u' Crime', u' Drama']   
2     2          Schindler's List  [u' Biography', u' Drama', u' History']   
3     3               Raging Bull    [u' Biography', u' Drama', u' Sport']   
4     4                Casablanca        [u' Drama', u' Romance', u' War']   

                                           wiki_plot  \
0  On the day of his only daughter's wedding, Vit...   
1  In 1947, banker Andy Dufresne is convicted of ...   
2  In 1939, the Germans move Polish Jews into the...   
3  In a brief scene in 1964, an aging, overweight...   
4  It is early December 1941. American expatriate...   

                                           imdb_plot

## 7. Calculate Similarity Index

In [19]:
# 1 point

# (9)-1
# Import cosine_similarity to calculate similarity of movie plots.
from sklearn.metrics.pairwise import cosine_similarity

# (9)-2 show 2D array matrix.
tf_array=tfidf_matrix.toarray()

# (9)-3 calculate cos similarity scores for all the combinations.
#You should come up with a 100 by 100 array matrix.
cosine_sim=cosine_similarity(tf_array,tf_array)
print (cosine_sim[:5,:5])
print ()
print (cosine_sim.shape)

[[1.         0.19324956 0.22727177 0.25819769 0.26345827]
 [0.19324956 1.         0.2543143  0.17559654 0.23020865]
 [0.22727177 0.2543143  1.         0.18700026 0.28888393]
 [0.25819769 0.17559654 0.18700026 1.         0.1836903 ]
 [0.26345827 0.23020865 0.28888393 0.1836903  1.        ]]

(100, 100)


## 8. Build a Small Recommendation Algorithm based on Similarity Score

In [20]:
# create a defined function which gives you top 10 most similar movies (based on their similarity scores) to a focal movie
# that you type in. This part has been done for you.

# Generate mapping between titles and index.
indices = pd.Series(movies_df.index, index=movies_df['title']).drop_duplicates()
print (indices)
print (indices.shape)

def get_recommendations(title, cosine_sim, indices):
    # Get index of movie that corresponds to a title.
    idx = indices[title]
    
    # Sort the movies based on the similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    #sort the scores from highest to lowest (descending).
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores for 10 most similar movies EXCEPT FOR the 1st highest one (itself).
    sim_scores = sim_scores[1:11]
    
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    
    # Return the top 10 most similar movies
    result=movies_df['title'].iloc[movie_indices]

    return result

title
The Godfather                0
The Shawshank Redemption     1
Schindler's List             2
Raging Bull                  3
Casablanca                   4
                            ..
Rebel Without a Cause       95
Rear Window                 96
The Third Man               97
North by Northwest          98
Yankee Doodle Dandy         99
Length: 100, dtype: int64
(100,)


## 9. What are the similar movies of a focal movie? 

In [23]:
# e.g. what are the top 10 similar movies for the movie 'Star Wars'?

print(get_recommendations('Saving Private Ryan', cosine_sim, indices))

55                                              Platoon
62                       All Quiet on the Western Front
61                                      The Deer Hunter
6                                    Gone with the Wind
24                         The Bridge on the River Kwai
48                                               Patton
79                                                Shane
0                                         The Godfather
29    Dr. Strangelove or: How I Learned to Stop Worr...
38                              Raiders of the Lost Ark
Name: title, dtype: object
