# Movie content based recommendation using similarity measures

Goal: Recommend movies based on the closeness (distance) between two movies

In case of cold start problem for recommendation systems, 
there is possibility that the new user has not rated the movie yet.

One remedy is,
This new user would have filled user data (they might have filled if they like comedy, or certain movie or so)

If we can find how close the two movies are and relate it to the user data, new top 10 movie can be recommended for that user

Credits to https://www.geeksforgeeks.org/movie-recommender-based-on-plot-summary-using-tf-idf-vectorization-and-cosine-similarity/ for the code reference. Here the code is adapted to our scenario, checked with respect to movie description and genres 

**Author** : Akshaya , **Date** : 11/10/2020

In [1]:
import pandas as pd
import numpy as np
import nltk 
from nltk.stem import WordNetLemmatizer  
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer 
nltk.download('stopwords') 
nltk.download('punkt') 
nltk.download('averaged_perceptron_tagger') 
nltk.download('wordnet') 

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\20204321\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\20204321\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\20204321\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\20204321\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
lemmatizer = WordNetLemmatizer() 
stop_words = set(stopwords.words('english')) 
VERB_CODES = {'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'} 
tfidfvec = TfidfVectorizer() 

In [3]:
df_movie = pd.read_csv("Movies.csv", delimiter =";")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


# Based on movie description

In [4]:
df=df_movie[["title","description"]]
df.description= df.description.astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Preprocessing on the Movie description (Removing stop words,punctuation etc.,)

In [5]:
def preprocess_sentences(text): 
    text = text.lower() 
    temp_sent =[] 
    words = nltk.word_tokenize(text) 
    tags = nltk.pos_tag(words) 
    for i, word in enumerate(words): 
        if tags[i][1] in VERB_CODES:  
            lemmatized = lemmatizer.lemmatize(word, 'v') 
        else: 
            lemmatized = lemmatizer.lemmatize(word) 
        if lemmatized not in stop_words and lemmatized.isalpha(): 
              temp_sent.append(lemmatized) 
          
    finalsent = ' '.join(temp_sent) 
    finalsent = finalsent.replace("n't", " not") 
    finalsent = finalsent.replace("'m", " am") 
    finalsent = finalsent.replace("'s", " is") 
    finalsent = finalsent.replace("'re", " are") 
    finalsent = finalsent.replace("'ll", " will") 
    finalsent = finalsent.replace("'ve", " have") 
    finalsent = finalsent.replace("'d", " would") 
    return finalsent 
  
df["desc_processed"]= df["description"].apply(preprocess_sentences) 
df.head() 


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["desc_processed"]= df["description"].apply(preprocess_sentences)


Unnamed: 0,title,description,desc_processed
0,Toy Story,"Led by Woody, Andy's toys live happily in his ...",lead woody andy toy live happily room andy bir...
1,Jumanji,When siblings Judy and Peter discover an encha...,sibling judy peter discover enchant board game...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...,family wedding reignite ancient feud neighbor ...
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",cheat mistreat step woman hold breath wait elu...
4,Father of the Bride Part II,Just when George Banks has recovered from his ...,george bank recover daughter wedding receive n...


Performing TF-IDF (Product of the frequency of occurences of a word in each movie and word occuring in all movies)

In [6]:
# Vectorizing pre-processed movie plots using TF-IDF 
df_copy = df[0:5000] #due to memory restriction in using cosine similarity only took 5000 movies
# df_copy.head()

tfidf_movieid = tfidfvec.fit_transform((df["desc_processed"]))
tfidf_movieid = tfidf_movieid[0:5000]

In [7]:
# Finding cosine similarity between vectors  
from sklearn.metrics.pairwise import cosine_similarity 
cos_sim = cosine_similarity(tfidf_movieid, tfidf_movieid) 

In [8]:
# Storing indices of the data 
indices = pd.Series(df_copy.title) 
def recommendations(title, cosine_sim = cos_sim): 
    recommended_movies = [] 
    index = indices[indices == title].index[0] 
    similarity_scores = pd.Series(cosine_sim[index]).sort_values(ascending = False) 
    top_10_movies = list(similarity_scores.iloc[1:11].index) 
    for i in top_10_movies: 
        recommended_movies.append(list(df_copy.title)[i]) 
    return recommended_movies 

In [9]:
recommendations("GoldenEye")

['Live and Let Die',
 'Licence to Kill',
 'Dr. No',
 'The Way of the Dragon',
 'Thunderball',
 'Phantasm',
 'Diamonds Are Forever',
 'From Russia with Love',
 'Into the Arms of Strangers: Stories of the Kindertransport',
 'A View to a Kill']

In [10]:
recommendations("Ace Ventura: When Nature Calls")

['The Adventures of Milo and Otis',
 'The Golden Child',
 'Diamonds',
 'Kandahar',
 'Princess Mononoke',
 'The Saltmen of Tibet',
 'Ace Ventura: Pet Detective',
 'Pearl Harbor',
 'Top Gun',
 'Jungle 2 Jungle']

In [11]:
recommendations("Toy Story")

['Toy Story 2',
 'Man on the Moon',
 'Rebel Without a Cause',
 'Condorman',
 'Bound for Glory',
 "Losin' It",
 'Malice',
 'The Sunchaser',
 'Indecent Proposal',
 "Child's Play 3"]

In [12]:
df_copy.head(30)

Unnamed: 0,title,description,desc_processed
0,Toy Story,"Led by Woody, Andy's toys live happily in his ...",lead woody andy toy live happily room andy bir...
1,Jumanji,When siblings Judy and Peter discover an encha...,sibling judy peter discover enchant board game...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...,family wedding reignite ancient feud neighbor ...
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",cheat mistreat step woman hold breath wait elu...
4,Father of the Bride Part II,Just when George Banks has recovered from his ...,george bank recover daughter wedding receive n...
5,Heat,"Obsessive master thief, Neil McCauley leads a ...",obsessive master thief neil mccauley lead crew...
6,Sabrina,An ugly duckling having undergone a remarkable...,ugly duckling undergone remarkable change stil...
7,Tom and Huck,"A mischievous young boy, Tom Sawyer, witnesses...",mischievous young boy tom sawyer witness murde...
8,Sudden Death,International action superstar Jean Claude Van...,international action superstar jean claude van...
9,GoldenEye,James Bond must unmask the mysterious head of ...,james bond must unmask mysterious head janus s...


Calcualting based on genre

In [13]:
df_movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46733 entries, 0 to 46732
Data columns (total 30 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   id                    46733 non-null  object
 1   title                 46717 non-null  object
 2   tagline               20857 non-null  object
 3   description           45687 non-null  object
 4   genres                42310 non-null  object
 5   keywords              30515 non-null  object
 6   date                  44619 non-null  object
 7   collection            4466 non-null   object
 8   runtime               44450 non-null  object
 9   revenue               44712 non-null  object
 10  budget                44709 non-null  object
 11  director              43851 non-null  object
 12  cast                  42363 non-null  object
 13  production_companies  33038 non-null  object
 14  production_countries  38547 non-null  object
 15  popularity            44712 non-null

# Based on movie genres

In [14]:
df_genre =df_movie[['title','genres']]

In [15]:
df_genre.head()

Unnamed: 0,title,genres
0,Toy Story,"animation, comedy, family"
1,Jumanji,"adventure, fantasy, family"
2,Grumpier Old Men,"romance, comedy"
3,Waiting to Exhale,"comedy, drama, romance"
4,Father of the Bride Part II,comedy


In [16]:
df_genre = df_genre.astype(str)
df_genre['processed_genres'] = df_genre["genres"].apply(preprocess_sentences) 

In [17]:
df_genre.head()

Unnamed: 0,title,genres,processed_genres
0,Toy Story,"animation, comedy, family",animation comedy family
1,Jumanji,"adventure, fantasy, family",adventure fantasy family
2,Grumpier Old Men,"romance, comedy",romance comedy
3,Waiting to Exhale,"comedy, drama, romance",comedy drama romance
4,Father of the Bride Part II,comedy,comedy


In [18]:
tfidf_genreid = tfidfvec.fit_transform((df_genre['processed_genres']))

In [19]:
tfidf_genreid.shape

(46733, 182)

In [20]:
tfidf_genreid_copy = tfidf_genreid[0:5000]
df_genre_copy =df_genre[0:5000]
cos_sim_genre = cosine_similarity(tfidf_genreid_copy ,tfidf_genreid_copy) 
def recommendations_genre(title, cosine_sim = cos_sim_genre ): 
    recommended_movies = [] 
    index = indices[indices == title].index[0] 
    similarity_scores = pd.Series(cos_sim_genre[index]).sort_values(ascending = False) 
    top_10_movies = list(similarity_scores.iloc[1:11].index) 
    for i in top_10_movies: 
        recommended_movies.append(list(df_genre_copy.title)[i]) 
    return recommended_movies 

In [21]:
recommendations_genre("Toy Story")

['Chicken Run',
 'Meet the Deedles',
 'Oliver & Company',
 'A Close Shave',
 'The Wrong Trousers',
 'The Great Mouse Detective',
 'Toy Story 2',
 'Monsters, Inc.',
 'Creature Comforts',
 "Doug's 1st Movie"]

In [22]:
recommendations_genre("Jumanji")

['The Wizard of Oz',
 'The Indian in the Cupboard',
 'Labyrinth',
 "Harry Potter and the Philosopher's Stone",
 'Return to Oz',
 'The NeverEnding Story',
 'The Neverending Story II: The Next Chapter',
 'Hook',
 'Herbie Goes Bananas',
 'Little Monsters']

In [23]:
recommendations_genre("GoldenEye")

['The Spy Who Loved Me',
 'Speed 2: Cruise Control',
 'Thunderball',
 'Street Fighter',
 'Live and Let Die',
 'Knock Off',
 'Licence to Kill',
 'For Your Eyes Only',
 'Space Cowboys',
 'Firestorm']

In [24]:
recommendations_genre("Heat")

["Heaven's Burning",
 'Gloria',
 'Payback',
 'Original Gangstas',
 'Training Day',
 'Kill Me Again',
 'Mercury Rising',
 'Romeo Is Bleeding',
 'Code of Silence',
 'Get Carter']

In [27]:
df_movie[["title","genres"]].head(6)

Unnamed: 0,title,genres
0,Toy Story,"animation, comedy, family"
1,Jumanji,"adventure, fantasy, family"
2,Grumpier Old Men,"romance, comedy"
3,Waiting to Exhale,"comedy, drama, romance"
4,Father of the Bride Part II,comedy
5,Heat,"action, crime, drama, thriller"
