# Movie recommendation system

In this notebook, we will present the movie recommendation system based on the *cosine similarity* of movie descriptions.  
The dataset used in this notebook consists of 15065 movie titles and their descriptions. In our work, we will use this data to find which movies are most similar to *Harry Potter and the Sorcerer's stone*.

In order to process the text we will use the `TfIdf` vectorizer from the `sklearn` package.

In [3]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import re
from collections import defaultdict
from langdetect import detect

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import pos_tag

## Uploading and preprocessing data

In [6]:
movies = pd.read_csv('MovieData.csv')

In [8]:
movies.head(10)

Unnamed: 0,title,desc
0,Sorrowful Jones,Sorrowful Jones is a New York bookie who keeps...
1,South of St. Louis,"During the Civil War, Kip Davis (Joel McCrea),..."
2,Stampede,"Two brothers, Mike McCall (Rod Cameron) and Ti..."
3,State Department: File 649,"Kenneth Seeley (William Lundigan), member of t..."
4,Strange Bargain,"Because the firm is bankrupt, bookkeeper Sam W..."
5,The Stratton Story,Texas farm boy Monty Stratton (Stewart) demons...
6,Streets of Laredo,"A trio of outlaws, Jim Dawkins (Holden), Loren..."
7,Streets of San Francisco,Frankie Fraser finds out his father Luke has c...
8,The Sun Comes Up,Ex-opera singer Helen Lorfield Winter (Jeanett...
9,Sword in the Desert,Freighter owner and captain Mike Dillon (Dana ...


From all of this data we want to keep those movies whose descriptions are written in english, and whose descriptions are sufficiently long.

In [11]:
text_limit = 1000
lang = 'en'

In [13]:
movies['desc'] = movies['desc'].fillna('')
movies = movies[movies['desc'].apply(len) > text_limit]
movies = movies.drop_duplicates(subset=['title'])
movies['language'] = movies['desc'].apply(detect)
movies = movies[movies['language'] == lang]
movies.drop('language', axis=1, inplace=True)

We upload the main description, based on which we make recommendations.

In [16]:
main_description = ''
with open('Harry Potter description.txt','r') as f:
    main_description = f.read()

print(main_description)

Ten years later, just before Harry's eleventh birthday, owls begin delivering letters addressed to him. When the abusive Dursleys adamantly refuse to allow Harry to open any and flee to an island hut, Hagrid arrives to personally deliver Harry's letter of acceptance to Hogwarts. Hagrid also reveals that Harry's late parents, James and Lily, were killed by a dark wizard named Lord Voldemort. The killing curse that Voldemort had cast towards Harry rebounded, destroying Voldemort's body and giving Harry the lightning-bolt scar on his forehead. Hagrid then takes Harry to Diagon Alley for school supplies and gives him a pet snowy owl whom he names Hedwig. Harry buys a wand that is connected to Voldemort's own wand. At King's Cross, Harry boards the Hogwarts Express train, and meets fellow first-years Ron Weasley and Hermione Granger during the journey. Arriving at Hogwarts, Harry also meets Draco Malfoy, who is from a wealthy wizard family; the two immediately form a rivalry. The students a

## Processing text

The most notable library that we use when processing text is `ntlk` library.

In [19]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Korisnik\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Korisnik\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Korisnik\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Korisnik\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [21]:
stops = stopwords.words('english')
lemma = WordNetLemmatizer()

In [23]:
tag_map = defaultdict(lambda: 'n')
tag_map['J'] = 'a'
tag_map['V'] = 'v'
tag_map['R'] = 'r'

In [25]:
def get_wordnet_tags(tokens):
    tagged_tokens = pos_tag(tokens)
    return [(token[0], tag_map[token[1][0]]) for token in tagged_tokens]
    
def process_text(text):
    text = text.lower()
    text = re.sub("[^a-z ]", " ", text)
    tokens = word_tokenize(text)
    tokens = get_wordnet_tags(tokens)
    tokens = [lemma.lemmatize(word=token[0], pos=token[1]) for token in tokens]
    tokens = [word for word in tokens if word not in stops and len(word) > 2]
    return tokens

In [27]:
movies['processed_desc'] = movies['desc'].apply(process_text)

In [29]:
processed_main_desc = process_text(main_description)

In [31]:
print(processed_main_desc)

['ten', 'year', 'later', 'harry', 'eleventh', 'birthday', 'owls', 'begin', 'deliver', 'letter', 'address', 'abusive', 'dursleys', 'adamantly', 'refuse', 'allow', 'harry', 'open', 'flee', 'island', 'hut', 'hagrid', 'arrive', 'personally', 'deliver', 'harry', 'letter', 'acceptance', 'hogwarts', 'hagrid', 'also', 'reveal', 'harry', 'late', 'parent', 'james', 'lily', 'kill', 'dark', 'wizard', 'name', 'lord', 'voldemort', 'kill', 'curse', 'voldemort', 'cast', 'towards', 'harry', 'rebound', 'destroy', 'voldemort', 'body', 'give', 'harry', 'lightning', 'bolt', 'scar', 'forehead', 'hagrid', 'take', 'harry', 'diagon', 'alley', 'school', 'supply', 'give', 'pet', 'snowy', 'owl', 'name', 'hedwig', 'harry', 'buy', 'wand', 'connect', 'voldemort', 'wand', 'king', 'cross', 'harry', 'board', 'hogwarts', 'express', 'train', 'meet', 'fellow', 'first', 'year', 'ron', 'weasley', 'hermione', 'granger', 'journey', 'arrive', 'hogwarts', 'harry', 'also', 'meet', 'draco', 'malfoy', 'wealthy', 'wizard', 'family'

## TfIdf Vectorizer

In [34]:
vectorizer = TfidfVectorizer()

vectorised_data = vectorizer.fit_transform(movies['processed_desc'].apply(' '.join))
main_vector = vectorizer.transform([' '.join(processed_main_desc)])

## Cosine Similarity

With all of the text processed, we are ready to compare movie descriptions.

In [37]:
cosine_similarities = cosine_similarity(main_vector, vectorised_data).flatten()

movies['cosine_similarities'] = cosine_similarities

recommended_movies = movies.sort_values(by='cosine_similarities', ascending=False)
recommended_movies[['title', 'cosine_similarities']].head(10)

Unnamed: 0,title,cosine_similarities
8863,Harry Potter and the Sorcerer's Stone,0.697224
11041,Harry Potter and the Deathly Hallows: Part 2,0.608518
9064,Harry Potter and the Chamber of Secrets,0.549352
10846,Harry Potter and the Deathly Hallows: Part 1,0.516672
10226,Harry Potter and the Order of the Phoenix,0.475953
10649,Harry Potter and the Half-Blood Prince,0.472367
9668,Harry Potter and the Goblet of Fire,0.424314
883,Houdini,0.402911
808,The Bigamist,0.387794
7957,Deconstructing Harry,0.384933


## Conclusion

As evident by the results, the first seven places are taken by the movies from the *Harry Potter* movie franchise, with the first installment, which is used as a refference, taking the first place. Additionally, we can see that the eigth spot is again taken by the movie with *magic* as it's main motive. We conclude that we have obtained very strong results.