## Importing Libraries

Importing all libraries on top so it can be easily placed in requirements.txt to setup new virtualenv

In [1]:
import pandas as pd
#import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pickle

base_path = 'data/'

## Working with Movies dataset

This is main dataset where we mainly need movie id and title. All other attributes are not required in this system

This is old dataset and picture urls are not working. Otherwise I could have used those as movie poster to display

In [2]:
movies = pd.read_table(base_path+'movies.dat', encoding="ISO-8859-1")
movies = movies[['id', 'title']]
print(movies)

          id                        title
0          1                    Toy story
1          2                      Jumanji
2          3               Grumpy Old Men
3          4            Waiting to Exhale
4          5  Father of the Bride Part II
...      ...                          ...
10192  65088              Bedtime Stories
10193  65091          Manhattan Melodrama
10194  65126                        Choke
10195  65130           Revolutionary Road
10196  65133      Blackadder Back & Forth

[10197 rows x 2 columns]


## Working with Movie Genre

One of the main dataset for our analysis.

Checking if genre has single word or multiple.

In [3]:
movie_genres = pd.read_table(base_path+'movie_genres.dat', encoding="ISO-8859-1")
movie_genres['genre'].str.contains(' ').value_counts()

False    20809
Name: genre, dtype: int64

Since its single word so we can easily convert it into list of genres against each movie. 

Used groupby and aggrigate function with lambda, which is way faster in execution.

In [4]:
movie_genres = movie_genres.groupby('movieID').agg(lambda x: list(x)).reset_index()
movies = movies.merge(movie_genres, left_on='id', right_on='movieID')
movies = movies.drop('movieID', axis=1)

## Working with Movie Actors dataset
I am not considering ranking in this analysis. 

Also using actorID instead of atorName because actorName has few missing values. I could drop mising values but I choose to go with actorID because informatio is same in both atributes.

In [5]:
movie_actors = pd.read_table(base_path+'movie_actors.dat', encoding="ISO-8859-1")
movie_actors = movie_actors[['movieID', 'actorID']]
#print(movie_actors.dtypes)

No missing values

In [6]:
print(movie_actors.actorID.isna().sum())
#movie_actors.dropna(inplace=True)

0


A tranform function which is basically removing underscore between parts of the name.

Then capitalizing and joining them as single word. 

If we take parts of one actor name then those parts will be considered as separate name which will affect the results.

In [7]:
def transform(val, sep):
    new_val = val
    if sep in val:
        val_split = val.split(sep)
        val_split = [val_split[i].capitalize() for i in range(len(val_split))]
        new_val = ''.join(val_split)
    
    return new_val

Applying this transform function with apply method which is way faster and uses vector operation in pandas. 

Then converting to list, merging it with movies dataset and removing additional movieID column.

In [8]:
movie_actors['actorID'] = movie_actors['actorID'].apply(transform, sep="_")
movie_actors = movie_actors.groupby('movieID').agg(lambda x: list(x)).reset_index()
movies = movies.merge(movie_actors, left_on='id', right_on='movieID')
movies = movies.drop('movieID', axis=1)


## Working with Movie Directors dataset

Similar traformatons have been applied on this dataset as well. 

Then merged into main movies dataset.

In [9]:
movie_directors = pd.read_csv(base_path+'movie_directors.dat', encoding="ISO-8859-1", sep="\t")
print(movie_directors.directorID.isna().sum())
print(movie_directors.directorName.isna().sum())

0
0


In [10]:
movie_directors = movie_directors[['movieID', 'directorID']]
movie_directors['directorID'] = movie_directors['directorID'].apply(transform, sep="_")
movie_directors = movie_directors.groupby('movieID').agg(lambda x: list(x)).reset_index()
movies = movies.merge(movie_directors, left_on='id', right_on='movieID')
movies = movies.drop('movieID', axis=1)


## Working with Movie Tags dataset

This is another main dataset which have keywords related to movies. This is very import to find similar movies.

In [11]:
tags = pd.read_csv(base_path+'tags.dat', encoding="ISO-8859-1", sep="\t")
movie_tags = pd.read_csv(base_path+'movie_tags.dat', encoding="ISO-8859-1", sep="\t")

movie_tags = movie_tags.merge(tags, left_on='tagID', right_on='id')
#movie_tags['tagWeight'].value_counts()
movie_tags = movie_tags[['movieID', 'value']]


Just check unique movies in this dataset. Also check if one tag has multiple words or single.

As you can see just 7155 movies have tags. Whereas total movies are more than 10 thousand.

Because this is import data and while merging it with movies dataset we will get just 7155 movie records for further analysis.

I am moving forward to merge it. Alternatively we can go forward with just genre, actors and directors dataset.

In [12]:
print(movie_tags['movieID'].nunique())
print(movie_tags['value'].str.contains(' ').value_counts())

7155
False    26936
True     24859
Name: value, dtype: int64


In [13]:
movie_tags['value'] = movie_tags['value'].apply(transform, sep=" ")
movie_tags = movie_tags.groupby('movieID').agg(lambda x: list(x))
movie_tags = movie_tags.reset_index()
movies = movies.merge(movie_tags, left_on='id', right_on='movieID')
movies = movies.drop('movieID', axis=1)


###### Finally we are concatinating genres, actors, directors and movie tags as movie details.

###### And then converting it back as text from list.

In [14]:
movies['details'] = movies['genre']+movies['actorID']+movies['directorID']+movies['value']
movies_final = movies.drop(['genre', 'actorID', 'directorID', 'value'], axis=1)
movies_final['details'] = movies_final['details'].apply(lambda x: ' '.join(x))
print(movies_final)

         id                        title  \
0         1                    Toy story   
1         2                      Jumanji   
2         3               Grumpy Old Men   
3         5  Father of the Bride Part II   
4         6                         Heat   
...     ...                          ...   
7114  64993       Byôsoku 5 senchimêtoru   
7115  65006                      Impulse   
7116  65037                        Ben X   
7117  65126                        Choke   
7118  65130           Revolutionary Road   

                                                details  
0     Adventure Animation Children Comedy Fantasy An...  
1     Adventure Children Fantasy 1135379-peterBryant...  
2     Comedy Romance annmargret BuckHenry BuffySedla...  
3     Comedy ann-walker AnnieMeyersShyer AprilOrtiz ...  
4     Action Crime Thriller AlPacino AmyBrenneman As...  
...                                                 ...  
7114  Animation Drama Romance AyakaOnoue KenjiMizuha...  
7115  M

##### Converting our text data into vector of number for modeling

In [15]:
count_vec = CountVectorizer(max_features=len(movies_final), stop_words='english')

words_vec = count_vec.fit_transform(movies_final['details']).toarray()
print(words_vec.shape)

(7119, 7119)


###### Calculating the matrix of cosine similarity of each movie against each movie

In [16]:
similarity_score = cosine_similarity(words_vec)
print(similarity_score)

[[1.         0.15743507 0.05006262 ... 0.         0.05407381 0.        ]
 [0.15743507 1.         0.         ... 0.         0.         0.        ]
 [0.05006262 0.         1.         ... 0.         0.05143445 0.06299408]
 ...
 [0.         0.         0.         ... 1.         0.16666667 0.20412415]
 [0.05407381 0.         0.05143445 ... 0.16666667 1.         0.06804138]
 [0.         0.         0.06299408 ... 0.20412415 0.06804138 1.        ]]


###### A function which prints top 5 most similar movies to user query

In [17]:
def recommend_top_5_movies(movie):
    index = movies_final[movies_final['title'] == movie].index[0]
    distances = sorted(list(enumerate(similarity_score[index])),reverse=True,key = lambda x: x[1])
    for i in distances[1:6]:
        print(movies_final.iloc[i[0]].title)
        

###### Tried first example "Batman Begins" and results look amazing 

In [18]:
recommend_top_5_movies('Batman Begins')

The Dark Knight
Spider-Man
Batman
Batman: Mask of the Phantasm
Batman Returns


###### Tried second example "Superman" and results are still good 

In [19]:
recommend_top_5_movies('Superman')

Superman II
Superman IV: The Quest for Peace
Superman III
Batman & Robin
Superman Returns


###### Tried third example "Iron Man" and results are ok 

In [20]:
recommend_top_5_movies('Iron Man')

Hellboy II: The Golden Army
The Bourne Ultimatum
Indiana Jones and the Kingdom of the Crystal Skull
Serenity
The Incredible Hulk


###### Saving both data and similarity score in pickle file for web interface of recommendation system

In [21]:
pickle.dump(movies_final,open(base_path+'whole_data.pkl','wb'))
pickle.dump(similarity_score,open(base_path+'similarity_score.pkl','wb'))