- ## Content Based Recommender System :

In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import ast 
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
#from surprise import Reader, Dataset, SVD, evaluate

import warnings; warnings.simplefilter('ignore')

- ### 2. Load dataset
- We have MovieLens datasets.

- #### The Full Dataset: Consists of 26,000,000 ratings and 750,000 tag applications applied to 45,000 movies by 270,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.

- #### The Small Dataset: Comprises of 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users.

- #### We will build our Simple Recommender using movies from the Full Dataset

In [6]:
credits = pd.read_csv(r'C:\Users\Akshay\Desktop\AKSHAY S\AI_STATICS\ML100 Movie\Movie Metadata/credits.csv')
keywords = pd.read_csv(r'C:\Users\Akshay\Desktop\AKSHAY S\AI_STATICS\ML100 Movie\Movie Metadata/keywords.csv')
links_small = pd.read_csv(r'C:\Users\Akshay\Desktop\AKSHAY S\AI_STATICS\ML100 Movie\Movie Metadata/links_small.csv')
md = pd.read_csv(r'C:\Users\Akshay\Desktop\AKSHAY S\AI_STATICS\ML100 Movie\Movie Metadata/movies_metadata.csv')
ratings = pd.read_csv(r'C:\Users\Akshay\Desktop\AKSHAY S\AI_STATICS\ML100 Movie\Movie Metadata/ratings_small.csv')

- ### 3. Understand dataset
Credits dataframe

In [7]:
credits.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [None]:
credits.columns

- #### cast: Information about casting. Name of actor, gender and it's character name in movie
- #### crew: Information about crew members. Like who directed the movie, editor of the movie and so on.
- #### id: It's movie ID given by TMDb

In [8]:
credits.shape

(45476, 3)

In [9]:
credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 3 columns):
cast    45476 non-null object
crew    45476 non-null object
id      45476 non-null int64
dtypes: int64(1), object(2)
memory usage: 1.0+ MB


In [10]:
keywords.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [11]:
keywords.columns

Index(['id', 'keywords'], dtype='object')

In [12]:
keywords.shape

(46419, 2)

In [13]:
keywords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46419 entries, 0 to 46418
Data columns (total 2 columns):
id          46419 non-null int64
keywords    46419 non-null object
dtypes: int64(1), object(1)
memory usage: 725.4+ KB


- #### Link dataframe¶

In [14]:
links_small.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [15]:
links_small.columns

Index(['movieId', 'imdbId', 'tmdbId'], dtype='object')

- movieId: It's serial number for movie
- imdbId: Movie id given on IMDb platform
- tmdbId: Movie id given on TMDb platform

In [16]:
links_small.shape

(9125, 3)

In [17]:
links_small.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9125 entries, 0 to 9124
Data columns (total 3 columns):
movieId    9125 non-null int64
imdbId     9125 non-null int64
tmdbId     9112 non-null float64
dtypes: float64(1), int64(2)
memory usage: 213.9 KB


- #### Metadata dataframe¶

In [18]:
md.iloc[0:3].transpose()

Unnamed: 0,0,1,2
adult,False,False,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect..."
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
homepage,http://toystory.disney.com/toy-story,,
id,862,8844,15602
imdb_id,tt0114709,tt0113497,tt0113228
original_language,en,en,en
original_title,Toy Story,Jumanji,Grumpier Old Men
overview,"Led by Woody, Andy's toys live happily in his ...",When siblings Judy and Peter discover an encha...,A family wedding reignites the ancient feud be...


In [19]:
md.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

- Features

- adult: Indicates if the movie is X-Rated or Adult.
- belongs_to_collection: A stringified dictionary that gives information on the movie series the particular film belongs to.
- budget: The budget of the movie in dollars.
- genres: A stringified list of dictionaries that list out all the genres associated with the movie.
- homepage: The Official Homepage of the move.
- id: The ID of the movie.
- imdb_id: The IMDB ID of the movie.
- original_language: The language in which the movie was originally shot in.
- original_title: The original title of the movie.
- overview: A brief blurb of the movie.
- popularity: The Popularity Score assigned by TMDB.
- poster_path: The URL of the poster image.
- production_companies: A stringified list of production companies involved with the making of the movie.
- production_countries: A stringified list of countries where the movie was shot/produced in.
- release_date: Theatrical Release Date of the movie.
- revenue: The total revenue of the movie in dollars.
- runtime: The runtime of the movie in minutes.
- spoken_languages: A stringified list of spoken languages in the film.
- status: The status of the movie (Released, To Be Released, Announced, etc.)
- tagline: The tagline of the movie.
- title: The Official Title of the movie.
- video: Indicates if there is a video present of the movie with TMDB.
- vote_average: The average rating of the movie.
- vote_count: The number of votes by users, as counted by TMDB.

In [20]:
md.shape

(45466, 24)

In [21]:
md.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
adult                    45466 non-null object
belongs_to_collection    4494 non-null object
budget                   45466 non-null object
genres                   45466 non-null object
homepage                 7782 non-null object
id                       45466 non-null object
imdb_id                  45449 non-null object
original_language        45455 non-null object
original_title           45466 non-null object
overview                 44512 non-null object
popularity               45461 non-null object
poster_path              45080 non-null object
production_companies     45463 non-null object
production_countries     45463 non-null object
release_date             45379 non-null object
revenue                  45460 non-null float64
runtime                  45203 non-null float64
spoken_languages         45460 non-null object
status                   45379 non-null objec

Ratings dataframe¶

In [22]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [23]:
ratings.columns

Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')

userId: It is id for User
movieId: It is TMDb movie id.
rating: Rating given for the particular movie by specific user
timestamp: Time stamp when rating has been given by user

In [24]:
ratings.shape

(100004, 4)

In [25]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
userId       100004 non-null int64
movieId      100004 non-null int64
rating       100004 non-null float64
timestamp    100004 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


- ## 4. Data Wrangling
The Movie database TMDb
Already converted data from json to csv format
- ## 5. Pre-processing
We will perform pre-processing as and when needed throughout the
- ## 6. Build recommendation system




- ## 6.1. Simple recommendation system
Approach:

The Simple Recommender offers generalized recommendations to every user based on movie popularity and (sometimes) genre.

The basic idea behind this recommender is that movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience.

This model does not give personalized recommendations based on the user.

- ## What we are actually doing:

- The implementation of this model is extremely trivial.
- All we have to do is sort our movies based on ratings and popularity and display the top movies of our list.
- As an added step, we can pass in a genre argument to get the top movies of a particular genre.
- I will build our overall Top 250 Chart and will define a function to build charts for a particular genre. Let's begin!

- ## 6.2 Content based recommendation system¶

In [26]:
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')

In [27]:
## Pre-processing step

def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

In [28]:
md['id'] = md['id'].apply(convert_int)
md[md['id'].isnull()]

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
19730,- Written by Ørnås,0.065736,/ff9qCepilowshEtG2GYWwzt2bs4.jpg,"[{'name': 'Carousel Productions', 'id': 11176}...","[{'iso_3166_1': 'CA', 'name': 'Canada'}, {'iso...",,0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,1,,,,,,,,,
29503,Rune Balot goes to a casino connected to the ...,1.931659,/zV8bHuSL6WXoD6FWogP9j4x80bL.jpg,"[{'name': 'Aniplex', 'id': 2883}, {'name': 'Go...","[{'iso_3166_1': 'US', 'name': 'United States o...",,0,68.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,...,12,,,,,,,,,
35587,Avalanche Sharks tells the story of a bikini ...,2.185485,/zaSf5OG7V8X8gqFvly88zDdRm46.jpg,"[{'name': 'Odyssey Media', 'id': 17161}, {'nam...","[{'iso_3166_1': 'CA', 'name': 'Canada'}]",,0,82.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,22,,,,,,,,,


In [29]:
md = md.drop([19730, 29503, 35587])

In [30]:
md['id'] = md['id'].astype('int')

In [31]:
smd = md[md['id'].isin(links_small)]
smd.shape

(9099, 24)

- #### We have 9099 movies available in our small movies metadata dataset which is 5 times smaller than our original dataset of 45000 movies.
- #### Content based recommendation system : Using movie description and taglines
- #### Let us first try to build a recommender using movie descriptions and taglines.
- #### We do not have a quantitative metric to judge our machine's performance so this will have to be done qualitatively.

In [32]:
smd['description'] = smd['overview'] + smd['tagline']
smd['description'] = smd['description'].fillna('')

In [33]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])

In [34]:
tfidf_matrix.shape

(9099, 217961)

- #### Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score.
- #### Therefore, we will use sklearn's linear_kernel instead of cosine_similarities since it is much faster.

In [35]:

# http://scikit-learn.org/stable/modules/metrics.html#linear-kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [36]:
cosine_sim[0]
#cosine_sim.shape

array([0., 0., 0., ..., 0., 0., 0.])

- #### We now have a pairwise cosine similarity matrix for all the movies in our dataset.
- #### The next step is to write a function that returns the 30 most similar movies based on the cosine similarity score.

In [39]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])
indices.head(2)

title
Toy Story    0
Jumanji      1
dtype: int64

In [40]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

We're all set...!
Let us now try and get the top recommendations for a few movies and see how good the recommendations are.

In [41]:
get_recommendations('The Godfather').head(10)


973      The Godfather: Part II
8387                 The Family
3509                       Made
4196         Johnny Dangerously
5667                       Fury
29               Shanghai Triad
1582    The Godfather: Part III
4221                    8 Women
2159              Summer of Sam
618                     Thinner
Name: title, dtype: object