# Movies -Metadata Based- Recommender System

## Introduction
In this notebook I will be creating a recommender system based on metadata of movies. Besides movies data, we will use crew, cast data from credits dataset and keywords of each movie from the keywords dataset. Also, we will filter the results based on the ratings in the ratings dataset, so that only good movies will show up as recommended.
### Plan:
1- Data preprocessing and visualization: Exploring data and understanding what it can offer is important before starting any data science project. We will have to make some cleaning to the data in order to create the recommender function.
<br/>
2- Creating the recommender system.

` Note: I am using small dataset version because of the computing power, the same code could be applied to larger datasets `

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np 
import os

from scipy import stats
from ast import literal_eval

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/the-movies-dataset/ratings.csv
/kaggle/input/the-movies-dataset/links_small.csv
/kaggle/input/the-movies-dataset/credits.csv
/kaggle/input/the-movies-dataset/keywords.csv
/kaggle/input/the-movies-dataset/movies_metadata.csv
/kaggle/input/the-movies-dataset/ratings_small.csv
/kaggle/input/the-movies-dataset/links.csv


In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')

Starting with import the first dataset, movies metadata:

In [3]:
movies = pd.read_csv('/kaggle/input/the-movies-dataset/movies_metadata.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


## Data Visualization and Pre-processing

In [4]:
pd.set_option('display.max_columns', None)
movies.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


There are multiple columns in this dataset that could be used to create a recommender system, in this notebook we will take into consideration the genre of the movie, the title, overview, tagline, vote_average and vote_count.

In [5]:
movies['adult'].unique()

array(['False', 'True', ' - Written by Ørnås',
       ' Rune Balot goes to a casino connected to the October corporation to try to wrap up her case once and for all.',
       ' Avalanche Sharks tells the story of a bikini contest that turns into a horrifying affair when it is hit by a shark avalanche.'],
      dtype=object)

In [6]:
def fix_adult(x):
    if x != 'True' and x != 'False':
        x = 'Else'
    else:
        pass
    return x
movies['adult'] = movies['adult'].apply(fix_adult)
movies['adult'].unique()

array(['False', 'True', 'Else'], dtype=object)

In [7]:
movies['adult'].value_counts()

False    45454
True         9
Else         3
Name: adult, dtype: int64

Most of the movies are not adult movies. We will not take this columns into consideration because we could consider those adult cases are outliers.

From the first glence we can see that genres is a list containing a dictionary, so I would use literal_eval of ast to pre process this column.

In [8]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

The id of a genre won't matter, so we will transform this column to contain only one list with the genres instead of a list of a dictionary.

In [9]:
def fix_genres(x):
    genres = []
    for i in x:
        genres.append(list(i.values())[1])

    return genres

movies['genresx'] = movies['genres'].fillna('[]').apply(literal_eval).apply(lambda x: fix_genres(x))

In [10]:
movies = movies.drop([19730, 29503, 35587])

We would do the same thing with spoken languages column:

In [11]:
def fix_language(x):
    lang = []
    for i in x:
        lang.append(list(i.values())[1])
    return lang

movies['spoken_languages'] = movies['spoken_languages'].fillna('[]').apply(literal_eval).apply(fix_language)
movies.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,genresx
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,[English],Released,,Toy Story,False,7.7,5415.0,"[Animation, Comedy, Family]"
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[English, Français]",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[Adventure, Fantasy, Family]"
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,[English],Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,"[Romance, Comedy]"
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,[English],Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,"[Comedy, Drama, Romance]"
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,[English],Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,[Comedy]


In [12]:
links = pd.read_csv('/kaggle/input/the-movies-dataset/links_small.csv')
links = links[links['tmdbId'].notnull()]['tmdbId'].astype('int')
links.head()

0      862
1     8844
2    15602
3    31357
4    11862
Name: tmdbId, dtype: int64

Movie taglines are slogans or catchphrases for films. They can include a clever pun, a short phrase, or a sentence or two. Taglines can refer to the plot of the film or suggest the experience you'll have as a viewer. So we will concatenate this column with the overview column to get an overall description column.

In [13]:
movies['tagline'].unique()

array([nan, 'Roll the dice and unleash the excitement!',
       'Still Yelling. Still Fighting. Still Ready for Love.', ...,
       'Do you know what happened 50 years before "The Blair Witch Project"?',
       'Rising and falling between a man and woman',
       'A deadly game of wits.'], dtype=object)

In [14]:
movies['tagline'] = movies['tagline'].fillna('')
movies['overview'] = movies['overview'].fillna('')

In [15]:
movies['description'] = movies['overview'] + movies['tagline']
movies['description'] = movies['description'].fillna('')

Moving on, we will import credits and keywords dataset.

In [16]:
credits = pd.read_csv('/kaggle/input/the-movies-dataset/credits.csv')
keywords = pd.read_csv('/kaggle/input/the-movies-dataset/keywords.csv')

In [17]:
credits.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [18]:
credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   cast    45476 non-null  object
 1   crew    45476 non-null  object
 2   id      45476 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.0+ MB


In [19]:
keywords.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [20]:
keywords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46419 entries, 0 to 46418
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        46419 non-null  int64 
 1   keywords  46419 non-null  object
dtypes: int64(1), object(1)
memory usage: 725.4+ KB


In [21]:
print(movies['id'].dtype)
movies['id'] = movies['id'].astype(int)
print(movies['id'].dtype)

object
int64


As I mentioned in the introduction, I will be using crew, cast data and keywords, so let us merge these dataframes with the movies dataframe.

In [22]:
movies = movies.merge(credits, on='id')
movies = movies.merge(keywords, on='id')

We will keep only movies that are in tmbd, and we will create a new dataframe so we would leave the original dataframe movies with no changes.

In [23]:
recom = movies[movies['id'].isin(links)]

As we did with the genre column in movies dataframe, we would apply literal_eval on cast, crew and keywords columns.

In [24]:
recom['cast'] = recom['cast'].apply(literal_eval)
recom['crew'] = recom['crew'].apply(literal_eval)
recom['keywords'] = recom['keywords'].apply(literal_eval)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [25]:
# Uncomment this to see an example of a crew
#recom['crew'][0]

We can see that a crew of a movie contains dozens of people and roles, starting from simple tasks to the most complicated ones.
We will only pick the director since other roles don't contribute that much to the feel of the movie.

In [26]:
def director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [27]:
recom['director'] = recom['crew'].apply(director)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Again, each movie has dozens of actors, not all of them are that important, so we will only pick top three actors in the cast list.

In [28]:
recom['cast']

0        [{'cast_id': 14, 'character': 'Woody (voice)',...
1        [{'cast_id': 1, 'character': 'Alan Parrish', '...
2        [{'cast_id': 2, 'character': 'Max Goldman', 'c...
3        [{'cast_id': 1, 'character': 'Savannah 'Vannah...
4        [{'cast_id': 1, 'character': 'George Banks', '...
                               ...                        
40952    [{'cast_id': 1, 'character': 'Henry Cobb', 'cr...
41172    [{'cast_id': 0, 'character': 'Rustom Pavri', '...
41225    [{'cast_id': 0, 'character': 'Sarman', 'credit...
41391    [{'cast_id': 4, 'character': 'Rando Yaguchi : ...
41669    [{'cast_id': 0, 'character': 'Himself', 'credi...
Name: cast, Length: 9219, dtype: object

In [29]:
recom['cast'] = recom['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
recom['cast'] = recom['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [30]:
recom['keywords'] = recom['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [31]:
recom['cast'] = recom['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Directors affect the quality of a movie more than any other role, so we will mention the director name three times in order to add to its weight.

In [32]:
recom['director'] = recom['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
recom['director'] = recom['director'].apply(lambda x: [x,x, x])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [33]:
keyw = recom.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)

  """Entry point for launching an IPython kernel.


We will only keep movies with more than two keywords.

In [34]:
keyw = keyw.value_counts()
keyw = keyw[keyw > 2]
keyw

independent film        610
woman director          550
murder                  399
duringcreditsstinger    327
based on novel          318
                       ... 
florence                  3
slice of life             3
professional sports       3
fan culture               3
singing in a car          3
Length: 4539, dtype: int64

In [35]:
stemmer = SnowballStemmer('english')
def fix_keywords(x):
    keys = []
    for i in x:
        if i in keyw:
            keys.append(i)
    return keys
recom['keywords'] = recom['keywords'].apply(fix_keywords)
recom['keywords'] = recom['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
recom['keywords'] = recom['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


Now we will create a general columnn that contains all the data we have gathered so far, such as keywords, cast, director and genres of a movie.

In [36]:
recom['overall'] = recom['keywords'] + recom['cast'] + recom['director'] + recom['genresx']
recom['overall'] = recom['overall'].apply(lambda x: ' '.join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Create a count vectorizer:

In [37]:
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(recom['overall'])

I will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two movies. Mathematically, it is defined as follows:

<math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>c</mi>
  <mi>o</mi>
  <mi>s</mi>
  <mi>i</mi>
  <mi>n</mi>
  <mi>e</mi>
  <mo stretchy="false">(</mo>
  <mi>x</mi>
  <mo>,</mo>
  <mi>y</mi>
  <mo stretchy="false">)</mo>
  <mo>=</mo>
  <mfrac>
    <mrow>
      <mi>x</mi>
      <mo>.</mo>
      <msup>
        <mi>y</mi>
        <mo>&#x22BA;<!-- ⊺ --></mo>
      </msup>
    </mrow>
    <mrow>
      <mrow class="MJX-TeXAtom-ORD">
        <mo stretchy="false">|</mo>
      </mrow>
      <mrow class="MJX-TeXAtom-ORD">
        <mo stretchy="false">|</mo>
      </mrow>
      <mi>x</mi>
      <mrow class="MJX-TeXAtom-ORD">
        <mo stretchy="false">|</mo>
      </mrow>
      <mrow class="MJX-TeXAtom-ORD">
        <mo stretchy="false">|</mo>
      </mrow>
      <mo>.</mo>
      <mrow class="MJX-TeXAtom-ORD">
        <mo stretchy="false">|</mo>
      </mrow>
      <mrow class="MJX-TeXAtom-ORD">
        <mo stretchy="false">|</mo>
      </mrow>
      <mi>y</mi>
      <mrow class="MJX-TeXAtom-ORD">
        <mo stretchy="false">|</mo>
      </mrow>
      <mrow class="MJX-TeXAtom-ORD">
        <mo stretchy="false">|</mo>
      </mrow>
    </mrow>
  </mfrac>
</math>

In [38]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [39]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

In [40]:
recom = recom.reset_index()
titles = recom['title']
indices = pd.Series(recom.index, index=recom['title'])

In [41]:
recom['year'] = pd.to_datetime(recom['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [42]:
recom['year'].unique()

array(['1995', '1994', '1996', '1976', '1992', '1967', '1993', '1964',
       '1977', '1965', '1982', '1990', '1991', '1989', '1937', '1940',
       '1969', '1981', '1974', '1970', '1960', '1955', '1959', '1968',
       '1988', '1948', '1950', '1997', '1956', '1958', '1972', '1953',
       '1943', '1952', '1951', '1957', '1961', '1954', '1934', '1944',
       '1963', '1942', '1941', '1939', '1946', '1945', '1938', '1947',
       '1935', '1936', '1926', '1949', '1932', '1985', '1975', '1971',
       '1973', '1979', '1987', '1986', '1980', '1978', '1966', '1962',
       '1983', '1984', '1933', '1931', '1922', '1998', '1927', '1929',
       '1930', '1928', '1999', '2000', '2012', '1925', '1923', '1918',
       '1921', '2001', '1924', '2013', '2002', '2003', '1920', '1915',
       '2004', '1916', '1917', '2005', '2006', '1902', '2007', '2008',
       '2009', '1919', '2010', '2011', '2016', '2014', '2015'],
      dtype=object)

Now we will be using ratings to filter the results, we don't want a bad movie to show to a user even if they like the genre and the director of that movie. I will use IMDB's weighted rating formula to weight the ratings. Mathematically, it is represented as follows:

<a href="https://imgbb.com/"><img src="https://i.ibb.co/1m8Yzbq/eq.jpg" alt="eq" border="0"></a>

* v is the number of votes for the movie <br/>
* m is the minimum votes required to be listed in the chart <br/>
* R is the average rating of the movie <br/>
* C is the mean vote across the whole report

In [43]:
vote_counts = recom[recom['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = recom[recom['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
m = vote_counts.quantile(0.95)
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [44]:
def recommender_system(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = recom.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)
    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(10)
    return qualified

In [45]:
recommender_system('The Dark Knight')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


Unnamed: 0,title,vote_count,vote_average,year,wr
7648,Inception,14075,8,2010,7.732796
8613,Interstellar,11187,8,2014,7.674561
6623,The Prestige,4510,8,2006,7.344034
3381,Memento,4168,8,2000,7.308037
8031,The Dark Knight Rises,9263,7,2012,6.801517
6218,Batman Begins,7511,7,2005,6.765201
1134,Batman Returns,1706,6,1992,5.951183
132,Batman Forever,1529,5,1995,5.523252
9024,Batman v Superman: Dawn of Justice,7189,5,2016,5.203189
1260,Batman & Robin,1447,4,1997,5.123362


## Recommendation:
I would recommend to anyone who is learning how to create recommender systems to read this notebook: https://www.kaggle.com/rounakbanik/movie-recommender-systems
It is well explained and the writer used multiple examples, als they used collaborative filtering the enables the possibility to classifiy the suggestions based on the user's prefereneces.

### Omar El Yousfi
### elyousfiomarr@gmail.com