## Knowledge based recommeder system

** using below formulae to find the score

<img src='Equation pic.png'>


v = is number of votes received by movie

m = is the minimum number of votes required for the movie to be in scope

R = mean rating of the movie

C = is the mean rating of all the movies in database


In [21]:
import pandas as pd
import numpy as np

In [33]:
import warnings
warnings.filterwarnings('ignore')

In [264]:
df_metadata = pd.read_csv('movies_metadata.csv')

In [265]:
df_metadata.shape


(45466, 24)

In [266]:
df_metadata.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

## Make index as title to access the each movies easily

In [337]:
df_metadata.groupby('title')['adult'].count().sort_values()

title
!Women Art Revolution                               1
Signed, Sealed, Delivered: The Impossible Dream     1
Signed, Sealed, Delivered: Truth Be Told            1
Signer's Suitcase                                   1
Signs & Wonders                                     1
                                                   ..
Les Misérables                                      8
Beauty and the Beast                                8
Hamlet                                              9
Alice in Wonderland                                 9
Cinderella                                         11
Name: adult, Length: 42277, dtype: int64

In [268]:
df_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
adult                    45466 non-null object
belongs_to_collection    4494 non-null object
budget                   45466 non-null object
genres                   45466 non-null object
homepage                 7782 non-null object
id                       45466 non-null object
imdb_id                  45449 non-null object
original_language        45455 non-null object
original_title           45466 non-null object
overview                 44512 non-null object
popularity               45461 non-null object
poster_path              45080 non-null object
production_companies     45463 non-null object
production_countries     45463 non-null object
release_date             45379 non-null object
revenue                  45460 non-null float64
runtime                  45203 non-null float64
spoken_languages         45460 non-null object
status                   45379 non-null objec

## Picking only relevant columns now onwards

In [338]:
df_metadata_s = df_metadata[['title','release_date','vote_average','vote_count','runtime','genres']]

In [339]:
df_metadata_s.head()

Unnamed: 0,title,release_date,vote_average,vote_count,runtime,genres
0,Toy Story,1995-10-30,7.7,5415.0,81.0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '..."
1,Jumanji,1995-12-15,6.9,2413.0,104.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '..."
2,Grumpier Old Men,1995-12-22,6.5,92.0,101.0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
3,Waiting to Exhale,1995-12-22,6.1,34.0,127.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
4,Father of the Bride Part II,1995-02-10,5.7,173.0,106.0,"[{'id': 35, 'name': 'Comedy'}]"


In [340]:
df_metadata_s.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 6 columns):
title           45460 non-null object
release_date    45379 non-null object
vote_average    45460 non-null float64
vote_count      45460 non-null float64
runtime         45203 non-null float64
genres          45466 non-null object
dtypes: float64(3), object(3)
memory usage: 2.1+ MB


** Budget is object datatype , let make it float.

In [344]:
df_metadata_s.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 6 columns):
title           45460 non-null object
release_date    45379 non-null object
vote_average    45460 non-null float64
vote_count      45460 non-null float64
runtime         45203 non-null float64
genres          45466 non-null object
dtypes: float64(3), object(3)
memory usage: 2.1+ MB


## Take out the year component from release date feature.

In [345]:
df_metadata_s.release_date.isnull().sum()

87

** those 87 fields are to be bypassed.

In [346]:
df_metadata_s.release_date = pd.to_datetime(df_metadata_s['release_date'],errors  = 'coerce')

In [347]:
df_metadata_s['year'] = df_metadata_s['release_date'].apply(lambda x: x.year if x != np.nan else np.nan)

In [348]:
df_metadata_s['year']

0        1995.0
1        1995.0
2        1995.0
3        1995.0
4        1995.0
          ...  
45461       NaN
45462    2011.0
45463    2003.0
45464    1917.0
45465    2017.0
Name: year, Length: 45466, dtype: float64

In [349]:
from ast import literal_eval
df_metadata_s['genres'] = df_metadata_s['genres'].apply(literal_eval)

In [350]:
df_metadata_s['genres_list'] = df_metadata_s['genres'].apply(lambda x : [l['name'] for l in x])

In [351]:
df_metadata_s['genres_list']

0         [Animation, Comedy, Family]
1        [Adventure, Fantasy, Family]
2                   [Romance, Comedy]
3            [Comedy, Drama, Romance]
4                            [Comedy]
                     ...             
45461                 [Drama, Family]
45462                         [Drama]
45463       [Action, Drama, Thriller]
45464                              []
45465                              []
Name: genres_list, Length: 45466, dtype: object

## expanding the genres from list

In [352]:
s = df_metadata_s.apply(lambda x : pd.Series(x['genres_list']),axis=1).stack().reset_index(level=1,drop=True)

In [353]:
s.index[s.index==np.nan]

Int64Index([], dtype='int64')

** no null value 


In [354]:
s.name = 'Genres'

In [355]:
new = pd.DataFrame(s).join(df_metadata_s)

In [356]:
new.head()


Unnamed: 0,Genres,title,release_date,vote_average,vote_count,runtime,genres,year,genres_list
0,Animation,Toy Story,1995-10-30,7.7,5415.0,81.0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",1995.0,"[Animation, Comedy, Family]"
0,Comedy,Toy Story,1995-10-30,7.7,5415.0,81.0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",1995.0,"[Animation, Comedy, Family]"
0,Family,Toy Story,1995-10-30,7.7,5415.0,81.0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",1995.0,"[Animation, Comedy, Family]"
1,Adventure,Jumanji,1995-12-15,6.9,2413.0,104.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",1995.0,"[Adventure, Fantasy, Family]"
1,Fantasy,Jumanji,1995-12-15,6.9,2413.0,104.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",1995.0,"[Adventure, Fantasy, Family]"


In [357]:
new.drop(['genres','genres_list'],axis= 1,inplace = True)

In [381]:
new.head()

Unnamed: 0,Genres,title,vote_average,vote_count,runtime,year
0,Animation,Toy Story,7.7,5415.0,81.0,1995
0,Comedy,Toy Story,7.7,5415.0,81.0,1995
0,Family,Toy Story,7.7,5415.0,81.0,1995
1,Adventure,Jumanji,6.9,2413.0,104.0,1995
1,Fantasy,Jumanji,6.9,2413.0,104.0,1995


In [359]:
new.drop(['release_date'],axis=1,inplace=True)

In [360]:
new['year']= new['year'].astype('int32',errors='ignore')

In [371]:
new.dropna(axis=0,inplace=True)

In [379]:
new.year= new.year.astype('int32')

In [380]:
new.dtypes

Genres           object
title            object
vote_average    float64
vote_count      float64
runtime         float64
year              int32
dtype: object

## Get the input from User for their preference for recommendation

In [384]:
def rec_after_user_pref(new):
    #what genre you would like to see
    print('what genre you would like to see')
    genre = input()
    print('minimum duration of movie you would like to see')
    low_time = float(input())
    print('maximum duration of movie you would like to see')
    high_time = float(input())
    print('oldest time period from which you like see')
    low_year = int(input())
    print('latest time period from which you like see')
    high_year = int(input())
    
    movies = new[(new['Genres']==genre) & 
                (new['runtime']>=low_time) &
                (new['runtime']<=high_time) &
                (new['year']>= low_year) &
                (new['year']<= high_year)]
    C = movies['vote_average'].mean()
    m = movies['vote_count'].quantile(.8)
    #80% percentile is taken in account for calculation
    
    movies = movies[movies['vote_count']>m]
    
    movies['score'] = movies.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C),axis =1)
    movies = movies.sort_values('score', ascending = False)
    return movies

In [385]:
rec_after_user_pref(new)

what genre you would like to see
Animation
minimum duration of movie you would like to see
100
maximum duration of movie you would like to see
120
oldest time period from which you like see
1998
latest time period from which you like see
2000


Unnamed: 0,Genres,title,vote_average,vote_count,runtime,year,score
7636,Animation,Jin-Roh: The Wolf Brigade,7.4,82.0,102.0,1999,6.883709
5015,Animation,Vampire Hunter D: Bloodlust,7.0,92.0,103.0,2000,6.700199


In [386]:
rec_after_user_pref(new)

what genre you would like to see
Fantasy
minimum duration of movie you would like to see
100
maximum duration of movie you would like to see
200
oldest time period from which you like see
1990
latest time period from which you like see
2005


Unnamed: 0,Genres,title,vote_average,vote_count,runtime,year,score
5481,Fantasy,Spirited Away,8.3,3968.0,125.0,2001,7.924345
3030,Fantasy,The Green Mile,8.2,4166.0,189.0,1999,7.856038
4863,Fantasy,The Lord of the Rings: The Fellowship of the Ring,8.0,8892.0,178.0,2001,7.84034
5814,Fantasy,The Lord of the Rings: The Two Towers,8.0,7641.0,179.0,2002,7.8168
9698,Fantasy,Howl's Moving Castle,8.2,2049.0,119.0,2004,7.604338
2884,Fantasy,Princess Mononoke,8.2,2041.0,134.0,1997,7.602686
7725,Fantasy,Harry Potter and the Prisoner of Azkaban,7.7,6037.0,141.0,2004,7.510892
4748,Fantasy,Donnie Darko,7.7,3574.0,113.0,2001,7.405466
6390,Fantasy,Pirates of the Caribbean: The Curse of the Bla...,7.5,7191.0,143.0,2003,7.35906
4766,Fantasy,Harry Potter and the Philosopher's Stone,7.5,7188.0,152.0,2001,7.359008


## Great job!!
