# Content-Based Recommender System Project

##Recommender System
It is a subclass of information filtering system that provide suggestions for items that are most pertinent to a particular user.

### Main Types (3)
1. Content-Based (Based on content similarity)
2. Collaborative-Filtering (Based on user similarity/interest)
3. Hybrid (Combination of both) Big companies

### Project Flow
Data --> **Preprocessing --> ML Model --> convert2website --> Deploy**

###1. Import datasets

In [None]:
import numpy as np
import pandas as pd

In [None]:
movies = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Data Sets/Movie-Recommender-System-Dataset/tmdb_5000_movies.csv")
credits = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Data Sets/Movie-Recommender-System-Dataset/tmdb_5000_credits.csv")

In [None]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [None]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [None]:
# Merge datasets
movies = movies.merge(credits,on="title")

In [None]:
movies.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count', 'movie_id', 'cast', 'crew'],
      dtype='object')

###2. Data Preprocessing

In [None]:
# Remove columns that are not needed
movies = movies [['movie_id','title','overview','genres','keywords','cast','crew']]

In [None]:
# Missing values check
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [None]:
movies.dropna(inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies.dropna(inplace = True)


In [None]:
# Duplicate data check
movies.duplicated().sum()

0

In [None]:
# column format change
movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

---> We need this type of data ['Action','Adventure','Fantasy','Science Fiction']

In [None]:
# 1. Convert string of list to list
import ast
ast.literal_eval('[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]')

[{'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 14, 'name': 'Fantasy'},
 {'id': 878, 'name': 'Science Fiction'}]

In [None]:
# helper function
def convert(obj):
  l = []
  for i in ast.literal_eval(obj):
    l.append(i['name'])
  return l

In [None]:
# 1. genres
movies['genres'] = movies['genres'].apply(convert)

In [None]:
# 2. keywords (do the same thing)
movies['keywords'] = movies['keywords'].apply(convert)

In [None]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [None]:
# 3. Cast (we need top 3 characters only)
movies.iloc[0].cast

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

In [None]:
def convert3(obj):
  l = []
  counter = 0
  for i in ast.literal_eval(obj):
    if counter!=3:
     l.append(i['name'])
     counter+=1
    else:
      break
  return l

In [None]:
movies['cast'] = movies['cast'].apply(convert3)

In [None]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [None]:
# 4. crew column (need only director name)
movies['crew'][0]

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

In [None]:
def fetch_director(obj):
  l = []
  for i in ast.literal_eval(obj):
    if i['job'] == 'Director':
     l.append(i['name'])
     break
  return l

In [None]:
movies['crew'] = movies['crew'].apply(fetch_director)

In [None]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]


In [None]:
# 5. Overview column (convert string to list) for concatenation
movies['overview'] = movies['overview'].apply(lambda x:x.split())

In [None]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]


### Remove spaces

In [None]:
# remove spaces
movies['genres'] = movies['genres'].apply(lambda x:[i.replace(" ","")for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x:[i.replace(" ","")for i in x])
movies['cast'] = movies['cast'].apply(lambda x:[i.replace(" ","")for i in x])
movies['crew'] = movies['crew'].apply(lambda x:[i.replace(" ","")for i in x])

In [None]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]


In [None]:
# add new column (tags) and concat 5 columns in it
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies ['cast'] + movies['genres']

In [None]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."


In [None]:
# remove columns
new_df = movies[['movie_id','title','tags']]

In [None]:
# convert tags from list to string
new_df['tags'] = new_df['tags'].apply(lambda x:" ".join(x))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x:" ".join(x))


In [None]:
new_df['tags'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver Action Adventure Fantasy ScienceFiction'

In [None]:
# convert all to lower case
new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())


In [None]:
new_df['tags'][0]

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver action adventure fantasy sciencefiction'

In [None]:
new_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."


In [None]:
new_df.duplicated().sum()

0

In [None]:
new_df.shape

(4806, 3)

#Stemming
converts [love][loving][loves] --> [love][love][love]


In [None]:
import nltk
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [None]:
ps.stem('loving')

'love'

In [None]:
# helper function
def stem(text):
  y=[]

  for i in text.split():   # string to list
    y.append(ps.stem(i))

  return " ".join(y)  # list to string

In [None]:
new_df['tags'] = new_df['tags'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(stem)


In [None]:
new_df['tags'][0:5]

0    in the 22nd century, a parapleg marin is dispa...
1    captain barbossa, long believ to be dead, ha c...
2    a cryptic messag from bond’ past send him on a...
3    follow the death of district attorney harvey d...
4    john carter is a war-weary, former militari ca...
Name: tags, dtype: object

# Text Vectorization
  Technique using --> (Bag of words)
  Text similarity

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000,stop_words='english')

In [None]:
vectors = cv.fit_transform(new_df['tags']).toarray()

In [None]:
cv.get_feature_names_out()

array(['3d', 'accident', 'act', 'action', 'adventur', 'affair',
       'aftercreditssting', 'age', 'agent', 'alcohol', 'alien', 'alway',
       'america', 'american', 'angel', 'ani', 'anim', 'anoth', 'apart',
       'armi', 'arriv', 'art', 'artist', 'assassin', 'assign', 'attack',
       'attempt', 'author', 'away', 'babi', 'bad', 'band', 'bank', 'bar',
       'base', 'basedonnovel', 'battl', 'beauti', 'becom', 'befor',
       'begin', 'believ', 'best', 'big', 'biographi', 'black', 'blood',
       'bond', 'book', 'boss', 'boy', 'boyfriend', 'break', 'bring',
       'british', 'brother', 'brutal', 'busi', 'california', 'captain',
       'captur', 'car', 'career', 'case', 'caught', 'caus', 'celebr',
       'center', 'chanc', 'chang', 'charact', 'chase', 'chicago', 'child',
       'childhood', 'children', 'christian', 'christma', 'cia', 'citi',
       'city', 'class', 'close', 'coach', 'colleg', 'come', 'comedi',
       'commun', 'compani', 'competit', 'confront', 'conspiraci',
       'co

# Finding cosine distance
 --> Eucleadean distance fails on high dimension data

 --> Distance is inversly proportional to similarity

In [None]:
# checking similarity b/w vectors
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# calculate distance/similarity b/w vectors
similarity = cosine_similarity(vectors)

In [None]:
# list
similarity[0]

array([1.        , 0.25928149, 0.19926335, ..., 0.0525105 , 0.        ,
       0.        ])

In [None]:
# list of tuples
list(enumerate(similarity[0]))

[(0, 0.9999999999999999),
 (1, 0.2592814894208657),
 (2, 0.19926334924652145),
 (3, 0.10219250343291215),
 (4, 0.3181045051401759),
 (5, 0.2450490147049017),
 (6, 0.0491671162291597),
 (7, 0.3021019648197444),
 (8, 0.17937941173234997),
 (9, 0.2643392192172763),
 (10, 0.31372549019607837),
 (11, 0.19885291966710147),
 (12, 0.2762455533096785),
 (13, 0.14462030521243743),
 (14, 0.33333333333333326),
 (15, 0.16618267200585896),
 (16, 0.24510221106170005),
 (17, 0.27128043348732983),
 (18, 0.1711453436034234),
 (19, 0.21914608621714188),
 (20, 0.15725550770308747),
 (21, 0.22140372138502384),
 (22, 0.20802038699693748),
 (23, 0.2271554252121273),
 (24, 0.1669931442893732),
 (25, 0.04222003309207491),
 (26, 0.2970442628930023),
 (27, 0.2970442628930023),
 (28, 0.2776029241433383),
 (29, 0.1575315094531511),
 (30, 0.2151449915893437),
 (31, 0.3313667478318056),
 (32, 0.1867897929731142),
 (33, 0.28005601680560194),
 (34, 0.0),
 (35, 0.29704426289300223),
 (36, 0.31069425495224695),
 (37, 0.

In [None]:
# sort in ascending order index[1]
sorted(list(enumerate(similarity[0])),reverse=True,key=lambda x:x[1])

[(0, 0.9999999999999999),
 (61, 0.4434220266088697),
 (1194, 0.4373740168406954),
 (1444, 0.3920784235278427),
 (322, 0.38468652405641013),
 (2999, 0.3704792868174741),
 (260, 0.3673154433462264),
 (3730, 0.36407282184728257),
 (94, 0.3634486803394037),
 (507, 0.3619613829965133),
 (1216, 0.3603018792888359),
 (931, 0.3600720216072025),
 (972, 0.35887028128263665),
 (495, 0.3550358012483631),
 (46, 0.35486043161491804),
 (973, 0.35486043161491804),
 (539, 0.3529411764705882),
 (1204, 0.35239609324728294),
 (4048, 0.3472909948612533),
 (2075, 0.3435921354681384),
 (168, 0.34347322487953885),
 (466, 0.3380331288700234),
 (577, 0.3377602647365992),
 (1658, 0.3377602647365992),
 (72, 0.33665016461206926),
 (4348, 0.33606722016672236),
 (1938, 0.33519554521580996),
 (232, 0.3339862885787464),
 (14, 0.33333333333333326),
 (31, 0.3313667478318056),
 (859, 0.33011264588676237),
 (2169, 0.33011264588676237),
 (1831, 0.32748887033017876),
 (56, 0.32694704563810695),
 (91, 0.32410186177608213),
 

# Recommendation System
--> If someone enters a movie we have to recommend 5 movies related to it

In [None]:
# If someone gives us a movie we have to recommend 5 movies related to it
def recommend(movie):
  movie_index = new_df[new_df['title']==movie].index[0]
  distances = similarity[movie_index]
  movies_list = sorted(list(enumerate(distances)),reverse=True,key=lambda x:x[1])[1:6]

  # print list
  for i in movies_list:
    # print(i[0]) # returning index
    # print(new_df.iloc[i[0]].title) # reurns movie title
    print(new_df.iloc[i[0]]["product_id"], new_df.iloc[i[0]]["product_title"])

In [None]:
recommend('Batman')

Batman
Batman & Robin
Batman Returns
Batman Forever
Batman Begins


# Model buiding is complete
Just a basic model

Now we convert it to a web application (Product)