<div style="color:green; font-family: 'Arial'; font-size: 36px;"><b>🎬 Content Based Film Recommendation System </b></div>


# Content-Based Filtering:

Content-based filtering stands as a sophisticated approach to recommendation systems, focusing on aligning suggestions with the unique interests and preferences of users. This methodology delves into the analysis of a user's historical interactions and preferences to present them with new items that bear resemblance in content.

At the core of content-based filtering lies the examination of the attributes of items to identify those that parallel the user's taste, based on specific similarity measures. For instance, in recommending films, the algorithm might evaluate elements like genre, starring actors, directors, and more, drawing from movies the user has previously enjoyed. Consequently, it proposes films that share these attributes.

These systems leverage advanced techniques such as text analysis, tagging, and categorization to dissect content features, enhancing their understanding of user inclinations. In the context of music recommendations, the analysis might extend to attributes such as genre, instrumental use, tempo, and rhythm.

By tailoring recommendations to mirror the user’s historical data, content-based recommendation systems excel in crafting personalized experiences that resonate with the user's established interests, enhancing satisfaction and engagement.

## Aim:
Our aim is to suggest films akin to those a visitor enjoys when they browse our website for movie viewing

## Project Blueprint

**1-Data Acquisition with Python:** Gathering the necessary datasets.

**2-Data Manipulation and Preparation:** Refining and structuring the data for analysis.

**3-Construction of the TF-IDF Matrix:** Developing the matrix to understand the importance of words in the movie descriptions.

**4-Establishing the Cosine Similarity Framework:** Creating a matrix to measure the similarity between movies.

**5-Making Recommendations Based on Similarities:** Identifying and suggesting movies based on their similarities.

**6-Finalizing the Analysis Script:** Preparing the script for comprehensive study and insights.

# 1-Data Acquisition with Python
### Import Libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Adjusting Row Column Settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)
pd.set_option('display.expand_frame_repr', False)

### Import data

In [3]:
df = pd.read_csv("/kaggle/input/movies-metadata/movies_metadata.csv" , low_memory=False) 

### Initial Analysis of the Dataset

In [4]:
def check_df(dataframe, head=5):
    print('##################### Shape #####################')
    print(dataframe.shape)
    print('##################### Types #####################')
    print(dataframe.dtypes)
    print('######################## Head ########################')
    print(dataframe.head(head))
    print('######################## Tail ########################')
    print(dataframe.tail(head))
    print('################## NA ##################')
    print(dataframe.isnull().sum())
    print('######################## Quantiles ########################')
    print(dataframe.describe([0, 0.05, 0.50, 0.95, 0.99, 1]).T)

check_df(df)

##################### Shape #####################
(45466, 24)
##################### Types #####################
adult                     object
belongs_to_collection     object
budget                    object
genres                    object
homepage                  object
id                        object
imdb_id                   object
original_language         object
original_title            object
overview                  object
popularity                object
poster_path               object
production_companies      object
production_countries      object
release_date              object
revenue                  float64
runtime                  float64
spoken_languages          object
status                    object
tagline                   object
title                     object
video                     object
vote_average             float64
vote_count               float64
dtype: object
######################## Head ########################
   adult                   

# 2-Data Manipulation and Preparation

In [5]:
df[df['overview'].isnull()]

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
32,False,,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 12, ...",,78802,tt0114952,fr,"Guillaumet, les ailes du courage",,0.745542,/k6ODtR38dKEfuzSGjggr8KDyAF4.jpg,"[{'name': 'Iwerks Entertainment', 'id': 70801}]","[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso...",1996-09-18,0.0,50.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Wings of Courage,False,6.8,4.0
300,False,,22000000,"[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name...",,161495,tt0114296,sv,Roommates,,3.395867,/hvHNlMvWS2GBt7RR971bJ3k4bJc.jpg,"[{'name': 'Hollywood Pictures', 'id': 915}, {'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-03-01,12400000.0,108.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Roommates,False,6.4,7.0
634,False,,0,"[{'id': 35, 'name': 'Comedy'}]",,287305,tt0117312,de,Peanuts – Die Bank zahlt alles,,0.066123,/wpk30SvRHmjC2plgKHZXxG0FlKd.jpg,"[{'name': 'Westdeutscher Rundfunk (WDR)', 'id'...","[{'iso_3166_1': 'DE', 'name': 'Germany'}]",1996-03-21,0.0,,[],Released,,Peanuts – Die Bank zahlt alles,False,4.0,1.0
635,False,,0,"[{'id': 35, 'name': 'Comedy'}]",,339428,tt0116485,de,Happy Weekend,,0.002229,,"[{'name': 'Senator Film Produktion', 'id': 191}]","[{'iso_3166_1': 'DE', 'name': 'Germany'}]",1996-03-14,65335.0,,"[{'iso_639_1': 'de', 'name': 'Deutsch'}]",Released,,Happy Weekend,False,0.0,0.0
641,False,,0,"[{'id': 35, 'name': 'Comedy'}]",,10801,tt0117788,de,Das Superweib,,0.821299,/AbhMKCh3fV5PY2B9uSPF1DWEvq2.jpg,"[{'name': 'Constantin Film', 'id': 47}]","[{'iso_3166_1': 'DE', 'name': 'Germany'}]",1996-03-06,0.0,86.0,"[{'iso_639_1': 'de', 'name': 'Deutsch'}]",Released,,The Superwife,False,5.3,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45342,False,,0,"[{'id': 18, 'name': 'Drama'}]",,199887,tt1771636,en,Over/Under,,0.704642,/1xLaIBGGPE4APtBJdfeuyOWICZ0.jpg,"[{'name': 'Fox Television Studios', 'id': 6529...","[{'iso_3166_1': 'US', 'name': 'United States o...",2013-01-04,0.0,87.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Over/Under,False,4.0,2.0
45377,False,,0,"[{'id': 12, 'name': 'Adventure'}]",,317389,tt0070695,es,Simbad e il califfo di Bagdad,,0.006352,/izk7KbT6LZO9baEhCkOZYMgj60w.jpg,"[{'name': 'Roas Produzioni', 'id': 21137}, {'n...","[{'iso_3166_1': 'IT', 'name': 'Italy'}]",1973-07-22,0.0,,"[{'iso_639_1': 'it', 'name': 'Italiano'}]",Released,,Simbad e il califfo di Bagdad,False,0.0,0.0
45398,False,,1254040,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",http://lmtr.fi/,468707,tt5742932,fi,Lauri Mäntyvaaran tuuheet ripset,,0.347806,/rKOpJuwb7pTqYVShHM2tl25VxyF.jpg,"[{'name': 'Elokuvayhtiö Oy Aamu', 'id': 84883}]","[{'iso_3166_1': 'FI', 'name': 'Finland'}]",2017-07-28,0.0,90.0,"[{'iso_639_1': 'fi', 'name': 'suomi'}]",Released,,Thick Lashes of Lauri Mäntyvaara,False,8.0,1.0
45399,False,,750000,"[{'id': 80, 'name': 'Crime'}, {'id': 35, 'name...",,280422,tt3805180,ru,Все и сразу,,0.201582,/hNsmPpl3zLG36jr4EIEd5P8I4pa.jpg,"[{'name': 'Кинокомпания «Lunapark»', 'id': 420...","[{'iso_3166_1': 'RU', 'name': 'Russia'}]",2014-06-05,3.0,0.0,"[{'iso_639_1': 'ru', 'name': 'Pусский'}]",Released,,All at Once,False,6.0,4.0


In [6]:
df['overview'] = df['overview'].fillna('')

# 3-Construction of the TF-IDF Matrix

In [7]:
#remove all English stop words (such as "the", "is", "in", etc.) 
tfidf = TfidfVectorizer(stop_words="english")

In [8]:
tfidf_matrix = tfidf.fit_transform(df['overview'])

In [9]:
tfidf_matrix.shape

(45466, 75827)

In [10]:
df['title'].shape

(45466,)

In [11]:
tfidf_matrix.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

# 4-Establishing the Cosine Similarity Framework

In [12]:
cosine_sim = cosine_similarity(tfidf_matrix,tfidf_matrix)

In [13]:
cosine_sim.shape

(45466, 45466)

In [14]:
cosine_sim[1]

array([0.01504121, 1.        , 0.04681953, ..., 0.        , 0.02198641,
       0.00929411])

# 5-Making Recommendations Based on Similarities

In [15]:
indices = pd.Series(df.index, index=df['title']) #Assigning an index for each title

indices.index.value_counts() #Get duplicate records with the same name

title
Cinderella              11
Hamlet                   9
Alice in Wonderland      9
Beauty and the Beast     8
Les Misérables           8
                        ..
Cluny Brown              1
Babies                   1
The Green Room           1
Captain Conan            1
Queerama                 1
Name: count, Length: 42277, dtype: int64

In [16]:
indices["Cinderella"] #It came as the record that has been made 11 times from the Cinderella movie and has the most duplicates

title
Cinderella      993
Cinderella    13076
Cinderella    23507
Cinderella    23518
Cinderella    28391
Cinderella    28664
Cinderella    34254
Cinderella    35593
Cinderella    35595
Cinderella    41074
Cinderella    45406
dtype: int64

In [17]:
indices = indices[~indices.index.duplicated(keep='last')] #To keep the last record in duplicates

In [18]:
indices["Cinderella"] #Only the last record remained for Cinderella

45406

In [19]:
indices["Seven Pounds"] #Let's look at my favourite film

13198

In [20]:
movie_index = indices["Seven Pounds"]

cosine_sim[movie_index] 

array([0.      , 0.      , 0.      , ..., 0.      , 0.010178, 0.      ])

In [21]:
#To access similarity scores with other movies
similarity_scores = pd.DataFrame(cosine_sim[movie_index],
                                 columns=["score"])

movie_indices = similarity_scores.sort_values("score", ascending=False)[1:11].index

df['title'].iloc[movie_indices] #similar films with "Seven Pounds"


20798                                        To the Arctic
2121                                                  Cube
31835                                       The Last Light
40214                                                  Sky
8238                                     The Polar Express
38972                                                Seven
21174                                       The Last Seven
11784                                                21 Up
5770                                          Scarlet Diva
32158    The World is Big and Salvation Lurks Around th...
Name: title, dtype: object

In [22]:
#Create a function for above steps
def content_based_recommender(title, cosine_sim, dataframe):
    # create indexes
    indices = pd.Series(dataframe.index, index=dataframe['title'])
    indices = indices[~indices.index.duplicated(keep='last')]
    # Catching the index of title
    movie_index = indices[title]
    #Calculating similarity scores based on 'title
    similarity_scores = pd.DataFrame(cosine_sim[movie_index], columns=["score"])
    # Retrieving the first 10 movies excluding itself
    movie_indices = similarity_scores.sort_values("score", ascending=False)[1:11].index
    return dataframe['title'].iloc[movie_indices]

In [23]:
content_based_recommender("Seven Pounds", cosine_sim, df)

20798                                        To the Arctic
2121                                                  Cube
31835                                       The Last Light
40214                                                  Sky
8238                                     The Polar Express
38972                                                Seven
21174                                       The Last Seven
11784                                                21 Up
5770                                          Scarlet Diva
32158    The World is Big and Salvation Lurks Around th...
Name: title, dtype: object

In [24]:
content_based_recommender("The Matrix", cosine_sim, df)

44161                        A Detective Story
44167                              Kid's Story
44163                             World Record
33854                                Algorithm
167                                    Hackers
20707    Underground: The Julian Assange Story
6515                                  Commando
24202                                 Who Am I
22085                           Berlin Express
9159                                  Takedown
Name: title, dtype: object

In [25]:
content_based_recommender("Lucy", cosine_sim, df)

43241    From a House on Willow Street
41490                              Pet
318               Swimming with Sharks
16212                   Lovers of Hate
3203                  The Waiting Game
17562                      Bear's Kiss
10143                    Ferpect Crime
10716                       Hallelujah
40978                  Girl in the Box
43342              Korgoth of Barbaria
Name: title, dtype: object

# 🎯 Conclusion
In conclusion, the Content-Based Film Recommendation system suggests films to a user by identifying matches based on the keywords in the descriptions or categories of films they like. This system leverages the content of a film to establish connections with similar films, thereby facilitating the discovery of new films that align with the user’s tastes and interests. As a result, it personalizes and enriches the user's viewing experience by tailoring film recommendations to their specific preferences.

# **🌟 If you enjoyed this notebook, please consider following me for more interesting projects! Your likes 👍 and comments 💬 are highly appreciated and they motivate me to create more valuable content. Feel free to share your thoughts, feedback, or any questions you might have. Let's learn and grow together! 🚀**