# **Capstone Project - Netflix**

Customer Behaviour and it’s prediction lies
at the core of every Business Model. From
Stock Exchange, e-Commerce and
Automobile to even Presidential Elections,
predictions serve a great purpose. Most of
these predictions are based on the data
available about a person’s activity either
online or in-person.

Recommendation Engines are the much
needed manifestations of the desired
Predictability of User Activity.
Recommendation Engines move one step
further and not only give information but
put forth strategies to further increase users
interaction with the platform.

In today’s world OTT platform and Streaming
Services have taken up a big chunk in the
Retail and Entertainment industry.
Organizations like Netflix, Amazon etc.
analyse User Activity Pattern’s and suggest
products that better suit the user needs and
choices.

For the purpose of this Project we will be
creating one such Recommendation Engine
from the ground-up, where every single user,
based on there area of interest and ratings,
would be recommended a list of movies that
are best suited for them.

**Dataset Information:**

1. ID – Contains the separate keys for
customer and movies.
2. Rating – A section contains the user
ratings for all the movies.
3. Genre – Highlights the category of the
movie.
4. Movie Name – Name of the movie with
respect to the movie id.

**Objectives:**

1. Find out the list of most popular and liked genre
2. Create Model that finds the best suited Movie for one
user in every genre.
3. Find what Genre Movies have received the best and
worst ratings based on User Rating.

##**Data Pre-processing Steps**

###**Step 1: Import the Libraries**

In [None]:
# Importing the Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

###**Step 2: Import the Dataset**

In [None]:
# Load the .csv file

df = pd.read_csv('/content/netflix_titles-1.csv')

##**Exploratory Data Analysis**

###**Step 3: Descriptive Analysis of the Dataset**

In [None]:
df.head()

Unnamed: 0,ID,Movie Name,Rating,Genre
0,s1,Dick Johnson Is Dead,PG-13,Documentaries
1,s2,Blood & Water,TV-MA,"International TV Shows, TV Dramas, TV Mysteries"
2,s3,Ganglands,TV-MA,"Crime TV Shows, International TV Shows, TV Act..."
3,s4,Jailbirds New Orleans,TV-MA,"Docuseries, Reality TV"
4,s5,Kota Factory,TV-MA,"International TV Shows, Romantic TV Shows, TV ..."


In [None]:
df.describe()

Unnamed: 0,ID,Movie Name,Rating,Genre
count,8807,8807,8803,8807
unique,8807,8804,17,514
top,s1,15-Aug,TV-MA,"Dramas, International Movies"
freq,1,2,3207,362


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   ID          8807 non-null   object
 1   Movie Name  8807 non-null   object
 2   Rating      8803 non-null   object
 3   Genre       8807 non-null   object
dtypes: object(4)
memory usage: 275.3+ KB


In [None]:
df.isnull().sum()

Unnamed: 0,0
ID,0
Movie Name,0
Rating,4
Genre,0


In [None]:
df.duplicated().sum()

0

###**Step 4: Data cleaning - handling missing values or null, duplicates, etc**

In [None]:
df.isnull().sum()['Rating']

4

In [None]:
df.dropna(inplace=True)

In [None]:
df

Unnamed: 0,ID,Movie Name,Rating,Genre
0,s1,Dick Johnson Is Dead,PG-13,Documentaries
1,s2,Blood & Water,TV-MA,"International TV Shows, TV Dramas, TV Mysteries"
2,s3,Ganglands,TV-MA,"Crime TV Shows, International TV Shows, TV Act..."
3,s4,Jailbirds New Orleans,TV-MA,"Docuseries, Reality TV"
4,s5,Kota Factory,TV-MA,"International TV Shows, Romantic TV Shows, TV ..."
...,...,...,...,...
8802,s8803,Zodiac,R,"Cult Movies, Dramas, Thrillers"
8803,s8804,Zombie Dumb,TV-Y7,"Kids' TV, Korean TV Shows, TV Comedies"
8804,s8805,Zombieland,R,"Comedies, Horror Movies"
8805,s8806,Zoom,PG,"Children & Family Movies, Comedies"


In [None]:
df.isnull().sum()

Unnamed: 0,0
ID,0
Movie Name,0
Rating,0
Genre,0


###**Step 5: Find out the list of most popular and liked genre**

* **Most Popular Genre can be find out by counting the number of ratings for each genre.**

* **Most Liked Genre can be find out by calculating the average/mean for each genre.**

In [None]:
# Popular genres by the number of ratings
popular_genres = df.groupby('Genre')['Rating'].count().sort_values(ascending=False)

# Liked genres by average rating
# Convert 'Rating' column to numeric, handling errors by setting them to NaN
df['Rating'] = pd.to_numeric(df['Rating'], errors = "coerce")
liked_genres = df.groupby('Genre')['Rating'].mean(numeric_only=True).sort_values(ascending=False)

print("Most Popular Genres:\n", popular_genres)
print("Most Liked Genres:\n", liked_genres)

Most Popular Genres:
 Genre
Action & Adventure                                      0
International TV Shows, Reality TV                      0
Horror Movies, International Movies, Romantic Movies    0
Horror Movies, International Movies                     0
Horror Movies, Independent Movies, Thrillers            0
                                                       ..
Classic Movies, Comedies, International Movies          0
Classic Movies, Comedies, Independent Movies            0
Classic Movies, Comedies, Dramas                        0
Classic Movies, Comedies, Cult Movies                   0
Thrillers                                               0
Name: Rating, Length: 514, dtype: int64
Most Liked Genres:
 Genre
Action & Adventure                                             NaN
Action & Adventure, Anime Features                             NaN
Action & Adventure, Anime Features, Children & Family Movies   NaN
Action & Adventure, Anime Features, Classic Movies             NaN


##**Create Model that finds the best suited Movie for one user in every genre.**

###**Step 6: Create a Recommendation Model**

* **Collaborative Filtering is used as Recommendation Engine/Model because of the most user-item(genre) interaction data.**

In [None]:
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m153.6/154.4 kB[0m [31m4.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp310-cp310-linux_x86_64.whl size=2357270 sha256=e90550b6c6f984da690ea5297220c845b65c6d232c1c8491d1995b32d32305e6
  Stored in directory: /root/.cache/pip/wheels/4b/3f/df/6acbf0a

In [None]:
from surprise import SVD
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import accuracy

# Prepare data for Surprise library
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['ID', 'Movie Name', 'Rating']], reader)

# Split data into train and test
trainset, testset = train_test_split(data, test_size=0.2)

# Train SVD model
model = SVD()
model.fit(trainset)

# Test the model
predictions = model.test(testset)
print("RMSE:", accuracy.rmse(predictions))

# Recommend movies for a user
def recommend_movies(user_id, df, model, genre=None, top_n=5):
    user_movies = df[df['ID'] == user_id]['Movie Name'].tolist()
    all_movies = df['Movie Name'].unique()

    # Exclude movies already rated by the user
    recommendations = [movie for movie in all_movies if movie not in user_movies]

    # Predict ratings for the user
    predictions = [
        (movie, model.predict(user_id, movie).est)
        for movie in recommendations
        if genre is None or df[df['Movie Name'] == movie]['Genre'].iloc[0] == genre
    ]

    # Sort by predicted rating
    predictions.sort(key=lambda x: x[1], reverse=True)
    return predictions[:top_n]

user_id = 1  # Example user ID
recommended_movies = recommend_movies(user_id, df, model)
print("Recommended Movies:", recommended_movies)


RMSE: nan
RMSE: nan
Recommended Movies: [('Dick Johnson Is Dead', 5), ('Blood & Water', 5), ('Ganglands', 5), ('Jailbirds New Orleans', 5), ('Kota Factory', 5)]


##**Find what Genre Movies have received the best and worst ratings based on User Rating.**


###**Step 7: Find Genres with Best and Worst ratings**

In [None]:
# Genres with the best average ratings
best_genres = df.groupby('Genre')['Rating'].max().sort_values(ascending=False)

# Genres with the worst average ratings
worst_genres = df.groupby('Genre')['Rating'].min().sort_values(ascending=False)

print("Best Rated Genres:\n", best_genres)
print("Worst Rated Genres:\n", worst_genres)


Best Rated Genres:
 Genre
Action & Adventure                                             NaN
Action & Adventure, Anime Features                             NaN
Action & Adventure, Anime Features, Children & Family Movies   NaN
Action & Adventure, Anime Features, Classic Movies             NaN
Action & Adventure, Anime Features, Horror Movies              NaN
                                                                ..
TV Horror, TV Mysteries, Teen TV Shows                         NaN
TV Horror, Teen TV Shows                                       NaN
TV Sci-Fi & Fantasy, TV Thrillers                              NaN
TV Shows                                                       NaN
Thrillers                                                      NaN
Name: Rating, Length: 514, dtype: float64
Worst Rated Genres:
 Genre
Action & Adventure                                             NaN
Action & Adventure, Anime Features                             NaN
Action & Adventure, Anime Features

**OR**

In [None]:
df = pd.read_csv('/content/netflix_titles-1.csv')

In [None]:
df.dropna(inplace=True)

In [None]:
# Genres with the best average ratings
best_genres = df.groupby('Genre')['Rating'].max().sort_values(ascending=False)

# Genres with the worst average ratings
worst_genres = df.groupby('Genre')['Rating'].min().sort_values(ascending=False)

print("Best Rated Genres:\n", best_genres)
print("Worst Rated Genres:\n", worst_genres)


Best Rated Genres:
 Genre
Action & Adventure, Comedies                                      UR
Dramas, International Movies, Romantic Movies                     UR
Kids' TV, TV Action & Adventure, TV Sci-Fi & Fantasy        TV-Y7-FV
Action & Adventure, Children & Family Movies                TV-Y7-FV
Children & Family Movies, Comedies                          TV-Y7-FV
                                                              ...   
Children & Family Movies, Comedies, Faith & Spirituality          PG
Children & Family Movies, Classic Movies, Dramas                  PG
Action & Adventure, Classic Movies, Sci-Fi & Fantasy               G
Classic Movies, Music & Musicals                                   G
Children & Family Movies, Classic Movies                           G
Name: Rating, Length: 514, dtype: object
Worst Rated Genres:
 Genre
Anime Series, Kids' TV, TV Action & Adventure          TV-Y7
Children & Family Movies, Comedies, LGBTQ Movies       TV-Y7
Classic & Cult TV, Kids' 