<a href="https://colab.research.google.com/github/dsmukti/mukticapstone/blob/main/NETFLIX_MOVIES_AND_TV_SHOWS_CLUSTERING_Mukti.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement**

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.

## <b>In this  project, you are required to do </b>
1. Exploratory Data Analysis 

2. Understanding what type content is available in different countries

3. Is Netflix has increasingly focusing on TV rather than movies in recent years.
4. Clustering similar content by matching text-based features



# **Attribute Information**

1. show_id : Unique ID for every Movie / Tv Show

2. type : Identifier - A Movie or TV Show

3. title : Title of the Movie / Tv Show

4. director : Director of the Movie

5. cast : Actors involved in the movie / show

6. country : Country where the movie / show was produced

7. date_added : Date it was added on Netflix

8. release_year : Actual Releaseyear of the movie / show

9. rating : TV Rating of the movie / show

10. duration : Total Duration - in minutes or number of seasons

11. listed_in : Genere

12. description: The Summary description

# **Importing Libraries**

In [527]:
# Importing the Libraries
import numpy as np
import scipy
import pandas as pd
import math
import random
import sklearn
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

# **Loading Data**

### Mounting the Drive

In [528]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Reading the Dataset 

In [529]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Module4/CapstoneProject_UnsupervisedLearning/Clustering/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

# **Data Sanity Check**

### First Look of given dataset

In [530]:
df.head(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [531]:
df.info()
print('Number of rows i.e. accounts in given dataset is: \033[1m',df.shape[0],'\n\033[0mNumber of columns i.e. variables or features in given dataset is: \033[1m',df.shape[1])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       7787 non-null   object
 1   type          7787 non-null   object
 2   title         7787 non-null   object
 3   director      5398 non-null   object
 4   cast          7069 non-null   object
 5   country       7280 non-null   object
 6   date_added    7777 non-null   object
 7   release_year  7787 non-null   int64 
 8   rating        7780 non-null   object
 9   duration      7787 non-null   object
 10  listed_in     7787 non-null   object
 11  description   7787 non-null   object
dtypes: int64(1), object(11)
memory usage: 730.2+ KB
Number of rows i.e. accounts in given dataset is: [1m 7787 
[0mNumber of columns i.e. variables or features in given dataset is: [1m 12


Here **release_year** has been read as **"int64"** datatype , as per the knowledge it is a categorical variable , so we need to change the datatype for correct analysis.

Similarly **date_added** has been read as object type, we need to convert it in datetime formate for further analysis to extract more detailed information.

In [532]:
# Checking characters in continuous numerical variables & unique values in categorical variables
for i in df:
    print('\033[1m',i,'\033[0m')
    print(df[i].unique())
    print('-'*120)

[1m show_id [0m
['s1' 's2' 's3' ... 's7785' 's7786' 's7787']
------------------------------------------------------------------------------------------------------------------------
[1m type [0m
['TV Show' 'Movie']
------------------------------------------------------------------------------------------------------------------------
[1m title [0m
['3%' '7:19' '23:59' ... 'Zulu Man in Japan' "Zumbo's Just Desserts"
 "ZZ TOP: THAT LITTLE OL' BAND FROM TEXAS"]
------------------------------------------------------------------------------------------------------------------------
[1m director [0m
[nan 'Jorge Michel Grau' 'Gilbert Chan' ... 'Josef Fares' 'Mozez Singh'
 'Sam Dunn']
------------------------------------------------------------------------------------------------------------------------
[1m cast [0m
['João Miguel, Bianca Comparato, Michel Gomes, Rodolfo Valente, Vaneza Oliveira, Rafael Lozano, Viviane Porto, Mel Fronckowiak, Sergio Mamberti, Zezé Motta, Celso Fratesc

# **Data Cleaning and Feature Engineering**

In [533]:
df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In [534]:
df.nunique()

show_id         7787
type               2
title           7787
director        4049
cast            6831
country          681
date_added      1565
release_year      73
rating            14
duration         216
listed_in        492
description     7769
dtype: int64

## **Missing Value and Treatment**

In [535]:
#Checking Null Values
(df.isnull().sum()/len(df))*100

show_id          0.000000
type             0.000000
title            0.000000
director        30.679337
cast             9.220496
country          6.510851
date_added       0.128419
release_year     0.000000
rating           0.089893
duration         0.000000
listed_in        0.000000
description      0.000000
dtype: float64

In [536]:
#Dropping 'Director' Column
#df = df.dropna(['director'],axis=1 ,inplace= True) 
df.drop(['director','cast'],axis=1,inplace=True)

In [537]:
#checking and displaying column name after dropping two coloumns 'director' and 'cast'
df.columns

Index(['show_id', 'type', 'title', 'country', 'date_added', 'release_year',
       'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In [538]:
# changing the datatype for "date_added" from object type to datetime form 
df["date_added"] = pd.to_datetime(df['date_added'])
df.head(3)

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,Brazil,2020-08-14,2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Mexico,2016-12-23,2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Singapore,2018-12-20,2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."


We can extract **month and year of adding** as well 

In [539]:
df['year_of_adding'] = df['date_added'].dt.year
df['month_of_adding'] = df['date_added'].dt.month

In [540]:
df.head(3)

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in,description,year_of_adding,month_of_adding
0,s1,TV Show,3%,Brazil,2020-08-14,2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...,2020.0,8.0
1,s2,Movie,7:19,Mexico,2016-12-23,2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...,2016.0,12.0
2,s3,Movie,23:59,Singapore,2018-12-20,2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow...",2018.0,12.0


In [542]:
# converting month number to month name
#df_month['month_of_adding'] = df_month['month'].replace({1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'June', 7:'July', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'})


In [543]:
#Replacing the null values with "NA"
df['country'].fillna('NA',inplace=True)
df['date_added'].fillna('NA',inplace=True)
df['rating'].fillna('NA',inplace=True)
#df.isnull().sum().sum()

In [544]:
#Rechecking the null value
(df.isnull().sum()/len(df))*100

show_id            0.000000
type               0.000000
title              0.000000
country            0.000000
date_added         0.000000
release_year       0.000000
rating             0.000000
duration           0.000000
listed_in          0.000000
description        0.000000
year_of_adding     0.128419
month_of_adding    0.128419
dtype: float64

In [None]:
#Assigning the 'ratings' into grouped categories

# **Exploratory Data Analysis**

# **Hypthesis based on Data Visualisation**

# **Modelling**


# **Prediction and Evaluation for model**


# **Final summary of conclusion**

# **Type of content available in different countries**





# **Is Netflix has increasingly focusing on TV rather than movies in recent years.**

# **Clustering similar content by matching text-based features**