<a href="https://colab.research.google.com/github/anshu57/Netflix-Movies-and-TV-shows-Clustering-Unsupervised-Machine-Learning/blob/main/NETFLIX_MOVIES_AND_TV_SHOWS_CLUSTERING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement**

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.

## <b>In this  project, you are required to do </b>
1. Exploratory Data Analysis 

2. Understanding what type content is available in different countries

3. Is Netflix has increasingly focusing on TV rather than movies in recent years.
4. Clustering similar content by matching text-based features



# **Attribute Information**

1. show_id : Unique ID for every Movie / Tv Show

2. type : Identifier - A Movie or TV Show

3. title : Title of the Movie / Tv Show

4. director : Director of the Movie

5. cast : Actors involved in the movie / show

6. country : Country where the movie / show was produced

7. date_added : Date it was added on Netflix

8. release_year : Actual Releaseyear of the movie / show

9. rating : TV Rating of the movie / show

10. duration : Total Duration - in minutes or number of seasons

11. listed_in : Genere

12. description: The Summary description

In [68]:
import pandas as pd
import numpy as np

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [69]:
dataset = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Capstone Projects/Unsupervised ML (Netflix Movies and TV shows Clustering) /netflix_titles.csv')

In [71]:
dataset.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...


#Exploratory Data Analysis

In [72]:
dataset.shape

(6234, 12)

In [73]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6234 entries, 0 to 6233
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       6234 non-null   int64 
 1   type          6234 non-null   object
 2   title         6234 non-null   object
 3   director      4265 non-null   object
 4   cast          5664 non-null   object
 5   country       5758 non-null   object
 6   date_added    6223 non-null   object
 7   release_year  6234 non-null   int64 
 8   rating        6224 non-null   object
 9   duration      6234 non-null   object
 10  listed_in     6234 non-null   object
 11  description   6234 non-null   object
dtypes: int64(2), object(10)
memory usage: 584.6+ KB


In [74]:
dataset.describe(include='all')

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
count,6234.0,6234,6234,4265,5664,5758,6223,6234.0,6224,6234,6234,6234
unique,,2,6172,3301,5469,554,1524,,14,201,461,6226
top,,Movie,The Silence,"Raúl Campos, Jan Suter",David Attenborough,United States,"January 1, 2020",,TV-MA,1 Season,Documentaries,A surly septuagenarian gets another chance at ...
freq,,4265,3,18,18,2032,122,,2027,1321,299,3
mean,76703680.0,,,,,,,2013.35932,,,,
std,10942960.0,,,,,,,8.81162,,,,
min,247747.0,,,,,,,1925.0,,,,
25%,80035800.0,,,,,,,2013.0,,,,
50%,80163370.0,,,,,,,2016.0,,,,
75%,80244890.0,,,,,,,2018.0,,,,


In [75]:
#Checking for duplicates
len(dataset[dataset.duplicated()])

0

In [76]:
dataset.nunique()

show_id         6234
type               2
title           6172
director        3301
cast            5469
country          554
date_added      1524
release_year      72
rating            14
duration         201
listed_in        461
description     6226
dtype: int64

##Handling missing Values

In [77]:
dataset.isnull().sum()

show_id            0
type               0
title              0
director        1969
cast             570
country          476
date_added        11
release_year       0
rating            10
duration           0
listed_in          0
description        0
dtype: int64

In [78]:
#Calculating the percentage of NULL of values in each column
total = dataset.isnull().sum().sort_values(ascending=False)
percent = (dataset.isnull().sum()/6234).sort_values(ascending=False) * 100
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data

Unnamed: 0,Total,Percent
director,1969,31.584857
cast,570,9.143407
country,476,7.635547
date_added,11,0.176452
rating,10,0.160411
show_id,0,0.0
type,0,0.0
title,0,0.0
release_year,0,0.0
duration,0,0.0


Director contains 31% null values we cannot remove these, cast countries and date_added contains significant no. of null values.

In [79]:
#Replaing null values in 'director' column with value 'No data'
dataset['director'].replace(np.nan, 'No data',inplace  = True)

In [80]:
#Replacing null values in cast column with 'No data'
dataset['cast'].replace(np.nan, 'No data',inplace  = True)

In [85]:
#Replacing null values in country column with most frequent country i.e mode
dataset['country'].replace(np.nan, dataset['country'].mode()[0],inplace  = True)

In [86]:
#Replacing null values in 'rating' column with most frequent rating i.e. mode
dataset['rating'] = dataset['rating'].fillna(dataset['rating'].mode()[0])

In [87]:
# Dropping the null values in date column
dataset.dropna(inplace=True)

In [88]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6223 entries, 0 to 6222
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       6223 non-null   int64 
 1   type          6223 non-null   object
 2   title         6223 non-null   object
 3   director      6223 non-null   object
 4   cast          6223 non-null   object
 5   country       6223 non-null   object
 6   date_added    6223 non-null   object
 7   release_year  6223 non-null   int64 
 8   rating        6223 non-null   object
 9   duration      6223 non-null   object
 10  listed_in     6223 non-null   object
 11  description   6223 non-null   object
dtypes: int64(2), object(10)
memory usage: 632.0+ KB
