<a href="https://colab.research.google.com/github/babinash/NetflixShowsProject/blob/main/Netflix_Movies%26Shows_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Netflix Movies and TV Shows Clustering



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**

Write the summary here within 500-600 words.

Netflix is the world's largest online streaming service provider with over 220 million subscribers as of 2022-Q2.

Effective clustering of shows on the Netflix platform is crucial for enhancing the user experience and preventing subscriber churn.

The goal of the project is to classify/group Netflix shows into clusters based on similarity, allowing for personalized show suggestions to users.

The dataset used for this project consists of TV shows and movies available on Netflix as of 2019, collected from Flixable, a third-party Netflix search engine.

A report from 2018 revealed that the number of TV shows on Netflix has nearly tripled since 2010, while the number of movies has decreased by over 2,000 titles.

Exploring the dataset can uncover further insights, such as understanding the types of content available in different countries and whether Netflix has shifted its focus more towards TV shows in recent years.

Integrating external datasets like IMDB ratings and Rotten Tomatoes can provide additional interesting findings.

Exploratory data analysis can help in understanding the distribution of content types, analyzing trends over time, and identifying patterns or relationships in the data.

Text-based features can be utilized to cluster similar content, allowing for a better understanding of the relationships between different shows and movies on Netflix.

Clustering shows on Netflix based on similarity can enable the platform to offer personalized recommendations to users, improving user engagement and satisfaction.

The dataset from Flixable provides an opportunity to explore the growth and evolution of Netflix's content library, particularly the shift towards producing and acquiring more TV shows.

Analyzing the distribution of content types in different countries can reveal regional preferences and help Netflix tailor its offerings to specific markets.

By integrating external datasets such as IMDB ratings and Rotten Tomatoes, Netflix can gain insights into the quality and popularity of its content, which can inform decision-making processes related to acquisitions and content production.

Clustering shows based on text-based features, such as genre, description, or cast information, can uncover hidden patterns and similarities among different titles, allowing for more targeted content recommendations and a better understanding of user preferences.

These points provide an overview of the objectives, dataset, insights, and methods involved in the project.

# **In this project, you are required to do**

Exploratory Data Analysis

Understanding what type content is available in different countries

Is Netflix has increasingly focusing on TV rather than movies in recent years.

Clustering similar content by matching text-based features

# **GitHub Link -**

Provide your GitHub Link here.

https://github.com/babinash/NetflixShowsProject

# **Problem Statement**


**Write Problem Statement Here.**

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import math
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
from matplotlib.pyplot import figure
import plotly.graph_objects as go
import plotly.offline as py
import plotly.express as px
from datetime import datetime

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')
df=pd.read_csv('/content/drive/MyDrive/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head() #top 5 rows

In [None]:
#last 5 rows
df.tail()

In [None]:
# Checking the shape of the dataframe
df.shape

In [None]:
# Columns name
df.columns

There are 12 columns.

# **Attribute Information**

1. show_id : Unique ID for every Movie / Tv Show

2. type : Identifier - A Movie or TV Show

3. title : Title of the Movie / Tv Show

4. director : Director of the Movie

5. cast : Actors involved in the movie / show

6. country : Country where the movie / show was produced

7. date_added : Date it was added on Netflix

8. release_year : Actual Releaseyear of the movie / show

9. rating : TV Rating of the movie / show

10. duration : Total Duration - in minutes or number of seasons

11. listed_in : Genere

12. description: The Summary description

## Checking Datatypes

In [None]:
# Information about the dataset
df.info()

# **Data processing**

In [None]:
#Drop duplicates
df[df.duplicated()]

There are no duplicated values

**Handling Null values**

In [None]:
#Checking Null Values
df.isnull().sum()

In [None]:
#total null values
df.isnull().sum().sum()

There are 3631 null values in the dataset, 2389 null values in director column, 718 null values in cast column ,507 null values in country column ,10 in date_added and 7 in rating. so we need to handle the null values

In [None]:
#Handling Null Values
df['cast'].fillna(value='No cast',inplace=True)
df['country'].fillna(value=df['country'].mode()[0],inplace=True)

In [None]:
#'date_added' and 'rating' contains an insignificant portion of the data so we will drop them from the dataset
df.dropna(subset=['date_added','rating'],inplace=True)

In [None]:
#Dropping Director Column
df.drop(['director'],axis=1,inplace=True)

In [None]:
#again checking is there any null values are not
df.isnull().sum()

# **Exploratory Data Analysis**

**1. Count of Movies and TV Shows**

In [None]:
df['type'].value_counts()

In [None]:
#countplot to visualize the number of movies and tv_shows in type column
sns.countplot(df['type'])

* On Netflix, there are 5,372 movies and 2,398 TV shows available.

* This indicates that the number of movies on Netflix is higher than the number of TV shows, highlighting a greater quantity of movies compared to TV show content on the platform.

**2. Which category has highest ratings.**

In [None]:
df['rating']

In [None]:
#Assigning the Ratings into grouped categories
ratings = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}
df['target_ages'] = df['rating'].replace(ratings)

In [None]:
# type should be a category
df['type'] = pd.Categorical(df['type'])
df['target_ages'] = pd.Categorical(df['target_ages'], categories=['Kids', 'Older Kids', 'Teens', 'Adults'])

In [None]:
df

In [None]:
#creating two extra columns
tv_shows=df[df['type']=='TV Show']
movies=df[df['type']=='Movie']

In [None]:
#Rating based on rating system of all TV Shows
tv_ratings = tv_shows.groupby(['rating'])['show_id'].count().reset_index(name='count').sort_values(by='count',ascending=False)
fig_dims = (14,7)
fig, ax = plt.subplots(figsize=fig_dims)
sns.pointplot(x='rating',y='count',data=tv_ratings)
plt.title('TV Show Ratings',size='20')
plt.show()

* For TV shows on Netflix, the TV-MA rating category has the highest number of ratings. TV-MA stands for "TV Mature Audience," indicating that the content is intended for adult audiences due to its mature themes, language, or content. This suggests that Netflix offers a significant amount of content targeted at adult viewers.

In [None]:
#Movie Ratings based on Target Age Groups
plt.figure(figsize=(14,6))
plt.title('movie ratings')
sns.countplot(x=movies['rating'],hue=movies['target_ages'],data=movies,order=movies['rating'].value_counts().index)

* In both cases, TV-MA has the highest number of ratings for TV shows, indicating that it is the rating category with the most content targeted at adult audiences. TV-MA is associated with mature themes, strong language, and explicit content, making it suitable for viewers who are 17 years or older. This observation suggests that Netflix offers a considerable amount of content catering to adult viewers.

**3.Release_year**

In which year maximum no. of Movies released.

In [None]:
movies_year =movies['release_year'].value_counts().sort_index(ascending=False)

In [None]:
movies_year

In [None]:
tvshows_year =tv_shows['release_year'].value_counts().sort_index(ascending=False)

In [None]:
# visualizing the movies and tv_shows based on the release year
sns.set(font_scale=1.4)
movies_year.plot(figsize=(12, 8), linewidth=2.5, color='maroon',label="Movies / year",ms=3)
tvshows_year.plot(figsize=(12, 8), linewidth=2.5, color='blue',label="TV Shows / year")
plt.xlabel("Years", labelpad=15)
plt.ylabel("Number", labelpad=15)
plt.title("Production growth yearly", y=1.02, fontsize=22);

In [None]:
#Analysing how many movies released per year in last 20 years
plt.figure(figsize=(15,5))
sns.countplot(y=movies['release_year'],data=df,order=movies['release_year'].value_counts().index[0:20])

* The highest number of movies was released in the years 2017 and 2018. These two years saw a significant influx of movies being added to the Netflix platform, indicating a robust release of new content during that period. The specific number of movies released in each year can be obtained from the dataset or further analysis.

In [None]:
tvshows_year

In [None]:
#Analysing how many tv_shows released per year in last 15 years
plt.figure(figsize=(15,5))
sns.countplot(y=tv_shows['release_year'],data=df,order=tv_shows['release_year'].value_counts().index[0:20])

*  Highest number of movies released in 2017 and 2018
*  Highest number of tv_shows released in 2020
*  The number of movies on Netflix is growing significantly faster than the number of TV shows.
*   We saw a huge increase in the number of movies and television episodes after 2015.
*   There is a significant drop in the number of movies and television episodes produced after 2020.
*   It appears that Netflix has focused more attention on increasing Movie content that TV Shows. Movies have increased much more dramatically than TV shows.

In [None]:
df

In [None]:
#adding columns of month and year of addition

df['month'] = pd.DatetimeIndex(df['date_added']).month
df.head()

**4.Release_month**

In which month Maximum no. of movies released.

In [None]:
from typing import Any
# Plotting the Countplot
plt.figure(figsize=(10,10))
ax=sns.countplot(x='month',data = df)

* The period from October to January experienced the highest influx of new movies and TV shows being added to the Netflix platform. During these months, there was a notable increase in the number of content additions, indicating a peak in new releases and updates to the streaming service's library. This timeframe is likely associated with the holiday season and colder months when people tend to spend more time indoors and engage in streaming entertainment.

In [None]:
fig, ax = plt.subplots(figsize=(15,6))

sns.countplot(x='month', hue='type',lw=5, data=df, ax=ax)

* The above graph shows that the most content is added to Netflix from october to january.

* During this period, there is a significant spike in the number of additions, indicating a concentrated period of content updates and releases on the platform. This trend suggests that Netflix strategically focuses on introducing new movies and TV shows during the end-of-year holiday season and the beginning of the new year, potentially to cater to increased viewer demand during these months.

**5.genre**

Which genre is more popular.

In [None]:
#Analysing top10 genre of the movies
plt.figure(figsize=(14,6))
plt.title('Top10 Genre of Movies',fontweight="bold")
sns.countplot(y=movies['listed_in'],data=movies,order=movies['listed_in'].value_counts().index[0:10])

* Documentaries are the top most genre in netflix which is fllowed by standup comedy and Dramas and international movies.

* Documentaries likely hold a significant position due to their educational and informative nature, appealing to a wide range of viewers. Stand-up comedy, known for its entertaining and comedic performances, is also highly popular. Dramas, with their compelling narratives and emotional depth, attract a substantial audience. Additionally, the inclusion of international movies highlights the global appeal and diverse content available on Netflix, catering to viewers' preferences from various regions and cultures.

In [None]:
#Analysing top10 genres of TVSHOWS
plt.figure(figsize=(14,6))
plt.title('Top10 Genre of TV Shows',fontweight="bold")
sns.countplot(y=tv_shows['listed_in'],data=tv_shows,order=tv_shows['listed_in'].value_counts().index[0:10])

* According to the information provided, the top genre for TV shows on Netflix is "Kids TV." This genre specifically caters to children and offers age-appropriate content that is entertaining and educational for young viewers. The popularity of kids' TV shows on Netflix reflects the platform's efforts to provide a diverse range of programming for different audience segments, including children and families. By offering a variety of kids' TV shows, Netflix aims to meet the entertainment needs of young viewers and provide a safe and enjoyable streaming experience for families.


**6.Duration**

Which Season has the maximum duration.

In [None]:
#Checking the distribution of Movie Durations
plt.figure(figsize=(10,7))
#Regular Expression pattern \d is a regex pattern for digit + is a regex pattern for at leas
sns.distplot(movies['duration'].str.extract('(\d+)'),kde=False, color=['red'])
plt.title('Distplot with Normal distribution for Movies',fontweight="bold")
plt.show()

* Most of the movies have duration of between 50 to 150.

* This duration range suggests that the majority of movies available on the platform fall within a standard feature film length. Movies within this duration range are likely to provide a complete and engaging storytelling experience within a reasonable time frame. It indicates that Netflix offers a diverse selection of movies that cater to various preferences and viewing preferences, ensuring there are options available for shorter and longer viewing sessions.

In [None]:
#Checking the distribution of TV SHOWS
plt.figure(figsize=(30,6))
plt.title("Distribution of TV Shows duration",fontweight='bold')
sns.countplot(x=tv_shows['duration'],data=tv_shows,order = tv_shows['duration'].value_counts().index)

* Highest number of tv_shows consisting of single season.
* This indicates that there is a significant portion of TV shows available on the platform that were either intended to be a limited series or were discontinued after a single season. These TV shows might offer a concise and self-contained narrative within a single season, providing viewers with a complete story arc. It also suggests that Netflix offers a diverse range of TV shows, including both long-running series and shorter, self-contained ones, catering to different viewer preferences and providing a variety of content options.

In [None]:
movies['minute'] = movies['duration'].str.extract('(\d+)').apply(pd.to_numeric)
duration_year = movies.groupby(['rating'])['minute'].mean()
duration_df=pd.DataFrame(duration_year).sort_values('minute')
plt.figure(figsize=(12,6))
ax=sns.barplot(x=duration_df.index, y=duration_df.minute)

* Movies with an NC-17 rating tend to have the longest average duration among all rating categories. The NC-17 rating signifies that the content is intended for mature audiences only, usually due to explicit or graphic content. These movies often explore adult themes and can have extended runtimes to accommodate complex storytelling or explicit scenes.

* On the other hand, movies with a TV-Y rating, which is intended for young children, have the shortest average runtime. TV-Y-rated movies are typically designed to be age-appropriate and suitable for all audiences, including young viewers. As a result, they often have shorter runtimes to match the attention span and viewing preferences of young children.

* These observations highlight the correlation between content rating and movie duration, with more mature content often being associated with longer runtimes, while movies for younger audiences tend to be shorter to cater to their needs and engagement levels.

**7.country**

Which country has the maximum no. of content on Netflix.

In [None]:
#Analysing top15 countries with most content
plt.figure(figsize=(18,5))
sns.countplot(x=df['country'],order=df['country'].value_counts().index[0:15],hue=df['type'])
plt.xticks(rotation=50)
plt.title('Top 15 countries with most contents', fontsize=15, fontweight='bold')
plt.show()

* Unitated States has the highest number of content on the netflix ,followed by India.
* This indicates that the Netflix library offers a significant amount of content targeted at viewers in these two countries.
* The United States, being the home country of Netflix, has a diverse range of movies and TV shows available on the platform.
* India, with its large population and growing market for streaming services, has also emerged as a significant contributor to the Netflix content library.
* The presence of a substantial amount of content from both the United States and India suggests that Netflix aims to cater to the preferences and interests of viewers in these key markets.

In [None]:
#top_two countries where netflix is most popular
country=df['country'].value_counts().reset_index()
country

In [None]:
# Plotting the Horizontal bar plot for top 10 country contains Movie & TV Show split
country_order = df['country'].value_counts()[:11].index
content_data = df[['type', 'country']].groupby('country')['type'].value_counts().unstack().loc[country_order]
content_data['sum'] = content_data.sum(axis=1)
content_data_ratio = (content_data.T / content_data['sum']).T[['Movie', 'TV Show']].sort_values(by='Movie',ascending=False)[::-1]

# Plotting the barh
fig, ax = plt.subplots(1,1,figsize=(15, 8),)

ax.barh(content_data_ratio.index, content_data_ratio['Movie'],
        color='crimson', alpha=0.8, label='Movie')
ax.barh(content_data_ratio.index, content_data_ratio['TV Show'], left=content_data_ratio['Movie'],
        color='black', alpha=0.8, label='TV Show')

* India has highest number of movies in Netflix.
* This indicates that the Netflix library includes a significant collection of movies specifically targeted towards Indian viewers. Given the popularity of Indian cinema and the large film industry in India, it is not surprising to see a substantial number of Indian movies featured on the platform. This diverse selection of Indian movies allows Netflix to cater to the preferences and interests of its Indian audience and provides them with a wide range of choices from the Indian film industry.

In [None]:
# Preparing data for heatmap
df['count'] = 1
data = df.groupby('country')[['count']].sum().sort_values(by='count',ascending=False).reset_index()[:10]
top_countries = data['country']


df_heatmap = df[df['country'].isin(top_countries)]
df_heatmap = pd.crosstab(df_heatmap['country'],df_heatmap['target_ages'],normalize = "index").T
df_heatmap

In [None]:
# Plotting the heatmap
fig, ax = plt.subplots(1, 1, figsize=(10, 10))

country_order2 = ['United States', 'India', 'United Kingdom', 'Canada', 'Japan', 'France', 'South Korea', 'Spain',
       'Mexico']

age_order = ['Adults', 'Teens', 'Older Kids', 'Kids']

sns.heatmap(df_heatmap.loc[age_order,country_order2],cmap="YlGnBu",square=True, linewidth=2.5,cbar=False, annot=True,fmt='1.0%',vmax=.6,vmin=0.05,ax=ax,annot_kws={"fontsize":12})
plt.show()

* The target age demographics for Netflix in the US and UK are closely aligned, indicating that the content offered by the platform caters to similar age groups in these countries. However, when compared to countries like India or Japan, the target age demographics differ significantly. This suggests that the content preferences and interests of viewers in India or Japan may vary significantly from those in the US and UK. Netflix likely tailors its content selection and recommendations to cater to the specific preferences and cultural nuances of each country.

* On the other hand, Mexico and Spain have similar content on Netflix, but for different age groups. This implies that while the content offered in these countries may align in terms of themes, genres, or overall style, the target age demographics within each country differ. Netflix recognizes the varying preferences of viewers within different regions and strives to provide tailored content that appeals to specific age groups in each market.

* These observations highlight how Netflix adapts its content strategy to cater to the preferences and cultural nuances of different countries and their respective target age demographics.

**8.Originals**

Count of Netflix Originals

In [None]:
df['date_added'] = pd.to_datetime(df['date_added'], format ='%B %d, %Y', errors='coerce')
movies['year_added'] = df['date_added'].dt.year
df

* Some movies and TV shows on Netflix were originally released outside of the platform and were later added to the Netflix library. These titles are not considered Netflix Originals. They include content from various production studios and television networks that have distribution deals with Netflix.

* On the other hand, Netflix Originals refer to movies and TV shows that are produced or co-produced by Netflix itself. These titles are exclusively created for the platform and are not available on any other streaming service or traditional television networks. Netflix Originals encompass a wide range of genres and include both original movies and series that are developed and produced by Netflix's own production teams.

* By producing original content, Netflix aims to offer unique and exclusive programming to its subscribers, distinguishing itself from other streaming platforms and traditional television networks. Netflix Originals have gained significant popularity and critical acclaim, contributing to the platform's success in attracting and retaining subscribers.

In [None]:
movies['originals'] = np.where(movies['release_year'] == movies['year_added'], 'Yes', 'No')
# pie plot showing percentage of originals and others in movies
fig, ax = plt.subplots(figsize=(5,5),facecolor="#363336")
ax.patch.set_facecolor('#363336')
explode = (0, 0.1)
ax.pie(movies['originals'].value_counts(), explode=explode, autopct='%.2f%%', labels= ['Others', 'Originals'],
       shadow=True, startangle=90,textprops={'color':"black", 'fontsize': 20}, colors =['red','#F5E9F5'])

* Based on the given information, approximately 30% of the movies available on Netflix were originally released on the platform itself. These movies are categorized as Netflix Originals, indicating that they were produced or co-produced by Netflix and premiered exclusively on the platform.

* On the other hand, about 70% of the movies added to Netflix were previously released through different modes, such as theatrical releases, DVD releases, or releases on other streaming services or television networks. These movies were later acquired by Netflix and made available to their subscribers.

* This mix of content on Netflix, with a significant portion being movies released through traditional channels and a substantial percentage being Netflix Originals, allows the platform to offer a diverse range of movies to its audience. It enables them to enjoy both exclusive content created by Netflix and popular movies from various sources that have been curated and added to the Netflix library.

# **Making some Hypothesis**

**1.HYPOTHESIS TESTING**
*   HO:The hypothesis being tested is that movies rated for kids and older kids on Netflix have a duration of equal to or more than two hours.
*   H1:The hypothesis being tested is that movies rated for kids and older kids on Netflix have a duration of less than two hours.

In [None]:
movies

In [None]:
#making copy of df_clean_frame
df_hypothesis=df.copy()
#head of df_hypothesis
df_hypothesis.head()

In [None]:
#filtering movie from Type_of_show column
df_hypothesis = df_hypothesis[df_hypothesis["type"] == "Movie"]

In [None]:
#with respect to each ratings assigning it into group of categories
ratings_ages = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}

df_hypothesis['target_ages'] = df_hypothesis['rating'].replace(ratings_ages)
#let's see unique target ages
df_hypothesis['target_ages'].unique()

In [None]:
#Another category is target_ages (4 classes).
df_hypothesis['target_ages'] = pd.Categorical(df_hypothesis['target_ages'], categories=['Kids', 'Older Kids', 'Teens', 'Adults'])
#from duration feature extractin string part and after extracting Changing the object type to numeric
df_hypothesis['duration']= df_hypothesis['duration'].str.extract('(\d+)')
df_hypothesis['duration'] = pd.to_numeric(df_hypothesis['duration'])
#head of df_
df_hypothesis.head(3)

In [None]:
#group_by duration and target_ages
group_by_= df_hypothesis[['duration','target_ages']].groupby(by='target_ages')
#mean of group_by variable
group=group_by_.mean().reset_index()
group

In [None]:
#In A and B variable grouping values

A= group_by_.get_group('Kids')
B= group_by_.get_group('Older Kids')
#mean and std. calutation for kids and older kids variables
A['target_ages'] = A['target_ages'].astype('category').cat.codes.astype(float)
B['target_ages'] = B['target_ages'].astype('category').cat.codes.astype(float)

M1 = A.mean()
S1 = A.std()

M2 = B.mean()
S2 = B.std()

print('Mean for movies rated for Kids {} \n Mean for  movies rated for older kids {}'.format(M1,M2))
print('Std for  movies rated for Older Kids {} \n Std for  movies rated for kids {}'.format(S2,S1))

In [None]:
from scipy import stats
#length of groups and DOF
n1 = len(A)
n2= len(B)
print(n1,n2)

dof = n1+n2-2
print('dof',dof)

sp_2 = ((n2-1)*S1**2  + (n1-1)*S2**2) / dof
print('SP_2 =',sp_2)

sp = np.sqrt(sp_2)
print('SP',sp)

#tvalue
t_val = (M1-M2)/(sp * np.sqrt(1/n1 + 1/n2))
print('tvalue',t_val[0])

In [None]:
#t-distribution
stats.t.ppf(0.025,dof)

In [None]:
#t-distribution
stats.t.ppf(0.975,dof)

Because the t-value is not in the range, the null hypothesis is rejected.

As a result, movies rated for kids and older kids are not at least two hours long.

**2. HYPOTHESIS TESTING**
*    H1:The hypothesis being tested is that movies on Netflix have a duration of more than 90 minutes.
*   HO:The hypothesis being tested is that there are no movies on Netflix with a duration of more than 90 minutes.

In [None]:
#making copy of df_clean_frame
df_hypothesis=df.copy()
#head of df_hypothesis
df_hypothesis.head()

In [None]:
df_hypothesis['duration']= df_hypothesis['duration'].str.extract('(\d+)')
df_hypothesis['duration'] = pd.to_numeric(df_hypothesis['duration'])
#head of df_

In [None]:
df_hypothesis['type'] = pd.Categorical(df_hypothesis['type'], categories=['Movie','TV Show'])
#from duration feature extractin string part and after extracting Changing the object type to numeric
#df_hypothesis['duration']= df_hypothesis['duration'].str.extract('(\d+)')
#df_hypothesis['duration'] = pd.to_numeric(df_hypothesis['duration'])
#head of df_
df_hypothesis.head(3)

In [None]:
#group_by duration and TYPE
group_by_= df_hypothesis[['duration','type']].groupby(by='type')
#mean of group_by variable
group=group_by_.mean().reset_index()
group

In [None]:
#In A and B variable grouping values
A= group_by_.get_group('Movie')
B= group_by_.get_group('TV Show')

A['type'] = A['type'].astype('category').cat.codes.astype(float)
B['type'] = B['type'].astype('category').cat.codes.astype(float)
#mean and std
M1 = A.mean()
S1 = A.std()

M2= B.mean()
S2 = B.std()

print('Mean  {}'.format(M1,M2))
print('Std  {}'.format(S2,S1))

In [None]:
#import stats
from scipy import stats
#length of groups and DOF
n1 = len(A)
n2= len(B)
print(n1,n2)

dof = n1+n2-2
print('dof',dof)

sp_2 = ((n2-1)*S1**2  + (n1-1)*S2**2) / dof
print('SP_2 =',sp_2)

sp = np.sqrt(sp_2)
print('SP',sp)

#tvalue
t_val = (M1-M2)/(sp * np.sqrt(1/n1 + 1/n2))
print('tvalue',t_val[0])

In [None]:
#t-distribution
stats.t.ppf(0.025,dof)

In [None]:
#t-distribution
stats.t.ppf(0.975,dof)

Because the t-value is not in the range, the null hypothesis is rejected.

As a result, The duration which is more than 90 mins are movies

# **Feature Engineering**

In [None]:
df.dtypes

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
nltk.download('punkt')

In [None]:
df.dtypes

In [None]:
df['description'].astype(str)

In [None]:
# after above all the changes, those features are in list format, so making list of description feature
df['description'] = df['description'].apply(lambda x: x.split(' '))

In [None]:
# converting text feature to string from list
df['description']= df['description'].apply(lambda x: " ".join(x))
# making all the words in text feature to lowercase
df['description']= df['description'].apply(lambda x: x.lower())

In [None]:
def remove_punctuation(text):
    '''a function for removing punctuation'''
    import string
    # replacing the punctuations with no space,
    # which in effect deletes the punctuation marks
    translator = str.maketrans('', '', string.punctuation)
    # return the text stripped of punctuation marks
    return text.translate(translator)
# applying above function on text feature
df['description']= df['description'].apply(remove_punctuation)

In [None]:
df['description'][0:10]

In [None]:
# using nltk library to download stopwords
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
sw=stopwords.words('english')
#Defining stopwords
def stopwords(text):
    '''a function for removing the stopword'''
    text = [word for word in text.split() if word not in sw]
    # joining the list of words with space separator
    return " ".join(text)
# applying above function on text feature
df['description']=df['description'].apply(stopwords)
# this is how value in text looks like after removing stopwords
df['description'][0]

In [None]:
# importing TfidVectorizer from sklearn library
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
#Applying Tfidf Vectorizer
tfidfmodel = TfidfVectorizer(max_features=5000)
X_tfidf = tfidfmodel.fit_transform(df['description'])
X_tfidf.shape

In [None]:
# convert X into array form for clustering
X = X_tfidf.toarray()

# **clustering algorithms**

**1.Kmean**

Finding the optimal number of clusters using the elbow method

In [None]:
#finding optimal number of clusters using the elbow method
from sklearn.cluster import KMeans
wcss_list= []  #Initializing the list for the values of WCSS

#Using for loop for iterations from 1 to 30.
for i in range(1, 30):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)
    kmeans.fit(X)
    wcss_list.append(kmeans.inertia_)
plt.plot(range(1, 30), wcss_list)
plt.title('The Elbow Method Graph')
plt.xlabel('Number of clusters(k)')
plt.ylabel('wcss_list')
plt.show()

In [None]:
from sklearn.metrics import silhouette_score
#sillhoute score of clusters
sill = []
for i in range(2,30):
    model = KMeans(n_clusters=i,init ='k-means++',random_state=51)
    model.fit(X)
    y1 = model.predict(X)
    score = silhouette_score(X,y1)
    sill.append(score)
    print('cluster: %d \t Sillhoute: %0.4f'%(i,score))

In [None]:
#Plotting Sillhoute's score
plt.plot(sill,'bs--')
plt.xticks(list(range(2,30)),list(range(2,30)))
plt.grid(),plt.xlabel('Number of cluster')
plt.show()

* Based on the **elbow method and silhouette score analysis**, it is suggested to form 26 clusters for the given dataset. The elbow method helps determine the optimal number of clusters by evaluating the distortion or inertia, while the silhouette score measures the quality and separation of the clusters. Both methods indicate that 26 clusters would be appropriate for the dataset based on the available information.italicized text

In [None]:
#training the K-means model on a dataset
kmeans = KMeans(n_clusters= 26, init='k-means++', random_state= 42)
y_predict= kmeans.fit_predict(X)

**Evaluation**

In [None]:
#Predict the clusters and evaluate the silhouette score

score = silhouette_score(X, y_predict)
print("Silhouette score is {}".format(score))

In [None]:
#davies_bouldin_score of our clusters
from sklearn.metrics import davies_bouldin_score
davies_bouldin_score(X, y_predict)

In [None]:
#Adding a seperate column for the cluster
df["cluster"] = y_predict

In [None]:
df['cluster'].value_counts()

In [None]:
fig, ax = plt.subplots(figsize=(15,6))
sns.countplot(x='cluster', hue='type',lw=5, data=df, ax=ax)

* **Cluster 20** has the highest number of Datapoints

In [None]:
#SCATTER PLOT FOR CLUSTERS
fig = px.scatter(df, y="description", x="cluster",color="cluster")
fig.update_traces(marker_size=100)
fig.show()

**Dendogram**

In [None]:
import scipy.cluster.hierarchy as shc
plt.figure(figsize =(8, 8))
plt.title('Visualising the data')
Dendrogram = shc.dendrogram((shc.linkage(X, method ='ward')))

**2.AgglomerativeClustering**

In [None]:
#Fitting our variable in Agglomerative Clusters
from sklearn.cluster import AgglomerativeClustering
aggh = AgglomerativeClustering(n_clusters=6, affinity='euclidean', linkage='ward')
aggh.fit(X)
#Predicting using our model
y_hc=aggh.fit_predict(X)

In [None]:
df_hierarchical =df.copy()
#creating a column where each row is assigned to their separate cluster
df_hierarchical['cluster'] = aggh.labels_
df_hierarchical.head()

**Evaluation**

In [None]:
#Silhouette Coefficient
print("Silhouette Coefficient: %0.3f"%silhouette_score(X,y_hc, metric='euclidean'))

In [None]:
#davies_bouldin_score of our clusters
from sklearn.metrics import davies_bouldin_score
davies_bouldin_score(X, y_hc)

# **conclusion**

*  From elbow and sillhoute score ,optimal of 26 clusters formed , K Means is best for identification than Hierarchical as the evaluation metrics also indicates the same.in kmean cluster 0 has the highest number of datapoints
and evnly distributed for other cluster
*   Netflix has 5372 movies and 2398 TV shows,
there are more   number movies on Netflix than TV shows.

*   TV-MA has the highest number of ratings for tv shows i,e adult ratings

*   Highest Number of movies released in 2017 and 2018
Highest number of movies released in 2020
The number of movies on Netflix is growing significantly faster than the number of TV shows.
We saw a huge increase in the number of movies and television episodes after 2015.
there is a significant drop in the number of movies and television episodes produced after 2020.
It appears that Netflix has focused more attention on increasing Movie content than TV Shows. Movies have increased much more dramatically than TV shows

*   The most content is added to Netflix from october to january

*   Documentaries are the top most genre in netflix which is fllowed by standup comedy and Drams and international movies
*   Kids tv is the top most  TV show genre in netflix


*   Most of the movies have duration of between 50 to 150


*   Highest number of tv_shows consistig of single season


*   Those movies that have a rating of NC-17 have the longest average duration.

When it comes to movies having a TV-Y rating, they have the shortest runtime on average


*   Unitated states has the highest number of content on the netflix ,followed    by india

*   India has highest number of movies in netflix
*   30% movies released on Netflix.
70% movies added on Netflix were released earlier by different mode.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***