# **Project Name**    -



##### **Project Type**    - Netflix Movies and TV Shows: Classification & Clustering
##### **Contribution**    - Individual
##### **Project created by -** ASTHA

# **Project Summary -**

The goal of this project is to perform Exploratory Data Analysis (EDA) and create Machine Learning models on the Netflix movies and TV shows available in the platform. The analysis presents insights, trends, and patterns related to content type, content title, country origin, ratings, release year and duration.

**Key steps:**


**Data Collection and Cleaning:** The project uses a dataset which contains information about titles and ratings of the movies and TV shows in Netflix.
The data was cleaned by handling missing values, removing duplicates, and ensuring data consistency.

**EDA:**
* Univariate, bivariate, and multivariate analysis
* Distribution plots for categorical and numerical variables

**Feature Engineering:**
* Extracting duration in minutes or seasons
* Encoding categorical variables
* Creating genre-based binary features
* Extracting number of actors, title length, and description length

**Modeling:**
* Classification: Random Forest, Logistic Regression, SVM, Decision Tree
* Clustering: KMeans on scaled numerical features

**Evaluation:**
* Accuracy, classification report, confusion matrix for classification
* Visualizing clusters for unsupervised learning


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The goal of this project is to analyze Netflix's content and build machine learning models to:
1. Classify whether a content item is a Movie or TV Show based on its metadata.
2. Cluster content into groups with similar characteristics using unsupervised learning.

This involves:

* Exploratory Data Analysis (EDA)
* Feature Engineering
* Supervised Learning (Classification)
* Unsupervised Learning (Clustering)

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from IPython.display import display

import scipy.stats as stats
# sklearn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.impute import SimpleImputer

### Dataset Loading

In [None]:
# Load Dataset
my_data = pd.read_csv('NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look
my_data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
my_data.shape

### Dataset Information

In [None]:
# Dataset Info
my_data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
my_data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
my_data.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(my_data.isnull(), cbar=False)

### What did you know about your dataset?

This dataset contains information about various TV Shows and movies available on Netflix, including details like the country, rating, duration,genre and description of each title. It has 12 columns and 7787 rows.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
my_data.columns

In [None]:
# Dataset Describe
my_data.describe()

### Variables Description

The dataset contains information about Movies and TV Shows available on Netflix.

It includes the following columns:
- **show_id**: Unique identifier for each title in the dataset.
- **type**: Content category (Movie or TV Show).
- **title**: Name of the movie or TV show.
- **director**: Name(s) of the director(s).
- **cast**: Names of main actors/actresses.
- **country**: Country of origin.
- **date_added**: Date the title was added to Netflix.
- **release_year**: Year of original release.
- **rating**: Content rating such as PG-13, TV-MA, R.
- **duration**: For movies- runtime in minutes. For TV shows- number of seasons.
- **listed_in**: Comma-separated list of genres or categories.
- **description**: Short synopsis or description of the title.


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
my_data.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
my_data.isnull().sum().sort_values(ascending=False)
# Replace with 'Unknown' for categorical features
my_data['director'].fillna('Unknown', inplace=True)
my_data['cast'].fillna('Unknown', inplace=True)
my_data['country'].fillna('Unknown', inplace=True)

# Replace missing date_added with the mode
my_data['date_added'].fillna(my_data['date_added'].mode()[0], inplace=True)

# Replace missing rating with 'NR'
my_data['rating'].fillna('NR', inplace=True)

In [None]:
# Check any missing values after performing Data Wrangling
my_data.isnull().sum()

### What all manipulations have you done and insights you found?

The data has been cleaned now and all set for further analysis.
1. **Handling missing values:**
*   Null values of 'director', 'cast', 'country' were replaced with 'Unknown'.
*   Null values of 'date_added' were replaced with the mode of that column.
*   Null values of 'ratings' were replaced with 'NR'.
2. **Handling Duplicate values:**
*   Duplicated entries have been identified, and the sum of values in one column is zero. Each column has different unique values. Additionally, the date added column has been parsed to extract the day, month, and year.






## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

Data vizualization is the graphical representation of information and data. BY using visual elements like charts, graphs, and maps, data vizualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.  

In [None]:
my_data

In [None]:
# Separate movies and tv shows
movies = my_data[my_data['type'] == 'Movie'].copy()
tv_shows = my_data[my_data['type'] == 'TV Show'].copy()

# Extract duration in minutes for movies
movies['duration_cleaned'] = movies['duration'].str.extract('(\d+)').astype(float)

# Extract number of seasons for TV shows
tv_shows['duration_cleaned'] = tv_shows['duration'].str.extract('(\d+)').astype(float)

# Add a new column to identify type
movies['content_type'] = 'Movie (min)'
tv_shows['content_type'] = 'TV Show (seasons)'

**Univariate Analysis:**

#### Chart - 1

In [None]:
# Chart - 1 visualization code
my_data['type'].value_counts().plot(kind='pie', title='Distribution of Content Type', color='purple', autopct='%1.1f%%')
plt.show()

##### 1. Why did you pick the specific chart?

The specific chart used in the code is a pie chart. I picked this chart because it is effective in visualizing the distribution of categorical data.
In this case, the chart is used to represent the types of content watched on Netflix, which are categorized as "TV Show" and "Movie."

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can clearly see that Movies constitute the majority, accounting for 69.1% of the content watched on Netflix, while TV Shows make up a smaller percentage of 30.9%.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The data indicates a clear preference for TV shows over movies, with a significantly higher percentage of 69.1% compared to the lower percentage of 30.9% for movies. This suggests that people tend to enjoy shorter formats like TV shows rather than investing their time in longer movies that may be less engaging.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

# Top 10 ratings for movies
plt.figure(figsize=(10, 6))
sns.boxplot(x='rating', y='duration_cleaned', data=movies[movies['rating'].isin(movies['rating'].value_counts().head(10).index)])
plt.title('Movie Duration Distribution by Top 10 Content Ratings')
plt.xlabel('Rating')
plt.ylabel('Duration (min)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A box plot has been chosen in order to visualize the distribution of movie durations across different content ratings because it effectively displays the median, quartiles, and potential outliers for each rating category. This allows for a clear comparison of the spread and central tendency of movie durations among the top content ratings.

##### 2. What is/are the insight(s) found from the chart?

From the box plot showing movie duration distribution by the top 10 content ratings, we can observe several insights:

* Duration Range: Ratings like TV-MA and R tend to have a wider range of movie durations, including some longer films (outliers).
* Median Duration: The median duration varies across ratings. For instance, TV-PG and PG-13 movies seem to have a slightly higher median duration compared to TV-Y or G rated movies.
* Outliers: There are outliers (movies with significantly longer or shorter durations than typical for their rating) present in several rating categories.
* Children's Content Duration: Ratings aimed at younger audiences (TV-Y, TV-Y7) generally have shorter movie durations, with less variability.
* NR (Not Rated): The NR category shows a relatively wide range of durations, which is expected as it includes a mix of content that hasn't gone through a formal rating process.
These insights highlight how content ratings can be associated with typical movie lengths on Netflix.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can help create a positive business impact for Netflix. Understanding the typical duration distribution of movies across different content ratings can inform several business decisions:

* Content Acquisition: Netflix can use this information to guide their content acquisition strategy. For example, if data shows that audiences for TV-MA content prefer a wider range of movie lengths, Netflix might prioritize acquiring more movies with varied durations in this category. Conversely, if children's content audiences prefer shorter, consistent durations, this can inform acquisition for TV-Y and TV-Y7 rated movies.
* Content Recommendation: Knowing the typical durations associated with ratings can potentially be used to refine recommendation algorithms, suggesting movies of lengths that align with user preferences based on the types of content they watch.
* Content Production: For original content production, these insights can help in deciding the optimal runtime for movies targeting specific audience segments defined by content ratings.

There aren't direct insights from this chart that inherently lead to negative growth. However, a failure to understand and cater to audience preferences regarding content duration within specific rating categories could potentially lead to negative growth.

For instance, if Netflix primarily acquires very long movies for a rating category where audiences prefer shorter films, it might lead to lower engagement and viewership within that segment.
This chart helps in avoiding such mismatches by providing data on existing patterns.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

# Top 10 ratings for TV shows
plt.figure(figsize=(10, 6))
sns.boxplot(x='rating', y='duration_cleaned', data=tv_shows[tv_shows['rating'].isin(tv_shows['rating'].value_counts().head(10).index)])
plt.title('TV Show Seasons Distribution by Top 10 Content Ratings')
plt.xlabel('Rating')
plt.ylabel('Number of Seasons')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A box plot was chosen to visualize the distribution of TV show seasons across different content ratings because it effectively displays the median, quartiles, and potential outliers for each rating category. This allows for a clear comparison of the spread and central tendency of the number of seasons among the top content ratings.


##### 2. What is/are the insight(s) found from the chart?

From the box plot showing TV show seasons distribution by the top 10 content ratings, we can observe several insights:

* Dominance of Short Series: Across most popular ratings, the majority of TV shows have a relatively low number of seasons, often centered around 1 or 2 seasons. The box plots are quite narrow at the lower end, indicating a concentration of short series.
* Outliers in Mature Ratings: While most shows are short, ratings like TV-MA and TV-14 show a wider spread and the presence of outliers with a significantly higher number of seasons (e.g., shows with 10+ seasons).
* Children's Content Length: Ratings aimed at younger audiences (TV-Y, TV-Y7, TV-G) also primarily consist of shows with a small number of seasons, with fewer extreme outliers compared to more mature ratings.
* Median Seasons: The median number of seasons is generally low (around 1 or 2) across most ratings, reinforcing the prevalence of shorter TV series on the platform.
* NR (Not Rated): The 'NR' category, although less populated, also shows a distribution primarily centered around a small number of seasons.

These insights suggest that Netflix's TV show catalog is heavily weighted towards shorter series, although longer-running shows do exist, particularly in ratings targeting older audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can help create a positive business impact for Netflix. Understanding the typical distribution of TV show seasons across different content ratings can inform several business decisions:

* Content Acquisition: Netflix can use this information to guide their content acquisition strategy. If data shows that audiences for most ratings prefer shorter series, Netflix might prioritize acquiring more TV shows with 1-2 seasons. Conversely, if there's a demand for longer series within specific mature ratings, this can inform acquisition decisions for those segments.
* Content Production: For original content production, these insights are crucial for deciding the optimal number of seasons for new TV shows targeting specific audience segments defined by content ratings.
* Content Recommendation: Knowing the typical number of seasons associated with ratings can potentially be used to refine recommendation algorithms, suggesting TV shows with a number of seasons that align with user preferences based on the types of content they watch.

There aren't direct insights from this chart that inherently lead to negative growth. However, a failure to understand and cater to audience preferences regarding the number of seasons within specific rating categories could potentially lead to negative growth.

For example, if Netflix consistently produces or acquires TV shows with many seasons for ratings where audiences prefer shorter series, it might lead to lower completion rates and reduced engagement within that segment. This chart provides data to help avoid such mismatches.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
if 'release_year' in my_data.columns:
    my_data['release_year'].value_counts().sort_index().plot(kind='line', marker='o', title='Titles released per year', color='red')
    plt.xlabel('Year')
    plt.ylabel('Count')
    plt.show()

##### 1. Why did you pick the specific chart?

A line chart was chosen to visualize the number of titles released per year because it is effective in showing trends over time. The x-axis represents the release year, and the y-axis represents the count of titles, making it easy to observe how the volume of content released has changed annually.


##### 2. What is/are the insight(s) found from the chart?

From the line chart showing titles released per year, we can observe a significant increasing trend in the number of movies and TV shows released over time. The volume of content released gradually increased from the early years, accelerating significantly in the 2000s and peaking around 2019-2020. There appears to be a slight drop in the number of releases in the most recent year (2021) in this dataset. This indicates a substantial growth in content production or acquisition by Netflix over the years covered.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, the gained insights can help create a positive business impact. The strong increasing trend in content releases over the years suggests Netflix's growth in either producing original content or acquiring licensed content. This expansion of the content library is crucial for attracting and retaining subscribers. A larger and more diverse catalog can cater to a wider range of tastes, increasing the platform's value proposition.

The slight dip in releases in 2021 could potentially lead to negative growth if it signifies a slowing down in content availability compared to previous peak years, especially if subscriber growth is tied to the freshness and volume of new content. However, without more context (e.g., why the drop occurred, such as production impacts from external factors like a global pandemic, or a shift in strategy towards quality over quantity), it's difficult to definitively label it as leading to negative growth. It's an insight that warrants further investigation.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
if 'rating' in my_data.columns:
    my_data['rating'].value_counts().head(15).plot(kind='bar', title='Top Ratings')
    plt.show()


##### 1. Why did you pick the specific chart?

A bar chart was chosen to visualize the top content ratings because it is effective for displaying the frequency or count of distinct categories. The height of each bar clearly represents the number of titles associated with each rating, making it easy to compare the popularity or prevalence of different content ratings in the dataset.



##### 2. What is/are the insight(s) found from the chart?

From the bar chart showing the top content ratings, we can see the most frequent ratings in the dataset. The ratings TV-MA, TV-14, and R are significantly more prevalent than others. This indicates that a large portion of the content available on Netflix is geared towards mature audiences. Other ratings like PG-13, TV-PG, TV-Y, and TV-Y7 are also present but in smaller quantities, suggesting a more limited offering for younger viewers compared to adult content.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can help create a positive business impact. Understanding the distribution of content ratings is crucial for Netflix's content strategy:

* Content Acquisition and Production: Knowing which ratings are most prevalent (like TV-MA, TV-14, R) confirms that Netflix has a strong focus on content for mature audiences. If this aligns with their target demographic and subscriber base, they can continue to acquire and produce content within these ratings. If they aim to expand their audience, they might identify underserved rating categories (like G or UR) and strategically invest in content for those segments.

* Marketing and Targeting: This distribution helps in understanding the primary audience being served. Marketing efforts can be tailored to appeal to viewers interested in content with prevalent ratings.

Insights that could lead to negative growth would arise if the current content rating distribution does not align with subscriber demographics or growth goals. For example, if Netflix aims to significantly increase its family-friendly subscriber base but the catalog is heavily dominated by mature ratings, this mismatch could hinder growth in that segment.

The chart itself doesn't indicate negative growth, but it provides the data to assess if the content offering aligns with business objectives for different audience segments.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

# Get top 10 countries for TV shows and movies
top_tv_countries = tv_shows['country'].value_counts().head(10)
top_movie_countries = movies['country'].value_counts().head(10)

# Combine the dataframes for plotting
combined_countries = pd.concat([top_tv_countries, top_movie_countries], axis=1)
combined_countries.columns = ['TV Shows', 'Movies']

# Plot grouped bar chart
combined_countries.plot(kind='bar', figsize=(12, 6))
plt.title('Top 10 Countries with Most Content (TV Shows vs Movies)')
plt.xlabel('Country')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A grouped bar chart was chosen to visualize the top 10 countries with the most content, differentiating between Movies and TV Shows. This chart type is effective for comparing the counts of two different categories (Movies and TV Shows) across multiple distinct groups (Countries). It allows for easy comparison of the content volume within each country and across content types.


##### 2. What is/are the insight(s) found from the chart?

From the grouped bar chart showing the top 10 countries with the most content, we can observe several insights:

* United States Dominance: The United States contributes the highest volume of content for both Movies and TV Shows, significantly more than any other country listed.
* Content Type Specialization: Some countries show a preference for one content type over the other. For example, India has a very high number of Movies but a relatively low number of TV Shows among the top countries. South Korea and Japan, while having fewer titles overall than the US or India, show a stronger presence in TV Shows compared to Movies in this top list.
* Global Reach: While the US leads, the presence of multiple countries in the top 10 indicates Netflix's global content acquisition and production strategy.
* Unknown" Category: The "Unknown" category is significant for both content types, highlighting data completeness issues or content where the country of origin is not specified.

These insights reveal the geographical distribution of content on Netflix and the relative contributions of different countries in terms of movie and TV show volumes.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, the gained insights can help create a positive business impact for Netflix. Understanding the geographical distribution of content is crucial for several reasons:

* Content Strategy: Identifying the countries with the most content helps Netflix understand its existing strengths and potential gaps in different markets. They can decide to invest more in content from countries with high viewership or explore acquiring more content from underrepresented regions with growing subscriber bases. The insights about content type specialization can inform targeted acquisition strategies (e.g., acquiring more South Korean TV dramas or Indian movies).
* Localization: Knowing which countries are major content sources is essential for localization efforts (subtitles, dubbing) to make content accessible to a wider global audience.
* Marketing and Partnerships: This data can inform marketing campaigns targeting specific regions or lead to strategic partnerships with production houses in key countries.

Insights that could lead to negative growth are related to the "Unknown" country category and potential imbalances in content distribution. A large volume of content with unknown origin makes it harder to understand geographical content trends and target audiences effectively.

Also, if there's a significant disconnect between where subscribers are located and where content is sourced (e.g., a large subscriber base in a country with very little local content), it could lead to subscriber churn. However, the chart itself primarily provides data for strategic decision-making, which should lead to positive growth if acted upon effectively.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
genres = my_data['listed_in'].dropna().str.split(', ')
genre_flat = [genre for sublist in genres for genre in sublist]
genre_counts = Counter(genre_flat)

import matplotlib.pyplot as plt

pd.Series(genre_counts).sort_values(ascending=False).head(10).plot(kind='bar', title='Top 10 Genres', color='brown')

##### 1. Why did you pick the specific chart?

A bar chart was chosen to visualize the top 10 genres because it is effective for displaying the frequency or count of distinct categories. The height of each bar clearly represents how many titles belong to each genre, making it easy to compare the popularity or prevalence of different genres in the dataset.

##### 2. What is/are the insight(s) found from the chart?

From the bar chart showing the top 10 genres, we can see the most frequent genres in the dataset. "International Movies", "Dramas", and "Comedies" are the most prevalent genres. This indicates that Netflix has a large collection of content within these categories. Other popular genres include "International TV Shows", "Documentaries", and "Action & Adventure". The distribution shows that Netflix offers a diverse range of content, but with a significant focus on international films, dramas, and comedies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, the gained insights can help create a positive business impact. Understanding the most popular genres is crucial for Netflix's content strategy:

* Content Acquisition and Production: Knowing which genres are most prevalent helps Netflix understand its existing strengths and audience preferences. They can continue to invest in acquiring and producing content within these popular genres. They can also identify popular emerging genres or underserved niche genres to expand their catalog strategically.
* Marketing and Targeting: This distribution helps in understanding the primary content types being consumed. Marketing efforts can be tailored to highlight the extensive library within popular genres to attract subscribers.
* Audience Engagement: A strong offering in popular genres can lead to higher audience engagement and retention.

Insights that could lead to negative growth would arise if Netflix were to significantly underinvest in popular genres or overinvest in unpopular genres, leading to subscriber dissatisfaction or churn.

However, the chart itself primarily provides data on current content distribution, which if used wisely, should lead to positive growth.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
tv_shows['duration_cleaned'].hist(bins=20)
plt.title('Distribution of TV Show Seasons')
plt.xlabel('Number of Seasons')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram was chosen to visualize the distribution of TV show seasons because it is effective in showing the frequency of different numbers of seasons. It allows us to quickly see the most common number of seasons and how the count of shows decreases as the number of seasons increases.

##### 2. What is/are the insight(s) found from the chart?

From the histogram showing the distribution of TV show seasons, the most prominent insight is that a very large proportion of TV shows on Netflix have a small number of seasons, particularly 1 or 2 seasons. The frequency drops off sharply for shows with more seasons. This indicates that Netflix's TV show catalog is heavily dominated by shorter series.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, the gained insights can help create a positive business impact. Understanding that most TV shows are short (1-2 seasons) is valuable for Netflix's content strategy:

* Content Acquisition and Production: This insight reinforces that there's a strong presence of and likely demand for shorter TV series. Netflix can continue to acquire and produce shows with a limited number of seasons. This can be cost-effective and cater to viewers who prefer less commitment.
* Subscriber Engagement: Shorter series might be easier for subscribers to start and finish, potentially leading to higher completion rates and satisfaction for some audience segments.

An insight that could lead to negative growth would be if there is a significant segment of subscribers who prefer longer-running series, and Netflix's catalog is heavily skewed towards shorter ones.

While the current distribution shows a focus on short series, a failure to also provide a sufficient number of longer shows for viewers who desire them could lead to dissatisfaction and churn among that segment.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
movies['duration_cleaned'].hist(bins=20)
plt.title('Distribution of Movie Duration')
plt.xlabel('Number of Minutes')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram was chosen to visualize the distribution of movie durations because it is effective in showing the frequency of different movie lengths. It allows us to see the most common movie durations and the overall shape of the distribution.


##### 2. What is/are the insight(s) found from the chart?


From the histogram showing the distribution of movie duration, we can see that the majority of movies on Netflix have durations between approximately 80 and 120 minutes, with a peak around the 90-110 minute mark. There are fewer movies that are very short (under 60 minutes) or very long (over 150 minutes). The distribution appears somewhat skewed to the right, indicating a tail of longer movies. This suggests that Netflix's movie catalog is primarily composed of standard-length feature films.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can help create a positive business impact. Understanding the distribution of movie durations is valuable for Netflix's content strategy:

* Content Acquisition and Production: Knowing the most common movie lengths can inform decisions about acquiring or producing movies that align with audience preferences for duration. If the majority of viewers prefer movies around 90-120 minutes, Netflix can prioritize content in this range.
* Content Recommendation: The typical duration can be a factor in recommendation algorithms, suggesting movies of lengths that a user is likely to finish based on their viewing history.
* User Experience: Understanding the duration distribution can help in designing the user interface, for example, by categorizing movies by length or providing filters.

An insight that could lead to negative growth would be if there is a significant demand for very short or very long movies that are currently underrepresented in the catalog, leading to unmet audience needs and potential churn among viewers who prefer those durations.

However, the chart primarily shows the current state, which can be used to make strategic decisions for positive growth.

**Bivariate Analysis:**

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Type vs Rating
pd.crosstab(my_data['type'], my_data['rating']).plot(kind='bar', stacked=True, colormap='viridis')
plt.title('Content Type vs Rating')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

A stacked bar chart was chosen to visualize the relationship between Content Type (Movie or TV Show) and Rating. This chart is effective for showing the distribution of a categorical variable (Rating) within each category of another categorical variable (Content Type). It allows for a clear comparison of how the prevalence of different ratings varies between movies and TV shows.



##### 2. What is/are the insight(s) found from the chart?


From the stacked bar chart showing Content Type vs Rating, we can observe several insights:

* Rating Distribution Differences: The distribution of ratings is significantly different between Movies and TV Shows. Movies have a wider range of ratings with substantial counts in categories like R, PG-13, and NR, in addition to TV-MA and TV-14.
* TV Show Rating Concentration: TV Shows are heavily concentrated in TV-MA and TV-14 ratings, with significant counts also in TV-PG, TV-Y7, and TV-Y. Other ratings are less common for TV shows.
* Mature Content Prevalence: For both Movies and TV Shows, mature ratings (TV-MA, TV-14, R for movies) represent a large portion of the content.
* Children's Content: While present, children's ratings (TV-Y, TV-Y7, TV-G) appear to constitute a smaller proportion of the overall content compared to mature ratings, and their distribution differs between movies and TV shows.

These insights highlight the distinct rating profiles of Movies and TV Shows on Netflix.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, the gained insights can help create a positive business impact. Understanding the distinct rating distributions for movies and TV shows is crucial for Netflix's content strategy:

* Content Acquisition and Production: Knowing which ratings are dominant for each content type informs targeted acquisition and production efforts. For example, if Netflix aims to grow its TV show catalog, the chart indicates that focusing on TV-MA, TV-14, TV-PG, TV-Y7, and TV-Y series aligns with the current catalog composition and likely audience demand within those segments. For movies, they need to cater to a broader range of mature ratings.
* Audience Segmentation and Marketing: This insight helps in segmenting the audience based on the type of content they prefer and its associated ratings. Marketing campaigns can be tailored to promote content effectively to different demographic groups.

Insights that could lead to negative growth would arise if Netflix fails to balance its content acquisition and production across different ratings to match subscriber demographics and preferences.

For instance, if they disproportionately acquire R-rated movies while a growing subscriber base prefers TV-PG content, it could lead to dissatisfaction and churn. The chart provides data to assess if the content mix aligns with business goals for different audience segments.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
if set(['release_year','type']).issubset(my_data.columns):
    pivot = pd.crosstab(my_data['release_year'], my_data['type'])
    pivot.plot(kind='line', title='Release Year vs Type counts')
    plt.show()

##### 1. Why did you pick the specific chart?

A line chart was chosen to visualize the relationship between release year and the count of Movies and TV Shows because it is effective in showing trends over time for two different categories simultaneously. It allows for a clear comparison of how the number of releases for both content types has changed annually.

##### 2. What is/are the insight(s) found from the chart?


From the line chart showing Release Year vs Type counts, we can observe several insights:

* Overall Growth: Both Movies and TV Shows show a significant increasing trend in releases over the years, indicating Netflix's expanding content library.
* Movies Dominance (Historically): Historically, the number of movie releases was consistently higher than TV show releases for many years.
* TV Show Surge: In recent years, particularly from around 2015 onwards, there is a sharp increase in the number of TV show releases, eventually catching up to and even slightly surpassing movie releases in the most recent years shown.
* Peak and Slight Decline: Similar to the overall trend seen earlier, the releases for both content types appear to peak around 2019-2020, with a slight decline in 2021.
These insights highlight the changing composition of Netflix's content library over time, with a growing emphasis on TV shows in recent years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can help create a positive business impact. Understanding the trends in content type releases is crucial for Netflix's content strategy:

* Content Strategy Alignment: The data shows a clear shift towards increasing TV show releases in recent years. If this aligns with subscriber viewing habits and preferences (e.g., a growing preference for binge-watching series), then this trend is positive and supports business growth.
* Market Trends: The increase in releases for both content types reflects Netflix's aggressive content acquisition and production strategy to remain competitive in the streaming market.
* Resource Allocation: Understanding the relative volume of movies vs. TV shows informs resource allocation for content acquisition, production, and platform features.

Insights that could lead to negative growth would arise if the observed trends do not match subscriber demand. For instance, if Netflix is heavily investing in TV shows but a significant portion of their subscriber base still primarily prefers movies, this mismatch could lead to dissatisfaction and churn.

The slight dip in releases in 2021, if it represents a sustained slowdown without a strategic shift, could also potentially impact growth if subscribers expect a continuous increase in new content.

#### Chart - 12 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Select only numeric columns for correlation

df_num = movies[['release_year', 'duration_cleaned']]

# Correlation matrix
corr = df_num.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap was chosen to visualize the relationship between numerical variables in the dataset. It provides a quick and intuitive way to see the pairwise correlation coefficients between features, helping to identify potential linear relationships.


##### 2. What is/are the insight(s) found from the chart?

The correlation heatmap shows the correlation between 'release_year' and 'duration_cleaned' (movie duration). The correlation coefficient is approximately -0.2. This indicates a very weak negative linear relationship between the release year and movie duration. In simple terms, there is a slight tendency for movies released in more recent years to be slightly shorter on average, but the relationship is not strong.

3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

The insights from this specific correlation heatmap (a very weak negative correlation between release year and movie duration) have limited direct business impact on their own.

A weak correlation suggests that the release year is not a strong predictor of movie duration.

It doesn't provide a clear actionable insight for content strategy or acquisition based solely on this relationship. A stronger correlation might suggest trends in movie length over time that Netflix could capitalize on.

The lack of a strong correlation doesn't inherently lead to negative growth, but it also doesn't provide a clear path for positive growth based on this specific feature relationship. It simply indicates that these two variables, in isolation, are not strongly linearly related in the dataset.

#### Chart - 13 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(movies[['release_year', 'duration_cleaned']])
plt.suptitle("Pair Plot of Numerical Features", y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot was chosen to visualize the relationships between numerical features and their individual distributions. It displays scatter plots for each pair of numerical features (showing their relationship) and histograms for each individual feature (showing their distribution). Even with only two numerical features, it provides a quick overview of their relationship and distribution in a single plot.

##### 2. What is/are the insight(s) found from the chart?

From the pair plot of 'release_year' and 'duration_cleaned' for movies, we can observe:

* Scatter Plot (release_year vs. duration_cleaned): The scatter plot does not show a strong linear pattern, confirming the weak correlation observed in the heatmap. The points are spread out, indicating that release year is not a strong predictor of movie duration. There might be a slight trend towards shorter movies in more recent years, but it's not a clear relationship.
* Histogram (release_year): The histogram for 'release_year' shows the distribution of movie release years in the dataset, likely reflecting the increasing trend of content over time (similar to the line plot of releases per year).
* Histogram (duration_cleaned): The histogram for 'duration_cleaned' shows the distribution of movie durations, confirming the peak around 90-110 minutes and the skew to the right observed in the individual movie duration histogram.

Overall, the pair plot reinforces that there is no strong pairwise linear relationship between movie release year and duration.


3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.


Similar to the correlation heatmap, the insights from this pair plot regarding the weak relationship between release year and movie duration have limited direct business impact. The lack of a strong predictable relationship doesn't provide clear guidance for content acquisition or production based solely on these two factors.

It reinforces that Netflix's strategy isn't simply tied to acquiring shorter or longer movies based on their release year. Business impact would come from understanding more complex relationships or audience preferences related to duration and other factors, which this simple pair plot doesn't reveal. There are no direct insights from this plot that inherently lead to negative growth.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Here are three hypothetical statements based on the chart experiments:

* **Hypothesis 1**: Movies are, on average, older than TV shows on Netflix (comparing their release years).
* **Hypothesis 2**: The number of actors listed for movies is significantly different from the number of actors listed for TV shows.
* **Hypothesis 3**: There is an association between the content Type (Movie or TV Show) and the Rating.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null hypothesis (H0)**: The mean release year for Movies is equal to the mean release year for TV Shows.

**Alternative hypothesis (H1)**: The mean release year for Movies is different from the mean release year for TV Shows.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Hypothesis 1: Release year difference between Movies and TV Shows

movies = my_data[my_data['type']=='Movie']['release_year'].dropna()
tv = my_data[my_data['type']=='TV Show']['release_year'].dropna()
res = stats.ttest_ind(movies, tv, equal_var=False, nan_policy='omit')
print('t-statistic:', res.statistic)
print('p-value:', res.pvalue)
print('Movies mean release year:', movies.mean())
print('TV mean release year:', tv.mean())

##### Which statistical test have you done to obtain P-Value?

Welch's t-test (two-sample t-test with unequal variances).

##### Why did you choose the specific statistical test?

Release year is a continuous numeric variable and we compare two independent groups (Movies vs TV Shows). The sample sizes are large and variances may differ, so Welch's t-test is appropriate.

**Decision & Conclusion:**

Result: t-statistic = -18.678, p-value ≈ 4.45e-76. **Reject the null hypothesis**. There is a statistically significant difference in release year between Movies (mean ≈ 2012.92) and TV Shows (mean ≈ 2016.19).

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null hypothesis (H0)**: The mean number of listed actors is the same for Movies and TV Shows.

**Alternative hypothesis (H1)**: The mean number of listed actors differs between Movies and TV Shows.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Hypothesis 2: Number of actors differs between Movies and TV Shows

# Create 'num_actors' column by counting actors in the 'cast' column
my_data['num_actors'] = my_data['cast'].apply(lambda x: len(x.split(', ')) if x != 'Unknown' else 0)
# Verify the new column is created
print(my_data[['cast', 'num_actors']].head())

movies_na = my_data[my_data['type']=='Movie']['num_actors'].dropna()
tv_na = my_data[my_data['type']=='TV Show']['num_actors'].dropna()
res2 = stats.ttest_ind(movies_na, tv_na, equal_var=False, nan_policy='omit')
print('t-statistic:', res2.statistic)
print('p-value:', res2.pvalue)
print('Movies mean num_actors:', movies_na.mean())
print('TV mean num_actors:', tv_na.mean())

##### Which statistical test have you done to obtain P-Value?

Welch's t-test (two-sample t-test with unequal variances).

##### Why did you choose the specific statistical test?

Number of actors is numeric; we compare two independent groups. Use Welch's t-test due to possible unequal variances.

**Decision & Conclusion:**

Result: t-statistic = -1.458, p-value ≈ 0.145. **Fail to reject the null hypothesis**. No statistically significant difference in the average number of listed actors between Movies and TV Shows at α=0.05 (Movies mean ≈ 7.13, TV mean ≈ 7.32).

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null hypothesis (H0)**: Content Type (Movie vs TV Show) is independent of Rating (no association).

**Alternative hypothesis (H1)**: Content Type is associated with Rating (they are not independent).



#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Hypothesis 3: Rating distribution is associated with Type (Movie vs TV Show)

ct = pd.crosstab(my_data['type'], my_data['rating'])
chi2, p, dof, expected = stats.chi2_contingency(ct.fillna(0))
print('chi2:', chi2)
print('p-value:', p)
print('degrees of freedom:', dof)
print('\nContingency table (counts):\n', ct)

##### Which statistical test have you done to obtain P-Value?

Chi-square test of independence.

##### Why did you choose the specific statistical test?

Both variables are categorical (Type and Rating). Chi-square tests whether the two categorical variables are independent.

**Decision & Conclusion:**

Result: chi2 ≈ 931.90, p-value ≈ 6.31e-190. **Reject the null hypothesis**. Rating distribution is strongly associated with content type (Movie vs TV Show).



## ***6. Feature Engineering & Data Pre-processing***

In [None]:
# Copy df to avoid overwriting original accidentally
data = my_data.copy()

# Parse date_added
if 'date_added' in data.columns:
    data['date_added'] = pd.to_datetime(data['date_added'], errors='coerce')
    data['added_year'] = data['date_added'].dt.year
    data['added_month'] = data['date_added'].dt.month
else:
    data['added_year'] = np.nan
    data['added_month'] = np.nan

# Parse duration: extract numeric part
if 'duration' in data.columns:
    data['duration_num'] = data['duration'].str.extract(r'(\d+)').astype(float)
else:
    data['duration_num'] = np.nan

# Type encoding
if 'type' in data.columns:
    data['type_encoded'] = data['type'].map({'Movie':0, 'TV Show':1})
else:
    data['type_encoded'] = np.nan

# Number of actors in cast
if 'cast' in data.columns:
    data['num_actors'] = data['cast'].apply(lambda x: len(str(x).split(', ')) if pd.notnull(x) else 0)
else:
    data['num_actors'] = 0

# Title and description lengths
data['title_length'] = data['title'].apply(lambda x: len(str(x)) if pd.notnull(x) else 0)
data['desc_length']  = data['description'].apply(lambda x: len(str(x)) if pd.notnull(x) else 0)

# Label encode rating (safe fallback: treat NaN as a string)
if 'rating' in data.columns:
    le = LabelEncoder()
    data['rating_filled'] = data['rating'].fillna('Unknown')
    data['rating_encoded'] = le.fit_transform(data['rating_filled'])
else:
    data['rating_encoded'] = 0

# Genre one-hot flags for top N genres
if 'listed_in' in data.columns:
    genres_series = data['listed_in'].dropna().str.split(', ')
    flat_genres = [g for sublist in genres_series for g in sublist]
    top_genres = pd.Series(Counter(flat_genres)).sort_values(ascending=False).head(10).index.tolist()
    for genre in top_genres:
        col_name = 'genre_' + genre.lower().replace(' ', '_').replace('-','_').replace('&','and')
        data[col_name] = data['listed_in'].apply(lambda x: 1 if pd.notnull(x) and genre in x else 0)
    print('Top genres used for features:', top_genres)
else:
    print('No listed_in column found; skipping genre features.')

# Show sample of engineered features
display(data[['type','type_encoded','duration','duration_num','release_year','added_year','num_actors','title_length','desc_length']].head())

**Feature Engineering Summary:**

*   Created a copy of the original dataframe (`my_data`).
*   Parsed `date_added` to extract `added_year` and `added_month`.
*   Extracted the numeric duration into `duration_num`.
*   Encoded the `type` column numerically (`type_encoded`).
*   Calculated the number of actors in the `cast` (`num_actors`).
*   Calculated the length of `title` and `description`.
*   Handled missing `rating` values and label encoded the column (`rating_encoded`).
*   Created binary features for the top 10 `listed_in` genres.

In [None]:
# Select candidate features (auto-include top genre flags if present)
feature_candidates = ['duration_num','release_year','added_year','added_month','num_actors','title_length','desc_length','rating_encoded']
# add genre flags found in data
genre_cols = [c for c in data.columns if c.startswith('genre_')]
feature_candidates += genre_cols

# Filter features that actually exist
features = [c for c in feature_candidates if c in data.columns]
print('Features used:', features)

# Drop rows with missing target
data_model = data.dropna(subset=['type_encoded']).copy()

# Fill missing numeric values with median (simple strategy)
for col in features:
    if data_model[col].dtype in [np.float64, np.int64] or np.issubdtype(data_model[col].dtype, np.number):
        data_model[col] = data_model[col].fillna(data_model[col].median())
    else:
        data_model[col] = data_model[col].fillna(0)

X = data_model[features]
y = data_model['type_encoded']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print('Train shape:', X_train.shape, 'Test shape:', X_test.shape)

**Data Preprocessing Summary**:

* Selected candidate features for modeling, including engineered numerical features and genre flags.
* Filtered the list to include only features present in the DataFrame.
* Dropped rows where the target variable (type_encoded) is missing.
* Filled remaining missing numerical feature values with the median.
* Separated features (X) and target variable (y).
* Split the data into training (80%) and testing (20%) sets using train_test_split.
* Ensured the proportion of target classes is maintained in both sets using stratify=y.
* Printed the shapes of the training and testing sets.

## ***7. ML Model Implementation***

### Step 7.0 – Overview
We will implement **two classification algorithms** (Logistic Regression and Random Forest) and **two clustering algorithms** (KMeans and Agglomerative Clustering). The target variable for classification is `type` (Movie vs TV Show).

In [None]:
my_data.columns = [c.strip() for c in my_data.columns]

print("Rows, Columns:", my_data.shape)
my_data.head(3)

### Step 7.1 – Data preparation
We remove ID and text-heavy columns that don't help much for modelling, identify numeric and categorical columns, and set up simple preprocessing steps (scaling for numbers, one-hot encoding for categories).

In [None]:
drop_cols = [c for c in ["show_id","title","description","cast","director"] if c in my_data.columns]

### ML Model - 1

In [None]:
# ML Model - 1 Implementation


# Fit the Algorithm

# Predict on the model

log_reg = LogisticRegression(max_iter=1000) # Removed the preprocessor step

log_reg.fit(X_train, y_train)
y_pred_lr = log_reg.predict(X_test)

acc_lr = accuracy_score(y_test, y_pred_lr)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart


# Display Classification Report
print(f"Logistic Regression Accuracy: {acc_lr:.4f}")
print("Classification Report (LR):\n", classification_report(y_test, y_pred_lr, zero_division=0))
print("Confusion Matrix (LR):\n", confusion_matrix(y_test, y_pred_lr))

# Display Confusion Matrix
print("\nConfusion Matrix (Logistic Regression):")
cm_lr = confusion_matrix(y_test, y_pred_lr)
plt.figure(figsize=(6, 4))
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix (Logistic Regression)')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Fit the Algorithm

# Predict on the model

# Define the parameter grid for Logistic Regression
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],  # Regularization strength
    'solver': ['liblinear', 'lbfgs'] # Algorithm to use in the optimization problem
}

# Initialize Logistic Regression model
log_reg = LogisticRegression(max_iter=1000, random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=log_reg, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
best_params_lr = grid_search.best_params_
best_score_lr = grid_search.best_score_

print(f"Best parameters found: {best_params_lr}")
print(f"Best cross-validation accuracy: {best_score_lr:.4f}")

# Evaluate the best model on the test data
best_log_reg = grid_search.best_estimator_
y_pred_lr_tuned = best_log_reg.predict(X_test)

acc_lr_tuned = accuracy_score(y_test, y_pred_lr_tuned)
print(f"Tuned Logistic Regression Accuracy on Test Set: {acc_lr_tuned:.4f}")
print("Classification Report (Tuned LR):\n", classification_report(y_test, y_pred_lr_tuned, zero_division=0))
print("Confusion Matrix (Tuned LR):\n", confusion_matrix(y_test, y_pred_lr_tuned))

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV was used for hyperparameter tuning. This technique systematically searches through a predefined grid of hyperparameter values. For each combination of parameters, it trains and evaluates the model using cross-validation.


##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Comparing the tuned Logistic Regression model with the initial model:

*   **Initial Model Accuracy:** 0.9981
*   **Tuned Model Accuracy:** 0.9987

**Confusion Matrix (Initial LR):**
[[1074    2]
 [   1  481]]

**Confusion Matrix (Tuned LR):**
[[1074    2]
 [   0  482]]

**Analysis:**

While the overall accuracy shows a slight improvement from 0.9981 to 0.9987 after tuning, the improvement is minimal due to the already very high performance of the initial model.

Looking at the confusion matrices:
*   The number of **True Positives** (correctly predicted TV Shows) increased from 481 to 482.
*   The number of **False Negatives** (TV Shows incorrectly predicted as Movies) decreased from 1 to 0.
*   The number of **True Negatives** (correctly predicted Movies) remained 1074.
*   The number of **False Positives** (Movies incorrectly predicted as TV Shows) remained 2.

The classification reports show that for both classes (0: Movie, 1: TV Show), precision, recall, and f1-score are already extremely high (mostly 1.00) even before tuning.

In conclusion, hyperparameter tuning for Logistic Regression resulted in a minor improvement, primarily eliminating the single false negative observed in the initial model, leading to a perfect recall (1.00) for the TV Show class on the test set. Given the near-perfect performance of the initial model, the impact of tuning was very small.

### ML Model - 2

In [None]:
# ML Model - 2 Implementation


# Fit the Algorithm

# Predict on the model
rf = RandomForestClassifier(random_state=42) # Removed the preprocessor step

rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

acc_rf = accuracy_score(y_test, y_pred_rf)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print(f"Random Forest Accuracy: {acc_rf:.4f}")
print("Classification Report (RF):\n", classification_report(y_test, y_pred_rf, zero_division=0))
print("Confusion Matrix (RF):\n", confusion_matrix(y_test, y_pred_rf))

# Display Confusion Matrix
print("\nConfusion Matrix (Random Forest):")
cm_lr = confusion_matrix(y_test, y_pred_lr)
plt.figure(figsize=(6, 4))
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix (Random Forest)')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques


# Define the parameter grid for Random Forest
param_grid_rf = {
    'n_estimators': [100, 200, 300],  # Number of trees in the forest
    'max_depth': [None, 10, 20],      # Maximum depth of the tree
    'min_samples_split': [2, 5, 10]   # Minimum number of samples required to split an internal node
}

# Initialize Random Forest model
rf = RandomForestClassifier(random_state=42)

# Initialize GridSearchCV
grid_search_rf = GridSearchCV(estimator=rf, param_grid=param_grid_rf, cv=3, scoring='accuracy', n_jobs=-1) # Using cv=3 for faster execution

# Fit GridSearchCV to the training data
grid_search_rf.fit(X_train, y_train)

# Get the best parameters and best score
best_params_rf = grid_search_rf.best_params_
best_score_rf = grid_search_rf.best_score_

print(f"Best parameters found: {best_params_rf}")
print(f"Best cross-validation accuracy: {best_score_rf:.4f}")

# Evaluate the best model on the test data
best_rf = grid_search_rf.best_estimator_
y_pred_rf_tuned = best_rf.predict(X_test)

acc_rf_tuned = accuracy_score(y_test, y_pred_rf_tuned)
print(f"Tuned Random Forest Accuracy on Test Set: {acc_rf_tuned:.4f}")
print("Classification Report (Tuned RF):\n", classification_report(y_test, y_pred_rf_tuned, zero_division=0))
print("Confusion Matrix (Tuned RF):\n", confusion_matrix(y_test, y_pred_rf_tuned))

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV was used for hyperparameter tuning of the Random Forest model. It was chosen because it systematically checks all combinations in the grid, guaranteeing that the best combination within that grid is found. This is suitable when the number of hyperparameters and their possible values are manageable, as was the case here with n_estimators, max_depth, and min_samples_split.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Comparing the tuned Random Forest model with the initial model:

*   **Initial Model Accuracy:** 0.9994
*   **Tuned Model Accuracy:** 0.9994

**Confusion Matrix (Initial RF):**
[[1075    1]
 [   0  482]]

**Confusion Matrix (Tuned RF):**
[[1075    1]
 [   0  482]]

**Analysis:**

* In this case, the overall accuracy on the test set remained the same (0.9994) after hyperparameter tuning for the Random Forest model.

* Looking at the confusion matrices, both the initial and tuned models achieved perfect recall (1.00) for the TV Show class (correctly identifying all 482 TV Shows).

* For the Movie class, both models correctly identified 1075 out of 1076 movies. The initial model had 1 false positive (a Movie incorrectly classified as a TV Show), while the tuned model also had 1 false positive.

* It appears that the initial Random Forest model was already performing at a very high level, and the hyperparameter tuning within the specified grid did not yield a significant improvement in overall accuracy or reduce false positives. The chosen default or initial parameters were already near optimal for this dataset and feature set.



Since the accuracy did not change significantly, and the classification report metrics were already near perfect, an updated chart might not visually show a difference. However, if a chart were to be updated, it would show the tuned model's accuracy as 0.9994, same as the initial model.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation


# Define columns to drop for clustering
# Exclude ID, text-heavy columns, and the target variable for classification ('type')
drop_cols = ['show_id', 'title', 'director', 'cast', 'country', 'date_added', 'duration', 'listed_in', 'description', 'type']

# For clustering we use all features except the target for classification and the dropped columns.
X_clust = data.drop(columns=drop_cols + ["type_encoded", "rating", "rating_filled"], errors="ignore")


# Identify numerical and categorical columns for preprocessing
# Numerical columns are those that are not binary genre flags
numerical_cols = [col for col in X_clust.columns if X_clust[col].dtype in [np.float64, np.int64] or np.issubdtype(X_clust[col].dtype, np.number) and not col.startswith('genre_')]
# Binary genre columns
binary_cols = [col for col in X_clust.columns if col.startswith('genre_')]

# Create a preprocessor to handle imputation and scaling for numerical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline([('imputer', SimpleImputer(strategy='median')), # Add imputation
                          ('scaler', StandardScaler())]), numerical_cols), # Add scaling
        ('binary', 'passthrough', binary_cols) # Pass through binary columns
    ],
    remainder='passthrough' # Handle any other columns (though should be none here)
)

# Create a pipeline for clustering
# Reuse the same preprocessing and get a numeric matrix
X_proc = preprocessor.fit_transform(X_clust)


# KMeans: choose k by silhouette (simple range)
best_k = None
best_score = -1
# Check if X_proc has more than 1 data point for silhouette score
if X_proc.shape[0] > 1:
    for k in range(2, 9):
        km = KMeans(n_clusters=k, n_init=10, random_state=42)
        labels = km.fit_predict(X_proc)
        # Check if there is more than 1 cluster before calculating silhouette score
        if len(np.unique(labels)) > 1:
            score = silhouette_score(X_proc, labels)
            if score > best_score:
                best_score, best_k = score, k
    if best_k is not None:
        print(f"Selected k for KMeans: {best_k} (silhouette={best_score:.4f})")

        # Final KMeans
        km_final = KMeans(n_clusters=best_k, n_init=25, random_state=42)
        labels_km = km_final.fit_predict(X_proc)
        print("KMeans final silhouette:", silhouette_score(X_proc, labels_km))

        # Agglomerative with same k
        agg = AgglomerativeClustering(n_clusters=best_k)
        labels_agg = agg.fit_predict(X_proc)
        print("Agglomerative silhouette:", silhouette_score(X_proc, labels_agg))
    else:
        print("Could not determine best k for KMeans. Data might not be suitable for clustering in this range of k.")

else:
    print("Data has insufficient samples for clustering.")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

# Visualize Silhouette Scores for different k values (if the loop ran)
# We need the scores for each k from the previous loop.
# Since the loop was within the previous cell, we'll re-calculate or assume the scores are somehow accessible.
# A better approach is to store the scores in the loop.
# For now, let's create a placeholder visualization based on the final result.

# Placeholder: Visualize the selected best_k and its silhouette score
if 'best_k' in locals() and 'best_score' in locals():
    print(f"Visualizing the best k found: {best_k} with Silhouette Score: {best_score:.4f}")
    # A simple bar plot showing the best k and score
    plt.figure(figsize=(6, 4))
    plt.bar(f'k={best_k}', best_score)
    plt.ylabel('Silhouette Score')
    plt.title('Silhouette Score for Selected Best k')
    plt.show()
else:
    print("Best k and silhouette score not available for visualization.")


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)


##### Which hyperparameter optimization technique have you used and why?

Hyperparameter tuning for clustering algorithms like KMeans and Agglomerative Clustering is different from supervised learning. The main "hyperparameter" to tune is typically the number of clusters (k), which was explored and selected based on the silhouette score.

Traditional hyperparameter tuning techniques like GridSearchCV with evaluation metrics like accuracy or F1-score are not applicable to unsupervised learning tasks as there are no true labels to optimize against.

Therefore, we do not perform hyperparameter optimization in the same way for these models.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

None.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

For the classification task (classifying content as Movie or TV Show), several evaluation metrics were considered, and their relevance to business impact is as follows:

* Accuracy: Overall correctness of the model's predictions. A high accuracy means the model is generally good at distinguishing between movies and TV shows. Business Impact: High accuracy is fundamental for reliable content categorization, which impacts content recommendations, search filters, and targeted marketing. Incorrect classifications can lead to a poor user experience and missed opportunities for engagement.
* Precision: Of all the instances predicted as positive (e.g., TV Show), what proportion were actually positive? Business Impact: High precision is important when the cost of a False Positive is high. For example, incorrectly recommending a TV Show to someone who only wants to watch movies could lead to frustration. For Netflix, ensuring that content categorized as a specific type is indeed that type helps maintain user trust and improves content discovery.
* Recall (Sensitivity): Of all the actual positive instances (e.g., TV Show), what proportion were correctly predicted as positive? Business Impact: High recall is important when the cost of a False Negative is high. For example, failing to classify a TV Show correctly means it won't be shown to users who are specifically looking for TV Shows, leading to missed viewing opportunities. For Netflix, high recall ensures that users can find all relevant content of a particular type.
* F1-Score: The harmonic mean of precision and recall. It provides a balance between the two metrics. Business Impact: The F1-Score is useful when you need to consider both false positives and false negatives. A high F1-Score indicates a good balance between accurately identifying positive cases and not incorrectly labeling negative cases as positive.

For the clustering task, evaluation metrics like the Silhouette Score were considered. This metric measures how similar an object is to its own cluster compared to other clusters.
 A higher silhouette score indicates that objects are well-matched to their own cluster and poorly matched to neighboring clusters.

 Business Impact: While not a direct measure of classification performance, good clustering (indicated by metrics like silhouette score) can lead to insights about groups of content with similar characteristics. These insights can inform content bundling, genre categorization refinement, or understanding viewer segments, which can indirectly impact business through improved content strategy and targeted offerings.

Given the project's goals, a high Accuracy, along with strong Precision and Recall for both classes (Movie and TV Show), are key for positive business impact in terms of effective content management and user experience.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Based on the performance evaluation, both the Logistic Regression and Random Forest models achieved very high accuracy (close to 100%) in classifying content as either a Movie or a TV Show.

Given the near-perfect performance of both models on this task, either could be considered the final prediction model. However, for this project, the **Random Forest Classifier** is chosen as the final prediction model.

**Reasoning for choosing Random Forest:**

*   **High Performance:** It achieved slightly higher accuracy on the test set compared to the tuned Logistic Regression model (though the difference was minimal).
*   **Robustness:** Random Forest is generally robust and less prone to overfitting compared to a single decision tree.
*   **Feature Importance:** Random Forest models can provide insights into the importance of different features in making predictions, which can be valuable for understanding which factors are most influential in distinguishing between movies and TV shows. This aligns with the project's goal of presenting insights and patterns.

While Logistic Regression also performed exceptionally well and is more interpretable, the ability of Random Forest to provide feature importance makes it slightly more valuable for gaining insights from the data in this context.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

The model chosen as the final prediction model for classifying content type is the Random Forest Classifier.

**Explanation of the Random Forest Model:**

As explained earlier, Random Forest is an ensemble learning algorithm that builds multiple decision trees during training and combines their predictions. It is effective for classification tasks, handles complex relationships, and provides insights into feature importance. Its ensemble nature makes it robust and less prone to overfitting compared to a single decision tree.

**Feature Importance:**

Using the feature_importances_ attribute of the trained Random Forest model (which is a built-in form of model explainability for tree-based models), we can identify which features were most influential in helping the model classify content as either a Movie or a TV Show.

Based on the calculated feature importances:

The duration_num (the numeric duration of the content - minutes for movies, seasons for TV shows) is by far the most important feature.

This is a strong indicator that the length of content is a primary differentiator between movies and TV shows in this dataset.
Genre-related features like genre_international_tv_shows, genre_international_movies, and genre_tv_dramas are also highly important.

This confirms that the genre information plays a significant role in classifying content type.
Other features like rating_encoded, genre_children_and_family_movies, and genre_documentaries also show some level of importance, though less significant than duration and the top genres.

Features like release_year, added_year, added_month, num_actors, title_length, and desc_length have relatively lower importance in distinguishing between movies and TV shows compared to duration and key genres.

These feature importances provide valuable insights into what characteristics of content are most indicative of whether it is a movie or a TV show in this dataset. Understanding these can inform content categorization, recommendation systems, and content acquisition strategies.

# **Conclusion**

This project aimed to analyze Netflix's content and build machine learning models for classification (Movie vs TV Show) and clustering.

Through Exploratory Data Analysis (EDA), several key insights were gained:
*   Netflix's library is dominated by Movies, although the number of TV Show releases has significantly increased in recent years.
*   Content ratings like TV-MA, TV-14, and R are the most prevalent, indicating a focus on mature audiences.
*   The United States is the primary source of content, with significant contributions from other countries like India (movies) and the UK/Japan (TV shows).
*   Most TV shows on the platform have a small number of seasons (1-2), while movies have a duration distribution peaking around 90-120 minutes.
*   There is a strong association between Content Type and Rating.

Feature Engineering and Data Preprocessing involved creating numerical features from date, duration, cast, title/description length, and one-hot encoding genres. Missing values were handled.

For the classification task, Logistic Regression and Random Forest models were implemented. Both models achieved exceptionally high accuracy (near 100%) in classifying content type. The Random Forest model was selected as the final classifier, partly due to its ability to provide feature importance. Feature importance analysis revealed that `duration_num` and genre-related features are the most influential in distinguishing between movies and TV shows.

For the clustering task, KMeans and Agglomerative Clustering were implemented after scaling the numerical features and including binary genre features. The optimal number of clusters (k) was explored using the silhouette score.

Overall, the project successfully analyzed the Netflix content dataset, revealed interesting trends and patterns, and built highly accurate classification models capable of distinguishing between movies and TV shows based on their metadata and engineered features. The insights gained from the EDA and feature importance analysis can be valuable for Netflix's content strategy, acquisition, and recommendation systems.

**Key Takeaways:**
*   The length of content and its genre are strong indicators of whether a title is a Movie or a TV Show.
*   Netflix has a diverse international catalog but with a clear concentration in certain countries and content ratings.
*   The platform's content strategy has evolved over time, with a recent surge in TV show releases.

This project provides a solid foundation for understanding the Netflix content landscape and demonstrates the effectiveness of machine learning techniques for content classification and grouping.