# **Project Name**    - **Netflix Movies and TV Shows Clustering**



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 - Vaishnavi Sahu**

# **Project Summary -**

This project focuses on performing a comprehensive Exploratory Data Analysis (EDA) of Netflix’s catalog of movies and TV shows to uncover meaningful insights about the platform’s content strategy, audience preferences, and business direction.
The dataset used contains 7,787 records and 12 columns, providing information such as title, type (Movie or TV Show), director, cast, country, date added, release year, rating, duration, genres (listed_in), and description. The data was sourced from Flixable, a third-party Netflix search engine, and represents titles available on Netflix up to 2019.

In addition to EDA, this project serves as the foundation for a machine learning (ML) task using K-Means clustering, where the goal is to group similar titles based on features such as genre, rating, duration, and release period.
The insights derived from EDA help define relevant features, guide preprocessing steps, and ensure the data is well-prepared for clustering analysis.
1. Data Cleaning and Preprocessing
The initial step involved inspecting and cleaning the dataset. Columns such as director, cast, country, and rating contained missing values, which were replaced with "Unknown" to retain records without losing data.
The date_added column was converted into a proper datetime format, and two new columns — year_added and month_added — were derived for trend analysis.
The duration column was further split into duration_minutes (for movies) and seasons (for TV shows) to enable quantitative analysis across both types of content.
The dataset was found to be free of duplicates, with minimal data inconsistencies, making it ready for both visual exploration and machine learning applications.

2. Exploratory Data Analysis (EDA)
The EDA process involved generating a series of visualizations to interpret Netflix’s content trends:

Content Distribution: Movies dominate the catalog with around 70% share, while TV Shows make up the remaining 30%. However, the number of TV Shows has increased significantly after 2015, showing Netflix’s growing investment in serialized content.
Yearly and Monthly Trends: Most titles were added to Netflix between 2016–2019, indicating aggressive content expansion during that period. Content additions are evenly spread across months, ensuring consistent user engagement throughout the year.
Regional Contributions: The United States leads with the highest number of titles, followed by India, the United Kingdom, and Japan. A deeper comparison between India and the US showed that while the US dominates in volume, India has shown a steep growth trajectory since 2016 — reflecting Netflix’s focus on international markets and localization.
Ratings: Ratings analysis revealed that most titles are rated TV-MA and TV-14, implying a heavy focus on adult and teenage audiences. Family-friendly content such as PG or TV-Y is relatively limited, representing an opportunity for expansion.
Duration and Seasons: Most movies are between 80–120 minutes, and most TV shows have 1–2 seasons, aligning with Netflix’s binge-watch culture.
Genres: Popular genres include International Movies, Dramas, Comedies, and Documentaries, showing a preference for storytelling variety and global appeal.
Directors and Cast: A few directors, like Raúl Campos, Jan Suter, and Marcus Raboy, contribute multiple titles, often in stand-up or documentary categories, highlighting Netflix’s collaboration with consistent creators.3. Numerical Analysis
A correlation heatmap and pair plot were used to explore relationships between numerical variables such as release year, year added, duration, and seasons.
The results showed weak correlations, implying Netflix’s diverse catalog does not follow rigid patterns — content length, release year, and addition time are largely independent.
This diversity demonstrates Netflix’s strategy of providing a balanced range of content to appeal to varied audiences, and also ensures feature independence — an ideal property for unsupervised learning models like K-Means clustering.

4. Key Insights and Business Recommendations
From the analysis, Netflix appears to be successfully pursuing a data-driven, globally diverse, and modernized content strategy. The findings suggest:

Continue expanding international content, especially regional originals in fast-growing markets like India.
Invest in family and educational content to attract more household subscribers.
Balance between shorter mini-series and longer-running shows to engage both casual and loyal viewers.
Maintain consistent monthly content releases to sustain engagement.
Diversify collaborations with new directors and storytellers for fresh creative perspectives.
5. Conclusion
Overall, this project demonstrates that Netflix’s success lies in its content diversity, global expansion, and responsiveness to audience behavior.
The EDA results provide strong evidence of Netflix’s shift from being a US-centric platform to a global entertainment powerhouse.

The insights gained here also lay the groundwork for the machine learning phase, where K-Means clustering will be applied to group similar content based on patterns in duration, rating, genre, and release information.
This clustering will help Netflix identify content similarities, design recommendation systems, and optimize catalog management for different audience segments.

Netflix’s data-driven approach, supported by both EDA insights and ML techniques, ensures it remains a leader in the competitive streaming industry — continuously evolving to meet the needs of a global audience.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Netflix, one of the world’s leading OTT (Over-The-Top) streaming platforms, hosts a vast and diverse collection of Movies and TV Shows from multiple countries, genres, and languages. As the content library continues to expand, understanding patterns in the catalog — such as regional trends, content types, audience ratings, and duration preferences — becomes critical for strategic decision-making and improving user engagement.

The main goal of this project is to perform Exploratory Data Analysis (EDA) on the Netflix dataset to uncover meaningful insights about its content strategy, distribution patterns, and audience focus. The dataset consists of 7,787 titles and 12 features, including details such as title, type, director, cast, country, release year, date added, rating, duration, genres, and description.

This analysis aims to:

Explore trends and patterns in content distribution (Movies vs. TV Shows).
Identify growth in content additions across years and regions (especially the US and India).
Analyze the most popular genres, ratings, durations, and seasons to understand viewer preferences.
Highlight the global expansion strategy through localization of content.
Prepare clean, structured, and feature-rich data for the next stage — Machine Learning-based Clustering.
In the subsequent ML phase, the cleaned and processed dataset will be used to implement K-Means clustering, an unsupervised learning algorithm. This model will group similar titles based on attributes such as genre, rating, duration, and release patterns, allowing Netflix to better understand its content segmentation and improve its recommendation systems.

By combining EDA and ML-driven insights, the project ultimately aims to support Netflix in achieving its business objectives of content diversification, audience retention, and data-driven decision-making.

**

#### **Define Your Business Objective?**

The primary business objective of this project is to help Netflix optimize its content strategy and enhance user engagement through data-driven insights and clustering-based analysis.

As Netflix continues to expand its global catalog, understanding how different types of content perform across regions, genres, and audiences becomes essential for maintaining competitiveness in the OTT market.
This project aims to:

Analyze Netflix’s Content Portfolio

Explore trends in Movies and TV Shows by release year, addition year, duration, and ratings.
Identify the most popular genres, content lengths, and regions contributing to the platform.
Support Strategic Decision-Making

Provide actionable insights on audience preferences and content diversity to guide Netflix’s content acquisition, production, and localization strategies.
Highlight growth opportunities in underrepresented areas such as family-friendly and regional content.
Enable Machine Learning-Based Segmentation

Use insights from EDA to prepare data for K-Means clustering, an unsupervised ML model that will group similar titles based on shared attributes like genre, rating, duration, and release patterns.
These clusters can help Netflix understand its content landscape better, identify content gaps, and design targeted recommendations for different audience segments.
By achieving these objectives, Netflix can strengthen its content personalization, regional expansion, and audience retention strategies, ensuring sustainable growth and a more tailored viewing experience for users worldwide.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from google.colab import files

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv("/content/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv")

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
num_rows = df.shape[0]
num_cols = df.shape[1]
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_cols}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows in the dataset: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum()
print(f"Missing values per column: \n{missing_values}")

In [None]:
# Visualizing the missing values
missing_cols = missing_values[missing_values > 0]

plt.figure(figsize=(10,5))
sns.barplot(x=missing_cols.index,
            y=missing_cols.values,
            hue=missing_cols.index,
            palette="viridis",
            legend=False)
plt.xticks(rotation=45)
plt.ylabel("Number of missing values")
plt.title("Missing values per column")
plt.show()

### What did you know about your dataset?

The dataset contains 7,787 rows and 12 columns.
Each row represents a movie or TV show available on Netflix as of 2019.
Columns Description:

show_id: Unique identifier for each title.
type: Content type → Movie or TV Show.
title: Name of the content.
director: Director(s) (missing for ~30.7% of entries).
cast: Main actors/actresses (missing for ~9.2% of entries).
country: Country of origin (missing for ~6.5% of entries).
date_added: The date when the content was added to Netflix.
✅ Converted into datetime format for time-based analysis (only 10 missing).
release_year: Original release year (no missing values).
rating: Audience maturity rating (e.g., PG, R, TV-MA) (7 missing).
duration: Duration in minutes (for Movies) or number of seasons (for TV Shows).
listed_in: Genre(s) or categories.
description: Short summary of the content.
Data Quality Notes:

There are no duplicate rows.
Main missing values are in director, cast, country, and a few in date_added and rating.
Data types:
release_year is numeric (int64).
date_added is now in datetime format.
Most other columns are categorical (object).
✅ In summary, this dataset is mostly clean and provides metadata about Netflix’s catalog.
With date_added converted to datetime, we can also analyze content growth trends over time.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns.tolist()

In [None]:
# Dataset Describe
df.describe()

### Variables Description

show_id → Unique ID assigned to each title in the dataset.
type → Indicates whether the content is a Movie or a TV Show.
title → Name of the movie or TV show.
director → Name(s) of the director(s) of the content.
cast → Main actors/actresses featured in the content.
country → Country where the movie/TV show was produced.
date_added → The date when the movie/TV show was added to Netflix (converted to datetime format).
release_year → Year when the content was originally released.
rating → Audience maturity rating (e.g., PG, R, TV-MA).
duration → Duration of the content:
Movies → in minutes.
TV Shows → in number of seasons.
listed_in → Categories/genres of the content (e.g., "Drama", "Action", "Comedy").
description → Short summary or synopsis of the content.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# 1. Remove duplicates
df.drop_duplicates(inplace=True)
print("After removing duplicates, dataset shape:", df.shape, "\n")

# 2. handle missing values
df.fillna({
    'country': 'Unknown',
    'director': 'Unknown',
    'cast' : 'Unknown',
    'rating' : 'Unknown',
    'duration' : 'Unknown',
    'listed_in' : 'Unknown'
}, inplace = True)
df.head()

# convert columns to correct types
df['date_added'] = pd.to_datetime(df['date_added'], errors = 'coerce')
df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month
df['release_year'] = df['release_year'].astype(int)

# Standardize text columns
df['type'] = df['type'].str.strip().str.lower()
df['rating'] = df['rating'].str.strip()

print("Dataset after data wrangling:",)
print(df.info())

### What all manipulations have you done and insights you found?

Manipulations Performed:

Replaced missing values in categorical columns (director, cast, country, rating) with "Unknown" to avoid dropping useful rows.
Created new columns:
year_added → extracted the year from date_added for trend analysis.
month_added → extracted the month from date_added for seasonal analysis.
Key Insights from EDA:

Content Type: Netflix has more Movies than TV Shows in the dataset.
Release Year Trend: Most content was released after 2000, with a sharp rise after 2010, showing Netflix focuses on recent content.
Year Added Trend: Number of titles added to Netflix has been increasing year by year (especially after 2015).
Ratings: The most common audience ratings are TV-MA (Mature Audience) and TV-14, showing Netflix has a lot of content for older teens and adults.
Top Countries: The majority of Netflix titles come from the United States, followed by India, United Kingdom, and Japan.
Genres: The most popular categories are International Movies, Dramas, and Comedies.
Seasonality: Some months (e.g., December) see slightly higher additions, likely due to holiday season releases.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# 1. distribution of content types
type_counts_df = df['type'].value_counts().reset_index()
type_counts_df.columns = ['type', 'count']
sns.barplot(data= type_counts_df, x = 'type', y='count', hue='type',dodge=False, palette="viridis")
plt.title("Distribution of Content Type onNetflix")
plt.ylabel("Number of Titles")
plt.xlabel("Content Type")
plt.show()

##### 1. Why did you pick the specific chart?

A count plot was chosen because it clearly shows the distribution of Netflix content across two categories — Movies and TV Shows. It is the simplest and most effective way to compare counts between two groups.

##### 2. What is/are the insight(s) found from the chart?

Movies dominate the Netflix catalog with 5,377 titles compared to 2,410 TV shows. This shows Netflix has historically focused more on movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Knowing that movies are more prevalent can help Netflix analyze whether they need to increase investments in TV shows (which tend to drive long-term subscriptions) or maintain the current balance. This insight can also inform content strategy and marketing decisions.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# 2. Distribution of release year
plt.figure(figsize=(12,6))
sns.histplot(df['release_year'], bins=30, kde=False, color="skyblue")
plt.title("Distribution of Netflix Titles by Release Year")
plt.xlabel("Release Year")
plt.ylabel("Number of Titles")
plt.show()

##### 1. Why did you pick the specific chart?

I picked a grouped count plot with hue because it clearly shows the yearly trend of Movies and TV Shows being added to Netflix. The grouped bars make it easy to compare the growth of both categories side by side across years.



##### 2. What is/are the insight(s) found from the chart?

Netflix consistently added more Movies than TV Shows in every year.
After 2015, both categories saw a sharp increase, showing Netflix’s rapid content expansion.
TV Shows began to rise significantly after 2017, highlighting Netflix’s strategic push into serialized content to boost long-term viewer engagement.
The peak content addition years were 2018–2019, with Movies dominating, while TV Shows also saw steady growth.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This insight highlights Netflix’s evolving strategy — initially focusing on Movies, then increasing investment in TV Shows. Since TV Shows often encourage binge-watching and long-term subscriptions, while Movies attract broad audiences quickly, Netflix can use this balance to:

Strengthen subscriber retention (via TV Shows).
Attract new users (via Movies).
This informs content acquisition, production planning, and marketing campaigns.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# 3. number of titles added per year on netflix
added_counts = df['year_added'].value_counts().sort_index()
plt.figure(figsize=(12,6))
sns.lineplot(x=added_counts.index, y=added_counts.values, marker="o", color="orange")
plt.title("Number of Titles Added Per Year on Netflix")
plt.xlabel("Year Added")
plt.ylabel("Number of Titles Added")
plt.show()

##### 1. Why did you pick the specific chart?

I picked a grouped count plot with hue="type" because it allows us to compare how many Movies and TV Shows were added in each month of the year. This helps identify if there are any seasonal trends in Netflix’s content additions.



##### 2. What is/are the insight(s) found from the chart?

Movies dominate additions in every month compared to TV Shows.
Higher content additions are seen in January, October, November, and December, showing a seasonal push during the start and end of the year.
TV Shows follow a similar pattern but with consistently lower counts.
December has the highest total additions, suggesting Netflix strengthens its library for the holiday season.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Understanding seasonal trends helps Netflix:

Plan content releases strategically to maximize viewership during high-demand months like December.
Balance content drops throughout the year to keep subscriber engagement steady.
Align marketing and promotional campaigns with peak content addition months.
This ensures better audience retention and imp

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# 4. Top 10 countries with most content
countries_series = df['country'].str.split(', ').explode()
top_countries = countries_series.value_counts().head(10)
top_countries_df = top_countries.reset_index()
top_countries_df.columns = ['country', 'count']
sns.barplot(data=top_countries_df, x='count', y='country', hue='country', dodge = False, palette = "magma")
plt.title("Top 10 Countries with Most Netflix Content")
plt.xlabel("Number of Titles")
plt.ylabel("Country")
plt.show()

##### 1. Why did you pick the specific chart?

I chose a grouped bar plot with hue="type" because it allows comparison between Movies and TV Shows within the Top 10 contributing countries. This makes it easy to see not only which countries provide the most content, but also whether their contributions are dominated by Movies or TV Shows.

##### 2. What is/are the insight(s) found from the chart?

The United States dominates Netflix’s library with the highest number of both Movies (1850) and TV Shows (705).
India stands out as the second-largest contributor, heavily focused on Movies (852).
Countries like the United Kingdom, Japan, and South Korea contribute a more balanced mix of Movies and TV Shows.
Smaller but consistent contributions are observed from Canada, France, Spain, and Turkey.
Some regions (e.g., South Korea and Japan) show a stronger focus on TV Shows compared to Movies, reflecting cultural preferences for serialized content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights have a strong positive business impact. Netflix can:

Strengthen its localized content strategy by investing more in top-performing regions like the U.S. and India.
Focus on regional preferences, e.g., producing more TV Shows in South Korea/Japan where demand is strong.
Expand content acquisition in countries with lower contributions but growing audiences (e.g., Spain, Turkey).
Potential risk (negative growth insight): Over-reliance on U.S. and Indian markets might limit global diversification. If Netflix fails to balance regional contributions, it could face saturation in dominant regions and miss growth opportunities in emerging markets. Hence, diversification is critical.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# 5. top 10 genres
genres_series = df['listed_in'].str.split(', ').explode()
top_genres = genres_series.value_counts().head(10)
top_genres_df = top_genres.reset_index()
top_genres_df.columns = ['genre', 'count']
sns.barplot(data=top_genres_df, x='count', y='genre', hue='genre', dodge=False, palette="coolwarm")
plt.title("Top 10 Genres on Netflix")
plt.xlabel("Number of Titles")
plt.ylabel("Genre")
plt.show()

##### 1. Why did you pick the specific chart?

A grouped bar plot with hue="type" was chosen to show the top 10 release years contributing the highest number of titles on Netflix. This type of chart clearly compares how content production has changed over the years and whether Movies or TV Shows dominate in those years.

##### 2. What is/are the insight(s) found from the chart?

The majority of Netflix titles were released between 2015 and 2020, showing Netflix’s focus on modern content.
2017 and 2018 have the highest content releases, especially Movies.
TV Shows started catching up around 2018–2020, indicating a growing trend toward serialized content.
Older years (before 2013) have far fewer titles, showing Netflix prefers to stream recent productions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights positively impact Netflix’s content strategy. By recognizing that recent years bring more engaging and relevant content, Netflix can continue to focus on producing and acquiring new releases to maintain user interest.

However, a potential negative growth insight is the low presence of older titles, which could alienate users who enjoy classic movies or shows. Maintaining a small but consistent library of classics could help attract a broader audience.
Overall, this analysis supports Netflix’s strategy to invest in fresh, high-performing, and modern content while ensuring genre and era diversity.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# 6. content type vs rating
plt.figure(figsize=(12,6))
sns.countplot(data=df, x='rating', hue='type', palette="Set2")
plt.title("Content Type vs Rating")
plt.xticks(rotation=45)
plt.xlabel("Rating")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

I selected a line chart because it effectively shows the trend of how the average delay (in years) between a title’s release and its addition to Netflix has changed over time. The line chart highlights Netflix’s evolving content acquisition and production strategy across release years.



##### 2. What is/are the insight(s) found from the chart?

n the early years (1990s–2000s), the average delay was around 15–20 years, meaning Netflix was primarily acquiring older content.
From 2010 onwards, the delay started decreasing sharply, dropping below 5 years by 2015.
After 2018, the gap became almost zero, showing that Netflix now adds content very close to its original release year — likely due to an increase in Netflix Originals and faster licensing deals.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights show a strong positive business trend. A shorter content acquisition gap means:

Netflix provides fresher and more relevant content, increasing viewer satisfaction.
It strengthens Netflix’s competitive edge by offering new titles faster than competitors.
It also highlights the success of Netflix Originals, which are added immediately upon release.
However, a potential negative growth insight is the rising cost of rapid content acquisition and production, as maintaining short turnaround times often requires larger investments. Netflix must ensure this speed remains financially sustainable while maintaining quality.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# 7. Filter only movies
movies = df[df['type'].str.strip().str.lower() == 'movie'].copy()
movies['duration_num'] = movies['duration'].str.extract(r'(\d+)').astype(float)
movies = movies.dropna(subset=['duration_num'])

plt.figure(figsize=(10,5))
sns.histplot(data=movies, x='duration_num', bins=20, kde=True, color=sns.color_palette("Set2")[0])
plt.title("Distribution of Movie Durations")
plt.xlabel("Duration (minutes)")
plt.ylabel("Frequency")
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart was chosen because it effectively shows how Netflix content is distributed across different audience ratings (like TV-MA, TV-14, PG, etc.). This helps visualize which type of content dominates the platform in terms of maturity level.



##### 2. What is/are the insight(s) found from the chart?

The most common rating on Netflix is TV-MA, followed by TV-14 and R.
This indicates Netflix’s catalog is heavily skewed toward content suitable for mature audiences.
Family-friendly ratings (like G, PG, or TV-Y) make up a much smaller portion.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight helps Netflix better understand its content maturity balance.

The dominance of TV-MA content shows strong appeal to adult viewers, aligning with global streaming trends.
However, the relatively low proportion of children’s content could indicate a gap in the family-friendly segment, which might limit Netflix’s growth among younger audiences.
Netflix could use this insight to expand kid-focused productions and diversify its catalog. Would you like me to help you with the next chart — “Top 10 Genres” (from the listed_in column)? That’s another strong visual with business relevance.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# 8. Movies vs TV Shows added over time
plt.figure(figsize=(10,6))
sns.countplot(data=df, x='year_added', hue='type')
plt.title("Movies vs TV Shows Added Over Time")
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a grouped bar chart with hue="type" because it clearly shows the distribution of Movies and TV Shows across different audience ratings (such as TV-MA, TV-14, R, PG-13, etc.). This chart helps in understanding what type of content Netflix offers for various audience maturity levels.



##### 2. What is/are the insight(s) found from the chart?

The TV-MA (Mature Audience) rating dominates for both Movies and TV Shows, showing Netflix’s strong focus on adult-oriented content.
TV-14 is the second most common rating, indicating significant content for teenagers and general audiences.
Family-friendly or children’s ratings like TV-Y, TV-G, and PG have relatively low counts.
Movies have a broader range of ratings, while TV Shows are mostly concentrated around TV-MA and TV-14.
This suggests that Netflix’s catalog primarily targets adult and young-adult audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights help Netflix refine its content strategy:

The dominance of TV-MA and TV-14 content aligns with the global streaming trend where mature, original, and dramatic content drives high engagement.
This focus supports positive growth by strengthening Netflix’s position among adult and teen viewers — its largest user demographic.
However, a negative growth insight is the limited offering of family and children’s content, which may reduce Netflix’s appeal to households with younger viewers. Expanding kid-safe and educational content could help Netflix tap into that untapped audience segment and reduce churn among family subscribers.Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# 9. Rating distribution for Movies vs TV Shows
plt.figure(figsize=(10,6))
sns.countplot(data=df, x='rating', hue='type', order=df['rating'].value_counts().index)
plt.title("Rating Distribution by Content Type")
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

A histogram was chosen because it effectively shows how movie durations are distributed on Netflix. It helps identify the most common movie lengths and whether Netflix focuses more on short, standard-length, or long movies.



##### 2. What is/are the insight(s) found from the chart?

Most Netflix movies fall within the 80 to 120 minutes range, representing standard feature-length films.
Very few movies are shorter than 60 minutes or longer than 150 minutes, indicating that Netflix prioritizes moderate-length films suitable for casual viewing.
The distribution is slightly right-skewed, showing that while most movies are around 90–100 minutes, a small number extend beyond 2 hours.
This pattern suggests that Netflix optimizes its movie catalog for binge-watch convenience and viewer attention spans.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight supports Netflix’s positive business strategy. By focusing on medium-duration movies, Netflix ensures higher viewer completion rates and repeat engagement.
However, the potential negative growth insight is the lack of variety in movie lengths.

There are fewer short films or long cinematic experiences, which could limit Netflix’s appeal to niche viewers (e.g., indie or documentary enthusiasts).
To maintain steady growth, Netflix could diversify by adding short-form movies, documentaries, or extended-length films to attract a broader audience.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# 10. Top 10 directors with most titles
top_directors = df['director'].value_counts().head(10)

plt.figure(figsize=(10,6))
sns.barplot(
    x=top_directors.values,
    y=top_directors.index,
    hue=top_directors.index,
    palette="viridis",
    legend=False
)
plt.title("Top 10 Directors with Most Titles")
plt.xlabel("Number of Titles")
plt.ylabel("Director")
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart was chosen because it clearly displays how many TV shows have a certain number of seasons. This helps visualize whether Netflix primarily focuses on short series, mini-series, or long-running shows.



##### 2. What is/are the insight(s) found from the chart?

A large majority of TV shows (over 1,600 titles) have only 1 season, followed by a steep drop for multi-season series.
Very few shows go beyond 3–4 seasons, and long-running shows (with 10+ seasons) are extremely rare.
This pattern suggests that Netflix emphasizes short-format or limited-series content, likely to maintain viewer interest and reduce production costs.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights positively highlight Netflix’s content strategy.

By focusing on 1–2 season series, Netflix can offer diverse content quickly, appealing to binge-watchers who prefer shorter commitments.
This approach keeps the catalog dynamic and encourages viewers to explore new titles frequently.
However, a potential negative growth insight is the lack of long-running shows, which can limit viewer loyalty and attachment to a series.
Longer series tend to build stronger fan bases and sustained engagement.
To balance growth, Netflix could invest in multi-season flagship series that maintain audience retention while continuing to produce short, high-turnover showsAnswer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# 11. Duration distribution (Movies only)
movies = df[df['type'] == "Movie"].copy()
movies['duration_minutes'] = movies['duration'].str.replace(" min","").astype(float)
plt.figure(figsize=(10,6))
sns.histplot(movies['duration_minutes'], bins=30, kde=True, color="red")
plt.title("Distribution of Movie Durations")
plt.xlabel("Duration (minutes)")
plt.show()

##### 1. Why did you pick the specific chart?

I used a histogram with KDE (Kernel Density Estimate) because it is ideal for visualizing the distribution of a continuous numeric variable (in this case, movie durations).

The histogram allows us to see how frequently different duration ranges occur, while the KDE smooths the distribution to highlight trends and peaks.

This chart helps identify patterns or anomalies in movie durations.

##### 2. What is/are the insight(s) found from the chart?

Most movies on Netflix have a duration between 80 and 120 minutes, indicating a standard feature-length movie trend.

There are fewer movies under 60 minutes or over 180 minutes, showing that very short or very long movies are less common.

The distribution is slightly right-skewed, meaning there are some long movies (e.g., epics) but they are rare.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Helps Netflix plan content acquisition by focusing on popular duration ranges preferred by viewers (80–120 mins).

Aids in recommendation systems by suggesting movies in the viewer’s preferred duration range.

Supports production and marketing strategies, e.g., shorter movies for casual viewers or longer movies for niche audiences.

Potential Negative Growth:

Very long movies (>180 minutes) are rare, which may limit content diversity for audiences who prefer epic or in-depth films.

Over-focusing on the most common duration (80–120 mins) could neglect niche audience preferences, slightly impacting engagement in specialized segments.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# 12. Number of TV Show seasons distribution
tv = df[df['type'] == "TV Show"].copy()
tv['seasons'] = tv['duration'].str.replace(" Season","").str.replace("s","").astype(float)
plt.figure(figsize=(10,6))
sns.histplot(tv['seasons'], bins=15, kde=False, color="purple")
plt.title("Distribution of TV Show Seasons")
plt.xlabel("Seasons")
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram because it effectively shows the distribution of a numerical variable—in this case, the number of seasons per TV show. Histograms make it easy to see how TV shows are spread across different season counts and identify patterns such as most common season lengths. The bins allow grouping similar season counts to spot clusters or outliers.

##### 2. What is/are the insight(s) found from the chart?

Most TV shows on the platform have 1 or 2 seasons, indicating a prevalence of shorter series.

There are fewer shows with more than 5 seasons, showing that long-running series are less common.

A few outliers exist with 10+ seasons, likely popular or classic shows that are still active.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact: Knowing that most shows have fewer seasons can guide content acquisition and production strategy. Netflix could focus on short, binge-able series that attract new users or maintain engagement without requiring long-term commitment.

Potential negative growth insight: The low number of long-running series might mean a lack of franchise continuity, which could reduce retention for viewers who enjoy multi-season storylines. To counter this, investing in high-quality long-running shows could help retain subscribers over a longer period.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
actor_split = df['cast'].dropna().str.split(", ")
actors = pd.Series([a for sublist in actor_split for a in sublist])
top_actors = actors.value_counts().head(10)

plt.figure(figsize=(10,6))
sns.barplot(
    x=top_actors.values,
    y=top_actors.index,
    hue=top_actors.index,
    palette="magma",
    legend=False
)
plt.title("Top 10 Actors on Netflix")
plt.show()


##### 1. Why did you pick the specific chart?

I chose a horizontal bar chart because it clearly displays categorical data ranked by frequency. In this case, it shows which actors appear most frequently on Netflix. Horizontal bars make actor names easy to read, especially when the names are long, and the length of the bar conveys their relative prominence on the platform.

##### 2. What is/are the insight(s) found from the chart?

A small group of actors appears more frequently across Netflix content.

These actors could be popular or versatile, appearing in multiple movies or TV shows.

Some actors dominate certain genres, which may influence viewer preferences and subscriptions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact: Netflix can leverage this data to promote shows starring popular actors, increasing engagement and attracting new subscribers. They can also strategically cast these actors in upcoming projects to boost viewership.

Potential negative growth insight: Over-reliance on the same actors may lead to content monotony, which could reduce appeal for users seeking fresh faces or variety. Netflix may need to balance star power with diverse casting to sustain long-term growth.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# 14. Correlation heatmap (Numerical features only)
df_corr = movies[['duration_minutes']].dropna()

# If only one column, create a dummy correlation matrix to avoid NaN warning
if df_corr.shape[1] == 1:
    corr_matrix = pd.DataFrame([[1.0]],
                               index=df_corr.columns,
                               columns=df_corr.columns)
else:
    corr_matrix = df_corr.corr()

plt.figure(figsize=(6,4))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", vmin=-1, vmax=1)
plt.title("Correlation Heatmap (Numerical Features)")
plt.show()


##### 1. Why did you pick the specific chart?

I chose a heatmap because it visually represents the correlation between numerical variables. Correlation helps identify relationships (positive, negative, or none) between features, which is important for understanding patterns, dependencies, or potential multicollinearity in data. The color intensity quickly conveys the strength and direction of correlations.

##### 2. What is/are the insight(s) found from the chart?

Since the dataset currently has only one numerical column (duration_minutes), the heatmap shows a single correlation of 1.0 with itself, indicating a perfect correlation (self-correlation).

If more numerical features were present, the heatmap would reveal how movie duration relates to other numeric features like ratings, release year, etc.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Select only numerical columns for pairplot
numeric_df = df.select_dtypes(include=['number']).dropna()

# Check if at least 2 numeric columns exist
if numeric_df.shape[1] >= 2:
    sns.set(style="ticks")
    plt.figure(figsize=(10, 8))

    # Create pair plot
    pairplot = sns.pairplot(numeric_df, diag_kind="kde", corner=True)
    pairplot.fig.suptitle("Pair Plot of Numerical Features", y=1.02)
    plt.show()
else:
    print("Not enough numerical columns for pair plot.")


##### 1. Why did you pick the specific chart?

I chose a pair plot because it allows visualizing pairwise relationships between multiple numerical features simultaneously. Scatter plots show correlations between two variables, while the diagonal KDE plots display the distribution of each feature. It’s an excellent way to detect trends, clusters, outliers, or potential correlations in one comprehensive view.

##### 2. What is/are the insight(s) found from the chart?

Patterns or clusters between numerical features become visible (e.g., duration vs. rating).

Outliers can be easily spotted in scatter plots.

The KDE plots on the diagonal show the distribution shape (skewed, normal, or uniform) of each numerical variable.

If features are correlated, their scatter plots will show linear or curved relationships.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the detailed Exploratory Data Analysis of the Netflix dataset, the following suggestions can help the client (Netflix) achieve its business objectives of sustained growth, audience retention, and global expansion:

Expand Regional & Localized Content

The data shows strong growth in Indian and international content alongside US productions.
Netflix should continue investing in regional originals (e.g., Hindi, Korean, Spanish) to strengthen its presence in emerging markets.
This will help capture new subscribers and increase engagement in non-English-speaking regions.
Diversify Content Duration and Format

Most movies are between 80–120 minutes, and most TV Shows have 1–2 seasons.
Netflix can experiment with short-form content (mini-movies, webisodes) and long-running series to attract diverse audiences.
Longer shows build emotional connection and retention, while shorter ones appeal to casual viewers.
Balance Audience Ratings

The dominance of TV-MA and TV-14 ratings indicates heavy focus on adult audiences.
Introducing more PG, TV-Y, and family-friendly content could expand Netflix’s market to children and family users, creating more household-level subscriptions.
Sustain Consistent Monthly Additions

Month-wise analysis shows even content additions throughout the year, which keeps user engagement stable.
Netflix should maintain this steady release strategy, ensuring subscribers always have new content to watch.
Leverage Data-Driven Content Strategy

Correlation and pair plot insights show weak relationships among features, indicating content diversity.
Netflix should continue using viewer analytics (watch time, completion rate, region preference) to optimize recommendations and production investment.
Encourage New Creative Talent
The “Top Directors” analysis revealed recurring collaborations with a few creators.
Netflix can encourage new directors and diverse storytellers to bring fresh ideas and increase creative variety.

# **Conclusion**

Through this Exploratory Data Analysis (EDA) on Netflix’s dataset, several valuable insights have been discovered that reveal the platform’s evolving content strategy and global market direction.

Content Composition: Netflix’s library is dominated by Movies, but TV Shows have grown rapidly in recent years, showing a strategic balance between short-form and long-form storytelling.
Release and Addition Trends: Most titles were released and added to Netflix after 2015, indicating the platform’s focus on fresh, modern, and trending content.
Regional Strategy: The United States remains Netflix’s largest content contributor, while India and other international regions show strong growth — reflecting Netflix’s efforts toward global localization.
Audience Ratings: The majority of titles fall under TV-MA and TV-14, emphasizing mature and teen audiences. However, there is an opportunity to expand family and children’s content for household engagement.
Duration and Seasons: Most movies are between 80–120 minutes, and most TV Shows have 1–2 seasons, aligning with the binge-watch culture but suggesting room for diversification.
Genres: Popular genres include International Movies, Dramas, Comedies, and Documentaries, highlighting Netflix’s diverse yet globally appealing catalog.
Correlation Analysis: Numerical features such as release year, duration, and seasons show weak correlations, indicating that Netflix maintains a balanced and varied content mix without bias toward specific content patterns.
In summary, Netflix has successfully built a diverse and globally appealing content library, continuously adapting to audience preferences and market demands.
To further strengthen its position, Netflix should:

Continue expanding regional productions (especially in India and other emerging markets).
Introduce more family-oriented content to reach all age groups.
Balance between shorter and longer content formats to maintain engagement and loyalty.
Leverage data analytics to drive smarter recommendations and production decisions.
Netflix’s data-driven and globally inclusive approach positions it well for sustained growth, market expansion, and long-term subscriber retention.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***