<a href="https://colab.research.google.com/github/deshm084/netflix-eda-portfolio/blob/main/Netflix_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Netflix Content Strategy Analysis (EDA)
**Author:** Sanskruti Deshmukh

**Objective:** analyze the Netflix dataset to understand content trends over the last decade. Specifically, we are looking at the ratio of Movies vs. TV Shows and the rise of specific genres.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px  # This makes the "really pretty" interactive charts

# Set the aesthetic style
sns.set_style("whitegrid")
plt.style.use("dark_background") # Let's make it look cool/modern

First, we load the data and check for missing values to ensure data integrity.

In [None]:
# Load the data (make sure the filename matches what you uploaded)
df = pd.read_csv('netflix_titles.csv')

# Look at the first few rows
print(df.head())

# Check for missing data (Crucial step! This is the "cleaning" part)
print(df.isnull().sum())

  show_id     type                  title         director  \
0      s1    Movie   Dick Johnson Is Dead  Kirsten Johnson   
1      s2  TV Show          Blood & Water              NaN   
2      s3  TV Show              Ganglands  Julien Leclercq   
3      s4  TV Show  Jailbirds New Orleans              NaN   
4      s5  TV Show           Kota Factory              NaN   

                                                cast        country  \
0                                                NaN  United States   
1  Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...   South Africa   
2  Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...            NaN   
3                                                NaN            NaN   
4  Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...          India   

           date_added  release_year rating   duration  \
0  September 25, 2021          2020  PG-13     90 min   
1  September 24, 2021          2021  TV-MA  2 Seasons   
2  September 24, 2021        

Since the 'listed_in' column contains multiple genres per row, we 'explode' the data to analyze genre frequency accurately.

In [None]:
# Let's see the ratio of Movies vs. TV Shows
x = df.groupby(['type'])['type'].count()

# Create a pie chart
fig = px.pie(values=x, names=x.index, title='Netflix Content Distribution: Movies vs TV Shows')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

In [None]:
# 1. Convert the 'date_added' column to datetime objects
# We use .str.strip() to remove accidental spaces before/after the date
df['date_added'] = pd.to_datetime(df['date_added'].str.strip(), errors='coerce')

# 2. Extract the Year from the date
df['year_added'] = df['date_added'].dt.year

# 3. Group data by Year and Type (Movie vs TV Show)
growth_df = df.groupby(['year_added', 'type']).size().reset_index(name='count')

# 4. Filter out future years or bad data (optional cleanup)
growth_df = growth_df[growth_df['year_added'] >= 2010]

# 5. Plot the Growth
fig = px.line(growth_df,
              x='year_added',
              y='count',
              color='type',
              title='Netflix Content Growth: Content Added Per Year',
              markers=True)
fig.show()

In [None]:
# 1. Create a copy of the dataframe to avoid messing up the original
genres_df = df.copy()

# 2. Split the 'listed_in' column by the comma
# If a row has "Action, Comedy", it becomes a list ["Action", "Comedy"]
genres_df['listed_in'] = genres_df['listed_in'].str.split(', ')

# 3. "Explode" the list
# This creates a new row for EVERY genre in the list.
# The dataframe gets longer, but now each row is a single genre.
genres_df = genres_df.explode('listed_in')

# 4. Count the top 20 genres
top_genres = genres_df['listed_in'].value_counts().head(20).reset_index()
top_genres.columns = ['Genre', 'Count']

# 5. Plot the Bar Chart
fig = px.bar(top_genres,
             x='Count',
             y='Genre',
             orientation='h',
             title='Top 20 Genres on Netflix',
             color='Count',
             color_continuous_scale='Bluered_r') # Just for style

# Invert y-axis so the top genre is at the top
fig.update_layout(yaxis=dict(autorange="reversed"))
fig.show()

Observation: We see a significant dip in content added in 2020/2021, likely correlated with global production halts due to the pandemic. Additionally, 'International Movies' is a top genre, indicating Netflix's global expansion strategy.