
# ðŸŽ¬ Netflix Content Analysis â€” Stepâ€‘byâ€‘Step Notebook

This notebook walks you through a complete analysis of **netflix_titles.csv** (from Kaggle).  
It includes: loading data, cleaning, feature engineering, exploratory analysis, and saving outputs.

> **How to use**  
> 1. Put `netflix_titles.csv` in the same folder as this notebook.  
> 2. Run each cell from top to bottom.  
> 3. Feel free to tweak charts or add your own questions at the end.


## 1) Setup & Imports

In [None]:

# If needed, uncomment to install packages in your environment:
# !pip install pandas matplotlib

import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None)
print("Libraries imported.")


## 2) Load Dataset

In [None]:

# Update the path if your csv is elsewhere
CSV_PATH = "netflix_titles.csv"

df = pd.read_csv(CSV_PATH)
print(df.shape)
df.head()


## 3) Quick Scan: Structure & Missing Values

In [None]:

display(df.info())
print("\nMissing values per column:")
print(df.isna().sum())


## 4) Basic Cleaning

In [None]:

# Strip whitespace from object columns
for col in df.select_dtypes(include='object').columns:
    df[col] = df[col].astype(str).str.strip()

# Parse dates
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

# Fill simple nulls for text columns that are useful for grouping
for col in ['director', 'cast', 'country', 'rating']:
    if col in df.columns:
        df[col] = df[col].fillna("Unknown")

# Drop rows with completely missing titles or types (rare but safe)
df = df.dropna(subset=['title', 'type'])

# Preview after cleaning
print(df.shape)
df.head()


## 5) Feature Engineering

In [None]:

# Release year already exists. We can also extract the year added to Netflix.
df['year_added'] = df['date_added'].dt.year

# Normalize duration to a numeric column:
# For Movies: "90 min" -> 90; For TV Shows: "2 Seasons" -> 2
def parse_duration(val):
    if pd.isna(val):
        return None
    s = str(val).lower()
    try:
        num = int(s.split()[0])
        return num
    except Exception:
        return None

df['duration_value'] = df['duration'].apply(parse_duration)

# Also create a genre (listed_in) exploded view for later aggregations
genres_exploded = df.assign(genre=df['listed_in'].str.split(',')).explode('genre')
genres_exploded['genre'] = genres_exploded['genre'].astype(str).str.strip()

print("Engineered columns added: year_added, duration_value, and exploded genre table.")
genres_exploded[['title','type','genre']].head()



## 6) Highâ€‘Level Questions to Answer
1. **What is the split between Movies and TV Shows?**  
2. **How has content production changed over time (by `release_year`)?**  
3. **Which countries contribute the most content?**  
4. **What are the top genres?**  
5. **What are the most common ratings?**  
6. **Who are the top directors by title count?**


### 6.1 Movies vs TV Shows

In [None]:

counts = df['type'].value_counts().sort_index()
print(counts)

plt.figure()
counts.plot(kind='bar')
plt.title("Movies vs TV Shows")
plt.xlabel("Type")
plt.ylabel("Count")
plt.tight_layout()
plt.show()


### 6.2 Titles by Release Year

In [None]:

by_year = df.groupby('release_year')['show_id'].count().sort_index()
print(by_year.tail(10))

plt.figure()
by_year.plot(kind='line', marker='o')
plt.title("Titles by Release Year")
plt.xlabel("Release Year")
plt.ylabel("Number of Titles")
plt.tight_layout()
plt.show()


### 6.3 Top Countries

In [None]:

# Country field can contain multiple entries; split and explode like genres
countries_exploded = df.assign(country_split=df['country'].str.split(',')).explode('country_split')
countries_exploded['country_split'] = countries_exploded['country_split'].astype(str).str.strip()
top_countries = countries_exploded['country_split'].value_counts().head(15)
print(top_countries)

plt.figure()
top_countries.sort_values(ascending=True).plot(kind='barh')
plt.title("Top 15 Countries by Number of Titles")
plt.xlabel("Number of Titles")
plt.ylabel("Country")
plt.tight_layout()
plt.show()


### 6.4 Top Genres

In [None]:

top_genres = genres_exploded['genre'].value_counts().head(15)
print(top_genres)

plt.figure()
top_genres.sort_values(ascending=True).plot(kind='barh')
plt.title("Top 15 Genres on Netflix")
plt.xlabel("Number of Titles")
plt.ylabel("Genre")
plt.tight_layout()
plt.show()


### 6.5 Ratings Distribution

In [None]:

ratings = df['rating'].value_counts()
print(ratings)

plt.figure()
ratings.sort_values(ascending=True).plot(kind='barh')
plt.title("Ratings Distribution")
plt.xlabel("Number of Titles")
plt.ylabel("Rating")
plt.tight_layout()
plt.show()


### 6.6 Top Directors

In [None]:

# Some rows have 'Unknown' or multiple directors separated by commas
directors_exploded = df.assign(dir_split=df['director'].str.split(',')).explode('dir_split')
directors_exploded['dir_split'] = directors_exploded['dir_split'].astype(str).str.strip()
top_directors = directors_exploded[directors_exploded['dir_split'].str.lower() != 'unknown']['dir_split'].value_counts().head(15)
print(top_directors)

plt.figure()
top_directors.sort_values(ascending=True).plot(kind='barh')
plt.title("Top 15 Directors by Title Count")
plt.xlabel("Number of Titles")
plt.ylabel("Director")
plt.tight_layout()
plt.show()


## 7) Auto-Summary Helpers (Optional)

In [None]:

summary = {}
summary['total_titles'] = len(df)
summary['movie_count'] = int((df['type'] == 'Movie').sum())
summary['tv_count'] = int((df['type'] == 'TV Show').sum())
summary['earliest_release_year'] = int(df['release_year'].min())
summary['latest_release_year'] = int(df['release_year'].max())
summary['most_common_rating'] = df['rating'].mode().iloc[0] if not df['rating'].mode().empty else None
summary['top_genre'] = genres_exploded['genre'].value_counts().idxmax() if not genres_exploded.empty else None
summary['top_country'] = countries_exploded['country_split'].value_counts().idxmax() if not countries_exploded.empty else None

summary


## 8) Save Cleaned Data & Artifacts

In [None]:

# Save cleaned main table
df.to_csv("netflix_clean.csv", index=False)

# Save exploded helper tables (optional)
genres_exploded.to_csv("netflix_genres_exploded.csv", index=False)
countries_exploded.to_csv("netflix_countries_exploded.csv", index=False)

print("Saved: netflix_clean.csv, netflix_genres_exploded.csv, netflix_countries_exploded.csv")


## 9) Your Custom Questions (Add Cells Below)


- Which **genres** grew the most after 2015?  
- Are **TV Shows** typically released by fewer countries than **Movies**?  
- Does **duration_value** differ by **rating**?  
- For each country, what's the **Movie vs TV ratio**?
