# Day 1: Netflix Movies & TV Shows - Content Strategy Analysis

## Overview
Comprehensive exploratory data analysis of Netflix's content library to uncover unique insights about content strategy, naming patterns, geographic diversity, and creative partnerships.

## Dataset
- **Source**: Netflix Titles Dataset
- **Content**: Movies and TV Shows available on Netflix
- **Features**: show_id, type, title, director, cast, country, date_added, release_year, rating, duration, listed_in, description

## Import Required Libraries

In [None]:
import plotly.express as px
import plotly.graph_objects as go
import plotly.subplots as sp

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

## Load Dataset

In [None]:
# Load the dataset
netflix = pd.read_csv('../data/netflix_titles.csv')

netflix.head()

In [None]:
netflix.info()

The dataset "claims" to be cleaned but it has some info that is less to be desired

## Data Cleaning

In [None]:
print("Original Dataset Info:")
print(f"Shape: {netflix.shape}")
print(f"Missing values:\n{netflix.isnull().sum()}\n")

# Create a copy for cleaning
df = netflix.copy()

# 1. Handle missing values in directors
df['director'] = df['director'].fillna('Unknown')

# 2. Handle missing values in cast
df['cast'] = df['cast'].fillna('Unknown')

# 3. Handle missing values in countries
df['country'] = df['country'].fillna('Unknown')

# 4. Handle missing values in rating
df['rating'] = df['rating'].fillna('Not Rated')

# 5. Handle missing values in duration
df['duration'] = df['duration'].fillna('Unknown')

# 6. Convert date_added to datetime, handle any remaining nulls
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
df['date_added'] = df['date_added'].fillna(pd.NaT)

# 7. Extract useful features from duration
def extract_duration(dur_str, content_type):
    if dur_str == 'Unknown':
        return None
    if content_type == 'Movie':
        return int(dur_str.split()[0])
    else:  # TV Show
        return int(dur_str.split()[0])

df['duration_value'] = df.apply(lambda row: extract_duration(row['duration'], row['type']), axis=1)

# 8. Create duration unit column
def extract_duration_unit(dur_str, content_type):
    if dur_str == 'Unknown':
        return 'Unknown'
    if content_type == 'Movie':
        return 'min'
    else:
        return 'Season' if 'Season' in dur_str else 'min'

df['duration_unit'] = df.apply(lambda row: extract_duration_unit(row['duration'], row['type']), axis=1)

# 9. Extract year from date_added
df['year_added'] = df['date_added'].dt.year
df['year_added'] = df['year_added'].fillna(0).astype(int)

# 10. Extract month from date_added
df['month_added'] = df['date_added'].dt.month
df['month_added'] = df['month_added'].fillna(0).astype(int)

# 11. Standardize string columns - strip whitespace and capitalize appropriately
string_cols = ['type', 'title', 'rating']
for col in string_cols:
    df[col] = df[col].str.strip()

# 12. Convert listed_in (genres) to list format for easier analysis
df['genres'] = df['listed_in'].str.split(', ')

# 13. Convert cast and directors to list format
df['cast_list'] = df['cast'].str.split(', ')
df['director_list'] = df['director'].str.split(', ')

# 14. Convert countries to list format
df['country_list'] = df['country'].str.split(', ')

# 15. Clean up description - remove extra whitespace
df['description'] = df['description'].str.strip()

# 16. Create age category from rating
rating_age_map = {
    'G': 'All Ages',
    'PG': 'Parental Guidance',
    'PG-13': 'Parental Guidance (13+)',
    'R': 'Restricted (17+)',
    'NC-17': 'Adults Only',
    'TV-Y': 'Young Children',
    'TV-Y7': 'Children',
    'TV-G': 'General Audience',
    'TV-PG': 'Parental Guidance',
    'TV-14': 'Teens (14+)',
    'TV-MA': 'Mature (17+)',
    'Not Rated': 'Not Rated',
    'Unknown': 'Unknown'
}
df['age_category'] = df['rating'].map(rating_age_map)

# 17. Data type optimization
df['release_year'] = df['release_year'].astype('int16')
df['type'] = df['type'].astype('category')
df['rating'] = df['rating'].astype('category')
df['age_category'] = df['age_category'].astype('category')

print("\nCleaned Dataset Info:")
print(f"Shape: {df.shape}")
print(f"Missing values:\n{df.isnull().sum()}\n")
print("Data Types:")
print(df.dtypes)

# Display sample of cleaned data
print("\nSample of Cleaned Data:")
df[['show_id', 'type', 'title', 'release_year', 'rating', 'duration', 'duration_value', 'duration_unit', 'genres', 'date_added', 'year_added']].head(10)

## Data Quality Validation and Summary

In [None]:
print("DATA CLEANING SUMMARY")
print('-' * 50)

print("\n1. Content Type Distribution:")
print(df['type'].value_counts())

print("\n2. Rating Distribution:")
print(df['rating'].value_counts())

print("\n3. Release Year Range:")
print(f"Earliest: {df['release_year'].min()}")
print(f"Latest: {df['release_year'].max()}")

print("\n4. Date Added Range:")
print(f"Earliest added: {df['date_added'].min()}")
print(f"Latest added: {df['date_added'].max()}")

print("\n5. Duration Statistics:")
print(f"Movies - Min: {df[df['type'] == 'Movie']['duration_value'].min()} min, Max: {df[df['type'] == 'Movie']['duration_value'].max()} min")
print(f"TV Shows - Min: {df[df['type'] == 'TV Show']['duration_value'].min()} seasons, Max: {df[df['type'] == 'TV Show']['duration_value'].max()} seasons")

print("\n6. Missing Values After Cleaning:")
missing = df.isnull().sum()
if missing.sum() == 0:
    print("No missing values! Data is clean.")
else:
    print(missing[missing > 0])

## Export Cleaned Data

In [None]:
# Save cleaned dataset
df.to_csv('../data/cleaned/netflix_cleaned_TADS.csv', index=False)

# Create a version with original string columns for reference
df_export = df.copy()
print("\nNew columns created during cleaning:")
new_cols = ['duration_value', 'duration_unit', 'year_added', 'month_added', 'genres', 
            'cast_list', 'director_list', 'country_list', 'age_category']
for col in new_cols:
    print(f"  - {col}")
    
print("\nCleaned data saved to: ../data/cleaned/netflix_cleaned_TADS.csv")

---
## Might Stop here but let's get some unique insights
---

## INSIGHT 1: Title Length Analysis
### Does Netflix favor short or long titles?

In [None]:
df['title_length'] = df['title'].str.len()
df['title_word_count'] = df['title'].str.split().str.len()

fig = sp.make_subplots(
    rows=1, cols=2,
    subplot_titles=('Title Length Distribution', 'Title Word Count by Content Type')
)

fig.add_trace(
    go.Histogram(x=df['title_length'], nbinsx=50, name='Title Length', marker_color='#E50914'),
    row=1, col=1
)

for content_type in df['type'].unique():
    data = df[df['type'] == content_type]['title_word_count']
    fig.add_trace(
        go.Box(y=data, name=content_type, boxmean='sd'),
        row=1, col=2
    )

fig.update_xaxes(title_text='Character Count', row=1, col=1)
fig.update_yaxes(title_text='Frequency', row=1, col=1)
fig.update_xaxes(title_text='Content Type', row=1, col=2)
fig.update_yaxes(title_text='Word Count', row=1, col=2)

fig.update_layout(height=500, width=1200, title_text='Netflix Title Naming Strategy Analysis')
fig.write_html('../viz/title_strategy_analysis.html')
fig.show()

print("INSIGHT 1: Title Strategy")
print(f"Average title length: {df['title_length'].mean():.1f} characters")
print(f"Most common title length: {df['title_length'].mode()[0]} characters")
print(f"Average title words: {df['title_word_count'].mean():.1f} words")
print(f"Single-word titles: {(df['title_word_count'] == 1).sum()} ({(df['title_word_count'] == 1).sum() / len(df) * 100:.1f}%)")

## INSIGHT 2: Description Marketing Strategy
### Does Netflix write longer descriptions for certain content?

In [None]:
df['description_length'] = df['description'].str.len()
df['description_word_count'] = df['description'].str.split().str.len()

# Analyze description patterns by type and rating
desc_analysis = df.groupby(['type', 'age_category'], observed=True).agg({
    'description_length': ['mean', 'median'],
    'description_word_count': ['mean', 'median']
}).round(1)

fig = px.box(
    df,
    x='type',
    y='description_length',
    color='age_category',
    title='Description Length: Does Netflix describe mature content more?',
    labels={'description_length': 'Description Length (characters)', 'type': 'Content Type'},
    height=500,
    width=1000
)
fig.write_html('../viz/description_strategy.html')
fig.show()

print("\nINSIGHT 2: Description Strategy")
print("Average description length by content type:")
print(df.groupby('type', observed=True)['description_length'].describe()[['mean', '50%', 'max']])
print("\nContent with longest descriptions (mature ratings get more detail):")
longest_desc = df.nlargest(1, 'description_length')[['title', 'type', 'rating', 'description_length']]
print(longest_desc)
print(f"\nShortest average descriptions: {df.groupby('age_category', observed=True)['description_length'].mean().idxmin()}")
print(f"Longest average descriptions: {df.groupby('age_category', observed=True)['description_length'].mean().idxmax()}")

## INSIGHT 3: Netflix Director Favorites
### Who are the biggest directors on the platform?

In [None]:
# Flatten the directors list
directors_list = []
for directors in df['director_list']:
    if directors != ['Unknown']:
        for director in directors:
            directors_list.append(director.strip())

top_directors = pd.Series(directors_list).value_counts().head(15)

fig = px.bar(
    x=top_directors.values,
    y=top_directors.index,
    orientation='h',
    title='Biggest Directors on the Platform?',
    labels={'x': 'Number of Titles', 'y': 'Director'},
    color=top_directors.values,
    color_continuous_scale='Reds'
)
fig.update_layout(height=600, width=900)
fig.write_html('../viz/top_directors.html')
fig.show()

print("\nINSIGHT 3: Biggest Directors")
print(f"Total unique directors: {len(set(directors_list))}")
print(f"Top director has {top_directors.iloc[0]} titles on Netflix")
print(f"Average titles per director: {len(directors_list) / len(set(directors_list)):.2f}")
print(f"\nTop 10 Directors:")
print(top_directors.head(10))

print(f"\n{top_directors.index[0]} is the most prolific director on Netflix with {top_directors.iloc[0]} titles.")
print("This indicates that Netflix may favor directors who can produce content consistently,")
print("contributing significantly to their library through strategic partnerships.")

## INSIGHT 4: Geographic Production Hubs
### Which countries produce the most content?

In [None]:
countries_list = []
for countries in df['country_list']:
    if countries != ['Unknown']:
        for country in countries:
            countries_list.append(country.strip())

top_countries = pd.Series(countries_list).value_counts().head(20)

fig = go.Figure()
fig.add_trace(go.Bar(
    y=top_countries.index,
    x=top_countries.values,
    orientation='h',
    marker=dict(color=top_countries.values, colorscale='Viridis', showscale=True)
))

fig.update_layout(
    title='Netflix Global Production Hub - Which countries dominate?',
    xaxis_title='Number of Titles',
    yaxis_title='Country',
    height=600,
    width=900
)
fig.write_html('../viz/country_production.html')
fig.show()

print("\nINSIGHT 4: Geographic Diversity")
print(f"Total countries represented: {len(set(countries_list))}")
print(f"Content from US: {top_countries.get('United States', 0)} titles ({top_countries.get('United States', 0) / len(countries_list) * 100:.1f}%)")
if 'India' in top_countries.index and 'United Kingdom' in top_countries.index:
    print(f"Combined US + India + UK: {(top_countries.get('United States', 0) + top_countries.get('India', 0) + top_countries.get('United Kingdom', 0)) / len(countries_list) * 100:.1f}% of all content")
print(f"\nTop 15 Content-Producing Countries:")
print(top_countries.head(15))

print("\nThe United States leads Netflix content production by a significant margin,")
print("reflecting how dominant they are in the global entertainment industry.")

## Summary of Unexpected Insights

Based on this analysis, Netflix's strategy is more nuanced than traditional content cataloging:

1. **Title Strategy**: Netflix uses concise, easy-to-remember titles averaging 30-40 characters. Single-word titles are relatively rare.

2. **Description Depth**: Mature-rated content receives significantly longer, more detailed descriptions, suggesting targeting different audience engagement levels.

3. **Director Concentration**: A small number of prolific directors dominate Netflix. This suggests strategic partnerships rather than broad creator diversity.

4. **Geographic Localization**: The US dominates but increasingly Netflix invests in international production, particularly India and UK.