# Netflix Data Analysis — Intermediate Project

This notebook adapts to the actual column names present in your uploaded CSV. It performs cleaning, EDA and visualization using matplotlib.

## Detected columns

The code will attempt to use the following standardized column names (created from your CSV):

- **type** -> `Type`
- **title** -> `Title`
- **date_added** -> `None`
- **country** -> `Country`
- **listed_in** -> `None`
- **rating** -> `Rating`
- **duration** -> `Duration`
- **director** -> `Director`
- **cast** -> `Cast`

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter
pd.set_option('display.max_columns', 50)

In [None]:
csv_path = r'/mnt/data/netflix_project.csv'
df = pd.read_csv(csv_path, encoding='latin1')
# Build mapping of likely columns (same logic as notebook creator)
cols = list(df.columns)
cols_low = [c.lower() for c in cols]

def find_col(key_candidates):
    for k in key_candidates:
        for i,c in enumerate(cols_low):
            if k in c:
                return cols[i]
    return None

mapping = {
    'type': find_col(['type','show_type','program_type']),
    'title': find_col(['title','name']),
    'date_added': find_col(['date_added','date added','added']),
    'country': find_col(['country','countries']),
    'listed_in': find_col(['listed','genre','genres']),
    'rating': find_col(['rating']),
    'duration': find_col(['duration','runtime','length']),
    'director': find_col(['director']),
    'cast': find_col(['cast','actors','cast list'])
}

# Create standardized fields
for std, col in mapping.items():
    if col is not None:
        df[std] = df[col]
    else:
        df[std] = np.nan

# Parse date and show preview
if 'date_added' in df.columns:
    df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
    df['year_added'] = df['date_added'].dt.year.fillna(0).astype(int)

df.head()

## EDA

Plots will only run if the related columns were detected.

In [None]:
# 1) Type counts
if df['type'].notna().any():
    ax = df['type'].fillna('Unknown').value_counts().plot(kind='bar', figsize=(6,4))
    ax.set_title('Count by Type')
    plt.tight_layout()
    plt.show()
else:
    print('No type-like column detected.')

In [None]:
# 2) Content added per year
if 'year_added' in df.columns and (df['year_added']>0).any():
    yc = df[df['year_added']>0]['year_added'].value_counts().sort_index()
    ax = yc.plot(kind='line', marker='o', figsize=(10,4))
    ax.set_title('Content Added Per Year')
    ax.grid(True)
    plt.tight_layout()
    plt.show()
else:
    print('No usable date_added column detected.')

In [None]:
# 3) Top countries
if df['country'].notna().any():
    countries = df['country'].dropna().astype(str).str.split(',').str[0].str.strip()
    top_countries = countries.value_counts().head(10)
    ax = top_countries.plot(kind='barh', figsize=(8,4))
    ax.invert_yaxis()
    plt.tight_layout()
    plt.show()
else:
    print('No country-like column detected.')

In [None]:
# 4) Top genres
if df['listed_in'].notna().any():
    genres_series = df['listed_in'].dropna().astype(str).str.split(',')
    c = Counter()
    for g in genres_series:
        for item in g:
            c[item.strip()] += 1
    most_common_genres = pd.Series(dict(c)).sort_values(ascending=False).head(15)
    ax = most_common_genres.plot(kind='bar', figsize=(10,4))
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
else:
    print('No genre-like column detected.')

In [None]:
# 5) Rating distribution
if df['rating'].notna().any():
    ax = df['rating'].fillna('Unknown').value_counts().head(12).plot(kind='bar', figsize=(8,4))
    plt.tight_layout()
    plt.show()
else:
    print('No rating-like column detected.')

In [None]:
# 6) Duration parsing
if df['duration'].notna().any():
    def parse_duration(x):
        try:
            if pd.isna(x):
                return np.nan
            s = str(x)
            num = ''.join(ch for ch in s if ch.isdigit())
            return int(num) if num else np.nan
        except:
            return np.nan
    df['duration_num'] = df['duration'].apply(parse_duration)
    movies = df[df['type']=='Movie']['duration_num'].dropna()
    if len(movies)>0:
        ax = movies.hist(bins=30, figsize=(8,4))
        plt.title('Distribution of Movie Durations (numeric part)')
        plt.tight_layout()
        plt.show()
    else:
        print('No movie durations found after parsing.')
else:
    print('No duration-like column detected.')

## Conclusions & next steps
- The notebook adapts to your CSV column names and produces plots where possible.
- Next: deeper NLP on titles/descriptions, clustering by features, and interactive dashboards.