## üßπStep 3: Data Preprocessing, EDA & Feature Engineering

---
This step transforms raw Netflix metadata into a structured, analysis-ready format by addressing data quality issues, exploring catalog imbalances, and engineering content features required for similarity-based recommendation.

*   **Approach**: Data cleaning, exploratory analysis, and feature construction
*   **Input**: Raw Netflix title metadata
*   **Goal**: Prepare reliable content signals for cold-start recommendation
*   **Constraint**: No modeling or similarity computation performed

In [None]:
# =============================================================================
# STEP 3: DATA PROCESSING, EDA & FEATURE ENGINEERING
# =============================================================================

"""
Prepare Netflix title metadata for content-based recommendation by cleaning
raw data, exploring key patterns, and creating structured features that
support similarity-based modeling.

This step directly addresses the data issues identified in Step 2 and
translates them into practical preprocessing and feature engineering actions.

There were five key problems identified in Step 2, each addressed through
a specific subcomponent in this step.

Problem: Missing metadata (e.g., cast and director)
Solution: 3a. Data Cleaning ‚Äì Missing Value Analysis

Problem: Catalog imbalance across content types, genres, and countries
Solution: 3b. Exploratory Data Analysis (EDA) ‚Äì Catalog Distributions

Problem: Inconsistent text formats across metadata fields
Solution: 3c. Text Cleaning and Normalization

Problem: Lack of user interaction data for personalization
Solution: 3d. Feature Engineering ‚Äì Content Profile Construction

Problem: Duplicate title names causing lookup ambiguity
Solution: 3e. Duplicate Handling Strategy
"""

# ---------------------------------------------------------------------------
# 3a. Data Cleaning ‚Äì Missing Value Analysis
# ---------------------------------------------------------------------------
# Problem addressed: Missing metadata (e.g., cast and director).
# Purpose: Measuring missingness helps decide which fields can still be used
# safely without removing otherwise useful titles from the catalog.

print("DATA CLEANING: MISSING VALUE ANALYSIS")
print("\n")

def analyze_missing(df):                            # Define a reusable function to analyze missing values across all columns.
    """
    Analyze missing values across all columns.

    Purpose:
    Identify where metadata gaps exist so that preprocessing decisions
    remain practical and do not unnecessarily reduce catalog size.
    """
    missing = df.isnull().sum()                     # Count missing values per column.
    missing_pct = 100 * missing / len(df)           # Convert missing counts to percentages for severity assessment.

    missing_df = pd.DataFrame({                     # Combine counts and percentages into a summary table.
        "Missing Count": missing,
        "Missing %": missing_pct
    })

    return (
        missing_df[missing_df["Missing Count"] > 0] # Keep only columns with at least one missing value.
        .sort_values("Missing %", ascending=False)  # Sort columns by highest missing percentage.
    )

missing_summary = analyze_missing(df)               # Apply missing value analysis to the dataset.
display(missing_summary)                            # Display missing value summary for review.
print(
    f"INTERPRETATION:\n This table shows that several metadata fields have missing values, \n "
    f"with the highest missing rates observed in columns such as cast and director. \n "
    f"For example, at least one column exceeds {missing_summary['Missing %'].max():.1f}% missing entries, \n "
    f"which indicates that dropping these records would significantly reduce the catalog size. \n "
    f"As a result, the recommender is designed to tolerate partial metadata rather than \n "
    f"exclude large portions of the dataset. \n"
)
print("\n")

# ---------------------------------------------------------------------------
# 3b. Exploratory Data Analysis (EDA) ‚Äì Catalog Distributions
# ---------------------------------------------------------------------------
# Problem addressed: Catalog imbalance across content types and countries.
# Purpose: Visualizing these patterns explains potential recommendation bias and
# supports the later use of diversity-focused evaluation metrics.

print("EDA: CATAGLOG DISTRIBUTION")
print("\n")

def plot_distributions(df, columns, figsize=(15, 5)):           # Define a helper function for visualizing distributions.
    """
    Plot distributions for categorical or numerical fields.

    Purpose:
    Understand how content is distributed across the catalog and identify
    dominant patterns that may affect recommendation exposure.
    """
    n_cols = min(3, len(columns))                               # Limit the number of plots per row for readability.
    n_rows = (len(columns) + n_cols - 1) // n_cols              # Compute required number of rows dynamically.

    fig, axes = plt.subplots(n_rows, n_cols, figsize=figsize)   # Create subplot grid.
    axes = axes.flatten() if n_rows * n_cols > 1 else [axes]    # Flatten axes for consistent indexing.

    for idx, col in enumerate(columns):                         # Iterate through selected columns.
        if df[col].dtype in ["int64", "float64"]:               # Check if column is numerical.
            axes[idx].hist(df[col].dropna(), bins=30, edgecolor="black")  # Plot histogram for numerical data.
        else:
            df.loc[df[col] != "", col].value_counts().head(10).plot(      # Plot top 10 categories for categorical data.
                kind="bar", ax=axes[idx]
            )

        axes[idx].set_title(col)                                # Set plot title to column name.
        axes[idx].tick_params(axis="x", rotation=45)            # Rotate x-axis labels for readability.

    plt.tight_layout()                                          # Adjust layout to avoid overlap.
    plt.show()                                                  # Display the plots.

# Catalog composition: Movies vs TV Shows
# ---------------------------------------------------------------------------
# Purpose: Understanding the balance between movies and TV shows supports
# type-aware recommendation logic and helps avoid mismatched suggestions
# like recommending TV series when a user selects a movie.

print("CATALOG COMPOSITION: MOVIES VS TV SHOWS")
print("\n")

plt.figure(figsize=(6, 4))
df["type"].value_counts().plot(kind="bar")                      # Plot the count of Movies vs TV Shows in the catalog.
plt.title("Content Type Distribution (Movies vs TV Shows)")
plt.xlabel("Content Type")
plt.ylabel("Number of Titles")
plt.show()
type_counts = df["type"].value_counts()
print(
    f"INTERPRETATION:\n  This chart shows that the catalog contains {type_counts.get('Movie', 0)} Movies and {type_counts.get('TV Show', 0)} TV Shows. \n  "
    f"The higher number of movies indicates a natural imbalance in content availability, \n  "
    f"which could bias recommendations toward movies if content type is not considered. \n  "
    f"This supports the need for type-aware filtering in later recommendation stages. \n "
)
print("\n")


# High-level catalog distributions (Country)
# ---------------------------------------------------------------------------
# Purpose: Visualizing country dominance highlights exposure imbalance across
# regions and reinforces the need for diversity-aware evaluation metrics
# such as Catalog Coverage.

print("COUNTRY DISTRIBUTION")
print("\n")

plot_distributions (df, ["country"])                            # Visualize catalog balance by production country.

top_country = df["country"].value_counts().idxmax()
top_country_count = df["country"].value_counts().max()

print(
    f"INTERPRETATION:\n  The country distribution shows that content production is concentrated \n  "
    f"in a small number of regions. The most represented country, {top_country}, \n  "
    f"appears {top_country_count} times in the catalog. This concentration increases the risk that \n  "
    f"recommendations may repeatedly surface titles from dominant regions, reinforcing the \n  "
    f"need for coverage-aware evaluation later in the pipeline. \n "
)
print("\n")


# Genre distribution (Top categories)
# ---------------------------------------------------------------------------
# Purpose: Identifying dominant genres explains the risk of popularity bias and
# supports the use of Intra-list Diversity (ILD) as a success metric to
# encourage variety in recommendations.

print("GENRE DISTRIBUTION")
print("\n")

genre_counts = (                                        # Aggregate genre token frequencies.
    df["listed_in"]                                     # Access genre metadata field.
    .fillna("")                                         # Ensure no NaNs
    .str.split(" ")                                     # Split genre strings into individual tokens.
    .explode()                                          # Expand tokens into separate rows for counting.
    .loc[lambda x: x.notna() & (x != "") & (x != "&")]  # Remove blank and non-informative tokens
    .value_counts()                                     # Count frequency of each genre token.
    .head(10)                                           # Retain only the top 10 most frequent genres.
)

plt.figure(figsize=(8, 5))
genre_counts.plot(kind="bar")                       # Plot the most common genre tokens.
plt.title("Top 10 Genre Tokens in Catalog")
plt.xlabel("Genre")
plt.ylabel("Frequency")
plt.show()

top_genre = genre_counts.index[0]
top_genre_count = genre_counts.iloc[0]

print(
    f"INTERPRETATION:\n  This chart shows that a small number of genres dominate the catalog.\n  "
    f"The most frequent genre, '{top_genre}', appears {top_genre_count} times among the top tokens.\n  "
    f"Such dominance suggests that similarity-based recommendations may repeatedly surface \n  "
    f"very similar content unless diversity controls are applied. This directly motivates the \n  "
    f"use of Intra-list Diversity as a success metric.\n "
)
print("\n")


# ---------------------------------------------------------------------------
# 3c. Text Cleaning and Normalization
# ---------------------------------------------------------------------------
# Problem addressed: Inconsistent text formats across metadata fields.
# Purpose: Standardizing text ensures similarity calculations focus on meaning
# rather than formatting differences.

def clean_text(text):                               # Define a utility function for text standardization.
    """
    Clean and normalize text fields.

    Purpose:
    Standardize text inputs so that similarity calculations are consistent
    and comparable across titles.
    """
    if pd.isna(text):                               # Check if the value is missing.
        return ""                                   # Replace missing values with empty strings.
    return (
        text.lower()                                # Convert text to lowercase for consistency.
        .strip()                                    # Remove leading and trailing whitespace.
        .replace(",", " ")                          # Replace commas to standardize token separation.
    )

text_columns = ["listed_in", "description", "cast", "director", "country"]  # Define text-based metadata fields.

for col in text_columns:                            # Apply text cleaning to each selected column.
    df[col] = df[col].apply(clean_text)             # Clean text values in place.

# ---------------------------------------------------------------------------
# 3d. Feature Engineering ‚Äì Content Profile Construction
# ---------------------------------------------------------------------------
# Problem addressed: Lack of user interaction data.
# Purpose: Combining metadata fields creates a richer content signal that
# supports recommendation in cold-start scenarios.

def build_content_profile(row):                     # Define a function to combine metadata into one profile.
    """
    Combine selected metadata fields into a single content profile.

    Purpose:
    Create a unified text representation that captures genre, theme,
    and key attributes needed for similarity-based recommendation.
    """
    return " ".join([                                # Concatenate relevant metadata fields into one string.
        row["listed_in"],
        row["description"],
        row["cast"],
        row["director"],
        row["country"].replace(" ", "_") if isinstance(row["country"], str) and row["country"].strip() else ""  # Normalize country field to prevent multi-word country names from being treated
                                                                                                                # as unrelated tokens during vectorization and similarity computation
                                                                                                                # Convert spaces to underscores (e.g., United_States).
    ])

df["content_profile"] = df.apply(build_content_profile, axis=1)  # Create a content profile for each title.

print("\nSample Content Profiles:\n")

display(df[["title", "content_profile"]].head(5))

print(
    "\nINTERPRETATION:\n"
    "The content_profile column demonstrates how multiple metadata fields like genres, descriptions, cast, \n"
    "director, and country are consolidated into a single unified text representation. This enriched profile \n"
    "serves as the foundation for similarity-based modeling, enabling the recommender system to identify \n"
    "relationships between titles even in the absence of user interaction data. By combining both thematic \n"
    "and categorical signals, the system is better equipped to generate meaningful recommendations in cold-start \n"
    "scenarios, where traditional collaborative filtering approaches would not be applicable.\n"
)

# ---------------------------------------------------------------------------
# 3e. Duplicate Handling Strategy
# ---------------------------------------------------------------------------
# Problem addressed: Duplicate title names.
# Purpose: Using index-based referencing avoids ambiguity during recommendation
# without removing valid titles from the dataset.

df = df.reset_index(drop=True)                      # Reset index to ensure unique and stable row identifiers.


# ---------------------------------------------------------------------------
# Export Cleaned Dataset
# ---------------------------------------------------------------------------
# Purpose: Persist the cleaned and feature-engineered dataset for reuse,
# reproducibility, and alignment with project structure.

processed_path = "data/processed/netflix_titles_cleaned.csv"

os.makedirs(os.path.dirname(processed_path), exist_ok=True)
df.to_csv(processed_path, index=False)

print(f"Cleaned dataset exported to: {processed_path}")



### ìÇÉüñä Key Findings

Exploratory data analysis identified missing metadata, duplicate title strings, catalog imbalance, and concentration across geography and genres, requiring targeted handling without reducing catalog size. Fully empty columns were removed, while partially missing semantic fields were retained through tolerant preprocessing and feature selection to support cold-start scenarios, and duplicate titles were handled using unique row indices to prevent lookup ambiguity. Catalog imbalance informed type-aware similarity logic, geographic concentration motivated coverage-aware evaluation, and genre dominance justified the use of Intra-List Diversity. These findings guided feature engineering through semantic field selection, normalization, and unified content profile construction, while dimensionality reduction such as PCA was not applicable due to the need to preserve semantic meaning and explainability in text-based similarity modeling.

*   **Missing Data**
    *  Fully empty placeholder columns: **14 columns** (100% missing, 8,809 rows)
    *  Director missing: 29.9% (2,634 titles)
    *  Cast missing: 9.4% (825 titles)
    *  Country missing: 9.4% (831 titles) - partial metadata retained to preserve catalog coverage
    *  How this was addressed:
    	*  Removed fully empty placeholder columns as non-informative noise
    	*  Retained partially missing fields and applied missing-value tolerance rather than row deletion
    	*  Consolidated multiple text fields into a unified content_profile dataframe, allowing similarity to be computed even when some metadata is absent
    	*  This preserves catalog size and ensures robustness in cold-start and incomplete-metadata scenarios.

*   **Duplicates**
    *  3 duplicate strings found
    *  How this was addressed:
    	*  Did not remove duplicate rows to avoid discarding valid titles that share the same name
    	*  Reset and relied on unique row indices for all similarity computation and recommendation lookups
    	*  Enforced deterministic title resolution during recommendation retrieval
    	*  This is to revent lookup ambiguity while preserving catalog completeness and recommendation integrity

*   **Catalog Imbalance**
    *  Movies: **6,132 titles**
    *  TV Shows: **2,677 titles**
    *  Supports type-aware filtering
    *  How this was addressed:
    	*  Enforced **type-aware similarity logic** so movies are compared only with movies and TV shows with TV shows
    	*  Prevented cross-type recommendations that would reduce relevance
    	*  This reduces structural bias toward movies and improves contextual relevance of recommendations.

*   **Geographic Concentration**
    *  United States: **2,819 titles** - dominant production region
    *  Motivates coverage-aware evaluation
    *  How this was addressed:
    	*  Measured **Catalog Coverage (CC)** during model evaluation to track how broadly recommendations surface the catalog
    	*  Applied balanced feature weighting to prevent country metadata from overpowering semantic content
    	*  This limits repeated exposure to dominant regions and supports fairer catalog representation.

*   **Genre Dominance**
    *  Top genre token **‚ÄúTV‚Äù appears 5,230 times**
    *  Indicates heavy concentration in a small number of genres
    *  How this was addressed:
    	*  Introduced **Intra-List Diversity (ILD)** as a core evaluation metric
    	*  Tuned genre weighting in hybrid similarity models to balance relevance and variety
    	*  This prevents overly repetitive recommendations and improves perceived discovery value.

*   **Feature Engineering & Selection**
    *  Selected content signals: **description, listed_in (genres), cast, director, country**
    *  Excluded metadata: **release year, rating, duration, date added** (non-semantic)
    *  Text normalization: lowercasing, delimiter standardization, missing-value handling
    *  Feature construction: merged selected fields into a unified **content_profile** dataframe
    *  Dimensionality Reduction: PCA was not applicable to this case because preserving semantic meaning and explainability is essential in text-based similarity modeling.
    *  How this was addressed:
    	*  Performed **explicit feature selection at the metadata level** rather than statistical dimensionality reduction
    	*  Preserved full semantic text for TF-IDF and embedding-based similarity
    	*  Avoided PCA to maintain interpretability and traceability of similarity decisions
    	*  This maintains explainability, semantic richness, and alignment with content-based retrieval objectives.

*   Output:
    *  Cleaned, feature-ready dataset exported for similarity modeling

---