# Anime Dataset 2023 - Advanced Exploratory Data Analysis

This notebook performs a thorough and academically oriented EDA for the Anime Dataset 2023.

Role and goal:
- Role: Producer and data analyst
- Goal: Understand which factors are associated with anime success and prepare features for a later Success Score model

Target metric (conceptual model):
- Score = a*Type + b*Episodes + c*Genre_Action + ... + constant + error

The notebook is structured with clear sections, explicit assumptions and systematic diagnostics.


## 1. Data understanding and feature dictionary

This section documents the meaning of each column in the dataset and groups features into logical categories.
The content below is adapted from the original raw analysis file.


## Definition of Each Features

### 1. Thông tin cơ bản và nhận dạng (Basic Identification & Description)
- **anime_id**: ID duy nhất cho mỗi anime.
- **Name**: Tên của anime bằng ngôn ngữ gốc.
- **English name**: Tên tiếng Anh của anime.
- **Other name**: Tên bản địa hoặc tựa đề của anime.
- **Synopsis**: Mô tả hoặc tóm tắt ngắn gọn về cốt truyện của anime.
- **Genres**: Các thể loại của anime, được phân tách bằng dấu phẩy.
- **Image URL**: URL của hình ảnh hoặc poster của anime.

### 2. Chi tiết sản xuất và kỹ thuật (Production & Technical Details)
- **Type**: Loại anime.
- **Source**: Vật liệu gốc của anime.
- **Producers**: Các công ty sản xuất hoặc nhà sản xuất của anime.
- **Studios**: Các studio hoạt hình đã thực hiện anime.
- **Licensors**: Các nhà cấp phép của anime.
- **Episodes**: Số lượng tập trong anime.
- **Duration**: Thời lượng của mỗi tập phim.

### 3. Thông tin phát sóng và phát hành (Release & Airing Information)
- **Aired**: Ngày anime được phát sóng.
- **Premiered**: Mùa và năm anime ra mắt.
- **Status**: Trạng thái của anime.

### 4. Chỉ số tương tác người xem và hiệu suất (Audience Engagement & Performance Metrics)
- **Score**: Điểm được trao cho anime.
- **Rating**: Xếp hạng độ tuổi của anime.
- **Rank**: Xếp hạng của anime dựa trên mức độ phổ biến hoặc các tiêu chí khác.
- **Popularity**: Xếp hạng mức độ phổ biến của anime.
- **Favorites**: Số lần anime được người dùng đánh dấu là yêu thích.
- **Scored By**: Số lượng người dùng đã chấm điểm anime.
- **Members**: Số lượng thành viên đã thêm anime vào danh sách của họ trên nền tảng.

### Unnecessary Feature
- Tiến hành drop feature không cần thiết khỏi dataframe

## 2. Setup and configuration


In [None]:
import os
import re
from itertools import chain

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

pd.set_option("display.max_columns", 120)
pd.set_option("display.width", 180)

PLOT_DIR = "plots"
os.makedirs(PLOT_DIR, exist_ok=True)

def save_plot(name: str):
    """Save static matplotlib plots in a consistent way."""
    plt.tight_layout()
    path = os.path.join(PLOT_DIR, f"{name}.png")
    plt.savefig(path, dpi=150)
    plt.close()
    print(f"Saved plot to {path}")


## 3. EDA methodology

This EDA follows a structured workflow:

1. Structural overview
   - Inspect shape, schema and basic distributions.
2. Data quality assessment
   - Missingness, NA like tokens, suspicious zeros, type conversions and outliers.
3. Univariate analysis
   - Distributions of key numeric and categorical features.
4. Bivariate analysis with respect to the target
   - Score vs numeric features
   - Score vs categorical features
5. Multivariate structure
   - Correlation matrix and simple multicollinearity diagnostics.
6. Segment based and diagnostic analysis
   - Year, Type and Studio segments.
7. Feature readiness
   - Document which features can be safely used for modeling.


## 4. Load dataset and structural overview

In [None]:
DATA_PATH = "anime-dataset-2023.csv"  # update if needed

df = pd.read_csv(DATA_PATH)

print("Shape (rows, columns):", df.shape)
print("\nColumns:")
print(df.columns.tolist())


In [None]:
df.info()

In [None]:
print("First 10 rows:")
display(df.head(10))


In [None]:
print("Numeric columns summary:")
display(df.describe(include=[np.number]).T)

print("\nCategorical columns summary:")
display(df.describe(include=["object"]).T)


## 5. Data quality assessment

This section focuses on data quality:
- Missing values and NA like tokens
- Duplicates
- Type conversions
- Suspicious zeros
- Outlier inspection


### 5.1 Missing values by column

In [None]:
missing_ratio = df.isna().mean().sort_values(ascending=False)
print("Missing value ratio by column:")
display(missing_ratio)


### 5.2 Duplicates by anime_id

In [None]:
dup_count = df.duplicated(subset=["anime_id"]).sum()
print("Number of duplicated anime_id:", dup_count)

if dup_count > 0:
    print("\nExamples of duplicated anime_id:")
    display(df[df.duplicated(subset=["anime_id"], keep=False)].sort_values("anime_id").head(10))


### 5.3 NA like tokens in object columns

Some string tokens effectively represent missing values but are not coded as NaN.
We detect and optionally normalize these tokens to proper NaN.


In [None]:
import numpy as np

NA_TOKENS = {
    "", " ", "NA", "N/A", "na", "n/a",
    "None", "NONE", "null", "NULL", "NaN", "nan",
    "-", "?", "Unknown", "unknown"
}

obj_cols = df.select_dtypes(include=["object"]).columns

na_like_rows = []
for col in obj_cols:
    vc = df[col].value_counts(dropna=False)
    top_vals = vc.head(20)
    na_like_vals = [v for v in top_vals.index if isinstance(v, str) and v.strip() in NA_TOKENS]
    if na_like_vals:
        na_like_rows.append({
            "column": col,
            "na_like_values": na_like_vals,
            "na_like_total": int(top_vals.loc[na_like_vals].sum())
        })

na_like_report = pd.DataFrame(na_like_rows)
print("Columns with NA like tokens (top 20 values inspected):")
display(na_like_report)


In [None]:
# Normalize NA like tokens to real NaN

for col in obj_cols:
    df[col] = df[col].replace(list(NA_TOKENS), np.nan)


### 5.4 Type conversion for Score and Episodes

In [None]:
# Convert Score to numeric
df["Score"] = pd.to_numeric(df["Score"], errors="coerce")

# Preserve original Episodes
df["Episodes_raw"] = df["Episodes"]
df["Episodes"] = pd.to_numeric(df["Episodes"], errors="coerce")

print("Episodes before and after conversion (sample):")
display(df[["Episodes_raw", "Episodes"]].head(10))


### 5.5 Suspicious zeros in numeric columns

Some numeric columns may use 0 as a placeholder for missing or not applicable values.
We generate a zero report to identify candidates for zero to NaN recoding.


In [None]:
num_cols = df.select_dtypes(include=[np.number]).columns

rows = []
n_total = len(df)
for c in num_cols:
    series = df[c]
    zero_count = int((series == 0).sum())
    nan_count = int(series.isna().sum())
    rows.append({
        "column": c,
        "zero_count": zero_count,
        "zero_pct": round(zero_count / n_total * 100, 2) if n_total else 0.0,
        "nan_count": nan_count,
    })

zero_report = pd.DataFrame(rows).sort_values(["zero_pct", "zero_count"], ascending=False)
print("Zero report for numeric columns:")
display(zero_report.head(30))


In [None]:
# Example: specify which columns should treat zero as missing if business logic supports it
cols_treat_zero_as_nan = []  # for example: ["Rank", "Popularity"]

for c in cols_treat_zero_as_nan:
    if c in df.columns:
        df[c] = df[c].replace(0, np.nan)


### 5.6 Outlier inspection for Score and Episodes

In [None]:
print("Score summary:")
display(df["Score"].describe())

print("\nEpisodes summary:")
display(df["Episodes"].describe())


In [None]:
plt.figure()
df["Score"].hist(bins=30)
plt.xlabel("Score")
plt.ylabel("Count")
plt.title("Distribution of Score")
save_plot("score_distribution")
plt.show()


In [None]:
plt.figure()
df["Episodes"].hist(bins=50)
plt.xlabel("Episodes")
plt.ylabel("Count")
plt.title("Distribution of Episodes")
save_plot("episodes_distribution")
plt.show()


### 5.7 IQR based outlier analysis

We use the Interquartile Range (IQR) rule to identify extreme values for key numeric
variables. A value is considered an outlier if it lies outside:

[Q1 - 1.5 * IQR, Q3 + 1.5 * IQR].


In [None]:
def find_outliers_iqr(series: pd.Series):
    series = series.dropna()
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    mask = (series < lower) | (series > upper)
    return series[mask], lower, upper

key_numeric = ["Score", "Members", "Favorites", "Episodes"]
for col in key_numeric:
    if col in df.columns:
        print(f"\n=== IQR outliers for {col} ===")
        outliers, lower, upper = find_outliers_iqr(df[col])
        print(f"Bounds: [{lower:.3f}, {upper:.3f}]")
        print(f"Detected {len(outliers)} outliers out of {df[col].notna().sum()} non-missing values.")
        # Show a few extreme high outliers
        if len(outliers) > 0:
            extreme_high = outliers.sort_values(ascending=False).head(10)
            print("Top 10 extreme high values:")
            display(df.loc[extreme_high.index, ["Name", col, "Score", "Type"]].head(10))


### 5.8 String data consistency: fuzzy matching for studios

String fields such as Studios or Producers may contain near duplicates
due to inconsistent naming (for example, "Sunrise" vs "Sunrise Inc.").
We use fuzzy string matching to detect candidates for standardization.


In [None]:
try:
    from rapidfuzz import fuzz
except ImportError:
    import sys
    !{sys.executable} -m pip install rapidfuzz -q
    from rapidfuzz import fuzz

if "Studios" in df.columns:
    unique_studios_raw = df["Studios"].dropna().unique()
    studio_tokens = set()
    for s in unique_studios_raw:
        for part in str(s).split(","):
            token = part.strip()
            if token:
                studio_tokens.add(token)

    studio_list = sorted(studio_tokens)
    print(f"Total unique studio tokens: {len(studio_list)}")

    from itertools import combinations

    similar_pairs = []
    max_pairs = 500  # limit combinations for performance
    subset = studio_list[:max_pairs]

    for s1, s2 in combinations(subset, 2):
        score = fuzz.ratio(s1, s2)
        if score > 90 and s1 != s2:
            similar_pairs.append((s1, s2, score))

    if similar_pairs:
        similar_pairs.sort(key=lambda x: x[2], reverse=True)
        print("Potential duplicate / inconsistent studio names (similarity > 90):")
        for s1, s2, sc in similar_pairs[:20]:
            print(f"- '{s1}' vs '{s2}' (Similarity: {sc:.1f})")
    else:
        print("No highly similar studio name pairs found in the checked subset.")
else:
    print("Studios column not available.")


### 5.9 Business rule and logical validation

We validate several basic business rules to detect logically inconsistent records:

- Aired end date should not be before the start date.
- The count of users who scored an anime (Scored By) should not exceed Members.
- Duration should be roughly consistent with the number of episodes.


In [None]:
# 1. Aired start and end dates

if "Aired" in df.columns:
    aired_split = df["Aired"].str.split(" to ", n=1, expand=True)
    start_dates = pd.to_datetime(aired_split[0], errors="coerce")
    end_dates = pd.to_datetime(aired_split[1], errors="coerce")

    invalid_dates_mask = (start_dates.notna()) & (end_dates.notna()) & (end_dates < start_dates)
    invalid_dates = df[invalid_dates_mask]

    print(f"Found {len(invalid_dates)} entries with end date before start date.")
    if len(invalid_dates) > 0:
        display(invalid_dates[["Name", "Aired", "Score", "Type"]].head(10))

# 2. Scored By vs Members

if {"Scored By", "Members"} <= set(df.columns):
    scored_by_num = pd.to_numeric(df["Scored By"], errors="coerce") if df["Scored By"].dtype == object else df["Scored By"]
    members_num = pd.to_numeric(df["Members"], errors="coerce") if df["Members"].dtype == object else df["Members"]

    invalid_scores_mask = scored_by_num > members_num
    invalid_scores = df[invalid_scores_mask]

    print(f"Found {len(invalid_scores)} entries where 'Scored By' > 'Members'.")
    if len(invalid_scores) > 0:
        display(invalid_scores[["Name", "Score", "Members", "Scored By"]].head(10))

# 3. Duration vs Episodes (simple heuristic)

if {"Duration", "Episodes"} <= set(df.columns):
    dur = df["Duration"].astype(str)
    ep = df["Episodes"]

    # Example rule: Episodes == 1 but duration string contains 'per ep'
    mask_single_with_per_ep = (ep == 1) & dur.str.contains("per ep", case=False, na=False)
    inconsistent_single = df[mask_single_with_per_ep]

    print(f"Found {len(inconsistent_single)} entries where Episodes == 1 but Duration looks like per-episode duration.")
    if len(inconsistent_single) > 0:
        display(inconsistent_single[["Name", "Episodes", "Duration"]].head(10))


## 6. Univariate analysis

We study the distribution of individual variables to understand their scale, skewness and potential issues.


### 6.1 Numeric features

In [None]:
numeric_candidates = ["Score", "Episodes", "Members", "Scored By", "Favorites"]
numeric_existing = [c for c in numeric_candidates if c in df.columns]

df[numeric_existing].describe().T


In [None]:
for col in numeric_existing:
    plt.figure()
    df[col].hist(bins=40)
    plt.title(f"Distribution of {col}")
    plt.xlabel(col)
    plt.ylabel("Count")
    save_plot(f"dist_{col.lower()}")
    plt.show()


### 6.2 Categorical features

In [None]:
categorical_candidates = ["Type", "Rating", "Status", "Source"]
categorical_existing = [c for c in categorical_candidates if c in df.columns]

for col in categorical_existing:
    print(f"\nValue counts for {col}:")
    display(df[col].value_counts(dropna=False).head(20))


In [None]:
for col in categorical_existing:
    vc = df[col].value_counts().head(15)
    plt.figure(figsize=(8, 4))
    vc.plot(kind="bar")
    plt.title(f"Top categories for {col}")
    plt.ylabel("Count")
    save_plot(f"cat_{col.lower()}")
    plt.show()


## 7. Target variable analysis: Score

We treat Score as the primary success metric.
We create ordered bands and inspect the distribution.


In [None]:
bins = [0, 6, 7, 8, 10]
labels = ["Low", "Medium", "High", "Top"]
df["Score_band"] = pd.cut(df["Score"], bins=bins, labels=labels, include_lowest=True)

print("Score band distribution (count):")
display(df["Score_band"].value_counts(dropna=False))

print("\nScore band distribution (ratio):")
display(df["Score_band"].value_counts(normalize=True, dropna=False))


In [None]:
plt.figure()
df.boxplot(column="Score")
plt.title("Boxplot of Score")
save_plot("score_boxplot")
plt.show()


## 8. Bivariate analysis with Score

We now explore how Score relates to other variables:
- Numeric features (Episodes, Members, Scored By, Favorites)
- Categorical features (Type, Rating, Source, Score_band segments)


### 8.1 Score vs numeric features

In [None]:
for col in ["Episodes", "Members", "Scored By", "Favorites"]:
    if col in df.columns:
        fig = px.scatter(
            df,
            x=col,
            y="Score",
            title=f"Score vs {col}",
            opacity=0.5
        )
        fig.show()


### 8.2 Score by categorical features

In [None]:
for col in ["Type", "Rating", "Source"]:
    if col in df.columns:
        print(f"\nScore by {col}:")
        stats = df.groupby(col)["Score"].describe().sort_values("mean", ascending=False)
        display(stats.head(20))


In [None]:
for col in ["Type", "Rating", "Source"]:
    if col in df.columns:
        plt.figure(figsize=(10, 5))
        df.boxplot(column="Score", by=col, rot=45)
        plt.suptitle("")
        plt.title(f"Score by {col}")
        plt.ylabel("Score")
        save_plot(f"score_by_{col.lower()}")
        plt.show()


### 8.3 Segmented distribution analysis

We examine how the distribution of key variables changes across important segments,
such as Type and Rating.


In [None]:
# Score distribution by Type (only Types with sufficient sample size)

if {"Score", "Type"} <= set(df.columns):
    df_seg = df.dropna(subset=["Score", "Type"]).copy()
    type_counts = df_seg["Type"].value_counts()
    major_types = type_counts[type_counts > 100].index
    df_seg = df_seg[df_seg["Type"].isin(major_types)]

    print("Types included in segmented analysis (n > 100):")
    display(major_types)

    fig = px.box(
        df_seg,
        x="Type",
        y="Score",
        title="Score distribution by Type",
        labels={"Type": "Anime Type", "Score": "User Score"}
    )
    fig.show()

# log(Members) distribution by Rating

if {"Members", "Rating"} <= set(df.columns):
    df_seg2 = df.dropna(subset=["Members", "Rating"]).copy()
    df_seg2 = df_seg2[df_seg2["Members"] > 0]
    df_seg2["log_Members"] = np.log10(df_seg2["Members"])

    rating_counts = df_seg2["Rating"].value_counts()
    major_ratings = rating_counts[rating_counts > 100].index
    df_seg2 = df_seg2[df_seg2["Rating"].isin(major_ratings)]

    print("Ratings included in segmented analysis (n > 100):")
    display(major_ratings)

    fig = px.box(
        df_seg2,
        x="Rating",
        y="log_Members",
        title="log10(Members) distribution by Rating",
        labels={"Rating": "Age Rating", "log_Members": "log10(Members)"}
    )
    fig.show()


## 9. Multivariate structure and correlation

Here we look at the joint structure of numeric features:
- Correlation matrix
- Simple multicollinearity diagnostics


In [None]:
num_for_corr = df.select_dtypes(include=[np.number]).copy()
corr_matrix = num_for_corr.corr()

print("Correlation matrix (numeric features):")
display(corr_matrix)


In [None]:
plt.figure(figsize=(10, 8))
im = plt.imshow(corr_matrix, cmap="coolwarm", vmin=-1, vmax=1)
plt.colorbar(im, fraction=0.046, pad=0.04)
plt.xticks(range(len(corr_matrix.columns)), corr_matrix.columns, rotation=90)
plt.yticks(range(len(corr_matrix.columns)), corr_matrix.columns)
plt.title("Correlation matrix of numeric features")
plt.tight_layout()
save_plot("corr_matrix_numeric")
plt.show()


## 10. Genres exploration

We transform the Genres string column into a list of genres and corresponding dummy variables, then study:
- Genre frequencies
- Score by genre
- Distribution of number of genres per title


In [None]:
def split_genres(value):
    if pd.isna(value):
        return []
    parts = [g.strip() for g in str(value).split(",")]
    return [p for p in parts if p]

df["Genres_list"] = df["Genres"].apply(split_genres)
all_genres = sorted(set(chain.from_iterable(df["Genres_list"])))
print("Number of unique genres:", len(all_genres))
print("Sample genres:", all_genres[:20])


In [None]:
for g in all_genres:
    col_name = f"Genre_{re.sub(r'[^0-9A-Za-z]+', '_', g)}"
    df[col_name] = df["Genres_list"].apply(lambda lst, genre=g: int(genre in lst))

genre_cols = [c for c in df.columns if c.startswith("Genre_")]
print("Number of genre dummy columns:", len(genre_cols))


In [None]:
genre_counts = df[genre_cols].sum().sort_values(ascending=False)
print("Top 30 genres by frequency:")
display(genre_counts.head(30))


In [None]:
genre_mean_score = pd.Series(
    {g: df.loc[df[g] == 1, "Score"].mean() for g in genre_cols}
).sort_values(ascending=False)

print("Top 20 genres by mean Score:")
display(genre_mean_score.head(20))

print("\nBottom 20 genres by mean Score:")
display(genre_mean_score.tail(20))


In [None]:
df["n_genres"] = df["Genres_list"].apply(len)

print("Number of genres per title summary:")
display(df["n_genres"].describe())

plt.figure()
df["n_genres"].hist(bins=range(1, 15))
plt.xlabel("Number of genres")
plt.ylabel("Count")
plt.title("Distribution of number of genres per title")
save_plot("dist_n_genres")
plt.show()


## 11. Diagnostic EDA by year and studio

We perform segment based analysis to understand temporal and studio related patterns.


### 10.4 Genre co-occurrence and association analysis

We build a co-occurrence matrix for the most common genres and compute
pairwise correlations to understand which genres tend to appear together.


In [None]:
# Build a title-genre incidence matrix

if "Genres_list" in df.columns:
    # Explode genres by title
    df_genres_long = df[["anime_id", "Genres_list"]].explode("Genres_list")
    df_genres_long["Genres_list"] = df_genres_long["Genres_list"].astype(str).str.strip()
    df_genres_long = df_genres_long[df_genres_long["Genres_list"] != ""]

    co_occurrence = pd.crosstab(df_genres_long["anime_id"], df_genres_long["Genres_list"])

    # Focus on top genres to keep the matrix interpretable
    top_genres = co_occurrence.sum().nlargest(15).index
    co_occurrence_top = co_occurrence[top_genres]

    genre_corr = co_occurrence_top.corr()

    print("Top genres used in co-occurrence matrix:")
    display(top_genres)

    fig = px.imshow(
        genre_corr,
        text_auto=True,
        aspect="auto",
        title="Co-occurrence correlation of top 15 genres"
    )
    fig.show()
else:
    print("Genres_list not available - ensure genre preprocessing section has been executed.")


In [None]:
# Year extraction

if "Aired" in df.columns:
    df["Year"] = df["Aired"].str.extract(r"(\d{4})").astype(float)

    print("Year value counts (first 20 years):")
    display(df["Year"].value_counts().sort_index().head(20))

    year_stats = df.groupby("Year")["Score"].agg(["count", "mean"]).dropna()
    print("\nScore by Year (head):")
    display(year_stats.head(20))

    plt.figure(figsize=(10, 5))
    year_stats["mean"].plot()
    plt.ylabel("Mean Score")
    plt.title("Mean Score by Year")
    save_plot("score_by_year_mean")
    plt.show()

    plt.figure(figsize=(10, 5))
    year_stats["count"].plot()
    plt.ylabel("Number of titles")
    plt.title("Number of anime titles by Year")
    save_plot("titles_by_year_count")
    plt.show()


In [None]:
# Studio primary

if "Studios" in df.columns:
    df["Studios_clean"] = df["Studios"].fillna("Unknown")
    df["Studio_primary"] = df["Studios_clean"].apply(lambda x: str(x).split(",")[0].strip())

    studio_stats = (
        df.groupby("Studio_primary")
        .agg(
            count=("Score", "count"),
            mean_score=("Score", "mean"),
        )
        .sort_values("count", ascending=False)
    )

    print("Top 20 studios by number of titles:")
    display(studio_stats.head(20))

    min_titles = 20
    big_studios = studio_stats[studio_stats["count"] >= min_titles].sort_values("mean_score", ascending=False)

    print(f"\nStudios with at least {min_titles} titles, sorted by mean Score:")
    display(big_studios.head(20))

    plt.figure(figsize=(10, 6))
    big_studios.head(15)["mean_score"].plot(kind="bar")
    plt.ylabel("Mean Score")
    plt.title(f"Top studios by mean Score (count >= {min_titles})")
    save_plot("studios_top_mean_score")
    plt.show()


## 12. Popularity and engagement metrics

We inspect standard popularity metrics and their relation to Score.


In [None]:
pop_cols = ["Score", "Members", "Scored By", "Favorites"]
existing_pop = [c for c in pop_cols if c in df.columns]

if len(existing_pop) == len(pop_cols):
    print("Correlation matrix for popularity metrics:")
    display(df[pop_cols].corr())

    fig = px.scatter(df, x="Members", y="Score", title="Score vs Members", opacity=0.5)
    fig.show()

    fig = px.scatter(df, x="Favorites", y="Score", title="Score vs Favorites", opacity=0.5)
    fig.show()

    print("High Score but relatively low Members (top 10):")
    high_score = df[df["Score"] >= 8.5]
    display(high_score.sort_values("Members").head(10)[["Name", "Score", "Members"]])

    print("\nTop 10 titles by Members:")
    display(df.sort_values("Members", ascending=False).head(10)[["Name", "Score", "Members"]])


## 13. Feature preparation summary

We summarize candidate features that are ready or nearly ready for modeling.
This is not a modeling step but a bridge between EDA and model development.


In [None]:
genre_cols = [c for c in df.columns if c.startswith("Genre_")]

feature_summary = {
    "numeric_features": ["Episodes", "Members", "Scored By", "Favorites", "Year"],
    "categorical_features": ["Type", "Rating", "Source", "Studio_primary"],
    "genre_features": genre_cols,
    "target": "Score",
}

print("Feature summary:")
for key, value in feature_summary.items():
    if isinstance(value, list):
        existing = [c for c in value if c in df.columns]
        print(f"{key}: {len(existing)} columns used")
    else:
        print(f"{key}: {value}")
