## **I. Introduction**
**1. Narrative Context**

In the previous stage, we took on the role of a Producer seeking the formula behind high-scoring anime. However, our first obstacle was immediate and unavoidable: the Raw Dataset was in poor condition. It contained missing values, unstructured text, inconsistent formats, and fragmented categories — making meaningful analysis impossible.

This notebook presents the next stage of the workflow: **the transformation**.
After applying a complete Data Processing Pipeline to clean, standardize, and engineer features, we now visualize the impact of that transformation. This notebook serves as evidence of how structured data enables structured insights.

**2. Objective of This Analysis**

The focus of this notebook is **Comparative Analysis**.
For every major feature, we address two key questions:

1. **The Transformation Insight** — How did the feature evolve from **Raw** to **Processed**?
This shows why Data Preparation is not optional but foundational.

2. **The Business Insight** — Once the data is clean, what does it reveal about the drivers of **high Score**?

In short, the purpose is not only to clean the data, but to demonstrate how cleaning transforms noise into clarity and reveals the signals that matter.

**3. Analytical Roadmap**

This transformation is explored through four structured themes:

1. **The Foundation – Target Variable (Score)**
   Revealing the true distribution and statistical behavior of the target metric.

2. **Theme A – Market Factors**
   Understanding how Media Type (TV, Movie, OVA…) and Source Material (Manga, Original, Game…) influence performance.

3. **Theme B – Creative Factors**
   Unpacking Genres, Producers, and Studios to identify collaboration patterns, specialization strengths, and creative drivers.

4. **Theme C – Release Strategy**
   Converting unstructured Aired dates, Duration formats, and Episode structures into analyzable fields to uncover timing and format advantages.

**4. Deep Business Insights → Strategic Recommendations**

At the end of this notebook, we consolidate the strongest patterns discovered across all themes to form **Strategic Recommendations** for creating or selecting high-scoring anime.
These insights connect data evidence to real-world decision-making — guiding choices about **content strategy**, **production planning**, **studio partnerships**, and **release timing**.




In [83]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

import pandas as pd
import numpy as np
import joblib
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.io as pio
import ast
pio.templates.default = "plotly_white"

In [84]:
path_raw = r'..\data\raw\anime-dataset-2023.csv'
df_anime_dataset_2023 = pd.read_csv(path_raw)

path = r'..\data\processed\prepared_data.csv'
df_anime_dataset_2023_prep= pd.read_csv(path)

## **II. Target Variable: Score**

### **1. Issue Overview**


### **2. Solution**


### **3. Visual Evidence: Misleading (Raw) vs. Correct (Clean) Insights**


In [85]:
import plotly.express as px
from plotly.subplots import make_subplots
import pandas as pd

# --- Prepare Data ---
df_raw_viz = df_anime_dataset_2023.copy()
df_raw_viz['Score'] = df_raw_viz['Score'].astype(str)

unknown_count = df_raw_viz[df_raw_viz['Score']=='UNKNOWN'].shape[0]

score_counts = df_raw_viz['Score'].value_counts().reset_index()
score_counts.columns = ['Score','Frequency']
score_counts['Color'] = score_counts['Score'].apply(lambda x: 'Misleading (Unknown)' if x=='UNKNOWN' else 'Valid Score')
top_scores = score_counts.head(15)

# --- Create Subplots ---
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("<b>Broken Histogram: Raw Score<b>", "<b>Top 15 Scores (Unknown Highlighted)<b>"),
    horizontal_spacing=0.15  # khoảng cách giữa 2 chart
)

# --- Chart 1: Histogram ---
hist_fig = px.histogram(df_raw_viz, x='Score', color_discrete_sequence=["#7814cf"])
for trace in hist_fig.data:
    fig.add_trace(trace, row=1, col=1)

fig.add_annotation(
    x='UNKNOWN', y=unknown_count,
    text=f"<b>UNKNOWN: {unknown_count}</b><br>(Dominates the data)",
    showarrow=True, arrowhead=2, ax=0, ay=-40,
    font=dict(color="#C3122F", size=14), arrowcolor="#C3122F",
    row=1, col=1
)

fig.update_xaxes(type='category', row=1, col=1, showticklabels=False)
fig.update_yaxes(title_text="Count", row=1, col=1)

# --- Chart 2: Horizontal Bar ---
bar_fig = px.bar(
    top_scores,
    x='Frequency',
    y='Score',
    orientation='h',
    color='Color',
    color_discrete_map={'Misleading (Unknown)': '#C3122F', 'Valid Score': '#95a5a6'},
    text='Frequency'
)
for trace in bar_fig.data:
    fig.add_trace(trace, row=1, col=2)

# --- Critical Settings to keep text outside ---
fig.update_traces(textposition='outside', textfont=dict(size=14, color='black'), row=1, col=2)

fig.update_yaxes(categoryorder='total ascending', row=1, col=2, showgrid=False, showline=False)
fig.update_xaxes(showticklabels=False, showgrid=False, zeroline=False, showline=False,
                 row=1, col=2, automargin=True, range=[0, max(top_scores['Frequency'])*1.3])

# --- Layout chung ---
fig.update_layout(
    template='plotly_white',
    showlegend=False,
    height=500,
    width=1300,  # tăng chiều rộng tổng thể
    margin=dict(l=60, r=150, t=80, b=50),
    title_text="<b>Raw Score Analysis: Broken vs Unknown Dominance</b>"
)

fig.show()


# 1. Drop NaN just for plotting (Storytelling purpose: Show the valid distribution)
# In prepared_data.csv, Score is already float, but might have NaNs. 
# We visualize the AVAILABLE valid scores.
valid_scores = df_anime_dataset_2023_prep['Score'].dropna()

# 2. Calculate Statistics for Annotation
mean_score = valid_scores.mean()
median_score = valid_scores.median()

# 3. Plot Histogram
fig = px.histogram(
    x=valid_scores,
    nbins=40, # Granular bins
    title='<b>True Anime Score Distribution (Cleaned)</b><br><i>(Gaussian-like distribution after removing noise)</i>',
    color_discrete_sequence=["#7C2EC1"], # Professional Blue
    opacity=0.8
)

# 4. Add Mean Line (Vertical Line)
fig.add_vline(
    x=mean_score, 
    line_width=3, 
    line_dash="dash", 
    line_color="#C3122F",
    annotation_text=f"Mean: {mean_score:.2f}", 
    annotation_position="top right"
)

# 5. Declutter
fig.update_layout(
    template='plotly_white',
    xaxis_title="Score (0-10)",
    yaxis_title="Frequency",
    bargap=0.1,
    height=500
)

fig.show()

##### **The Transformation Insight:** 

**The Misleading View:** The raw data suggested that a large portion of anime had undefined or unknown scores, obscuring the true distribution and making it impossible to assess typical performance.

**The Truth:** The clean data reveals a roughly Gaussian distribution, with most scores clustering around 6.38, and extreme low or high scores being rare.

**The Strategic Value:** Instead of noise, we now have a reliable numeric feature to analyze how Score varies across Types, Sources, and other factors, enabling accurate business insights.

## **III. Theme A — Market Factors (Type & Source)**

### **1. Issue Overview**


### **2. Solution**


### **3. Visual Evidence: Misleading (Raw) vs. Correct (Clean) Insights**

##### **The Transformation Insight:**
Áp dụng công thức

The Misleading View: The raw data suggested that [Misleading Insight] (or obscured the pattern due to [Specific Noise]).

The Truth: The clean data corrects this by revealing that [Correct Insight/Real Pattern].

The Strategic Value: Instead of noise, we now have a reliable feature to analyze [How this feature impacts Score].

### **4. Business Insights**

#### **A. [Tên]**
**Question**:

**Insight:**

**Business Takeaway:**

## **IV. Theme B — Creative & Production Factors (Genres, Studios, Producers)**



### **1. Issue Overview**

During the **Exploratory Data Analysis (EDA)** phase, we identified common **data quality issues** across `Genres`, `Studios`, and `Producers`:

*   **Missing Values:** A portion of the dataset contained **null** or empty entries.

*   **Aggregated Lists (Non-Atomic Data):** Values were stored as **stringified lists** (e.g., `['Action', 'Comedy']`) within a single cell. Performing a direct count on this format would incorrectly tally **specific combinations** rather than the frequency of individual items, leading to inaccurate statistical results.

### **2. Solution**

To resolve these inconsistencies, we implemented the following **data cleaning pipeline**:

*   **Handling Nulls:** Missing entries were temporarily labeled as **"UNKNOWN"** to track data completeness before being filtered out for detailed analysis.

*   **Data Normalization:** We transformed the text data using the following steps:
    *  **Regex Cleaning:** Used regular expressions to remove formatting characters like **brackets** (`[]`) and **quotes** (`''`).
    *  **Split & Explode:** Split the comma-separated strings and **expanded** (exploded) the list so that each individual element occupies its own row.
    *  **Trimming:** Stripped leading and trailing **whitespace** to ensure data consistency.

### **3. Visual Evidence: Misleading (Raw) vs. Correct (Clean) Insights**

#### **3.1. Genres**

In [86]:
# ==========================================
# 1. DATA PREPARATION
# ==========================================

# --- A. RAW DATA (LEFT CHART) ---
# Count occurrences and take Top 15
genre_counts_raw = df_anime_dataset_2023['Genres'].value_counts().head(10).reset_index()
genre_counts_raw.columns = ['Genres', 'Count']

# Sort by Count ascending (for horizontal bar chart to show largest at top)
genre_counts_raw = genre_counts_raw.sort_values('Count', ascending=True)

# Define Color Logic for Raw Data:
# - Red (#C3122F) for Noise (Combinations with ',' or 'UNKNOWN')
# - Light Grey (#D3D3D3) for Neutral Context (Single valid genres in raw data)
colors_raw = [
    '#C3122F' if (',' in str(x) or str(x).strip().upper() == 'UNKNOWN') else '#D3D3D3'
    for x in genre_counts_raw['Genres']
]

# --- B. CLEAN DATA (RIGHT CHART) ---
# Create a copy and ensure string format
df_genres_exploded = df_anime_dataset_2023_prep[['Genres']].copy()
df_genres_exploded['Genres'] = df_genres_exploded['Genres'].astype(str)

# REGEX CLEANING: Remove brackets and quotes
df_genres_exploded['Genres'] = df_genres_exploded['Genres'].str.replace(r"[\[\]\'\"]", "", regex=True)

# SPLIT & EXPLODE: Separate genres and create individual rows
df_genres_exploded['Genres'] = df_genres_exploded['Genres'].str.split(',')
df_genres_exploded = df_genres_exploded.explode('Genres')

# TRIM: Remove leading/trailing whitespace
df_genres_exploded['Genres'] = df_genres_exploded['Genres'].str.strip()

# FILTER: Remove garbage values (Noise)
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_genres_exploded = df_genres_exploded[~df_genres_exploded['Genres'].str.upper().isin(exclude_list)]
df_genres_exploded = df_genres_exploded[df_genres_exploded['Genres'] != ""]

# Count occurrences and take Top 15
genre_counts_clean = df_genres_exploded['Genres'].value_counts().head(10).reset_index()
genre_counts_clean.columns = ['Genres', 'Count']
genre_counts_clean = genre_counts_clean.sort_values('Count', ascending=True)

# Define Color Logic for Clean Data:
# - Vibrant Pink (#ea568c) for Valid Data
colors_clean = ['#ea568c'] * len(genre_counts_clean)


# ==========================================
# 2. VISUALIZATION (PHASE 2: TECHNICAL COMPARISON)
# ==========================================

# Initialize Subplots (1 Row, 2 Columns)
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("<b>Genres Raw Data Distribution</b>", "<b>Genres Cleaned Data Distribution</b>"), 
    horizontal_spacing=0.15 # Adjust spacing to prevent label overlap
)

# --- TRACE 1: RAW DATA (LEFT) ---
fig.add_trace(
    go.Bar(
        y=genre_counts_raw['Genres'],
        x=genre_counts_raw['Count'],
        orientation='h',
        marker_color=colors_raw,
        text=genre_counts_raw['Count'],
        textposition='outside',
        texttemplate='%{x:.2s}', # Format: 15k, 2.5k
        hovertemplate='<b>Raw Genre:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>',
        showlegend=False 
    ),
    row=1, col=1
)

# --- TRACE 2: CLEAN DATA (RIGHT) ---
fig.add_trace(
    go.Bar(
        y=genre_counts_clean['Genres'],
        x=genre_counts_clean['Count'],
        orientation='h',
        marker_color=colors_clean, # Pink for Valid Data
        text=genre_counts_clean['Count'],
        textposition='outside',
        texttemplate='%{x:.2s}', # Format: 15k, 2.5k
        hovertemplate='<b>Clean Genre:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>',
        showlegend=False # Rule: 1-2 traces -> hide legend
    ),
    row=1, col=2
)


# ==========================================
# 3. LAYOUT SETTINGS (STRICT COMPLIANCE)
# ==========================================

fig.update_layout(
    # General Settings
    font=dict(family="Tahoma", size=13, color="#000000"),
    hovermode="x unified",
    
    # Margin & Dimensions
    margin=dict(l=40, r=40, t=100, b=60),
    width=1550,
    height=550,
    
    # Backgrounds
    plot_bgcolor="white",
    paper_bgcolor="white",
    
    # Title Convention (Position x=0.02, 2-line structure)
    title=dict(
        text="<b>Removing Aggregation & 'Unknown' Values Unlocks True Genres Distribution</b><br><span style='color:#555555; font-size:14px'>(Top 10 Most Aired Genres By Number of Anime)</span>",
        x=0.02,
        y=0.95, # Adjust vertical position slightly to fit top margin
        xanchor='left',
        yanchor='top'
    )
)

# --- AXIS SETTINGS (DECLUTTER) ---
# Rule: Hide Gridlines, Hide X-axis Labels (focus on Shape/Trend + Direct Labeling)

# Update X-axes (Values)
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, row=1, col=1)
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, row=1, col=2)

# Update Y-axes (Categories)
# Keep labels for readability, no grid
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", row=1, col=1)
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", row=1, col=2)

fig.show()

#### **3.2.Producers**

In [87]:
# ==========================================
# 1. DATA PREPARATION (LOGIC)
# ==========================================

# --- A. RAW DATA PREP (LEFT CHART) ---
# Count occurrences
producer_counts_raw = df_anime_dataset_2023['Producers'].value_counts().head(15).reset_index()
producer_counts_raw.columns = ['Producers', 'Count']
# Sort by Count ascending for horizontal bar chart
producer_counts_raw = producer_counts_raw.sort_values('Count', ascending=True)

# --- COLOR LOGIC: RAW DATA ---
# Rule: Use Red (#C3122F) if the data is "Complex" (contains ',') or "Unknown"
# Rule: Use Light Grey (#D3D3D3) for simple/neutral entries
colors_raw = [
    '#C3122F' if (',' in str(x) or str(x).strip().upper() == 'UNKNOWN') else '#D3D3D3'
    for x in producer_counts_raw['Producers']
]

# --- B. CLEAN DATA PREP (RIGHT CHART) ---
# Create a copy and ensure string format
df_producers_clean = df_anime_dataset_2023_prep[['Producers']].copy()
df_producers_clean['Producers'] = df_producers_clean['Producers'].astype(str)

# REGEX: Remove brackets and quotes
df_producers_clean['Producers'] = df_producers_clean['Producers'].str.replace(r"[\[\]\'\"]", "", regex=True)

# SPLIT & EXPLODE: "Aniplex, Sony Music" -> 2 separate rows
df_producers_clean['Producers'] = df_producers_clean['Producers'].str.split(',')
df_producers_clean = df_producers_clean.explode('Producers')

# TRIM & FILTER: Remove whitespace and garbage values
df_producers_clean['Producers'] = df_producers_clean['Producers'].str.strip()
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_producers_clean = df_producers_clean[~df_producers_clean['Producers'].str.upper().isin(exclude_list)]
df_producers_clean = df_producers_clean[df_producers_clean['Producers'] != ""]

# Aggregation for Clean Data (Top 15)
producer_counts_clean = df_producers_clean['Producers'].value_counts().head(15).reset_index()
producer_counts_clean.columns = ['Producers', 'Count']
producer_counts_clean = producer_counts_clean.sort_values('Count', ascending=True)

# --- COLOR LOGIC: CLEAN DATA ---
# Rule: Use Vibrant Pink (#ea568c) for Valid/Clean Data
colors_clean = ['#ea568c'] * len(producer_counts_clean)

# ==========================================
# 2. VISUALIZATION (TECHNICAL COMPARISON)
# ==========================================

# Initialize Subplots
# Rule: rows=1, cols=2
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("<b>Producers Raw Data Distribution</b>", "<b>Producers Cleaned Data Distribution</b>"),
    horizontal_spacing=0.15 
)

# --- TRACE 1: RAW DATA (LEFT) ---
fig.add_trace(
    go.Bar(
        y=producer_counts_raw['Producers'],
        x=producer_counts_raw['Count'],
        orientation='h',
        marker_color=colors_raw, 
        
        # Direct Labeling
        text=producer_counts_raw['Count'],
        textposition='outside',
        texttemplate='%{x:.2s}',
        hovertemplate='<b>Raw Producer:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>',
        showlegend=False 
    ),
    row=1, col=1
)

# --- TRACE 2: CLEAN DATA (RIGHT) ---
fig.add_trace(
    go.Bar(
        y=producer_counts_clean['Producers'],
        x=producer_counts_clean['Count'],
        orientation='h',
        marker_color=colors_clean, 
        
        # Direct Labeling
        text=producer_counts_clean['Count'],
        textposition='outside',
        texttemplate='%{x:.2s}',
        hovertemplate='<b>Producer:</b> %{y}<br><b>Projects:</b> %{x:,}<extra></extra>',
        
        showlegend=False
    ),
    row=1, col=2
)

# ==========================================
# 3. LAYOUT SETTINGS 
# ==========================================

fig.update_layout(
    # --- FONTS & GENERAL ---
    font=dict(family="Tahoma", size=13, color="#000000"),
    hovermode="x unified",
    
    # --- MARGINS ---
    margin=dict(l=40, r=40, t=100, b=60),
    width=1550,
    height=550,
    
    # --- BACKGROUND ---
    plot_bgcolor="white",
    paper_bgcolor="white",
    
    # --- TITLE CONVENTION ---
    title=dict(
        text="<b>Parsing Producers Data Reveals True Market Leaders</b><br><span style='color:#555555; font-size:14px'>(Top 15 Most Active Producers)</span>",
        x=0.02,
        y=0.95,
        xanchor='left',
        yanchor='top'
    )
)

# ==========================================
# 4. AXIS DECLUTTERING
# ==========================================
# Rule: Hide Gridlines, Hide X-axis Labels (Use Direct Labeling), Keep Y-axis Labels

# Update X-axes
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, row=1, col=1)
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, row=1, col=2)

# Update Y-axes
# Rule: Use 'automargin=True' to ensure long producer names are not cropped
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", automargin=True, row=1, col=1)
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", automargin=True, row=1, col=2)

fig.show()

#### **3.3. Studios**

In [88]:
# ==========================================
# 1. DATA PROCESSING
# ==========================================

# Create a copy and clean specific characters
df_studios = df_anime_dataset_2023_prep[['Studios']].copy()
df_studios['Studios'] = df_studios['Studios'].astype(str).str.replace(r"[\[\]\'\"]", "", regex=True)

# Function to classify studio types
def classify_studio(studio_str):
    s = studio_str.strip().upper()
    if s in ["UNKNOWN", "NAN", "","NULL"]:
        return 'UNKNOWN'
    if ',' in studio_str:
        return 'Collaboration'
    return 'Solo'

# Apply classification and count frequencies
df_studios['Production_Type'] = df_studios['Studios'].apply(classify_studio)
studio_counts = df_studios['Production_Type'].value_counts().reset_index()
studio_counts.columns = ['Type', 'Count']
total_projects = studio_counts['Count'].sum()

# Sort Data: Largest to Smallest to ensure consistent clockwise rendering
studio_counts = studio_counts.sort_values(by='Count', ascending=False)

# ==========================================
# 2. COLOR STRATEGY & MAPPING
# ==========================================
color_map = {
    'UNKNOWN': '#C3122F',
    'Collaboration': '#ea568c',
    'Solo': '#D3D3D3'
}

# Map colors to the sorted dataframe
colors = [color_map[t] for t in studio_counts['Type']]

# ==========================================
# 3. VISUALIZATION
# ==========================================

fig = go.Figure(data=[go.Pie(
    labels=studio_counts['Type'],
    values=studio_counts['Count'],
    
    # --- DONUT SETTINGS ---
    hole=0.4, # Creates the donut shape
    
    # --- COLOR & BORDER ---
    marker=dict(colors=colors, line=dict(color='white', width=2)),
    
    # --- LABELS & FORMATTING ---
    # Use 'percent' for the chart slices to avoid clutter
    textinfo='percent',
    textposition='inside',
    textfont=dict(family='Tahoma', size=13),
    
    # --- LEGEND NAMES (ACTION ORIENTED) ---
    sort=False,
    direction='clockwise'
)])

# ==========================================
# 4. LAYOUT SETTINGS (STRICT COMPLIANCE)
# ==========================================

fig.update_layout(
    # GENERAL SETTINGS
    font=dict(family="Tahoma", size=13, color="#000000"),
    
    # --- MARGINS ---
    margin=dict(l=40, r=40, t=100, b=60),
    width=800,
    height=550,
    
    # BACKGROUND 
    plot_bgcolor="white",
    paper_bgcolor="white",
    
    # TITLE 
    title=dict(
        text="<b>Raw Data: UNKNOWN Dominance and Aggregation </b>",
        x=0.02,
        y=0.95,
        xanchor='left',
        yanchor='top'
    ),
    
    # LEGEND
    showlegend=True,
    legend=dict(
        x=1.02, 
        y=0.98, 
        xanchor='left', 
        yanchor='top',
        bgcolor='rgba(0,0,0,0)', # Transparent background
        font=dict(family='Tahoma', size=12)
    ),
    
    # CENTRAL ANNOTATION 
    # Displaying the Total count in the center of the Donut
    annotations=[dict(
        text=f'<b>{total_projects:,.0f}</b><br><span style="font-size:11px; color:gray">Total</span>',
        x=0.5, y=0.5, 
        font=dict(family='Tahoma', size=18, color='black'),
        showarrow=False
    )]
)

fig.show()


In [89]:
# ==========================================
# 1. DATA PREPARATION
# ==========================================

# --- A. RAW DATA (LEFT) ---
# Count occurrences and take Top 15
studio_counts_raw = df_anime_dataset_2023['Studios'].value_counts().head(10).reset_index()
studio_counts_raw.columns = ['Studios', 'Count']
studio_counts_raw = studio_counts_raw.sort_values('Count', ascending=True)

# Color Logic: Red (#C3122F) for "UNKNOWN" OR "Collaborations" (Commas), else Grey (#D3D3D3)
colors_raw = [
    '#C3122F' if (',' in str(x) or str(x).strip().upper() == 'UNKNOWN') else '#D3D3D3'
    for x in studio_counts_raw['Studios']
]

# --- B. CLEAN DATA (RIGHT) ---
df_studios_clean = df_anime_dataset_2023_prep[['Studios']].copy()
df_studios_clean['Studios'] = df_studios_clean['Studios'].astype(str)
df_studios_clean['Studios'] = df_studios_clean['Studios'].str.replace(r"[\[\]\'\"]", "", regex=True)
df_studios_clean['Studios'] = df_studios_clean['Studios'].str.split(',')
df_studios_clean = df_studios_clean.explode('Studios')
df_studios_clean['Studios'] = df_studios_clean['Studios'].str.strip()
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_studios_clean = df_studios_clean[~df_studios_clean['Studios'].str.upper().isin(exclude_list)]
df_studios_clean = df_studios_clean[df_studios_clean['Studios'] != ""]

# Aggregation
studio_counts_clean = df_studios_clean['Studios'].value_counts().head(10).reset_index()
studio_counts_clean.columns = ['Studios', 'Count']
studio_counts_clean = studio_counts_clean.sort_values('Count', ascending=True)

# Color Logic: Pink (#ea568c) for Clean Data
colors_clean = ['#ea568c'] * len(studio_counts_clean)

# ==========================================
# 2. VISUALIZATION (SUBPLOT STRATEGY)
# ==========================================

# Rule: Use make_subplots(rows=1, cols=2)
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("<b>Studios Raw Data Distribution</b>", "<b> Studios Cleaned Data Distribution</b>"),
    horizontal_spacing=0.2 # spacing to prevent long label overlap
)

# --- TRACE 1: RAW DATA (LEFT) ---
fig.add_trace(
    go.Bar(
        y=studio_counts_raw['Studios'],
        x=studio_counts_raw['Count'],
        orientation='h',
        marker_color=colors_raw,
        text=studio_counts_raw['Count'],
        textposition='outside',
        texttemplate='%{x:.2s}',
        hovertemplate='<b>Raw Entry:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>',
        showlegend=False
    ),
    row=1, col=1
)

# --- TRACE 2: CLEAN DATA (RIGHT) ---
fig.add_trace(
    go.Bar(
        y=studio_counts_clean['Studios'],
        x=studio_counts_clean['Count'],
        orientation='h',
        marker_color=colors_clean,
        text=studio_counts_clean['Count'],
        textposition='outside',
        texttemplate='%{x:.2s}',
        hovertemplate='<b>Studio:</b> %{y}<br><b>Total Projects:</b> %{x:,}<extra></extra>',
        showlegend=False
    ),
    row=1, col=2
)

# ==========================================
# 3. LAYOUT SETTINGS (DECLUTTER & FORMAT)
# ==========================================

fig.update_layout(
    # General
    font=dict(family="Tahoma", size=13, color="#000000"),
    hovermode="x unified",
    margin=dict(l=40, r=40, t=100, b=60),
    width=1500,
    height=550,
    plot_bgcolor="white",
    paper_bgcolor="white",
    
    # Title Convention
    title=dict(
        text="<b>Parsing Studio Data Reveals True Studios Market Leaders</b><br><span style='color:#555555; font-size:14px'>(Top 10 Most Active Studios)</span>",
        x=0.02,
        y=0.95,
        xanchor='left',
        yanchor='top'
    )
)

# --- AXIS SETTINGS ---
# Rule: Hide X-axis labels (use direct labeling), Hide Gridlines
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, row=1, col=1)
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, row=1, col=2)

# Rule: Keep Y-axis labels, ensure they fit (automargin)
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", automargin=True, row=1, col=1)
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", automargin=True, row=1, col=2)

fig.show()

##### **The Transformation Insight:**

*   **The Misleading View:** The raw data suggested that **"UNKNOWN"** was the **dominant category** across all features (leading with **13k** in Producers, **11k** in Studios, and **4.9k** in Genres), effectively **masking the actual market structure**. Additionally, **aggregated text** (e.g., "Action, Comedy" or "Madhouse, MAPPA") **fragmented the data**, preventing individual entities from receiving proper credit for their works.

*   **The Truth:** The clean data corrects this by revealing the **true market leaders** that were previously hidden by noise. By **splitting collaborations** and **removing unknown credits**, we identified that **Comedy** dominates genres (7.1k), **NHK** leads producers, and **Toei Animation** is the top studio. Trucially, this process allows us to **accurately record the total volume of works** for each entity, ensuring that projects are **correctly attributed** rather than lost in combined strings.

*   **The Strategic Value**: Instead of noise, we now have **precise, reliable features** to analyze **how Genres, Producers, Studios entities drive Anime Scores**. 

### **4. Business Insights**

#### **A. Genres Score Analysis** 
**Question**: Besides popular genres with a large number of release titles, **what are the emerging genres with high assessment**?

In [90]:
# ==========================================
# 1. STEP A: IDENTIFY TOP 10 POPULAR GENRES
# ==========================================
df_pop = df_anime_dataset_2023_prep[['Genres']].copy()
df_pop['Genres'] = df_pop['Genres'].astype(str).str.replace(r"[\[\]\'\"]", "", regex=True)
df_pop['Genres'] = df_pop['Genres'].str.split(',')
df_pop = df_pop.explode('Genres')
df_pop['Genres'] = df_pop['Genres'].str.strip()

exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_pop = df_pop[~df_pop['Genres'].str.upper().isin(exclude_list)]
df_pop = df_pop[df_pop['Genres'] != ""]
# Loại bỏ Award Winning
df_pop = df_pop[df_pop['Genres'] != 'Award Winning']

top_10_popular_list = df_pop['Genres'].value_counts().head(10).index.tolist()

# ==========================================
# 2. STEP B: CALCULATE TOP 10 HIGHEST RATED
# ==========================================
df_score = df_anime_dataset_2023_prep[['Genres', 'Score']].copy()
df_score = df_score.dropna(subset=['Score'])
df_score['Genres'] = df_score['Genres'].astype(str).str.replace(r"[\[\]\'\"]", "", regex=True)
df_score['Genres'] = df_score['Genres'].str.split(',')
df_score = df_score.explode('Genres')
df_score['Genres'] = df_score['Genres'].str.strip()
df_score = df_score[~df_score['Genres'].str.upper().isin(exclude_list)]
df_score = df_score[df_score['Genres'] != ""]
# Loại bỏ Award Winning
df_score = df_score[df_score['Genres'] != 'Award Winning']

# Aggregation
genre_avg_score = (
    df_score.groupby('Genres')
    .agg(
        average_score=('Score', 'mean'), 
        anime_count=('Score', 'count')
    )
    .reset_index()
)

#Genre wwith at least 50 release titles 
genre_avg_score = genre_avg_score[genre_avg_score['anime_count'] >= 50]

# Lấy Top 10
top_genres_score = genre_avg_score.sort_values('average_score', ascending=True).tail(10)

# ==========================================
# 3. STEP C: COLOR LOGIC
# ==========================================
colors = []
for genre in top_genres_score['Genres']:
    if genre in top_10_popular_list:
        colors.append('#D3D3D3') # Mainstream
    else:
        colors.append('#6A1B9A') # Niche Gem

# ==========================================
# 4. VISUALIZATION
# ==========================================
fig = go.Figure(go.Bar(
    y=top_genres_score['Genres'],
    x=top_genres_score['average_score'],
    orientation='h',
    marker_color=colors,
    text=top_genres_score['average_score'],
    textposition='outside',
    texttemplate='%{x:.2f}',
    hovertemplate='<b>Genre:</b> %{y}<br><b>Avg Score:</b> %{x:.2f}<br><b>Total Animes:</b> %{customdata}<extra></extra>',
    customdata=top_genres_score['anime_count']
))

# ==========================================
# 5. LAYOUT SETTINGS
# ==========================================
fig.update_layout(
    font=dict(family="Tahoma", size=13, color="#000000"),
    hovermode="y unified",
    margin=dict(l=40, r=40, t=100, b=60),
    width=800,
    height=550,
    plot_bgcolor="white",
    paper_bgcolor="white",
    title=dict(
        text="<b>Niche Genres Dominate Audience Satisfaction Rankings</b><br><span style='color:#555555; font-size:14px'>(Purple: Niche Genres | Grey: Mainstream Genres found in Top 10 Most Aired)</span>",
        x=0.02, y=0.95, xanchor='left', yanchor='top'
    ),
    showlegend=False
)

fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False)
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ")

fig.show()

**Insight:**

* ***The Niche Score Premium and Mainstream Risk**

    The data strongly suggests a significant **score premium** exists for **Niche Genres**, driven by the high **satisfaction** of their dedicated fanbases. Genres like **Mystery** (7.00) and **Suspense** (6.96) occupy the top ranks, consistently outperforming **Mainstream Genres** like "Drama" (6.85) and "Action" (6.67). Conversely, the lower scores for **Action/Adventure** (both 6.67) indicate that appealing to the **broader mainstream audience** is a higher-risk endeavor, as general audiences tend to be more critical due to the sheer volume and classic reputation of existing titles.

* **Strategic Genre Positioning: The Niche Landscape**

    The positioning of various niche genres highlights specific investment opportunities:

    *   **Mind-Game Dominance:** **Mystery** and **Suspense** are the clear **"Satisfaction Kings,"** being the only genres near or at the 7.00 mark. This signals that audiences highly value **intelligent plotting**, **narrative depth**, and dramatic tension.

    *   **Sports Potential:** **Sports** (6.72) is a niche genre with strong **Mainstream Potential**, as its score is comparable to popular categories. This makes it a **safer investment** for producers with compelling source material.

    *   **Gourmet's Low Barrier to Entry:** **Gourmet** (6.63) registers the second lowest average score among the ranked niche genres. **(The Gourmet anime genre typically focuses on cooking, food tasting, or culinary competitions, emphasizing the visual presentation of food and the emotional reactions of characters.)** While the score is low, its very specific focus suggests a low **competitive barrier to entry**. It can be a cost-effective market to enter, provided the producer masters the **visual quality** required to satisfy the focused niche audience.

**Business Takeaway:**

Producers aiming to maximize their project's final score should strategically target **Niche Genres** to capitalize on high fan loyalty. The highest returns on satisfaction are found in **Mystery and Suspense**, rewarding investments in **complex, engaging narratives**. The data advises a cautious approach to high-volume **Action/Adventure** titles, which carry a higher risk of underperforming due to the demanding nature of the mainstream audience.




In [91]:
# ==========================================
# 1. DATA PREPARATION
# ==========================================

# Helper function
def parse_list_field(x):
    if pd.isna(x): return []
    if isinstance(x, list): return [str(s).strip() for s in x]
    try:
        val = ast.literal_eval(x)
        if isinstance(val, (list, tuple)): return [str(s).strip() for s in val]
    except Exception: pass
    return [s.strip() for s in str(x).split(',') if s.strip()]

# Initialize
df_matrix = df_anime_dataset_2023_prep[['Type', 'Genres', 'Score']].copy()
df_matrix = df_matrix.dropna(subset=['Type', 'Genres', 'Score'])
df_matrix['Score'] = pd.to_numeric(df_matrix['Score'], errors='coerce')

# Parse Genres
df_matrix['Genres_list'] = df_matrix['Genres'].apply(parse_list_field)

# Explode Genres
df_exploded = df_matrix.explode('Genres_list')
df_exploded.rename(columns={'Genres_list': 'Genre'}, inplace=True)
df_exploded['Genre'] = df_exploded['Genre'].str.strip()
df_exploded['Type'] = df_exploded['Type'].str.strip()

# --- FILTERING ---
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
# Clean Genres
df_exploded = df_exploded[~df_exploded['Genre'].str.upper().isin(exclude_list)]
df_exploded = df_exploded[df_exploded['Genre'] != ""]
df_exploded = df_exploded[df_exploded['Genre'] != "Award Winning"] # Rule: Remove Award Winning

# Clean Types
df_exploded = df_exploded[~df_exploded['Type'].str.upper().isin(exclude_list)]

# --- STEP A: IDENTIFY TOP 10 GENRES (Highest Avg Score, Min 20 Titles) ---
genre_stats = df_exploded.groupby('Genre').agg(
    avg_score=('Score', 'mean'),
    count=('Score', 'count')
)
# Filter count >= 20 and Sort Descending
top_10_genres = genre_stats[genre_stats['count'] >= 20].sort_values('avg_score', ascending=False).head(10).index.tolist()

# --- STEP B: CREATE HEATMAP DATA (Type vs. Top 10 Genres) ---
# Filter dataset to only Top 10 Genres
df_final = df_exploded[df_exploded['Genre'].isin(top_10_genres)]

# Create Pivot Table
heatmap_data = df_final.groupby(['Type', 'Genre'])['Score'].mean().unstack()

# Reorder Columns (Genres) by Rank (Highest score left)
heatmap_data = heatmap_data.reindex(columns=top_10_genres)

# Optional: Reorder Rows (Type) by Average Score of that Type (for better visual flow)
type_order = df_final.groupby('Type')['Score'].mean().sort_values(ascending=False).index.tolist()
heatmap_data = heatmap_data.reindex(index=type_order)

# ==========================================
# 2. COLOR STRATEGY (PHASE 3)
# ==========================================
# Gradient: Grey (Low) -> Pink (Mid) -> Purple (High)
custom_colorscale = [
    [0.0, "#D3D3D3"], 
    [0.5, '#ea568c'], 
    [1.0, "#6A1B9A"]
]

# ==========================================
# 3. VISUALIZATION
# ==========================================

fig = px.imshow(
    heatmap_data,
    labels=dict(x="Top Genres", y="Format (Type)", color="Avg Score"),
    x=heatmap_data.columns,
    y=heatmap_data.index,
    aspect="auto",
    color_continuous_scale=custom_colorscale,
    text_auto=".2f"
)

# ==========================================
# 4. LAYOUT SETTINGS
# ==========================================

fig.update_layout(
    # --- GENERAL ---
    font=dict(family='Tahoma', size=12, color="#000000"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    width=1000,
    height=600, 
    
    # --- TITLE CONVENTION ---
    title=dict(
        text="<b>Format Matters: How Anime Type Affects Genre Scores</b><br><span style='color:#555555; font-size:14px'>(Average Score Matrix: Anime Types vs. Top 10 Highest-Rated Genres)</span>",
        x=0.02, 
        y=0.97,
        xanchor='left',
        yanchor='top'
    ),
    
    # --- MARGINS ---
    margin=dict(t=100, l=100, r=40, b=60),
    
    # --- COLORBAR ---
    coloraxis_colorbar=dict(
        title=dict(text="Score", side="top"),
        thickness=15,
        len=0.5,
        yanchor="top",
        y=1,
        tickfont=dict(family="Tahoma", size=11)
    )
)

# --- AXIS REFINEMENTS ---
# Move X-axis to top
fig.update_xaxes(side="top", tickfont=dict(family="Tahoma", size=12), title=None)
fig.update_yaxes(tickfont=dict(family="Tahoma", size=12), title=None, ticksuffix="  ")
fig.update_traces(xgap=1, ygap=1)

fig.show()

#### **B. Studios vs Genres** 
**Question**: Studios are directly responsible for drawing, animation, compositing, coloring, editing, and post production. In other words, they are the teams that produce the actual visual content we see on the screen. Therefore, we want to discover **which Genres do top Studios consistently excel in.**

In [92]:

# ==========================================
# 1. DATA PREPARATION
# ==========================================

# --- COMMON CLEANING (SCORE & STUDIOS) ---
df_studio = df_anime_dataset_2023_prep[['Studios', 'Score']].copy()

# Ensure strictly string type & valid scores
df_studio = df_studio.dropna(subset=['Score', 'Studios'])
df_studio['Score'] = pd.to_numeric(df_studio['Score'], errors='coerce')
df_studio['Studios'] = df_studio['Studios'].astype(str)

# Standard Cleaning: Remove brackets/quotes
df_studio['Studios'] = df_studio['Studios'].str.replace(r"[\[\]\'\"]", "", regex=True)

# Split & Explode
df_studio['Studios'] = df_studio['Studios'].str.split(',')
df_studio = df_studio.explode('Studios')

# Trim & Filter Noise
df_studio['Studios'] = df_studio['Studios'].str.strip()
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_studio = df_studio[~df_studio['Studios'].str.upper().isin(exclude_list)]
df_studio = df_studio[df_studio['Studios'] != ""]

# --- AGGREGATE STATS (Count & Mean) ---
studio_stats = df_studio.groupby('Studios').agg(
    avg_score=('Score', 'mean'),
    count=('Score', 'count')
).reset_index()

# --- GROUP A: TOP 10 BY VOLUME (MOST ACTIVE) ---
# Logic: Sort by Count Descending -> Take Top 10 -> Sort by Score for Plotting
top_volume_list = studio_stats.sort_values('count', ascending=False).head(10)
top_volume_plot = top_volume_list.sort_values('avg_score', ascending=True) # Sort Ascending for Horizontal Bar

# --- GROUP B: TOP 10 BY SCORE (QUALITY LEADERS) ---
# Logic: Filter Count >= 50 -> Sort by Score Descending -> Take Top 10 -> Sort Ascending for Plotting
threshold = 50
qualified_studios = studio_stats[studio_stats['count'] >= threshold]
top_quality_list = qualified_studios.sort_values('avg_score', ascending=False).head(10)
top_quality_plot = top_quality_list.sort_values('avg_score', ascending=True)

# ==========================================
# 2. VISUALIZATION (SUBPLOT STRATEGY)
# ==========================================

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=(
        "<b>Avg Score of Top 10 Most Active Studios</b>", 
        f"<b>Avg Score of Top 10 Highest Rated Studios (>{threshold} titles)</b>"
    ),
    horizontal_spacing=0.15
)

# --- TRACE 1: VOLUME LEADERS SCORE (PINK) ---
fig.add_trace(
    go.Bar(
        y=top_volume_plot['Studios'],
        x=top_volume_plot['avg_score'],
        orientation='h',
        marker_color='#ea568c', # Vibrant Pink (Volume Context)
        
        # Direct Labeling
        text=top_volume_plot['avg_score'],
        textposition='outside',
        texttemplate='%{x:.2f}',
        
        # Tooltip
        hovertemplate='<b>Studio:</b> %{y}<br><b>Avg Score:</b> %{x:.2f}<br><b>Total Projects:</b> %{customdata}<extra></extra>',
        customdata=top_volume_plot['count']
    ),
    row=1, col=1
)

# --- TRACE 2: QUALITY LEADERS SCORE (PURPLE) ---
fig.add_trace(
    go.Bar(
        y=top_quality_plot['Studios'],
        x=top_quality_plot['avg_score'],
        orientation='h',
        marker_color='#6A1B9A', # Deep Purple (Quality Context)
        
        # Direct Labeling
        text=top_quality_plot['avg_score'],
        textposition='outside',
        texttemplate='%{x:.2f}',
        
        # Tooltip
        hovertemplate='<b>Studio:</b> %{y}<br><b>Avg Score:</b> %{x:.2f}<br><b>Total Projects:</b> %{customdata}<extra></extra>',
        customdata=top_quality_plot['count']
    ),
    row=1, col=2
)

# ==========================================
# 3. LAYOUT SETTINGS
# ==========================================

fig.update_layout(
    # --- GENERAL ---
    font=dict(family="Tahoma", size=13, color="#000000"),
    hovermode="y unified",
    
    # --- MARGINS ---
    margin=dict(l=40, r=40, t=100, b=60),
    width=1400, 
    height=550,
    plot_bgcolor="white",
    paper_bgcolor="white",
    
    # --- TITLE CONVENTION ---
    title=dict(
        text="<b>Mass Production vs. Consistent Excellence: A Score Comparison",
        x=0.02,
        y=0.95,
        xanchor='left',
        yanchor='top'
    ),

    showlegend=False
)

# ==========================================
# 4. AXIS DECLUTTER
# ==========================================

# Determine common range for better comparison (e.g., from lowest score to 10)
min_score = min(top_volume_plot['avg_score'].min(), top_quality_plot['avg_score'].min())
range_x = [0,8]

# Left Chart Axis
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, range=range_x, row=1, col=1)
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", automargin=True, row=1, col=1)

# Right Chart Axis
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, range=range_x, row=1, col=2)
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", automargin=True, row=1, col=2)

fig.show()

**Insight**

* **The Strategic Trade-off: Volume vs. Score**

    The comparison explicitly highlights a fundamental **trade-off** between **mass production** and **consistent scoring excellence**. 

    * The **Top 10 Most Active Studios** generally maintain average scores below 7.00 (with the lowest at 6.59), prioritizing high **volume** to ensure stable revenue and resource utilization. 
    * In contrast, the **Top 10 Highest Rated Studios** (all above 7.09) focus on achieving superior **quality** and brand value. This disparity shows that studios must adopt a clear **strategic segmentation**: either optimize for **cash flow stability** (volume) or strive for a **premium brand reputation** (score).

* **The Significance of the Quality Floor Gap**

    * The difference in the lowest average score between the two groups-**0.50 points** (6.59 vs. 7.09)-is highly significant in the competitive anime market. This gap represents the difference between a project being merely *average* and being firmly in the *high-quality* tier. Studios must understand that achieving a reputation for **Consistent Excellence** (i.e., making the "Highest Rated" list) requires a commitment to a **higher quality floor** across their entire output. This performance difference validates the notion that specialized studios like Bones, ufotable, and Kyoto Animation command a **score premium** due to their unwavering quality standards.

* **The Exceptional Case and Producer Strategy**

    * **A-1 Pictures** stands out as a unique exception, appearing at the top of the **Most Active** list while maintaining an average score of 7.12, high enough to make the **Highest Rated** list. This demonstrates a successful, albeit rare, **ideal business model** that has mastered both **scale and quality**. For Producers, the data dictates a clear selection strategy: if the goal is securing a **guaranteed high score** and a critical hit, the **Highest Rated Studios** are the designated **premium partners**. If the goal is a balance between moderate quality and reliable production output, the **Most Active Studios** offer a broader, more accessible range of options.

**Business Takeaway**

Success in the anime industry is no longer a simple matter of **Volume** or **Score**, but of **strategic alignment**. **Highest Rated Studios** prove that focusing on **technical quality** yields both higher audience scores and greater long-term **brand equity**. Producers should choose their Studio based on their ultimate objective: **Cash Flow Stability (Active Studios)** or **Reputation & High Score (Rated Studios)**.

In [None]:
# ==========================================
# 1. DATA PREPARATION
# ==========================================

# Helper function
def parse_list_field(x):
    if pd.isna(x): return []
    if isinstance(x, list): return [str(s).strip() for s in x]
    try:
        val = ast.literal_eval(x)
        if isinstance(val, (list, tuple)): return [str(s).strip() for s in val]
    except Exception: pass
    return [s.strip() for s in str(x).split(',') if s.strip()]

# Initialize
df_matrix = df_anime_dataset_2023_prep[['anime_id', 'Studios', 'Genres', 'Score']].copy()
df_matrix = df_matrix.dropna(subset=['Studios', 'Genres', 'Score'])
df_matrix['Score'] = pd.to_numeric(df_matrix['Score'], errors='coerce')

# Parse Lists
df_matrix['Studios_list'] = df_matrix['Studios'].apply(parse_list_field)
df_matrix['Genres_list'] = df_matrix['Genres'].apply(parse_list_field)

# --- STEP A: IDENTIFY TOP 10 ELITE STUDIOS (Avg Score, Min 20 Titles) ---
df_studios = df_matrix.explode('Studios_list')
df_studios['Studio'] = df_studios['Studios_list'].astype(str).str.strip()

# Remove Garbage
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_studios = df_studios[~df_studios['Studio'].str.upper().isin(exclude_list)]
df_studios = df_studios[df_studios['Studio'] != ""]

# Aggregation Studio
studio_stats = df_studios.groupby('Studio').agg(
    avg_score=('Score', 'mean'),
    count=('Score', 'count')
)
# Filter & Sort
top_10_studios = studio_stats[studio_stats['count'] >= 20].sort_values('avg_score', ascending=False).head(10).index.tolist()

# --- STEP B: IDENTIFY TOP 10 ELITE GENRES (Avg Score, Min 20 Titles) ---
# We explode Genres from the FULL dataset (not just top studios) to find global top quality genres
df_genres = df_matrix.explode('Genres_list')
df_genres['Genre'] = df_genres['Genres_list'].astype(str).str.strip()
df_genres = df_genres[~df_genres['Genre'].str.upper().isin(exclude_list)]
df_genres = df_genres[df_genres['Genre'] != ""]

# --- UPDATE: REMOVE 'AWARD WINNING' ---
df_genres = df_genres[df_genres['Genre'] != 'Award Winning']

# Aggregation Genre
genre_stats = df_genres.groupby('Genre').agg(
    avg_score=('Score', 'mean'),
    count=('Score', 'count')
)
# Filter & Sort
top_10_genres = genre_stats[genre_stats['count'] >= 50].sort_values('avg_score', ascending=False).head(10).index.tolist()

# --- STEP C: CREATE INTERSECTION MATRIX ---
# Filter original exploded data to keep only Elite Studios AND Elite Genres
df_final = df_studios.copy() # Starts with exploded studios
df_final['Genres_list'] = df_final['Genres_list'].apply(lambda x: [g for g in x if g in top_10_genres]) # Keep only top genres in list
df_final = df_final.explode('Genres_list') # Explode Genres
df_final.rename(columns={'Genres_list': 'Genre'}, inplace=True)

# Filter rows
df_final = df_final[
    (df_final['Studio'].isin(top_10_studios)) & 
    (df_final['Genre'].notna())
]

# Create Pivot Table (Heatmap Data)
heatmap_data = df_final.groupby(['Studio', 'Genre'])['Score'].mean().unstack()

# Reorder indices for visual logic (Best at Top-Left)
heatmap_data = heatmap_data.reindex(index=top_10_studios, columns=top_10_genres)

# ==========================================
# 2. COLOR STRATEGY (PHASE 3)
# ==========================================
# Gradient: Grey (Low) -> Pink (Mid) -> Purple (High)
custom_colorscale = [
    [0.0, "#D3D3D3"], 
    [0.5, '#ea568c'], 
    [1.0, "#6A1B9A"]
]

# ==========================================
# 3. VISUALIZATION
# ==========================================

fig = px.imshow(
    heatmap_data,
    labels=dict(x="Top Genres", y="Top Studios", color="Avg Score"),
    x=heatmap_data.columns,
    y=heatmap_data.index,
    aspect="auto",
    color_continuous_scale=custom_colorscale,
    text_auto=".2f"
)

# ==========================================
# 4. LAYOUT SETTINGS
# ==========================================

fig.update_layout(
    # --- GENERAL ---
    font=dict(family='Tahoma', size=12, color="#000000"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    width=1000,
    height=800, 
    
    # --- TITLE CONVENTION ---
    title=dict(
        text="<b>Elite Quality Intersection: Top Studios vs. Top Genres</b><br><span style='color:#555555; font-size:14px'>(Performance Matrix of the 10 Highest-Rated Studios across the 10 Highest-Rated Genres)</span>",
        x=0.02, 
        y=0.97,
        xanchor='left',
        yanchor='top'
    ),
    
    # --- MARGINS ---
    margin=dict(t=100, l=150, r=40, b=60),
    
    # --- COLORBAR ---
    coloraxis_colorbar=dict(
        title=dict(text="Score", side="top"),
        thickness=15,
        len=0.5,
        yanchor="top",
        y=1,
        tickfont=dict(family="Tahoma", size=11)
    )
)

# --- AXIS REFINEMENTS ---
fig.update_xaxes(side="top", tickfont=dict(family="Tahoma", size=12), title=None)
fig.update_yaxes(tickfont=dict(family="Tahoma", size=12), title=None, ticksuffix="  ")
fig.update_traces(xgap=1, ygap=1)

fig.show()

**Insight**

* **The Crucial of Core Competency Alignment**

    The heatmap reveals that **Consistent Excellence** is not uniform across all genres; it is highly **specialized**. Top scores (8.0+) are achieved almost exclusively at specific **Studio-Genre intersections**, validating the necessity of **Core Competency Alignment**. For instance, **Wit Studio** is top-tier in "Suspense" (8.19), while **White Fox** peaks in "Romance" (8.31) and "Mystery" (8.20). Producers must treat the heatmap as a **strategic blueprint**, matching the project's genre to the studio's **proven, data-backed strength** to maximize the final score, rather than simply choosing a popular name.

* **Emerging Niche Markets: Gourmet and Sports (Strategic Gap Filling)**

    The genres of **Gourmet** and **Sports** currently present clear **Niche Market Opportunities** because no studio has yet achieved the highly exclusive 8.0+ score tier, despite strong performances from leaders.

    *   **Gourmet:** MAPPA (7.70) and ufotable (7.38) are leading the pack, showing the need for high-quality visual presentation.
    *   **Sports:** Kyoto Animation (7.56) and 8bit (7.42) are the benchmarks, requiring high-level **complex motion animation**.

    This lack of absolute dominance creates a clear call for **Strategic Gap Filling**: **Producers/Studios should proactively look for adjacent genres that align with their core strengths.** For example, a studio with a strong track record in **Action/Supernatural (like ufotable)** could leverage its superior **dynamic motion animation** to break the 8.0+ ceiling in **Sports**, becoming the quality leader in a currently underserved market.

* **Versatility vs. Specialization and Risk Assessment**

    The matrix provides crucial insight into studio risk profiles. **MAPPA** demonstrates superior **Versatility** by maintaining high scores (mostly above 7.00) across nearly all tested genres, with a peak in "Suspense" (8.54). This makes MAPPA an optimal choice for complex, **long-running series** that necessitate stable quality across multiple genre arcs. Conversely, Studios like **CloverWorks** show the risk of **Inconsistent Quality**, scoring high in "Romance" (7.97) but disastrously low in "Suspense" (6.17). Producers must rigorously **avoid Studio Miscasting**, ensuring the studio's specialization aligns with the project's primary genre to protect the investment.

**Business Takeaway**

The Heatmap is the definitive guide to achieving **Elite Quality Intersection**. The most reliable path to the prestigious **8.0+ score tier** requires a precise strategic match between the project's **Genre** and the Studio's demonstrated **Core Competency**. This means **avoiding generalists** when an 8.0+ score is the goal, and actively pursuing **Gap Filling** opportunities where a studio's established technical strength can dominate a less-competitive niche.

#### **C. Producers Committee** 
**Question**: A Producer can be understood as an “investor,” and an anime is essentially an “investment project.” Producers provide funding, planning, and coordination for the production.
Given the need to ensure quality and manage financial risk, **should producers operate alone (or in very small teams), or should they form a larger producer committee?**


In [94]:
# ==========================================
# 1. DATA PROCESSING
# ==========================================

# Filter relevant columns
df_trend = df_anime_dataset_2023_prep[['Producers', 'Aired Year']].copy()

# 1.1. CLEANING
df_trend['Aired Year'] = pd.to_numeric(df_trend['Aired Year'], errors='coerce')
df_trend.dropna(subset=['Aired Year', 'Producers'], inplace=True)

# Remove Unknowns
exclude_list = {"UNKNOWN", "NONE", "NAN", "NULL", ""}
df_trend = df_trend[~df_trend['Producers'].astype(str).str.strip().str.upper().isin(exclude_list)]

# 1.2. FILTER LAST 10 YEARS (2014 - 2023)
max_year = 2023
min_year = max_year - 9 
df_trend = df_trend[(df_trend['Aired Year'] >= min_year) & (df_trend['Aired Year'] <= max_year)]

# 1.3. GROUPING LOGIC
df_trend['num_producers'] = df_trend['Producers'].astype(str).str.count(',') + 1

def group_producers(n):
    if n <= 2: return '1-2 producers'
    if n <= 5: return '3-5 producers'
    return '6+ producers'

df_trend['Collab_Group'] = df_trend['num_producers'].apply(group_producers)

# 1.4. AGGREGATION
trend_data = (
    df_trend.groupby(['Aired Year', 'Collab_Group'])
    .size()
    .reset_index(name='Anime_Count')
    .sort_values('Aired Year')
)

# Logical Order
order_list = ['1-2 producers', '3-5 producers', '6+ producers']

# ==========================================
# 2. COLOR STRATEGY
# ==========================================
# Map specific groups to the requested colors
color_map = {
    '1-2 producers': '#6A1B9A', 
    '3-5 producers': '#ea568c', 
    '6+ producers':  '#2E86C1'  
}

# ==========================================
# 3. VISUALIZATION
# ==========================================

fig = go.Figure()

for group in order_list:
    # Filter data
    group_data = trend_data[trend_data['Collab_Group'] == group]
    
    # Get color
    line_color = color_map[group]

    line_width = 3

    fig.add_trace(go.Scatter(
        x=group_data['Aired Year'],
        y=group_data['Anime_Count'],
        mode='lines+markers',
        name=group,
        line=dict(color=line_color, width=line_width),
        hovertemplate=f'<b>{group}</b><br>Year: %{{x}}<br>Volume: %{{y}}<extra></extra>'
    ))

# ==========================================
# 4. LAYOUT SETTINGS
# ==========================================

fig.update_layout(
    # --- GENERAL ---
    font=dict(family="Tahoma", size=13, color="#000000"),
    hovermode="x unified",
    
    # --- MARGINS ---
    margin=dict(l=40, r=40, t=100, b=60),
    width=800,
    height=550,
    plot_bgcolor="white",
    paper_bgcolor="white",
    
    # --- TITLE ---
    title=dict(
        text=f"<b>Small Teams Consistently Drive Anime Volume</b><br><span style='color:#555555; font-size:14px'>(Annual Production Volume by Collaboration Size In {min_year} - {max_year})</span>",
        x=0.02,
        y=0.95,
        xanchor='left',
        yanchor='top'
    ),
    
    # --- LEGEND  ---
    showlegend=True,
    legend=dict(
        x=1.02, 
        y=0.98, 
        xanchor='left', 
        yanchor='top',
        bgcolor='rgba(0,0,0,0)',
        font=dict(size=12)
    )
)

# ==========================================
# 5. AXIS DECLUTTER
# ==========================================

fig.update_xaxes(
    
    showgrid=False, 
    showline=True, 
    linecolor='black',
    tickmode='linear',
    dtick=1
)

fig.update_yaxes(
    title="Number of Anime",
    showgrid=True, 
    showline=True, 
    ticksuffix="  "
)

fig.show()

**Insight:**

* **The Dominance and Decline of Small Producer Teams (1-2 Producers)**

    * The anime market remains overwhelmingly **dominated by projects involving only one or two producers**. This segment consistently drives the highest volume of annual production, often three to six times larger than the other groups combined. However, the data reveals a critical **overall declining trend** in the total output of this dominant group, dropping sharply from a peak of around 410 titles in 2014 to approximately 230 in 2023.

    * **Strategic Implication:** Small-scale producers are the main market driver, but their overall volume is shrinking. Producers and studios working primarily in this segment must recognize the market consolidation and **aggressively focus on project quality and efficiency** to counteract the declining trend and high production volatility.

* **Stability and Consistency of Larger Collaboration Models**

    * In sharp contrast to the high volatility of the small teams, the **mid-sized (3-5 Producers)** and **large collaboration (6+ Producers) models** maintain a significantly lower but remarkably **stable production volume**, generally ranging between 50 and 100 titles annually throughout the decade.

    * **Strategic Implication:** These stable output numbers suggest that larger committees are primarily involved in **resource-intensive, complex, or high-profile projects** (like major franchises) that demand more coordinated financial backing. For animation studios seeking reliable, year-to-year contracts and stable cash flow, prioritizing partnerships with these larger production committees (especially the **6+ Producers** group) is the optimal strategy.

**Business Takeaway:** The core strategy of the anime market is demonstrably shifting **from the enormous volume of small-scale projects towards the stable quality of larger-scale projects.**

*   **For Small-Scale Producers/Studios:** Caution is advised due to the overall declining trend in volume. Success requires a strategic pivot to **prioritize and enhance project quality** to effectively compete with the stability offered by larger projects.

*   **For Large-Scale Producers/Studios:** It is recommended to continue **maintaining or expanding the Large Collaboration Models** as they provide a consistent output, better stability, and superior resilience against general market fluctuations.

In [95]:
# ==========================================
# 1. DATA PROCESSING
# ==========================================

# 1. Copy data
df_producers_collab = df_anime_dataset_2023_prep[['Producers', 'Score']].copy()

# 2. Cleaning
df_producers_collab['Score'] = pd.to_numeric(df_producers_collab['Score'], errors='coerce')
df_producers_collab = df_producers_collab.dropna(subset=['Score', 'Producers'])

# 3. Filter Unknowns
exclude_list = ["UNKNOWN", "", "NONE"]
df_producers_collab = df_producers_collab[~df_producers_collab['Producers'].astype(str).str.strip().str.upper().isin(exclude_list)]

# 4. Calculate Number of Producers
df_producers_collab['num_producers'] = df_producers_collab['Producers'].astype(str).str.count(',') + 1

# 5. Binning Function
def group_size(n):
    if n <= 2: return '1-2 producers'
    if n <= 5: return '3-5 producers'
    return '6+ producers'

df_producers_collab['Collaboration_Group'] = df_producers_collab['num_producers'].apply(group_size)

# 6. Define Order
order_list = ['1-2 producers', '3-5 producers', '6+ producers']

# 7. Calculate Insight Stats for Title
group_stats = df_producers_collab.groupby('Collaboration_Group')['Score'].mean().reindex(order_list)
baseline_score = group_stats['1-2 producers']
top_score = group_stats.max()
improvement = top_score - baseline_score

# ==========================================
# 2. COLOR STRATEGY (CONSISTENT WITH LINE CHART)
# ==========================================
custom_palette = {
    '1-2 producers': '#6A1B9A', 
    '3-5 producers': '#ea568c',
    '6+ producers':  '#2E86C1'  
}

# ==========================================
# 3. VISUALIZATION
# ==========================================

fig = px.box(
    df_producers_collab,
    x='Collaboration_Group',
    y='Score',
    color='Collaboration_Group', 
    color_discrete_map=custom_palette, 
    category_orders={'Collaboration_Group': order_list} 
)

# Style Traces
fig.update_traces(
    width=0.4,           # Adjust box width
    marker_size=4,       # Smaller outlier points
    marker_opacity=0.3,  # Reduce outlier visual noise
    line_width=1.5
)

# ==========================================
# 4. LAYOUT SETTINGS (STRICT COMPLIANCE)
# ==========================================

fig.update_layout(
    # --- GENERAL ---
    font=dict(family="Tahoma", size=13, color="#000000"),
    
    # --- MARGINS ---
    margin=dict(l=40, r=40, t=100, b=60),
    width=800,
    height=550,
    plot_bgcolor="white",
    paper_bgcolor="white",
    
    # --- TITLE CONVENTION (ACTION-ORIENTED) ---
    title=dict(
        text=f"<b>Large Collaboration Teams Correlate with Higher Scores</b><br><span style='color:#555555; font-size:14px'>(Avg Score Increases by +{improvement:.2f} points when comparing 3-5 Producers vs Small Teams)</span>",
        x=0.02,
        y=0.95,
        xanchor='left',
        yanchor='top'
    ),
    
    # --- LEGEND ---
    # Rule: Box plot categories are clear on X-axis, hide legend to declutter
    showlegend=False
)

# ==========================================
# 5. AXIS DECLUTTER
# ==========================================

# X-Axis
fig.update_xaxes(
    title="Producer Team Size",
    showgrid=False,
    showline=True,
    linecolor='black'
)

# Y-Axis
fig.update_yaxes(
    title="Anime Score",
    showgrid=True,
    showline=True,
    linecolor='black',
    ticksuffix="  "
)

fig.show()

**Insight:**
* **Larger Producer Teams Deliver Better Results**

    There is a clear and direct **positive correlation** between the number of producers involved and the final Anime Score. The score systematically improves as the collaboration size increases.

    *   The chart explicitly shows the largest groups significantly outperforming the smallest: **1-2 Producers Average score: 6.33** vs. **3-5 Producers Average: 7.19**, representing an improvement of **+0.81** points.

    *   This performance gap of nearly a full point validates the notion that **larger budgets, pooled resources, and stricter quality control** inherent in multi-party production committees reliably lead to a better-received product.

* **Risk Mitigation and Quality Floor Elevation**

    Collaboration acts as an effective mechanism for risk management, substantially raising the quality floor for the investment.

    *   The **"1-2 Producers"** group exhibits extremely high volatility, with a wide score range and a significant cluster of "disaster" **outliers** plummeting down to the 2.0 – 3.0 score range. This configuration carries the highest risk of catastrophic failure.

    *   In sharp contrast, the presence of **3 Producers or more** significantly elevates the **"quality floor."** The lower whisker for both the 3-5 and 6+ producer groups rests at a much higher score (~ 5.1), demonstrating that collaboration effectively protects the investment from critically low-scoring failure.

**Business Takeaway:** The optimal strategic balance for maximizing quality and controlling risk is to form a **Production Committee** consisting of **3 to 5 partners**. This structure successfully balances the need for robust financing and stringent quality oversight while maintaining effective management efficiency.

## **V. Theme C — Release Strategy (Aired, Episodes, Duration)**

### **1. Issue Overview**


### **2. Solution**


### **3. Visual Evidence: Misleading (Raw) vs. Correct (Clean) Insights**


##### **The Transformation Insight:**

**Before**:

**After**

### **4. Business Insights**

#### **A.** 
**Question**:

**Insight:**

**Business Takeaway:**

## **VI. Deep Business Insights & Strategic Recommendations**

**1. Key Patterns Across All Themes**

**2. Strategic Recommendations**

