## **I. Introduction**
**1. Narrative Context**

In the previous stage, we took on the role of a Producer seeking the formula behind high-scoring anime. However, our first obstacle was immediate and unavoidable: the Raw Dataset was in poor condition. It contained missing values, unstructured text, inconsistent formats, and fragmented categories — making meaningful analysis impossible.

This notebook presents the next stage of the workflow: **the transformation**.
After applying a complete Data Processing Pipeline to clean, standardize, and engineer features, we now visualize the impact of that transformation. This notebook serves as evidence of how structured data enables structured insights.

**2. Objective of This Analysis**

The focus of this notebook is **Comparative Analysis**.
For every major feature, we address two key questions:

1. **The Transformation Insight** — How did the feature evolve from **Raw** to **Processed**?
This shows why Data Preparation is not optional but foundational.

2. **The Business Insight** — Once the data is clean, what does it reveal about the drivers of **high Score**?

In short, the purpose is not only to clean the data, but to demonstrate how cleaning transforms noise into clarity and reveals the signals that matter.

**3. Analytical Roadmap**

This transformation is explored through four structured themes:

1. **The Foundation – Target Variable (Score):** Revealing the true distribution and statistical behavior of the target metric.

2. **Theme A – Market Factors:** Understanding how Media Type (TV, Movie, OVA…) and Source Material (Manga, Original, Game…) influence performance.

3. **Theme B – Creative Factors:** Unpacking Genres, Producers, and Studios to identify collaboration patterns, specialization strengths, and creative drivers.

4. **Theme C – Release Strategy:** Converting unstructured Aired dates, Duration formats, and Episode structures into analyzable fields to uncover timing and format advantages.

**4. Deep Business Insights → Strategic Recommendations**

At the end of this notebook, we consolidate the strongest patterns discovered across all themes to form **Strategic Recommendations** for creating or selecting high-scoring anime.
These insights connect data evidence to real-world decision-making — guiding choices about **content strategy**, **production planning**, **studio partnerships**, and **release timing**.




In [1]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

import pandas as pd
import numpy as np
import joblib
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.io as pio
import ast
import os 
import re
from scipy import stats
import kaleido
pio.templates.default = "plotly_white"

In [2]:
path_raw = r'..\data\raw\anime-dataset-2023.csv'
df_anime_dataset_2023 = pd.read_csv(path_raw)

path = r'..\data\processed\prepared_data.csv'
df_anime_dataset_2023_prep= pd.read_csv(path)

In [3]:
def save_chart(fig, step, stt, title):
    """
    Saves a COPY of the chart with transparent background, 
    leaving the original chart (on screen) untouched.
    """
    # 1. Path Setup
    folder_name = f"Plot_{step}"
    output_dir = os.path.join("plots", folder_name) 
    
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    # 2. CLONE the figure
    # This ensures we don't mess up the chart currently displayed in the notebook
    fig_export = go.Figure(fig)

    # 3. Apply Transparent Background ONLY to the exported version
    fig_export.update_layout(
        paper_bgcolor='rgba(0,0,0,0)',
        plot_bgcolor='rgba(0,0,0,0)',
    )

    # 4. Filename Cleaning
    safe_title = re.sub(r'[\\/*?:"<>|]', '', title).strip()
    filename_base = f"[{step}]_[{stt}]_{safe_title}"
    
    path_svg = os.path.join(output_dir, f"{filename_base}.svg")
    path_png = os.path.join(output_dir, f"{filename_base}.png")

    # 5. Export
    try:
        fig_export.write_image(path_svg)
        fig_export.write_image(path_png, scale=3, width=1500, height=500)
        print(f"✅ Saved: {filename_base} (Transparent)")
    except Exception as e:
        print(f"❌ Error: {e}")

## **II. Target Variable: Score**

### **1. Issue Overview**
The `Score` variable is the heartbeat of our analysis, yet the raw data is heavily compromised. Diagnostic EDA revealed that approximately **37% of the records** (Sample: ~9,200 rows) had no valid score, represented variously as "UNKNOWN" or missing values. Additionally, there were inconsistencies in data types (strings mixed with numbers), creating a "dirty" distribution that skews any attempt at calculating averages or identifying top performers.

### **2. Solution**
Since imputing the target variable can introduce significant bias into a predictive model, we adopted a **strict cleaning strategy**:
1.  **Standardization:** Converted the `Score` column to numeric, coercing errors (like "UNKNOWN") to NaN.
2.  **Filtration:** Removed all rows where `Score` was missing. We prioritized data quality over quantity for the target variable to ensure the model learns from ground truth, not synthetic guesses.
3.  **Result:** A clean, numeric dataset ready for high-fidelity modeling.

### **3. Visual Evidence: Misleading (Raw) vs. Correct (Clean) Insights**

In [4]:
# Ensure os is imported for save function

# --- 1. DATA PREPARATION ---

# A. RAW DATA
raw_counts = df_anime_dataset_2023['Score'].value_counts().head(10).reset_index()
raw_counts.columns = ['Value', 'Count']

# --- FIX BOLD Y-AXIS HERE ---
# Wrap text in <b> tag to force Bold styling
raw_counts['Label'] = raw_counts['Value'].apply(lambda x: f"<b>'{x}'</b>") 

raw_counts['Color'] = raw_counts['Value'].apply(lambda x: '#C3122F' if x == 'UNKNOWN' else '#ef9a9a')
max_raw_count = raw_counts['Count'].max()

# B. CLEAN DATA
clean_scores = df_anime_dataset_2023_prep['Score'].dropna()
clean_mean = clean_scores.mean()

# --- 2. SETUP PLOT ---
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=(
        "<b>Raw Data: Interpreted as Text (Object)</b>", 
        "<b>Clean Data: Interpreted as Numbers (Float)</b>"
    ),
    horizontal_spacing=0.18, 
    column_widths=[0.4, 0.6]
)

# --- 3. LEFT CHART: CATEGORICAL BAR ---
fig.add_trace(
    go.Bar(
        y=raw_counts['Label'], 
        x=raw_counts['Count'],
        orientation='h',
        marker_color=raw_counts['Color'],
        text=raw_counts['Count'],
        textposition='outside', 
        hovertemplate='String Value: %{y}<br>Count: %{x}'
    ),
    row=1, col=1
)


# --- 4. RIGHT CHART: NUMERIC HISTOGRAM ---
fig.add_trace(
    go.Histogram(
        x=clean_scores,
        xbins=dict(start=0, end=10, size=0.2),
        marker_color='#ea568c', 
        opacity=0.8,
        hovertemplate='Score: %{x}<br>Count: %{y}'
    ),
    row=1, col=2
)

# Mean Line
fig.add_vline(
    x=clean_mean, line_width=2, line_dash="dash", line_color="#333333",
    annotation_text=f"Mean: {clean_mean:.2f}", 
    annotation_position="top right",
    row=1, col=2
)

# --- 5. VISUAL POLISH ---
fig.update_layout(
    template='plotly_white',
    font=dict(family="Tahoma", size=12, color="black"),
    title=dict(
        text="<b>Converting 'Text Strings' into 'Calculable Metrics'</b><br>"
             "<span style='font-size:14px; color:#555555'>Left: Raw Score is treated as discrete categories (Text) | Right: Clean Score is a continuous variable (Numbers)</span>",
        x=0.02,
        y=0.95
    ),
    margin=dict(t=100, l=60, r=40, b=60),
    height=500,
    bargap=0.1,
    showlegend=False
)

# Axis Styling
axis_style = dict(showline=True, linewidth=1, linecolor='black', ticks='outside')

# Axis Left: Adjusted range + Bold Font in Ticks
fig.update_xaxes(**axis_style, title_text="Frequency (Count)", range=[0, max_raw_count*1.3], row=1, col=1)
fig.update_yaxes(
    **axis_style, 
    title_text="Object Values", 
    categoryorder='total ascending', 
    tickfont=dict(family="Courier New", size=12, color="black"), # Font đậm hơn nhờ thẻ <b> ở data
    row=1, col=1
)

# Axis Right
fig.update_xaxes(**axis_style, title_text="Score (0-10)", range=[1, 10.5], row=1, col=2)
fig.update_yaxes(visible=False, row=1, col=2)
fig.update_xaxes(visible=False, row=1, col=1)


# --- 6. SAFE SAVE & SHOW ---
try:
    save_chart(fig, step="03", stt="Act2_1", title="Score_Object_vs_Numeric")
except Exception as e:
    print(f"⚠️ Could not save chart due to Kaleido error: {e}")
    print("✅ Chart is displayed below regardless.")

fig.show()

✅ Saved: [03]_[Act2_1]_Score_Object_vs_Numeric (Transparent)


##### **The Transformation Insight:** 

**The Misleading View:** The raw `Score` column was a mixed-type mess, dominated by text-based "Unknown" placeholders (~37%). This created a "Zero Trap," making it impossible to calculate averages or identify market trends without severe bias.

**The Truth:** The clean data reveals a roughly Gaussian distribution, with most scores clustering around 6.38, and extreme low or high scores being rare.

**The Strategic Value:** Instead of noise, we now have a reliable numeric feature to analyze how Score varies across Types, Sources, and other factors, enabling accurate business insights.

## **III. Theme A — Market Factors (Type & Source)**

### **1. Issue Overview**

**a. Non-Standard Missing Values (“Unknown”, “Not available”)**

Instead of proper NaN, missing values appear as literal text. Therefore, "Unknown" is counted as a valid value

- **Impact:**  
"  → Unknown" is represented in charts as major value and distort the distribution.

**b. Source categories are excessively fragmented**  

A single origin type is split into multiple subcategories (e.g., *Manga vs Web manga vs 4-koma manga*, *Novel vs Light novel vs Web novel*).  
  - **Impact** 
  
This fragmentation breaks a unified category into many small pieces, diluting the true market share of major source types.


### **2. Solution**
To resolve these inconsistencies, we implemented the following **data cleaning pipeline**:

*   **Handling Nulls:** Missing entries were temporarily labeled as **"UNKNOWN"** to track data completeness before being filtered out for detailed analysis.

### **3. Visual Evidence: Misleading (Raw) vs. Correct (Clean) Insights**
#### **3.1 Type**

In [5]:
# ==========================================
# 1. DATA PREPARATION
# ==========================================

# --- A. RAW DATA (LEFT CHART) ---
type_counts_raw = df_anime_dataset_2023['Type'].value_counts().head(10).reset_index()
type_counts_raw.columns = ['Type', 'Count']
type_counts_raw = type_counts_raw.sort_values('Count', ascending=False)

colors_raw = [
    '#C3122F' if str(x).strip().upper() == 'UNKNOWN' else '#D3D3D3'
    for x in type_counts_raw['Type']
]

# --- B. CLEAN DATA (RIGHT CHART) ---
df_type_clean = df_anime_dataset_2023_prep[['Type']].copy()
df_type_clean['Type'] = df_type_clean['Type'].astype(str).str.strip()

exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_type_clean = df_type_clean[~df_type_clean['Type'].str.upper().isin(exclude_list)]
df_type_clean = df_type_clean[df_type_clean['Type'] != ""]

type_counts_clean = df_type_clean['Type'].value_counts().head(10).reset_index()
type_counts_clean.columns = ['Type', 'Count']
type_counts_clean = type_counts_clean.sort_values('Count', ascending=False)

# --- GRADIENT HỒNG CHO CLEAN CHART ---
colors_clean = [
    "#ea568c",
    "#f279a9",
    "#f793ba",
    "#f9a8c8",
    "#fbb7d2",
    "#fdc8dd",
    "#fed7e7",
    "#ffe4f0",
    "#fff0f7",
    "#ffffff"
][:len(type_counts_clean)]

# ==========================================
# 2. VISUALIZATION
# ==========================================

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("<b>Type Raw Data Distribution</b>",
                    "<b>Type Cleaned Data Distribution</b>"), 
    horizontal_spacing=0.07
)

# --- RAW DATA ---
fig.add_trace(
    go.Bar(
        x=type_counts_raw['Type'],
        y=type_counts_raw['Count'],
        marker_color=colors_raw,
        text=type_counts_raw['Count'],
        textposition='outside',
        texttemplate='%{y:.2s}',
        hovertemplate='<b>Raw Type:</b> %{x}<br><b>Count:</b> %{y:,}<extra></extra>',
        showlegend=False
    ),
    row=1, col=1
)

# --- CLEAN DATA ---
fig.add_trace(
    go.Bar(
        x=type_counts_clean['Type'],
        y=type_counts_clean['Count'],
        marker_color=colors_clean,   # <-- Gradient hồng
        text=type_counts_clean['Count'],
        textposition='outside',
        texttemplate='%{y:.2s}',
        hovertemplate='<b>Clean Type:</b> %{x}<br><b>Count:</b> %{y:,}<extra></extra>',
        showlegend=False
    ),
    row=1, col=2
)

# ==========================================
# 3. LAYOUT SETTINGS
# ==========================================

fig.update_layout(
    font=dict(family="Tahoma", size=13, color="#000"),
    hovermode="x unified",
    margin=dict(l=40, r=40, t=100, b=60),
    width=1550,
    height=550,
    plot_bgcolor="white",
    paper_bgcolor="white",

    title=dict(
        text="<b>Removing 'Unknown' Values Reveals True Type Distribution</b>"
             "<br><span style='color:#555555; font-size:14px'>(Top 10 Most Common Anime Types)</span>",
        x=0.02,
        y=0.95,
        xanchor='left',
        yanchor='top'
    )
)

# X-Axis
fig.update_xaxes(showgrid=False, showline=False, tickangle=0, row=1, col=1)
fig.update_xaxes(showgrid=False, showline=False, tickangle=0, row=1, col=2)

# Y-Axis
fig.update_yaxes(showgrid=False, showticklabels=False, zeroline=False, row=1, col=1)
fig.update_yaxes(showgrid=False, showticklabels=False, zeroline=False, row=1, col=2)

try:
    save_chart(fig, step="03", stt="III_A_3.1.Type", title="Removing 'Unknown' Values Reveals True Type Distribution")
except Exception as e:
    print(f"⚠️ Could not save chart due to Kaleido error: {e}")
    print("✅ Chart is displayed below regardless.")

fig.show()


✅ Saved: [03]_[III_A_3.1.Type]_Removing 'Unknown' Values Reveals True Type Distribution (Transparent)


#### **The Transformation Insight:**

**The Misleading View:** The raw data suggested that TV was the overwhelmingly dominant anime type (7.6k), while non-standard "Unknown" values acted as visible noise that masked a deeper unreliability in the dataset's categorization.

**The Truth:** The clean data corrects this by removing invalid entries and re-balancing the categories, revealing a significant market shift where OVA (6.1k) is actually the most common release format, overtaking TV (5.5k).

**The Strategic Value:** Instead of categorical noise, we now have a reliable "Type" feature to accurately analyze how different release formats (such as the mass-market TV vs. the niche-focused OVA) specifically impact the Audience Score.

#### **3.2 Source**

In [6]:
from plotly.colors import n_colors
# ==========================================
# 1. DATA PREPARATION
# ==========================================

# --- A. RAW DATA (LEFT CHART) ---
source_counts_raw = df_anime_dataset_2023['Source'].value_counts().reset_index()
source_counts_raw.columns = ['Source', 'Count']
source_counts_raw = source_counts_raw.sort_values('Count', ascending=True)

fragmented_subcategories = [
    'VISUAL NOVEL', 'LIGHT NOVEL', 'WEB NOVEL',
    'WEB MANGA', '4-KOMA MANGA', 'DIGITAL MANGA'
]

colors_raw = []
for source, count in zip(source_counts_raw['Source'], source_counts_raw['Count']):
    s = str(source).strip().upper()

    if s == "UNKNOWN":
        colors_raw.append("#C3122F")
    elif count < 500:
        colors_raw.append("#C3122F")
    elif s in fragmented_subcategories:
        colors_raw.append("#C3122F")
    else:
        colors_raw.append("#D3D3D3")  # Major valid sources


# --- B. CLEAN DATA (RIGHT CHART) ---
df_source_clean = df_anime_dataset_2023_prep[['Source']].copy()
df_source_clean['Source'] = df_source_clean['Source'].astype(str).str.strip()

exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_source_clean = df_source_clean[~df_source_clean['Source'].str.upper().isin(exclude_list)]

source_counts_clean = df_source_clean['Source'].value_counts().reset_index()
source_counts_clean.columns = ['Source', 'Count']
source_counts_clean = source_counts_clean.sort_values('Count', ascending=True)

# ====== GRADIENT RÕ RỆT CHO CLEANED DATA ======
# màu đậm → màu nhạt
colors_clean = n_colors(
    'rgb(249,216,226)',    # nhạt ở dưới → CHUYỂN LÊN TRÊN
    'rgb(234,86,140)',     # đậm ở trên → CHUYỂN XUỐNG DƯỚI
    len(source_counts_clean),
    colortype='rgb'
)                     


# ==========================================
# 2. VISUALIZATION
# ==========================================

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=(
        "<b>Source Raw Data Distribution</b>",
        "<b>Source Cleaned Data Distribution</b>"
    ),
    horizontal_spacing=0.15
)

# --- RAW DATA ---
fig.add_trace(
    go.Bar(
        y=source_counts_raw['Source'],
        x=source_counts_raw['Count'],
        orientation='h',
        marker_color=colors_raw,
        text=source_counts_raw['Count'],
        textposition='outside',
        texttemplate='%{x:.2s}',
        hovertemplate='<b>Raw Source:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>',
        showlegend=False
    ),
    row=1, col=1
)

# --- CLEANED DATA ---
fig.add_trace(
    go.Bar(
        y=source_counts_clean['Source'],
        x=source_counts_clean['Count'],
        orientation='h',
        marker_color=colors_clean,   # gradient rõ
        text=source_counts_clean['Count'],
        textposition='outside',
        texttemplate='%{x:.2s}',
        hovertemplate='<b>Clean Source:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>',
        showlegend=False
    ),
    row=1, col=2
)

# ==========================================
# 3. LAYOUT SETTINGS
# ==========================================

fig.update_layout(
    font=dict(family="Tahoma", size=13, color="#000000"),
    hovermode="x unified",
    margin=dict(l=40, r=40, t=100, b=60),
    width=1550,
    height=550,
    plot_bgcolor="white",
    paper_bgcolor="white",
    title=dict(
        text="<b>Consolidating Fragmented Sources Reveals True Market Structure</b>",
        x=0.02,
        y=0.95,
        xanchor='left',
        yanchor='top'
    )
)

# X-axis
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, row=1, col=1)
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, row=1, col=2)

# Y-axis
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", row=1, col=1)
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", row=1, col=2)
try:
    save_chart(fig, step="03", stt="III_A_3.2.Source", title="Consolidating Fragmented Sources Reveals True Market Structure")
except Exception as e:
    print(f"⚠️ Could not save chart due to Kaleido error: {e}")
    print("✅ Chart is displayed below regardless.")

fig.show()


✅ Saved: [03]_[III_A_3.2.Source]_Consolidating Fragmented Sources Reveals True Market Structure (Transparent)


##### **The Transformation Insight:**

**The Misleading View:** The raw data presented a highly fragmented landscape where "Original" works appeared to dwarf all else, while the collective strength of adapted media was diluted by granular micro-categories (splitting "Visual Novel," "Game," and "Card Game") and clouded by a massive block of 3.7k "Unknown" entries.

**The Truth:** The clean data corrects this by consolidating these splintered tags into meaningful parent categories. This reveals the true market structure, showing specifically that Manga (rising to 5.4k) and Literature (consolidating novels and books to 2.1k) are far more significant market pillars than the raw counts initially suggested.

**The Strategic Value:** Instead of scattered metadata, we now have five distinct "Source Archetypes," allowing us to reliably analyze how the narrative origin (e.g., the structural difference between a Game adaptation vs. a Literary adaptation) directly dictates the potential for a high Score.

### **4. Business Insights**

#### **A. Type Anime Comparison**
**Question**: Which release formats maximize the potential for high audience approval, and where do we see diminishing returns in quality perception?

In [7]:
# ==========================================
# 1. DATA PREPARATION
# ==========================================

# Filter and clean data
df_type_score = df_anime_dataset_2023_prep[['Type', 'Score']].copy()

# Clean Score
df_type_score = df_type_score.dropna(subset=['Score', 'Type'])
df_type_score['Score'] = pd.to_numeric(df_type_score['Score'], errors='coerce')

# Clean Type
df_type_score['Type'] = df_type_score['Type'].astype(str).str.strip()
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_type_score = df_type_score[~df_type_score['Type'].str.upper().isin(exclude_list)]
df_type_score = df_type_score[df_type_score['Type'] != ""]

# Calculate average score by Type
type_avg_score = df_type_score.groupby('Type').agg(
    avg_score=('Score', 'mean'),
    count=('Score', 'count')
).reset_index()

# Sort by average score (ascending for horizontal bar chart - highest at top)
type_avg_score_plot = type_avg_score.sort_values('avg_score', ascending=True)

# ==========================================
# 2. VISUALIZATION
# ==========================================

fig = go.Figure()

# Add horizontal bar chart
fig.add_trace(
    go.Bar(
        y=type_avg_score_plot['Type'],
        x=type_avg_score_plot['avg_score'],
        orientation='h',
        marker_color='#6A1B9A',
        text=type_avg_score_plot['avg_score'],
        textposition='outside',
        texttemplate='%{x:.2f}',
        hovertemplate='<b>Type:</b> %{y}<br><b>Avg Score:</b> %{x:.2f}<br><b>Total Titles:</b> %{customdata}<extra></extra>',
        customdata=type_avg_score_plot['count'],
        showlegend=False
    )
)

# ==========================================
# 3. LAYOUT SETTINGS
# ==========================================

fig.update_layout(
    # General Settings
    font=dict(family="Tahoma", size=13, color="#000000"),
    hovermode="y unified",
    
    # Margin & Dimensions
    margin=dict(l=40, r=40, t=100, b=60),
    width=800,
    height=550,
    
    # Backgrounds
    plot_bgcolor="white",
    paper_bgcolor="white",
    
    # Title Convention
    title=dict(
        text="<b>Movies and TV Series Lead in Average Audience Scores</b><br><span style='color:#555555; font-size:14px'>(Average Score by Anime Type - All Formats Compared)</span>",
        x=0.02,
        y=0.95,
        xanchor='left',
        yanchor='top'
    )
)

# --- AXIS SETTINGS ---
# Update X-axis (Values)
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, range=[0, 8])

# Update Y-axis (Categories)
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", automargin=True)

try:
    save_chart(fig, step="04", stt="III_A_4.A", title="Movies_and_TV_Series_Lead_in_Average_Audience_Scores")
except Exception as e:
    print(f"⚠️ Could not save chart due to Kaleido error: {e}")
    print("✅ Chart is displayed below regardless.")

fig.show()

✅ Saved: [04]_[III_A_4.A]_Movies_and_TV_Series_Lead_in_Average_Audience_Scores (Transparent)


**Insight:**
- **The "Mainstream Prestige" Gap**

    The data establishes a definitive **"Format Tiering"** in audience scoring. There is a sharp divide between primary narrative formats (**TV and Movies**) and supplementary formats. **TV Series** (6.96) and **Movies** (6.90) enjoy a substantial **score premium**, hovering near the 7.00 threshold. In contrast, secondary formats like **Specials, OVAs, and ONAs** face a **"relevance penalty,"** consistently scoring significantly lower (dropping below 6.30). This indicates that audiences reserve their highest critical praise for "main event" storytelling, perceiving other formats as lower-stakes or lower-quality content.

- **Strategic Format Positioning: The Score Ceiling**

    Analyzing specific formats reveals the structural limits on potential scores:

    *   **The Twin Pillars of Quality:** **TV** (6.96) and **Movies** (6.90) are effectively tied as the **safest vehicles for high scores**. The marginal lead of TV suggests that audiences slightly favor the depth of long-form storytelling over the condensed spectacle of film, but both are the only reliable paths to critical acclaim.

    *   **The ONA/OVA Plateau:** **OVAs** (6.06) and **ONAs** (6.04) hit a distinct **score ceiling**. Despite being direct-to-video or web releases often targeted at hardcore fans, they fail to generate broad acclaim. This suggests that the lack of "broadcast prestige" or perceived budget limitations directly dampens the audience's quality assessment.

    *   **The "Music" Niche Floor:** **Music** entries (5.86) sit at the bottom of the hierarchy. While visually creative, their short duration and lack of narrative substance result in the lowest perceived value, making them poor drivers for increasing a franchise's average score.

**Business Takeaway:**

To maximize a project's critical reception and average **Score**, resources must be concentrated on **TV Series or Movie** productions. These formats command the audience respect and engagement levels necessary to break the **7.00 score barrier**. Producers should view **OVAs, ONAs, and Specials** strictly as engagement tools for existing fans rather than vehicles for critical success, as they structurally suffer from a lower **quality perception cap**.

#### **B. Top Source by Score of TV and Movie**
**Question**: Beyond the choice of format (TV vs. Movie), which source materials consistently deliver the highest quality perception (Score)?

In [8]:
# ==========================================
# 1. DATA PREPARATION
# ==========================================

# Filter TV and Movie only
df_type_source = df_anime_dataset_2023_prep[df_anime_dataset_2023_prep['Type'].isin(['TV', 'Movie'])].copy()

# Clean Score
df_type_source = df_type_source.dropna(subset=['Score', 'Source', 'Type'])
df_type_source['Score'] = pd.to_numeric(df_type_source['Score'], errors='coerce')

# Clean Source
df_type_source['Source'] = df_type_source['Source'].astype(str).str.strip()
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_type_source = df_type_source[~df_type_source['Source'].str.upper().isin(exclude_list)]
df_type_source = df_type_source[df_type_source['Source'] != ""]

# --- A. TOP 5 SOURCES FOR TV ---
tv_data = df_type_source[df_type_source['Type'] == 'TV']
tv_source_stats = tv_data.groupby('Source').agg(
    avg_score=('Score', 'mean'),
    count=('Score', 'count')
).reset_index()

# Filter: at least 30 titles
tv_source_stats = tv_source_stats[tv_source_stats['count'] >= 30]
top5_tv = tv_source_stats.sort_values('avg_score', ascending=False).head(5)

# --- B. TOP 5 SOURCES FOR MOVIE ---
movie_data = df_type_source[df_type_source['Type'] == 'Movie']
movie_source_stats = movie_data.groupby('Source').agg(
    avg_score=('Score', 'mean'),
    count=('Score', 'count')
).reset_index()

# Filter: at least 30 titles
movie_source_stats = movie_source_stats[movie_source_stats['count'] >= 30]
top5_movie = movie_source_stats.sort_values('avg_score', ascending=False).head(5)

# Sort ascending for horizontal bar (highest at top)
top5_tv_plot = top5_tv.sort_values('avg_score', ascending=True)
top5_movie_plot = top5_movie.sort_values('avg_score', ascending=True)

# ==========================================
# 2. VISUALIZATION
# ==========================================

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("<b>Top 5 Sources by Score - TV Series</b>", "<b>Top 5 Sources by Score - Movies</b>"),
    horizontal_spacing=0.15
)

# --- TRACE 1: TV SERIES (LEFT) ---
fig.add_trace(
    go.Bar(
        y=top5_tv_plot['Source'],
        x=top5_tv_plot['avg_score'],
        orientation='h',
        marker_color='#ae017e',
        text=top5_tv_plot['avg_score'],
        textposition='outside',
        texttemplate='%{x:.2f}',
        hovertemplate='<b>Source:</b> %{y}<br><b>Avg Score:</b> %{x:.2f}<br><b>Total Titles:</b> %{customdata}<extra></extra>',
        customdata=top5_tv_plot['count'],
        showlegend=False
    ),
    row=1, col=1
)

# --- TRACE 2: MOVIES (RIGHT) ---
fig.add_trace(
    go.Bar(
        y=top5_movie_plot['Source'],
        x=top5_movie_plot['avg_score'],
        orientation='h',
        marker_color='#6A1B9A',
        text=top5_movie_plot['avg_score'],
        textposition='outside',
        texttemplate='%{x:.2f}',
        hovertemplate='<b>Source:</b> %{y}<br><b>Avg Score:</b> %{x:.2f}<br><b>Total Titles:</b> %{customdata}<extra></extra>',
        customdata=top5_movie_plot['count'],
        showlegend=False
    ),
    row=1, col=2
)

# ==========================================
# 3. LAYOUT SETTINGS
# ==========================================

fig.update_layout(
    # General Settings
    font=dict(family="Tahoma", size=13, color="#000000"),
    hovermode="y unified",
    
    # Margin & Dimensions
    margin=dict(l=40, r=40, t=100, b=60),
    width=1400,
    height=550,
    
    # Backgrounds
    plot_bgcolor="white",
    paper_bgcolor="white",
    
    # Title Convention
    title=dict(
        text="<b>Manga and Novel Adaptations Lead Quality Rankings Across Both Formats</b><br><span style='color:#555555; font-size:14px'>(Top 5 Sources with Highest Average Scores - Minimum 30 Titles)</span>",
        x=0.02,
        y=0.95,
        xanchor='left',
        yanchor='top'
    )
)

# --- AXIS SETTINGS ---
# Update X-axes (Values)
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, range=[0, 8], row=1, col=1)
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, range=[0, 8], row=1, col=2)

# Update Y-axes (Categories)
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", automargin=True, row=1, col=1)
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", automargin=True, row=1, col=2)

fig.show()

try:
    save_chart(fig, step="04", stt="III_A_4.B", title="Manga_and_Novel_Adaptions_Lead_Quality_Rankings_Across_Both_Formats")
except Exception as e:
    print(f"⚠️ Could not save chart due to Kaleido error: {e}")
    print("✅ Chart is displayed below regardless.")


✅ Saved: [04]_[III_A_4.B]_Manga_and_Novel_Adaptions_Lead_Quality_Rankings_Across_Both_Formats (Transparent)


**Insight:**
- **The "Proven Narrative" Premium vs. The "Creation" Risk**
The data reveals a distinct **"Adaptation Quality Premium."** Established narrative sources—specifically **Manga and Literature**—consistently outperform Original works and Game adaptations across both TV and Movie formats. Both categories maintain an average score above **7.00** (Manga peaking at 7.24 for TV), suggesting that pre-validated storylines and existing fanbases translate directly into higher audience satisfaction and perceived quality compared to stories created from scratch or adapted from interactive media.
- **Strategic Source Positioning: Quality Drivers**
Breaking down the sources reveals specific insights into how origin impacts reception:
- **The Manga & Literature** Safety Net: Manga is the undisputed leader (7.24 TV / 7.20 Movie), closely followed by **Literature** (7.09 TV / 7.06 Movie). These sources act as a **quality guarantee.** The depth of world-building and character development inherent in books and long-running manga provides a solid foundation that resonates strongly with audiences, minimizing the risk of narrative failure.
- **The Original Format Divergence:** There is a notable dynamic for **Original** works. While they struggle to compete with adaptations in **TV Series** (6.68), they perform significantly better as **Movies** (6.97). This suggests that original scripts fare better as **contained, high-production cinematic experiences** rather than long-form serialized content, where maintaining plot consistency without source material is more challenging.
**The "Gameplay Translation" Hurdle: Game** adaptations consistently lag behind narrative sources (6.65 TV / 6.89 Movie). This highlights a **structural challenge**: translating "player agency" and gameplay loops into passive storytelling often results in weaker narratives, negatively impacting the final score.

**Business Takeaway:**

To maximize the probability of achieving a high **Score**, producers should prioritize securing rights to **Manga or Literary properties**, as they offer the highest baseline for audience satisfaction. If pursuing **Original IP**, the data suggests the **Movie format** is a safer strategic vehicle than a TV series. Finally, stakeholders should exercise extreme caution with **Game adaptations**, investing heavily in screenwriting to overcome the historical trend of lower quality assessments.

## **IV. Theme B — Creative & Production Factors (Genres, Studios, Producers)**


### **1. Issue Overview**

During the **Exploratory Data Analysis (EDA)** phase, we identified common **data quality issues** across `Genres`, `Studios`, and `Producers`:

*   **Missing Values:** A portion of the dataset contained **null** or empty entries.

*   **Aggregated Lists (Non-Atomic Data):** Values were stored as **stringified lists** (e.g., `['Action', 'Comedy']`) within a single cell. Performing a direct count on this format would incorrectly tally **specific combinations** rather than the frequency of individual items, leading to inaccurate statistical results.

### **2. Solution**

To resolve these inconsistencies, we implemented the following **data cleaning pipeline**:

*   **Handling Nulls:** Missing entries were temporarily labeled as **"UNKNOWN"** to track data completeness before being filtered out for detailed analysis.

*   **Data Normalization:** We transformed the text data using the following steps:
    *  **Regex Cleaning:** Used regular expressions to remove formatting characters like **brackets** (`[]`) and **quotes** (`''`).
    *  **Split & Explode:** Split the comma-separated strings and **expanded** (exploded) the list so that each individual element occupies its own row.
    *  **Trimming:** Stripped leading and trailing **whitespace** to ensure data consistency.

### **3. Visual Evidence: Misleading (Raw) vs. Correct (Clean) Insights**

#### **3.1. Genres**

In [9]:
# ==========================================
# 1. DATA PREPARATION
# ==========================================

# --- A. RAW DATA (LEFT CHART) ---
# Count occurrences and take Top 15
genre_counts_raw = df_anime_dataset_2023['Genres'].value_counts().head(10).reset_index()
genre_counts_raw.columns = ['Genres', 'Count']

# Sort by Count ascending (for horizontal bar chart to show largest at top)
genre_counts_raw = genre_counts_raw.sort_values('Count', ascending=True)

# Define Color Logic for Raw Data:
# - Red (#C3122F) for Noise (Combinations with ',' or 'UNKNOWN')
# - Light Grey (#D3D3D3) for Neutral Context (Single valid genres in raw data)
colors_raw = [
    '#C3122F' if (',' in str(x) or str(x).strip().upper() == 'UNKNOWN') else '#D3D3D3'
    for x in genre_counts_raw['Genres']
]

# --- B. CLEAN DATA (RIGHT CHART) ---
# Create a copy and ensure string format
df_genres_exploded = df_anime_dataset_2023_prep[['Genres']].copy()
df_genres_exploded['Genres'] = df_genres_exploded['Genres'].astype(str)

# REGEX CLEANING: Remove brackets and quotes
df_genres_exploded['Genres'] = df_genres_exploded['Genres'].str.replace(r"[\[\]\'\"]", "", regex=True)

# SPLIT & EXPLODE: Separate genres and create individual rows
df_genres_exploded['Genres'] = df_genres_exploded['Genres'].str.split(',')
df_genres_exploded = df_genres_exploded.explode('Genres')

# TRIM: Remove leading/trailing whitespace
df_genres_exploded['Genres'] = df_genres_exploded['Genres'].str.strip()

# FILTER: Remove garbage values (Noise)
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_genres_exploded = df_genres_exploded[~df_genres_exploded['Genres'].str.upper().isin(exclude_list)]
df_genres_exploded = df_genres_exploded[df_genres_exploded['Genres'] != ""]

# Count occurrences and take Top 15
genre_counts_clean = df_genres_exploded['Genres'].value_counts().head(10).reset_index()
genre_counts_clean.columns = ['Genres', 'Count']
genre_counts_clean = genre_counts_clean.sort_values('Count', ascending=True)

# Define Color Logic for Clean Data:
# - Vibrant Pink (#ea568c) for Valid Data

# Gradient hồng: Đậm ở trên → Nhạt ở dưới
colors_clean = n_colors(
    'rgb(249,216,226)',   # nhạt (dưới)
    'rgb(234,86,140)',    # đậm (trên)
    len(genre_counts_clean),
    colortype='rgb'
)  # Đảo để khớp thứ tự y (trên → dưới)

# ==========================================
# 2. VISUALIZATION (PHASE 2: TECHNICAL COMPARISON)
# ==========================================

# Initialize Subplots (1 Row, 2 Columns)
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("<b>Genres Raw Data Distribution</b>", "<b>Genres Cleaned Data Distribution</b>"), 
    horizontal_spacing=0.15 # Adjust spacing to prevent label overlap
)

# --- TRACE 1: RAW DATA (LEFT) ---
fig.add_trace(
    go.Bar(
        y=genre_counts_raw['Genres'],
        x=genre_counts_raw['Count'],
        orientation='h',
        marker_color=colors_raw,
        text=genre_counts_raw['Count'],
        textposition='outside',
        texttemplate='%{x:.2s}', # Format: 15k, 2.5k
        hovertemplate='<b>Raw Genre:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>',
        showlegend=False 
    ),
    row=1, col=1
)

# --- TRACE 2: CLEAN DATA (RIGHT) ---
fig.add_trace(
    go.Bar(
        y=genre_counts_clean['Genres'],
        x=genre_counts_clean['Count'],
        orientation='h',
        marker_color=colors_clean, # Pink for Valid Data
        text=genre_counts_clean['Count'],
        textposition='outside',
        texttemplate='%{x:.2s}', # Format: 15k, 2.5k
        hovertemplate='<b>Clean Genre:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>',
        showlegend=False # Rule: 1-2 traces -> hide legend
    ),
    row=1, col=2
)


# ==========================================
# 3. LAYOUT SETTINGS (STRICT COMPLIANCE)
# ==========================================

fig.update_layout(
    # General Settings
    font=dict(family="Tahoma", size=13, color="#000000"),
    hovermode="x unified",
    
    # Margin & Dimensions
    margin=dict(l=40, r=40, t=100, b=60),
    width=1550,
    height=550,
    
    # Backgrounds
    plot_bgcolor="white",
    paper_bgcolor="white",
    
    # Title Convention (Position x=0.02, 2-line structure)
    title=dict(
        text="<b>Removing Aggregation & 'Unknown' Values Unlocks True Genres Distribution</b><br><span style='color:#555555; font-size:14px'>(Top 10 Most Aired Genres By Number of Anime)</span>",
        x=0.02,
        y=0.95, # Adjust vertical position slightly to fit top margin
        xanchor='left',
        yanchor='top'
    )
)

# --- AXIS SETTINGS (DECLUTTER) ---
# Rule: Hide Gridlines, Hide X-axis Labels (focus on Shape/Trend + Direct Labeling)

# Update X-axes (Values)
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, row=1, col=1)
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, row=1, col=2)

# Update Y-axes (Categories)
# Keep labels for readability, no grid
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", row=1, col=1)
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", row=1, col=2)

fig.show()

#--- 6. SAFE SAVE & SHOW ---    
try:
    save_chart(fig, step="03", stt="IV_B_3.1.Genres", title="Removing_Aggregation_and_'Unknown'_Values_Unlocks_True_Genres_Distribution")
except Exception as e:
    print(f"⚠️ Could not save chart due to Kaleido error: {e}")
    print("✅ Chart is displayed below regardless.")

✅ Saved: [03]_[IV_B_3.1.Genres]_Removing_Aggregation_and_'Unknown'_Values_Unlocks_True_Genres_Distribution (Transparent)


#### **3.2.Producers**

In [10]:
# ==========================================
# 1. DATA PREPARATION (LOGIC)
# ==========================================

# --- A. RAW DATA PREP (LEFT CHART) ---
# Count occurrences
producer_counts_raw = df_anime_dataset_2023['Producers'].value_counts().head(15).reset_index()
producer_counts_raw.columns = ['Producers', 'Count']
# Sort by Count ascending for horizontal bar chart
producer_counts_raw = producer_counts_raw.sort_values('Count', ascending=True)

# --- COLOR LOGIC: RAW DATA ---
# Rule: Use Red (#C3122F) if the data is "Complex" (contains ',') or "Unknown"
# Rule: Use Light Grey (#D3D3D3) for simple/neutral entries
colors_raw = [
    '#C3122F' if (',' in str(x) or str(x).strip().upper() == 'UNKNOWN') else '#D3D3D3'
    for x in producer_counts_raw['Producers']
]

# --- B. CLEAN DATA PREP (RIGHT CHART) ---
# Create a copy and ensure string format
df_producers_clean = df_anime_dataset_2023_prep[['Producers']].copy()
df_producers_clean['Producers'] = df_producers_clean['Producers'].astype(str)

# REGEX: Remove brackets and quotes
df_producers_clean['Producers'] = df_producers_clean['Producers'].str.replace(r"[\[\]\'\"]", "", regex=True)

# SPLIT & EXPLODE: "Aniplex, Sony Music" -> 2 separate rows
df_producers_clean['Producers'] = df_producers_clean['Producers'].str.split(',')
df_producers_clean = df_producers_clean.explode('Producers')

# TRIM & FILTER: Remove whitespace and garbage values
df_producers_clean['Producers'] = df_producers_clean['Producers'].str.strip()
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_producers_clean = df_producers_clean[~df_producers_clean['Producers'].str.upper().isin(exclude_list)]
df_producers_clean = df_producers_clean[df_producers_clean['Producers'] != ""]

# Aggregation for Clean Data (Top 15)
producer_counts_clean = df_producers_clean['Producers'].value_counts().head(15).reset_index()
producer_counts_clean.columns = ['Producers', 'Count']
producer_counts_clean = producer_counts_clean.sort_values('Count', ascending=True)

# --- COLOR LOGIC: CLEAN DATA ---
# Rule: Use Vibrant Pink (#ea568c) for Valid/Clean Data
# Gradient hồng đậm → nhạt (đậm ở trên)
colors_clean = n_colors(
    'rgb(249,216,226)',   # nhạt (dưới)
    'rgb(234,86,140)',    # đậm (trên)
    len(producer_counts_clean),
    colortype='rgb'
)  # đảo để dòng trên nhận màu đậm

# ==========================================
# 2. VISUALIZATION (TECHNICAL COMPARISON)
# ==========================================

# Initialize Subplots
# Rule: rows=1, cols=2
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("<b>Producers Raw Data Distribution</b>", "<b>Producers Cleaned Data Distribution</b>"),
    horizontal_spacing=0.15 
)

# --- TRACE 1: RAW DATA (LEFT) ---
fig.add_trace(
    go.Bar(
        y=producer_counts_raw['Producers'],
        x=producer_counts_raw['Count'],
        orientation='h',
        marker_color=colors_raw, 
        
        # Direct Labeling
        text=producer_counts_raw['Count'],
        textposition='outside',
        texttemplate='%{x:.2s}',
        hovertemplate='<b>Raw Producer:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>',
        showlegend=False 
    ),
    row=1, col=1
)

# --- TRACE 2: CLEAN DATA (RIGHT) ---
fig.add_trace(
    go.Bar(
        y=producer_counts_clean['Producers'],
        x=producer_counts_clean['Count'],
        orientation='h',
        marker_color=colors_clean, 
        
        # Direct Labeling
        text=producer_counts_clean['Count'],
        textposition='outside',
        texttemplate='%{x:.2s}',
        hovertemplate='<b>Producer:</b> %{y}<br><b>Projects:</b> %{x:,}<extra></extra>',
        
        showlegend=False
    ),
    row=1, col=2
)

# ==========================================
# 3. LAYOUT SETTINGS 
# ==========================================

fig.update_layout(
    # --- FONTS & GENERAL ---
    font=dict(family="Tahoma", size=13, color="#000000"),
    hovermode="x unified",
    
    # --- MARGINS ---
    margin=dict(l=40, r=40, t=100, b=60),
    width=1550,
    height=550,
    
    # --- BACKGROUND ---
    plot_bgcolor="white",
    paper_bgcolor="white",
    
    # --- TITLE CONVENTION ---
    title=dict(
        text="<b>Parsing Producers Data Reveals True Market Leaders</b><br><span style='color:#555555; font-size:14px'>(Top 15 Most Active Producers)</span>",
        x=0.02,
        y=0.95,
        xanchor='left',
        yanchor='top'
    )
)

# ==========================================
# 4. AXIS DECLUTTERING
# ==========================================
# Rule: Hide Gridlines, Hide X-axis Labels (Use Direct Labeling), Keep Y-axis Labels

# Update X-axes
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, row=1, col=1)
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, row=1, col=2)

# Update Y-axes
# Rule: Use 'automargin=True' to ensure long producer names are not cropped
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", automargin=True, row=1, col=1)
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", automargin=True, row=1, col=2)

fig.show()

try:
    save_chart(fig, step="03", stt="IV_B_3.1.Producers", title="Parsing_Producers_Data_Reveals_True_Market_Leaders")
except Exception as e:
    print(f"⚠️ Could not save chart due to Kaleido error: {e}")
    print("✅ Chart is displayed below regardless.")

✅ Saved: [03]_[IV_B_3.1.Producers]_Parsing_Producers_Data_Reveals_True_Market_Leaders (Transparent)


#### **3.3. Studios**

In [11]:
# ==========================================
# 1. DATA PROCESSING
# ==========================================

# Create a copy and clean specific characters
df_studios = df_anime_dataset_2023_prep[['Studios']].copy()
df_studios['Studios'] = df_studios['Studios'].astype(str).str.replace(r"[\[\]\'\"]", "", regex=True)

# Function to classify studio types
def classify_studio(studio_str):
    s = studio_str.strip().upper()
    if s in ["UNKNOWN", "NAN", "","NULL"]:
        return 'UNKNOWN'
    if ',' in studio_str:
        return 'Collaboration'
    return 'Solo'

# Apply classification and count frequencies
df_studios['Production_Type'] = df_studios['Studios'].apply(classify_studio)
studio_counts = df_studios['Production_Type'].value_counts().reset_index()
studio_counts.columns = ['Type', 'Count']
total_projects = studio_counts['Count'].sum()

# Sort Data: Largest to Smallest to ensure consistent clockwise rendering
studio_counts = studio_counts.sort_values(by='Count', ascending=False)

# ==========================================
# 2. COLOR STRATEGY & MAPPING
# ==========================================
color_map = {
    'UNKNOWN': '#C3122F',
    'Collaboration': '#ea568c',
    'Solo': '#D3D3D3'
}

# Map colors to the sorted dataframe
colors = [color_map[t] for t in studio_counts['Type']]

# ==========================================
# 3. VISUALIZATION
# ==========================================

fig = go.Figure(data=[go.Pie(
    labels=studio_counts['Type'],
    values=studio_counts['Count'],
    
    # --- DONUT SETTINGS ---
    hole=0.4, # Creates the donut shape
    
    # --- COLOR & BORDER ---
    marker=dict(colors=colors, line=dict(color='white', width=2)),
    
    # --- LABELS & FORMATTING ---
    # Use 'percent' for the chart slices to avoid clutter
    textinfo='percent',
    textposition='inside',
    textfont=dict(family='Tahoma', size=13),
    
    # --- LEGEND NAMES (ACTION ORIENTED) ---
    sort=False,
    direction='clockwise'
)])

# ==========================================
# 4. LAYOUT SETTINGS (STRICT COMPLIANCE)
# ==========================================

fig.update_layout(
    # GENERAL SETTINGS
    font=dict(family="Tahoma", size=13, color="#000000"),
    
    # --- MARGINS ---
    margin=dict(l=40, r=40, t=100, b=60),
    width=800,
    height=550,
    
    # BACKGROUND 
    plot_bgcolor="white",
    paper_bgcolor="white",
    
    # TITLE 
    title=dict(
        text="<b>Raw Data: UNKNOWN Dominance and Aggregation In Studio's Proportion</b>",
        x=0.02,
        y=0.95,
        xanchor='left',
        yanchor='top'
    ),
    
    # LEGEND
    showlegend=True,
    legend=dict(
        x=1.02, 
        y=0.98, 
        xanchor='left', 
        yanchor='top',
        bgcolor='rgba(0,0,0,0)', # Transparent background
        font=dict(family='Tahoma', size=12)
    ),
    
    # CENTRAL ANNOTATION 
    # Displaying the Total count in the center of the Donut
    annotations=[dict(
        text=f'<b>{total_projects:,.0f}</b><br><span style="font-size:11px; color:gray">Total</span>',
        x=0.5, y=0.5, 
        font=dict(family='Tahoma', size=18, color='black'),
        showarrow=False
    )]
)

fig.show()

try:
    save_chart(fig, step="03", stt="IV_B_3.1.Studio1", title="Raw_Data_UNKNOWN_Dominance_and_Aggregation_In_Studios_Proportion")
except Exception as e:
    print(f"⚠️ Could not save chart due to Kaleido error: {e}")
    print("✅ Chart is displayed below regardless.")


✅ Saved: [03]_[IV_B_3.1.Studio1]_Raw_Data_UNKNOWN_Dominance_and_Aggregation_In_Studios_Proportion (Transparent)


In [12]:
# ==========================================
# 1. DATA PREPARATION
# ==========================================

# --- A. RAW DATA (LEFT) ---
# Count occurrences and take Top 15
studio_counts_raw = df_anime_dataset_2023['Studios'].value_counts().head(10).reset_index()
studio_counts_raw.columns = ['Studios', 'Count']
studio_counts_raw = studio_counts_raw.sort_values('Count', ascending=True)

# Color Logic: Red (#C3122F) for "UNKNOWN" OR "Collaborations" (Commas), else Grey (#D3D3D3)
colors_raw = [
    '#C3122F' if (',' in str(x) or str(x).strip().upper() == 'UNKNOWN') else '#D3D3D3'
    for x in studio_counts_raw['Studios']
]

# --- B. CLEAN DATA (RIGHT) ---
df_studios_clean = df_anime_dataset_2023_prep[['Studios']].copy()
df_studios_clean['Studios'] = df_studios_clean['Studios'].astype(str)
df_studios_clean['Studios'] = df_studios_clean['Studios'].str.replace(r"[\[\]\'\"]", "", regex=True)
df_studios_clean['Studios'] = df_studios_clean['Studios'].str.split(',')
df_studios_clean = df_studios_clean.explode('Studios')
df_studios_clean['Studios'] = df_studios_clean['Studios'].str.strip()
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_studios_clean = df_studios_clean[~df_studios_clean['Studios'].str.upper().isin(exclude_list)]
df_studios_clean = df_studios_clean[df_studios_clean['Studios'] != ""]

# Aggregation
studio_counts_clean = df_studios_clean['Studios'].value_counts().head(10).reset_index()
studio_counts_clean.columns = ['Studios', 'Count']
studio_counts_clean = studio_counts_clean.sort_values('Count', ascending=True)

# Color Logic: Pink (#ea568c) for Clean Data
# Gradient hồng đậm → nhạt (đậm ở trên)
colors_clean = n_colors(
    'rgb(249,216,226)',   # nhạt (dưới)
    'rgb(234,86,140)',    # đậm (trên)
    len(genre_counts_clean),
    colortype='rgb'
) 

# ==========================================
# 2. VISUALIZATION (SUBPLOT STRATEGY)
# ==========================================

# Rule: Use make_subplots(rows=1, cols=2)
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("<b>Studios Raw Data Distribution</b>", "<b> Studios Cleaned Data Distribution</b>"),
    horizontal_spacing=0.2 # spacing to prevent long label overlap
)

# --- TRACE 1: RAW DATA (LEFT) ---
fig.add_trace(
    go.Bar(
        y=studio_counts_raw['Studios'],
        x=studio_counts_raw['Count'],
        orientation='h',
        marker_color=colors_raw,
        text=studio_counts_raw['Count'],
        textposition='outside',
        texttemplate='%{x:.2s}',
        hovertemplate='<b>Raw Entry:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>',
        showlegend=False
    ),
    row=1, col=1
)

# --- TRACE 2: CLEAN DATA (RIGHT) ---
fig.add_trace(
    go.Bar(
        y=studio_counts_clean['Studios'],
        x=studio_counts_clean['Count'],
        orientation='h',
        marker_color=colors_clean,
        text=studio_counts_clean['Count'],
        textposition='outside',
        texttemplate='%{x:.2s}',
        hovertemplate='<b>Studio:</b> %{y}<br><b>Total Projects:</b> %{x:,}<extra></extra>',
        showlegend=False
    ),
    row=1, col=2
)

# ==========================================
# 3. LAYOUT SETTINGS (DECLUTTER & FORMAT)
# ==========================================

fig.update_layout(
    # General
    font=dict(family="Tahoma", size=13, color="#000000"),
    hovermode="x unified",
    margin=dict(l=40, r=40, t=100, b=60),
    width=1500,
    height=550,
    plot_bgcolor="white",
    paper_bgcolor="white",
    
    # Title Convention
    title=dict(
        text="<b>Parsing Studio Data Reveals True Studios Market Leaders</b><br><span style='color:#555555; font-size:14px'>(Top 10 Most Active Studios)</span>",
        x=0.02,
        y=0.95,
        xanchor='left',
        yanchor='top'
    )
)

# --- AXIS SETTINGS ---
# Rule: Hide X-axis labels (use direct labeling), Hide Gridlines
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, row=1, col=1)
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, row=1, col=2)

# Rule: Keep Y-axis labels, ensure they fit (automargin)
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", automargin=True, row=1, col=1)
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", automargin=True, row=1, col=2)

fig.show()

try:
    save_chart(fig, step="03", stt="IV_B_3.1.Studio2", title="Parsing_Studio_Data_Reveals_True_Studios_Market_Leaders")
except Exception as e:
    print(f"⚠️ Could not save chart due to Kaleido error: {e}")
    print("✅ Chart is displayed below regardless.")

✅ Saved: [03]_[IV_B_3.1.Studio2]_Parsing_Studio_Data_Reveals_True_Studios_Market_Leaders (Transparent)


##### **The Transformation Insight:**

*   **The Misleading View:** The raw data suggested that **"UNKNOWN"** was the **dominant category** across all features (leading with **13k** in Producers, **11k** in Studios, and **4.9k** in Genres), effectively **masking the actual market structure**. Additionally, **aggregated text** (e.g., "Action, Comedy" or "Madhouse, MAPPA") **fragmented the data**, preventing individual entities from receiving proper credit for their works.

*   **The Truth:** The clean data corrects this by revealing the **true market leaders** that were previously hidden by noise. By **splitting collaborations** and **removing unknown credits**, we identified that **Comedy** dominates genres (7.1k), **NHK** leads producers, and **Toei Animation** is the top studio. Trucially, this process allows us to **accurately record the total volume of works** for each entity, ensuring that projects are **correctly attributed** rather than lost in combined strings.

*   **The Strategic Value**: Instead of noise, we now have **precise, reliable features** to analyze **how Genres, Producers, Studios entities drive Anime Scores**. 

### **4. Business Insights**

#### **A. Genres Score Analysis** 
**Question**: Besides popular genres with a large number of release titles, **what are the emerging genres with high assessment**?

In [13]:
# ==========================================
# 1. STEP A: IDENTIFY TOP 10 POPULAR GENRES
# ==========================================
df_pop = df_anime_dataset_2023_prep[['Genres']].copy()
df_pop['Genres'] = df_pop['Genres'].astype(str).str.replace(r"[\[\]\'\"]", "", regex=True)
df_pop['Genres'] = df_pop['Genres'].str.split(',')
df_pop = df_pop.explode('Genres')
df_pop['Genres'] = df_pop['Genres'].str.strip()

exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_pop = df_pop[~df_pop['Genres'].str.upper().isin(exclude_list)]
df_pop = df_pop[df_pop['Genres'] != ""]
# Loại bỏ Award Winning
df_pop = df_pop[df_pop['Genres'] != 'Award Winning']

top_10_popular_list = df_pop['Genres'].value_counts().head(10).index.tolist()

# ==========================================
# 2. STEP B: CALCULATE TOP 10 HIGHEST RATED
# ==========================================
df_score = df_anime_dataset_2023_prep[['Genres', 'Score']].copy()
df_score = df_score.dropna(subset=['Score'])
df_score['Genres'] = df_score['Genres'].astype(str).str.replace(r"[\[\]\'\"]", "", regex=True)
df_score['Genres'] = df_score['Genres'].str.split(',')
df_score = df_score.explode('Genres')
df_score['Genres'] = df_score['Genres'].str.strip()
df_score = df_score[~df_score['Genres'].str.upper().isin(exclude_list)]
df_score = df_score[df_score['Genres'] != ""]
# Loại bỏ Award Winning
df_score = df_score[df_score['Genres'] != 'Award Winning']

# Aggregation
genre_avg_score = (
    df_score.groupby('Genres')
    .agg(
        average_score=('Score', 'mean'), 
        anime_count=('Score', 'count')
    )
    .reset_index()
)

#Genre wwith at least 50 release titles 
genre_avg_score = genre_avg_score[genre_avg_score['anime_count'] >= 50]

# Lấy Top 10
top_genres_score = genre_avg_score.sort_values('average_score', ascending=True).tail(10)

# ==========================================
# 3. STEP C: COLOR LOGIC
# ==========================================
colors = []
for genre in top_genres_score['Genres']:
    if genre in top_10_popular_list:
        colors.append('#D3D3D3') # Mainstream
    else:
        colors.append('#6A1B9A') # Niche Gem

# ==========================================
# 4. VISUALIZATION
# ==========================================
fig = go.Figure(go.Bar(
    y=top_genres_score['Genres'],
    x=top_genres_score['average_score'],
    orientation='h',
    marker_color=colors,
    text=top_genres_score['average_score'],
    textposition='outside',
    texttemplate='%{x:.2f}',
    hovertemplate='<b>Genre:</b> %{y}<br><b>Avg Score:</b> %{x:.2f}<br><b>Total Animes:</b> %{customdata}<extra></extra>',
    customdata=top_genres_score['anime_count']
))

# ==========================================
# 5. LAYOUT SETTINGS
# ==========================================
fig.update_layout(
    font=dict(family="Tahoma", size=13, color="#000000"),
    hovermode="y unified",
    margin=dict(l=40, r=40, t=100, b=60),
    width=800,
    height=550,
    plot_bgcolor="white",
    paper_bgcolor="white",
    title=dict(
        text="<b>Niche Genres Dominate Audience Satisfaction Rankings</b><br><span style='color:#555555; font-size:14px'>(Purple: Niche Genres | Grey: Mainstream Genres found in Top 10 Most Aired)</span>",
        x=0.02, y=0.95, xanchor='left', yanchor='top'
    ),
    showlegend=False
)

fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False)
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ")

fig.show()

#--- 6. SAFE SAVE & SHOW ---    
try:
    save_chart(fig, step="03", stt="IV_B_4.A", title="Niche_Genres_Dominate_Audience_Satisfaction_Rankings")
except Exception as e:
    print(f"⚠️ Could not save chart due to Kaleido error: {e}")
    print("✅ Chart is displayed below regardless.")

✅ Saved: [03]_[IV_B_4.A]_Niche_Genres_Dominate_Audience_Satisfaction_Rankings (Transparent)


The data compares audience satisfaction scores for different anime genres.
*   **Purple:** Niche Genres
*   **Grey:** Mainstream Genres (Top 10 Most Aired)

**Insights**

* **Niche Dominance at the Top:** **Mystery (7.00)** and **Suspense (6.96)**, both Niche genres, secure the highest satisfaction scores, confirming that specialized content can generate peak audience engagement.
* **Mainstream Stability:** Genres like **Drama (6.85)** and **Romance (6.80)** are safe bets, providing consistently high scores, though they do not reach the very top.
* **Polarized Niche Performance:** Niche is a **high-risk, high-reward** field. While Mystery/Suspense excel, **Gourmet (6.63)** and **Girls Love (6.59)** are at the bottom of the list.
* **Action/Adventure Saturation:** Highly popular Mainstream genres like **Action** and **Adventure** (both 6.67) score surprisingly low, suggesting market saturation makes it difficult to achieve exceptional satisfaction.

**Business Takeaways**

* **Strategy 1: Maximized Score via Elite Niche Focus**
    *   **Action:** Prioritize projects in **Mystery (7.00)** or **Suspense (6.96)**.
    *   **Rationale:** The clearest path to the highest possible audience satisfaction score.

* **Strategy 2: Safe & Stable Investment in High-Ranking Mainstream**
    *   **Action:** Invest in **Drama (6.85)** or **Romance (6.80)** for a broad-reach, high-satisfaction balance.
    *   **Rationale:** Low-risk, high-return strategy compared to other Mainstream genres like Action/Adventure.

* **Strategy 3: Develop "Super-Niche" Titles**
    *   **Action:** Combine top **Niche** elements (Mystery, Suspense) with **stable Mainstream** themes (e.g., Supernatural).
    *   **Rationale:** Uses Niche depth to boost satisfaction while leveraging Mainstream familiarity for broader appeal.

#### **B. Format-Genre Alignment Strategy**
**Question:** Based on the audience satisfaction ranking data by genre, for each anime genre, **which broadcasting format should we choose to optimize the project's audience satisfaction score**?

In [14]:
# ==========================================
# 1. DATA PREPARATION
# ==========================================

# Helper function
def parse_list_field(x):
    if pd.isna(x): return []
    if isinstance(x, list): return [str(s).strip() for s in x]
    try:
        val = ast.literal_eval(x)
        if isinstance(val, (list, tuple)): return [str(s).strip() for s in val]
    except Exception: pass
    return [s.strip() for s in str(x).split(',') if s.strip()]

# Initialize
df_matrix = df_anime_dataset_2023_prep[['Type', 'Genres', 'Score']].copy()
df_matrix = df_matrix.dropna(subset=['Type', 'Genres', 'Score'])
df_matrix['Score'] = pd.to_numeric(df_matrix['Score'], errors='coerce')

# Parse Genres
df_matrix['Genres_list'] = df_matrix['Genres'].apply(parse_list_field)

# Explode Genres
df_exploded = df_matrix.explode('Genres_list')
df_exploded.rename(columns={'Genres_list': 'Genre'}, inplace=True)
df_exploded['Genre'] = df_exploded['Genre'].str.strip()
df_exploded['Type'] = df_exploded['Type'].str.strip()

# --- FILTERING ---
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
# Clean Genres
df_exploded = df_exploded[~df_exploded['Genre'].str.upper().isin(exclude_list)]
df_exploded = df_exploded[df_exploded['Genre'] != ""]
df_exploded = df_exploded[df_exploded['Genre'] != "Award Winning"] # Rule: Remove Award Winning

# Clean Types
df_exploded = df_exploded[~df_exploded['Type'].str.upper().isin(exclude_list)]

# --- STEP A: IDENTIFY TOP 10 GENRES (Highest Avg Score, Min 20 Titles) ---
genre_stats = df_exploded.groupby('Genre').agg(
    avg_score=('Score', 'mean'),
    count=('Score', 'count')
)
# Filter count >= 20 and Sort Descending
top_10_genres = genre_stats[genre_stats['count'] >= 20].sort_values('avg_score', ascending=False).head(10).index.tolist()

# --- STEP B: CREATE HEATMAP DATA (Type vs. Top 10 Genres) ---
# Filter dataset to only Top 10 Genres
df_final = df_exploded[df_exploded['Genre'].isin(top_10_genres)]

# Create Pivot Table
heatmap_data = df_final.groupby(['Type', 'Genre'])['Score'].mean().unstack()

# Reorder Columns (Genres) by Rank (Highest score left)
heatmap_data = heatmap_data.reindex(columns=top_10_genres)

# Optional: Reorder Rows (Type) by Average Score of that Type (for better visual flow)
type_order = df_final.groupby('Type')['Score'].mean().sort_values(ascending=False).index.tolist()
heatmap_data = heatmap_data.reindex(index=type_order)

# ==========================================
# 2. COLOR STRATEGY (PHASE 3)
# ==========================================
# Gradient: Grey (Low) -> Pink (Mid) -> Purple (High)
custom_colorscale = [
    [0.0, "#D3D3D3"], 
    [0.5, '#ea568c'], 
    [1.0, "#6A1B9A"]
]

# ==========================================
# 3. VISUALIZATION
# ==========================================

fig = px.imshow(
    heatmap_data,
    labels=dict(x="Top Genres", y="Format (Type)", color="Avg Score"),
    x=heatmap_data.columns,
    y=heatmap_data.index,
    aspect="auto",
    color_continuous_scale=custom_colorscale,
    text_auto=".2f"
)

# ==========================================
# 4. LAYOUT SETTINGS
# ==========================================

fig.update_layout(
    # --- GENERAL ---
    font=dict(family='Tahoma', size=12, color="#000000"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    width=1000,
    height=600, 
    
    # --- TITLE CONVENTION ---
    title=dict(
        text="<b>Format Matters: How Anime Type Affects Genre Scores</b><br><span style='color:#555555; font-size:14px'>(Average Score Matrix: Anime Types vs. Top 10 Highest-Rated Genres)</span>",
        x=0.02, 
        y=0.97,
        xanchor='left',
        yanchor='top'
    ),
    
    # --- MARGINS ---
    margin=dict(t=100, l=100, r=40, b=60),
    
    # --- COLORBAR ---
    coloraxis_colorbar=dict(
        title=dict(text="Score", side="top"),
        thickness=15,
        len=0.5,
        yanchor="top",
        y=1,
        tickfont=dict(family="Tahoma", size=11)
    )
)

# --- AXIS REFINEMENTS ---
# Move X-axis to top
fig.update_xaxes(side="top", tickfont=dict(family="Tahoma", size=12), title=None)
fig.update_yaxes(tickfont=dict(family="Tahoma", size=12), title=None, ticksuffix="  ")
fig.update_traces(xgap=1, ygap=1)

fig.show()

#--- 6. SAFE SAVE & SHOW ---    
try:
    save_chart(fig, step="03", stt="IV_B_4.B", title="Format_Matters_How_Anime_Type_Affects_Genre_Scores")
except Exception as e:
    print(f"⚠️ Could not save chart due to Kaleido error: {e}")
    print("✅ Chart is displayed below regardless.")

✅ Saved: [03]_[IV_B_4.B]_Format_Matters_How_Anime_Type_Affects_Genre_Scores (Transparent)


**Insight**
* **The Dominance of TV and Movie Formats**

    The chart clearly establishes that **TV Series and Movies** are the only viable formats for achieving high audience scores (above 7.00). The vast majority of high-scoring cells are concentrated in these two rows. This mandates a **Long-form Production Investment Strategy**. Producers should allocate their most significant resources to **TV Series and Movies**, as these formats provide the necessary time for **narrative development** and high **production quality** required to achieve critical mass and high scores. Conversely, shorter formats (Special, ONA, OVA) consistently score in the 6.xx range, disqualifying them as primary vehicles for a major "hit."

* **Strategic Genre Specialization by Format**

    Choosing the right format must be a function of the genre's structural needs:

    *   **TV's Strength in Tension:** **TV Series** are the optimal format for genres requiring sustained tension and atmosphere, peaking in **Suspense** (7.57) and **Gourmet** (7.22). The episodic nature allows for effective slow-burn mystery and world-building that retains viewer engagement over time.

    *   **Movie's Strength in Narrative Density:** **Movies** excel in genres requiring narrative density, such as **Mystery** (7.44) and **Romance** (7.18). The focused runtime allows for an impactful, self-contained story that can deliver a complex plot without the "pacing issues" often seen in TV series.

* **Short-Form Content for Niche Exploration**

    Shorter formats (ONA, OVA, Special) should be primarily utilized for **Niche Market Exploration** and **Supplementary Content**. The highest scores in the ONA and OVA rows are found in specific niche genres like **Gourmet** (ONA: 6.72; OVA: 6.59). The **Music** format also shows an interesting anomaly, scoring surprisingly high in **Suspense** (6.86), suggesting its utility as a highly effective, low-budget tool for **tension-building marketing** and promotional material. However, Producers must maintain the clear expectation that short-form content is not a primary vehicle for maximizing the final audience score.

**Business Takeaway**

For maximizing project score, the production decision is binary: 
* **TV Series** for **sustained narrative tension and world-building**
* **Movies** for **concentrated, high-impact storytelling**.

#### **C. Studios vs Genres** 
**Question**: Studios are directly responsible for drawing, animation, compositing, coloring, editing, and post production. In other words, they are the teams that produce the actual visual content we see on the screen. Therefore, we want to discover **which Genres do top Studios consistently excel in.**

In [15]:

# ==========================================
# 1. DATA PREPARATION
# ==========================================

# --- COMMON CLEANING (SCORE & STUDIOS) ---
df_studio = df_anime_dataset_2023_prep[['Studios', 'Score']].copy()

# Ensure strictly string type & valid scores
df_studio = df_studio.dropna(subset=['Score', 'Studios'])
df_studio['Score'] = pd.to_numeric(df_studio['Score'], errors='coerce')
df_studio['Studios'] = df_studio['Studios'].astype(str)

# Standard Cleaning: Remove brackets/quotes
df_studio['Studios'] = df_studio['Studios'].str.replace(r"[\[\]\'\"]", "", regex=True)

# Split & Explode
df_studio['Studios'] = df_studio['Studios'].str.split(',')
df_studio = df_studio.explode('Studios')

# Trim & Filter Noise
df_studio['Studios'] = df_studio['Studios'].str.strip()
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_studio = df_studio[~df_studio['Studios'].str.upper().isin(exclude_list)]
df_studio = df_studio[df_studio['Studios'] != ""]

# --- AGGREGATE STATS (Count & Mean) ---
studio_stats = df_studio.groupby('Studios').agg(
    avg_score=('Score', 'mean'),
    count=('Score', 'count')
).reset_index()

# --- GROUP A: TOP 10 BY VOLUME (MOST ACTIVE) ---
# Logic: Sort by Count Descending -> Take Top 10 -> Sort by Score for Plotting
top_volume_list = studio_stats.sort_values('count', ascending=False).head(10)
top_volume_plot = top_volume_list.sort_values('avg_score', ascending=True) # Sort Ascending for Horizontal Bar

# --- GROUP B: TOP 10 BY SCORE (QUALITY LEADERS) ---
# Logic: Filter Count >= 50 -> Sort by Score Descending -> Take Top 10 -> Sort Ascending for Plotting
threshold = 50
qualified_studios = studio_stats[studio_stats['count'] >= threshold]
top_quality_list = qualified_studios.sort_values('avg_score', ascending=False).head(10)
top_quality_plot = top_quality_list.sort_values('avg_score', ascending=True)

# ==========================================
# 2. VISUALIZATION (SUBPLOT STRATEGY)
# ==========================================

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=(
        "<b>Avg Score of Top 10 Most Active Studios</b>", 
        f"<b>Avg Score of Top 10 Highest Rated Studios (>{threshold} titles)</b>"
    ),
    horizontal_spacing=0.15
)

# --- TRACE 1: VOLUME LEADERS SCORE (PINK) ---
fig.add_trace(
    go.Bar(
        y=top_volume_plot['Studios'],
        x=top_volume_plot['avg_score'],
        orientation='h',
        marker_color='#ea568c', # Vibrant Pink (Volume Context)
        
        # Direct Labeling
        text=top_volume_plot['avg_score'],
        textposition='outside',
        texttemplate='%{x:.2f}',
        
        # Tooltip
        hovertemplate='<b>Studio:</b> %{y}<br><b>Avg Score:</b> %{x:.2f}<br><b>Total Projects:</b> %{customdata}<extra></extra>',
        customdata=top_volume_plot['count']
    ),
    row=1, col=1
)

# --- TRACE 2: QUALITY LEADERS SCORE (PURPLE) ---
fig.add_trace(
    go.Bar(
        y=top_quality_plot['Studios'],
        x=top_quality_plot['avg_score'],
        orientation='h',
        marker_color='#6A1B9A', # Deep Purple (Quality Context)
        
        # Direct Labeling
        text=top_quality_plot['avg_score'],
        textposition='outside',
        texttemplate='%{x:.2f}',
        
        # Tooltip
        hovertemplate='<b>Studio:</b> %{y}<br><b>Avg Score:</b> %{x:.2f}<br><b>Total Projects:</b> %{customdata}<extra></extra>',
        customdata=top_quality_plot['count']
    ),
    row=1, col=2
)

# ==========================================
# 3. LAYOUT SETTINGS
# ==========================================

fig.update_layout(
    # --- GENERAL ---
    font=dict(family="Tahoma", size=13, color="#000000"),
    hovermode="y unified",
    
    # --- MARGINS ---
    margin=dict(l=40, r=40, t=100, b=60),
    width=1400, 
    height=550,
    plot_bgcolor="white",
    paper_bgcolor="white",
    
    # --- TITLE CONVENTION ---
    title=dict(
        text="<b>Mass Production vs. Consistent Excellence: A Score Comparison",
        x=0.02,
        y=0.95,
        xanchor='left',
        yanchor='top'
    ),

    showlegend=False
)

# ==========================================
# 4. AXIS DECLUTTER
# ==========================================

# Determine common range for better comparison (e.g., from lowest score to 10)
min_score = min(top_volume_plot['avg_score'].min(), top_quality_plot['avg_score'].min())
range_x = [0,8]

# Left Chart Axis
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, range=range_x, row=1, col=1)
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", automargin=True, row=1, col=1)

# Right Chart Axis
fig.update_xaxes(showgrid=False, showticklabels=False, zeroline=False, range=range_x, row=1, col=2)
fig.update_yaxes(showgrid=False, showline=False, ticksuffix="  ", automargin=True, row=1, col=2)

fig.show()

#--- 6. SAFE SAVE & SHOW ---    
try:
    save_chart(fig, step="03", stt="IV_B_C.1.1", title="Mass_Production_vs_Consistent_Excellence_A_Score_Comparison")
except Exception as e:
    print(f"⚠️ Could not save chart due to Kaleido error: {e}")
    print("✅ Chart is displayed below regardless.")

✅ Saved: [03]_[IV_B_C.1.1]_Mass_Production_vs_Consistent_Excellence_A_Score_Comparison (Transparent)


**Insight**

* **The Strategic Trade-off: Volume vs. Score**

    The comparison explicitly highlights a fundamental **trade-off** between **mass production** and **consistent scoring excellence**. 

    * The **Top 10 Most Active Studios** generally maintain average scores below 7.00 (with the lowest at 6.59), prioritizing high **volume** to ensure stable revenue and resource utilization. 
    * In contrast, the **Top 10 Highest Rated Studios** (all above 7.09) focus on achieving superior **quality** and brand value. This disparity shows that studios must adopt a clear **strategic segmentation**: either optimize for **cash flow stability** (volume) or strive for a **premium brand reputation** (score).

* **The Significance of the Quality Floor Gap**

    * The difference in the lowest average score between the two groups-**0.50 points** (6.59 vs. 7.09)-is highly significant in the competitive anime market. This gap represents the difference between a project being merely *average* and being firmly in the *high-quality* tier. Studios must understand that achieving a reputation for **Consistent Excellence** (i.e., making the "Highest Rated" list) requires a commitment to a **higher quality floor** across their entire output. This performance difference validates the notion that specialized studios like Bones, ufotable, and Kyoto Animation command a **score premium** due to their unwavering quality standards.

* **The Exceptional Case and Producer Strategy**

    * **A-1 Pictures** stands out as a unique exception, appearing at the top of the **Most Active** list while maintaining an average score of 7.12, high enough to make the **Highest Rated** list. This demonstrates a successful, albeit rare, **ideal business model** that has mastered both **scale and quality**. For Producers, the data dictates a clear selection strategy: if the goal is securing a **guaranteed high score** and a critical hit, the **Highest Rated Studios** are the designated **premium partners**. If the goal is a balance between moderate quality and reliable production output, the **Most Active Studios** offer a broader, more accessible range of options.

**Business Takeaway**

Success in the anime industry is no longer a simple matter of **Volume** or **Score**, but of **strategic alignment**. **Highest Rated Studios** prove that focusing on **technical quality** yields both higher audience scores and greater long-term **brand equity**. Producers should choose their Studio based on their ultimate objective: **Cash Flow Stability (Active Studios)** or **Reputation & High Score (Rated Studios)**.

In [29]:
# ==========================================
# 1. DATA PREPARATION
# ==========================================

# Helper function
def parse_list_field(x):
    if pd.isna(x): return []
    if isinstance(x, list): return [str(s).strip() for s in x]
    try:
        val = ast.literal_eval(x)
        if isinstance(val, (list, tuple)): return [str(s).strip() for s in val]
    except Exception: pass
    return [s.strip() for s in str(x).split(',') if s.strip()]

# Initialize
df_matrix = df_anime_dataset_2023_prep[['anime_id', 'Studios', 'Genres', 'Score']].copy()
df_matrix = df_matrix.dropna(subset=['Studios', 'Genres', 'Score'])
df_matrix['Score'] = pd.to_numeric(df_matrix['Score'], errors='coerce')

# Parse Lists
df_matrix['Studios_list'] = df_matrix['Studios'].apply(parse_list_field)
df_matrix['Genres_list'] = df_matrix['Genres'].apply(parse_list_field)

# --- STEP A: IDENTIFY TOP 10 ELITE STUDIOS (Avg Score, Min 20 Titles) ---
df_studios = df_matrix.explode('Studios_list')
df_studios['Studio'] = df_studios['Studios_list'].astype(str).str.strip()

# Remove Garbage
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_studios = df_studios[~df_studios['Studio'].str.upper().isin(exclude_list)]
df_studios = df_studios[df_studios['Studio'] != ""]

# Aggregation Studio
studio_stats = df_studios.groupby('Studio').agg(
    avg_score=('Score', 'mean'),
    count=('Score', 'count')
)
# Filter & Sort
top_10_studios = studio_stats[studio_stats['count'] >= 20].sort_values('avg_score', ascending=False).head(10).index.tolist()

# --- STEP B: IDENTIFY TOP 10 ELITE GENRES (Avg Score, Min 20 Titles) ---
# We explode Genres from the FULL dataset (not just top studios) to find global top quality genres
df_genres = df_matrix.explode('Genres_list')
df_genres['Genre'] = df_genres['Genres_list'].astype(str).str.strip()
df_genres = df_genres[~df_genres['Genre'].str.upper().isin(exclude_list)]
df_genres = df_genres[df_genres['Genre'] != ""]

# --- UPDATE: REMOVE 'AWARD WINNING' ---
df_genres = df_genres[df_genres['Genre'] != 'Award Winning']

# Aggregation Genre
genre_stats = df_genres.groupby('Genre').agg(
    avg_score=('Score', 'mean'),
    count=('Score', 'count')
)
# Filter & Sort
top_10_genres = genre_stats[genre_stats['count'] >= 50].sort_values('avg_score', ascending=False).head(10).index.tolist()

# --- STEP C: CREATE INTERSECTION MATRIX ---
# Filter original exploded data to keep only Elite Studios AND Elite Genres
df_final = df_studios.copy() # Starts with exploded studios
df_final['Genres_list'] = df_final['Genres_list'].apply(lambda x: [g for g in x if g in top_10_genres]) # Keep only top genres in list
df_final = df_final.explode('Genres_list') # Explode Genres
df_final.rename(columns={'Genres_list': 'Genre'}, inplace=True)

# Filter rows
df_final = df_final[
    (df_final['Studio'].isin(top_10_studios)) & 
    (df_final['Genre'].notna())
]

# Create Pivot Table (Heatmap Data)
heatmap_data = df_final.groupby(['Studio', 'Genre'])['Score'].mean().unstack()

# Reorder indices for visual logic (Best at Top-Left)
heatmap_data = heatmap_data.reindex(index=top_10_studios, columns=top_10_genres)

# ==========================================
# 2. COLOR STRATEGY (PHASE 3)
# ==========================================
# Gradient: Grey (Low) -> Pink (Mid) -> Purple (High)
custom_colorscale = [
    [0.0, "#D3D3D3"], 
    [0.5, '#ea568c'], 
    [1.0, "#6A1B9A"]
]

# ==========================================
# 3. VISUALIZATION
# ==========================================

fig = px.imshow(
    heatmap_data,
    labels=dict(x="Top Genres", y="Top Studios", color="Avg Score"),
    x=heatmap_data.columns,
    y=heatmap_data.index,
    aspect="auto",
    color_continuous_scale=custom_colorscale,
    text_auto=".2f"
)

# ==========================================
# 4. LAYOUT SETTINGS
# ==========================================

fig.update_layout(
    # --- GENERAL ---
    font=dict(family='Tahoma', size=12, color="#000000"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    width=1000,
    height=800, 
    
    # --- TITLE CONVENTION ---
    title=dict(
        text="<b>Elite Quality Intersection: Top Studios vs. Top Genres</b><br><span style='color:#555555; font-size:14px'>(Performance Matrix of the 10 Highest-Rated Studios across the 10 Highest-Rated Genres)</span>",
        x=0.02, 
        y=0.97,
        xanchor='left',
        yanchor='top'
    ),
    
    # --- MARGINS ---
    margin=dict(t=100, l=150, r=40, b=60),
    
    # --- COLORBAR ---
    coloraxis_colorbar=dict(
        title=dict(text="Score", side="top"),
        thickness=15,
        len=0.5,
        yanchor="top",
        y=1,
        tickfont=dict(family="Tahoma", size=11)
    )
)

# --- AXIS REFINEMENTS ---
fig.update_xaxes(side="top", tickfont=dict(family="Tahoma", size=12), title=None)
fig.update_yaxes(tickfont=dict(family="Tahoma", size=12), title=None, ticksuffix="  ")
fig.update_traces(xgap=1, ygap=1)

fig.show()

#--- 6. SAFE SAVE & SHOW ---    
try:
    save_chart(fig, step="03", stt="IV_B_C.1.2", title="Elite_Quality_Intersection_Top_Studios_vs_Top_Genres")
except Exception as e:
    print(f"⚠️ Could not save chart due to Kaleido error: {e}")
    print("✅ Chart is displayed below regardless.")

✅ Saved: [03]_[IV_B_C.1.2]_Elite_Quality_Intersection_Top_Studios_vs_Top_Genres (Transparent)


**Insight**

The chart "Elite Quality Intersection: Top Studios vs. Top Genres" (A performance matrix of the 10 Highest-Rated Studios across the 10 Highest-Rated Genres) provides critical data for strategic investment.

* **Absolute Peak Performance is in High-Tension Genres:**
    *   The absolute highest score in the entire matrix is **8.54** (**MAPPA** in **Suspense**).
    *   Scores above 8.10 are heavily concentrated in high-tension genres: Suspense (MAPPA 8.54, Wit Studio 8.19, White Fox 8.12) and Mystery (White Fox 8.20).

*  **White Fox Demonstrates Versatility:**
    *   **White Fox** is the only studio to achieve scores above 8.00 in two fundamentally different genres: **Romance (8.31)** and **Mystery (8.20)**. This indicates exceptional directorial and production skill capable of mastering both deep emotional narrative and intellectual, thrilling plotlines.

*  **Core Strengths Define Top Specialist Studios:**
    *   High-technical-requirement genres have clear specialists: **MAPPA** leads **Suspense**, **David Production** excels in **Adventure (8.04)**, and **Ufotable** dominates **Supernatural (8.01)**.
    *   **Kyoto Animation (KyoAni)** confirms its emotional core with the highest stable score in **Drama (7.98)**, showing mastery over character-driven narratives.

*  **Hidden Risks in Studio-Genre Mismatches:**
    *   Even top studios have clear weaknesses, representing significant investment risk:
        *   **CloverWorks** has the lowest score on the entire matrix in **Suspense (6.17)**.
        *   **White Fox** shows a significant dip in **Supernatural (6.33)**.
        *   **Wit Studio** performs poorly in **Mystery (6.67)**.

*   **Lack of Studio Presence in Niche Genres**

    * **Sports** and **Gourmet** have very few studios achieving scores above 7.50 (only 4 out of 10 studios score above 7.50 in *Sports*, and only 3 out of 10 in *Gourmet*).
    * This indicates that few leading studios actively focus on these two genres, creating a potential market opportunity for studios with compatible strengths to expand into these underserved areas.

**Business Takeaway**

* **Strategy 1: Target the "Triple-A Gold" for Maximum Score**

    Focus all resources on the top three proven combinations for peak performance:

    *   **MAPPA x Suspense:** The highest-risk, highest-reward investment for maximum tension and psychological drama delivery.
    *   **White Fox x Romance:** Invest here to create an industry-leading emotional masterpiece with guaranteed high satisfaction.
    *   **White Fox x Mystery:** Utilize White Fox's specialized directorial skill for complex, high-quality intellectual thrillers.

* **Strategy 2: Studio Specialized Strengths to Enter Niche Genres**

    This strategy focuses on leveraging each studio’s core strengths to expand into specialized genres where competition is relatively low.

    * **Action-to-Sports Crossover:** 
        * Encourage **Studio Bones** (7.42 in Action, 7.21 in Sports) and **MAPPA** (7.54 in Action) to increase their investment in the *Sports* genre.

        * **Rationale:** Sports anime requires highly technical character-motion animation, which is closely aligned with the demands of the Action genre. Studios with strong Action/technical capabilities can quickly build an advantage in the Sports market, where there are fewer strong competitors.

    * **Drama-to-Gourmet Crossover:** 
        * Encourage **Kyoto Animation** (7.98 in Drama) and **Lerche** (7.69 in Drama, 7.69 in Gourmet) to invest more deeply in the *Gourmet* genre.

        * **Rationale:** Success in Gourmet storytelling requires strong attention to detail and emotional presentation—qualities similar to those in Drama. Studios known for expressive character work and detailed animation can effectively transfer these strengths to dominate the underdeveloped Gourmet market.

* **Strategy 3: Stable Quality Assurance and Genre Elevation**

For consistency and to overcome market saturation:

*   **Adventure/Supernatural:** Only partner with specialists like **David Production (Adventure)** and **Ufotable (Supernatural)** for large-scale, high-fidelity projects.
*   **Drama:** Partner with **Kyoto Animation (KyoAni)**. KyoAni is the safest, most reliable bet for consistently high scores in character-driven narratives

#### **D. Producers Committee** 
**Question**: A Producer can be understood as an “investor,” and an anime is essentially an “investment project.” Producers provide funding, planning, and coordination for the production.
Given the need to ensure quality and manage financial risk, **should producers operate alone (or in very small teams), or should they form a larger producer committee?**


In [17]:
# ==========================================
# 1. DATA PROCESSING
# ==========================================

# Filter relevant columns
df_trend = df_anime_dataset_2023_prep[['Producers', 'Aired Year']].copy()

# 1.1. CLEANING
df_trend['Aired Year'] = pd.to_numeric(df_trend['Aired Year'], errors='coerce')
df_trend.dropna(subset=['Aired Year', 'Producers'], inplace=True)

# Remove Unknowns
exclude_list = {"UNKNOWN", "NONE", "NAN", "NULL", ""}
df_trend = df_trend[~df_trend['Producers'].astype(str).str.strip().str.upper().isin(exclude_list)]

# 1.2. FILTER LAST 10 YEARS (2014 - 2023)
max_year = 2023
min_year = max_year - 9 
df_trend = df_trend[(df_trend['Aired Year'] >= min_year) & (df_trend['Aired Year'] <= max_year)]

# 1.3. GROUPING LOGIC
df_trend['num_producers'] = df_trend['Producers'].astype(str).str.count(',') + 1

def group_producers(n):
    if n <= 2: return '1-2 producers'
    if n <= 5: return '3-5 producers'
    return '6+ producers'

df_trend['Collab_Group'] = df_trend['num_producers'].apply(group_producers)

# 1.4. AGGREGATION
trend_data = (
    df_trend.groupby(['Aired Year', 'Collab_Group'])
    .size()
    .reset_index(name='Anime_Count')
    .sort_values('Aired Year')
)

# Logical Order
order_list = ['1-2 producers', '3-5 producers', '6+ producers']

# ==========================================
# 2. COLOR STRATEGY
# ==========================================
# Map specific groups to the requested colors
color_map = {
    '1-2 producers': '#6A1B9A', 
    '3-5 producers': '#ea568c', 
    '6+ producers':  '#2E86C1'  
}

# ==========================================
# 3. VISUALIZATION
# ==========================================

fig = go.Figure()

for group in order_list:
    # Filter data
    group_data = trend_data[trend_data['Collab_Group'] == group]
    
    # Get color
    line_color = color_map[group]

    line_width = 3

    fig.add_trace(go.Scatter(
        x=group_data['Aired Year'],
        y=group_data['Anime_Count'],
        mode='lines+markers',
        name=group,
        line=dict(color=line_color, width=line_width),
        hovertemplate=f'<b>{group}</b><br>Year: %{{x}}<br>Volume: %{{y}}<extra></extra>'
    ))

# ==========================================
# 4. LAYOUT SETTINGS
# ==========================================

fig.update_layout(
    # --- GENERAL ---
    font=dict(family="Tahoma", size=13, color="#000000"),
    hovermode="x unified",
    
    # --- MARGINS ---
    margin=dict(l=40, r=40, t=100, b=60),
    width=800,
    height=550,
    plot_bgcolor="white",
    paper_bgcolor="white",
    
    # --- TITLE ---
    title=dict(
        text=f"<b>Small Teams Consistently Drive Anime Volume</b><br><span style='color:#555555; font-size:14px'>(Annual Production Volume by Collaboration Size In {min_year} - {max_year})</span>",
        x=0.02,
        y=0.95,
        xanchor='left',
        yanchor='top'
    ),
    
    # --- LEGEND  ---
    showlegend=True,
    legend=dict(
        x=1.02, 
        y=0.98, 
        xanchor='left', 
        yanchor='top',
        bgcolor='rgba(0,0,0,0)',
        font=dict(size=12)
    )
)

# ==========================================
# 5. AXIS DECLUTTER
# ==========================================

fig.update_xaxes(
    
    showgrid=False, 
    showline=True, 
    linecolor='black',
    tickmode='linear',
    dtick=1
)

fig.update_yaxes(
    title="Number of Anime",
    showgrid=True, 
    showline=True, 
    ticksuffix="  "
)

fig.show()

#--- 6. SAFE SAVE & SHOW ---    
try:
    save_chart(fig, step="03", stt="IV_B_D.1", title="Small_Teams_Consistently_Drive_Anime_Volume")
except Exception as e:
    print(f"⚠️ Could not save chart due to Kaleido error: {e}")
    print("✅ Chart is displayed below regardless.")

✅ Saved: [03]_[IV_B_D.1]_Small_Teams_Consistently_Drive_Anime_Volume (Transparent)


**Insight:**

* **The Dominance and Decline of Small Producer Teams (1-2 Producers)**

    * The anime market remains overwhelmingly **dominated by projects involving only one or two producers**. This segment consistently drives the highest volume of annual production, often three to six times larger than the other groups combined. However, the data reveals a critical **overall declining trend** in the total output of this dominant group, dropping sharply from a peak of around 410 titles in 2014 to approximately 230 in 2023.

    * **Strategic Implication:** Small-scale producers are the main market driver, but their overall volume is shrinking. Producers and studios working primarily in this segment must recognize the market consolidation and **aggressively focus on project quality and efficiency** to counteract the declining trend and high production volatility.

* **Stability and Consistency of Larger Collaboration Models**

    * In sharp contrast to the high volatility of the small teams, the **mid-sized (3-5 Producers)** and **large collaboration (6+ Producers) models** maintain a significantly lower but remarkably **stable production volume**, generally ranging between 50 and 100 titles annually throughout the decade.

    * **Strategic Implication:** These stable output numbers suggest that larger committees are primarily involved in **resource-intensive, complex, or high-profile projects** (like major franchises) that demand more coordinated financial backing. For animation studios seeking reliable, year-to-year contracts and stable cash flow, prioritizing partnerships with these larger production committees (especially the **6+ Producers** group) is the optimal strategy.

**Business Takeaway:** The core strategy of the anime market is demonstrably shifting **from the enormous volume of small-scale projects towards the stable quality of larger-scale projects.**

*   **For Small-Scale Producers/Studios:** Caution is advised due to the overall declining trend in volume. Success requires a strategic pivot to **prioritize and enhance project quality** to effectively compete with the stability offered by larger projects.

*   **For Large-Scale Producers/Studios:** It is recommended to continue **maintaining or expanding the Large Collaboration Models** as they provide a consistent output, better stability, and superior resilience against general market fluctuations.

In [18]:
# ==========================================
# 1. DATA PROCESSING
# ==========================================

# 1. Copy data
df_producers_collab = df_anime_dataset_2023_prep[['Producers', 'Score']].copy()

# 2. Cleaning
df_producers_collab['Score'] = pd.to_numeric(df_producers_collab['Score'], errors='coerce')
df_producers_collab = df_producers_collab.dropna(subset=['Score', 'Producers'])

# 3. Filter Unknowns
exclude_list = ["UNKNOWN", "", "NONE"]
df_producers_collab = df_producers_collab[~df_producers_collab['Producers'].astype(str).str.strip().str.upper().isin(exclude_list)]

# 4. Calculate Number of Producers
df_producers_collab['num_producers'] = df_producers_collab['Producers'].astype(str).str.count(',') + 1

# 5. Binning Function
def group_size(n):
    if n <= 2: return '1-2 producers'
    if n <= 5: return '3-5 producers'
    return '6+ producers'

df_producers_collab['Collaboration_Group'] = df_producers_collab['num_producers'].apply(group_size)

# 6. Define Order
order_list = ['1-2 producers', '3-5 producers', '6+ producers']

# 7. Calculate Insight Stats for Title
group_stats = df_producers_collab.groupby('Collaboration_Group')['Score'].mean().reindex(order_list)
baseline_score = group_stats['1-2 producers']
top_score = group_stats.max()
improvement = top_score - baseline_score

# ==========================================
# 2. COLOR STRATEGY (CONSISTENT WITH LINE CHART)
# ==========================================
custom_palette = {
    '1-2 producers': '#6A1B9A', 
    '3-5 producers': '#ea568c',
    '6+ producers':  '#2E86C1'  
}

# ==========================================
# 3. VISUALIZATION
# ==========================================

fig = px.box(
    df_producers_collab,
    x='Collaboration_Group',
    y='Score',
    color='Collaboration_Group', 
    color_discrete_map=custom_palette, 
    category_orders={'Collaboration_Group': order_list} 
)

# Style Traces
fig.update_traces(
    width=0.4,           # Adjust box width
    marker_size=4,       # Smaller outlier points
    marker_opacity=0.3,  # Reduce outlier visual noise
    line_width=1.5
)

# ==========================================
# 4. LAYOUT SETTINGS (STRICT COMPLIANCE)
# ==========================================

fig.update_layout(
    # --- GENERAL ---
    font=dict(family="Tahoma", size=13, color="#000000"),
    
    # --- MARGINS ---
    margin=dict(l=40, r=40, t=100, b=60),
    width=800,
    height=550,
    plot_bgcolor="white",
    paper_bgcolor="white",
    
    # --- TITLE CONVENTION (ACTION-ORIENTED) ---
    title=dict(
        text=f"<b>Large Collaboration Teams Correlate with Higher Scores</b><br><span style='color:#555555; font-size:14px'>(Avg Score Increases by +{improvement:.2f} points when comparing 3-5 Producers vs Small Teams)</span>",
        x=0.02,
        y=0.95,
        xanchor='left',
        yanchor='top'
    ),
    
    # --- LEGEND ---
    # Rule: Box plot categories are clear on X-axis, hide legend to declutter
    showlegend=False
)

# ==========================================
# 5. AXIS DECLUTTER
# ==========================================

# X-Axis
fig.update_xaxes(
    title="Producer Team Size",
    showgrid=False,
    showline=True,
    linecolor='black'
)

# Y-Axis
fig.update_yaxes(
    title="Anime Score",
    showgrid=True,
    showline=True,
    linecolor='black',
    ticksuffix="  "
)

fig.show()

#--- 6. SAFE SAVE & SHOW ---    
try:
    save_chart(fig, step="03", stt="IV_B_D.2", title="Large_Collaboration_Teams_Correlate_with_Higher_Scores")
except Exception as e:
    print(f"⚠️ Could not save chart due to Kaleido error: {e}")
    print("✅ Chart is displayed below regardless.")

✅ Saved: [03]_[IV_B_D.2]_Large_Collaboration_Teams_Correlate_with_Higher_Scores (Transparent)


**Insight:**
* **Larger Producer Teams Deliver Better Results**

    There is a clear and direct **positive correlation** between the number of producers involved and the final Anime Score. The score systematically improves as the collaboration size increases.

    *   The chart explicitly shows the largest groups significantly outperforming the smallest: **1-2 Producers Average score: 6.33** vs. **3-5 Producers Average: 7.19**, representing an improvement of **+0.81** points.

    *   This performance gap of nearly a full point validates the notion that **larger budgets, pooled resources, and stricter quality control** inherent in multi-party production committees reliably lead to a better-received product.

* **Risk Mitigation and Quality Floor Elevation**

    Collaboration acts as an effective mechanism for risk management, substantially raising the quality floor for the investment.

    *   The **"1-2 Producers"** group exhibits extremely high volatility, with a wide score range and a significant cluster of "disaster" **outliers** plummeting down to the 2.0 – 3.0 score range. This configuration carries the highest risk of catastrophic failure.

    *   In sharp contrast, the presence of **3 Producers or more** significantly elevates the **"quality floor."** The lower whisker for both the 3-5 and 6+ producer groups rests at a much higher score (~ 5.1), demonstrating that collaboration effectively protects the investment from critically low-scoring failure.

**Business Takeaway:** The optimal strategic balance for maximizing quality and controlling risk is to form a **Production Committee** consisting of **3 to 5 partners**. This structure successfully balances the need for robust financing and stringent quality oversight while maintaining effective management efficiency.

## **V. Theme C — Release Strategy (Aired, Episodes, Duration)**  
### **1. Issue Overview**

**a. Non-Numeric Data Stored as Strings**

Several features that should be numeric are stored as string, blocking quantitative analysis.

- **Episodes**: 
  Stored as strings (“12.0”, “2.0”).  
  → Causes fragmentation, prevents correct numeric sorting, and breaks grouping into meaningfull episode ranges.

- **Duration**: 
  Numeric runtimes expressed as free-form strings (“24 min per ep”, “1 hr”, “59 sec”).  
  → Prevents converting values into consistent minutes → bars or distributions show string categories, not real lengths.

- **Aired**: 
  Dates stored as strings instead of datetime.  
  → Plotly cannot interpret a timeline → monthly/seasonal charts become impossible.

- **Impact:**  
  → No meaningful numeric aggregation (ranges, averages, continuous plots). Charts reflect *string labels*, not real values.


**b. Inconsistent and Unstructured Formats**

The dataset contains multiple incompatible string styles within the same feature.

- **Duration**:  
  Examples: “1 hr”, “1h 0m”, “60 min”, “3 days”.  
→ Runtime durations overlap across multiple categories (seconds, hours, days), preventing accurate comparison.

- **Aired**:  
  Examples: “Jan 10, 2000”, “Jan 2005”, “Winter 2005”, “Apr 2005 to Jun 2005”.  
  → Impossible to extract a reliable *month*, *year*, or *season*.


- **Impact:**  
  → Plots become misleading because many unique strings represent the *same underlying value*, splitting true counts into multiple false categories.


**c. Non-Standard Missing Values (“Unknown”, “Not available”)**

Instead of proper NaN, missing values appear as literal text. Therefore, "Unknown" is counted as a valid value

- **Impact:**  
"  → Unknown" is represented in charts as major value and distort the distribution.

**d. Overall Consequence — No Reliable Release-Strategy Insights**

These cross-feature issues mean:

- Temporal analysis (monthly/seasonal trends) **cannot be computed**.  
- Runtime patterns **cannot be compared**.  
- Episode ranges **cannot be analyzed** for Score impact.  
- Charts built on raw strings show **noise**, not real production trends.

Before evaluating how release strategy affects anime success, all three features require **full cleaning, type conversion, and categorical standardization**.


### **2. Solution**

#### **a. Standardizing Aired Dates**
To resolve the inconsistencies in the *Aired* field, the raw date strings were:
- parsed into standardized `datetime` objects,
- decomposed into analytical components such as **Year**, **Month**, **Quarter**, and **Season**, and mapped to canonical anime release seasons (Winter, Spring, Summer, Fall).

#### **b. Structuring Duration into Numeric Format**
The *Duration* values were cleaned by:
- Extracting numeric runtime from textual strings,
- Converting all formats into a unified measure (minutes per episode), and categorizing them into meaningful groups such as  
  **TV Short**, **Standard TV (24 min)**, **Long-Form Special**, and **Movie Runtime**.

#### **c. Converting Episodes to Meaningful Ranges**
The *Episodes* field was standardized by:
- Converting string values to numeric format and removing invalid entries like "Unknown",
- Grouping episode counts into **six production length categories** ranging from mini-series (1–5 episodes) to long-running series (500+ episodes).


### **3. Visual Evidence** 
#### **3.1. Raw vs Cleaned: Number of Anime released per `Aired Month`**

In [60]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

def combine_raw_and_clean(df_raw_viz, df_clean_prep, top_n=7):

    # ==================== RAW DATA ====================
    raw_dates = df_raw_viz["Aired"].fillna("Unknown").astype(str)
    freq = raw_dates.value_counts().nlargest(top_n)

    raw_colors = ["#C3122F"] + ["#D3D3D3"]*(len(freq)-1)

    # ==================== CLEAN MONTH ====================
    clean_counts = df_clean_prep['Aired Month'].value_counts().sort_index()
    month_map = {1:'Jan',2:'Feb',3:'Mar',4:'Apr',5:'May',6:'Jun',7:'Jul',8:'Aug',9:'Sep',10:'Oct',11:'Nov',12:'Dec'}
    clean_month_names = [month_map[m] for m in clean_counts.index]

    clean_colors = [
        '#ea568c' if m in ['Apr', 'Jul', 'Oct'] else '#f9accc'
        for m in clean_month_names
    ]

    # ==================== SUBPLOTS ====================
    fig = make_subplots(
        rows=1,
        cols=2,
        column_widths=[0.55, 0.45],
        horizontal_spacing=0.12
    )

    # ==================== RAW BAR ====================
    fig.add_trace(go.Bar(
        x=freq.index,
        y=freq.values,
        marker_color=raw_colors,
        text=freq.values,
        textposition="outside",
        hovertemplate="<b>Raw Aired:</b> %{x}<br><b>Count:</b> %{y}<extra></extra>",
        showlegend=False
    ), row=1, col=1)

    # ==================== CLEAN BAR ====================
    fig.add_trace(go.Bar(
        x=clean_month_names,
        y=clean_counts.values,
        marker_color=clean_colors,
        marker_line_width=1,
        marker_line_color="white",
        text=clean_counts.values,
        texttemplate="%{text:,}",
        textposition="outside",
        hovertemplate="<b>Month:</b> %{x}<br><b>Count:</b> %{y:,}<extra></extra>",
        showlegend=False
    ), row=1, col=2)

    # ==================== LAYOUT ====================
    fig.update_layout(
        template="plotly_white",
        height=550,
        width=1500,
        margin=dict(l=40, r=40, t=120, b=60),
        font=dict(family="Tahoma")
    )

    # ==================== RAW X/Y ====================
    fig.update_xaxes(
        title="",
        tickangle=0,
        tickfont=dict(color="red", family="Tahoma", size=13, weight="bold"),
        showline=False,
        showgrid=False,
        row=1, col=1
    )
    fig.update_yaxes(showticklabels=False, showline=False, showgrid=False, row=1, col=1)

    # ==================== CLEAN X/Y ====================
    fig.update_xaxes(
        title="",
        tickangle=0,
        tickfont=dict(family="Tahoma", size=13, color="black", weight="bold"),
        showgrid=False,
        showline=False,
        row=1, col=2
    )
    fig.update_yaxes(showticklabels=False, showgrid=False, showline=False, row=1, col=2)

    # ==================== TITLES ====================
    fig.add_annotation(
        text="<b>Aired Distribution is Distorted</b><br>"
             "<span style='font-size:15px; color:#555'>Top 7 Raw Aired Strings</span>",
        x=0.01, y=1.18, xref='paper', yref='paper',
        showarrow=False, align="left",
        font=dict(family="Tahoma", size=18)
    )

    fig.add_annotation(
        text="<b>Anime released per Aired Month distribution</b><br>"
             "<span style='font-size:15px; color:#555'>True seasonal pattern</span>",
        x=0.88, y=1.18, xref='paper', yref='paper',
        showarrow=False, align="left",
        font=dict(family="Tahoma", size=18)
    )

    # Bar width
    fig.update_traces(width=0.75)
    try:
        save_chart(fig, step="03", stt="V_C_3.1", title="Aired_Month_Raw_vs_Cleaned")
    except Exception as e:
        print(f"⚠️ Could not save chart due to Kaleido error: {e}")
        print("✅ Chart is displayed below regardless.")

    fig.show()


# Run
combine_raw_and_clean(df_anime_dataset_2023, df_anime_dataset_2023_prep)


✅ Saved: [03]_[V_C_3.1]_Aired_Month_Raw_vs_Cleaned (Transparent)


#### **The Transformation Insight — Aired Month**

- **The Misleading View:**  
String-based date formats and alphabetical month sorting completely scrambled the temporal pattern. Raw data showed months in random order (Apr, Aug, Dec...) making seasonal trends invisible and monthly comparisons impossible.

- **The Truth:**  
Once converted to proper datetime format and cleaned, a clear **triple-peak seasonal rhythm** emerges: anime releases peak in **January, March, and October**, while hitting clear lows during **June-August summer months**. This reveals the industry's strategic release calendar aligned with television seasons and audience availability.

- **The Strategic Value:**  
With accurate temporal sequencing, Aired Month becomes a powerful feature for identifying optimal release windows, understanding audience engagement patterns, and planning production cycles around proven high-demand seasons.

#### **3.2. Duration**

In [20]:
def plot_duration_before_after_combined(df_anime_dataset_2023, df_anime_dataset_2023_prep):

    #============================================================
    # 1. RAW DURATION DATA
    #============================================================
    duration_counts = (
        df_anime_dataset_2023['Duration']
        .astype(str)
        .value_counts()
        .head(7)
        .reset_index()
    )
    duration_counts.columns = ['Duration', 'Count']

    invalid_values = ["UNKNOWN", "?", "NONE", "N/A", "0"]
    colors_raw = [
        "#C3122F" if str(v).strip().upper() in invalid_values else "#D3D3D3"
        for v in duration_counts['Duration']
    ]

    #============================================================
    # 2. CLEANED NUMERIC DURATION DATA
    #============================================================
    x_min = df_anime_dataset_2023_prep['Duration Minutes'].min()
    x_max = df_anime_dataset_2023_prep['Duration Minutes'].max()
    x_padding = (x_max - x_min) * 0.01

    #============================================================
    # 3. CREATE SUBPLOTS (1×2)
    #============================================================
    fig = make_subplots(
        rows=1, cols=2,
        horizontal_spacing=0.1,
        subplot_titles=("", "")
    )

    #============================================================
    # 4. LEFT PLOT — RAW
    #============================================================
    fig.add_trace(
        go.Bar(
            x=duration_counts["Duration"],
            y=duration_counts["Count"],
            marker_color=colors_raw,
            marker_line_color="white",
            marker_line_width=1,
            width=0.65,
            text=duration_counts["Count"],
            texttemplate="%{text:,}",
            textposition="outside",
            showlegend=False,
            hovertemplate="<b>Duration:</b> %{x}<br><b>Count:</b> %{y:,}<extra></extra>"
        ),
        row=1, col=1
    )

    #============================================================
    # 5. RIGHT PLOT — CLEANED
    #============================================================
    fig.add_trace(
        go.Histogram(
            x=df_anime_dataset_2023_prep['Duration Minutes'],
            nbinsx= 30,
            marker_color="#ea568c",
            hovertemplate="<b>Duration:</b> %{x}<br><b>Count:</b> %{y}<extra></extra>",
            showlegend=False
        ),
        row=1, col=2
    )

    #============================================================
    # 6. LAYOUT (GIỮ NGUYÊN STYLE)
    #============================================================
    fig.update_layout(
        template="plotly_white",
        height=520,
        width=1500,
        margin=dict(l=40, r=40, t=120, b=60),
        hovermode="x unified",
        font=dict(family="Tahoma"),
        bargap=0.2,
        plot_bgcolor="white",
        paper_bgcolor="white"
    )

    #============================================================
    # X / Y AXES
    #============================================================
    # RAW – left
    fig.update_xaxes(
        title="",
        tickangle=0,
        showgrid=False,
        tickfont=dict(family="Tahoma", size=11, weight="bold", color="#C3122F"),
        row=1, col=1
    )
    fig.update_yaxes(title="", showticklabels=False, showgrid=False, row=1, col=1)

    # CLEANED – right
    fig.update_xaxes(
        title="",
        showgrid=False,
        tickangle=0,
        tickfont=dict(family="Tahoma", size=13),
        range=[x_min - x_padding, x_max + x_padding],
        showline=False,
        ticklen=6,
        tickwidth=1,
        tickcolor='black',
        row=1, col=2
    )
    fig.update_yaxes(
        title="",
        showgrid=False,
        tickfont=dict(family="Tahoma", size=13),
        tickmode='array',
        tickvals=[1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000],
        ticktext=['1k', '2k', '3k', '4k', '5k', '6k','7k','8k','9k'],
        ticks='inside',
        ticklen=6,
        tickwidth=0.5,
        tickcolor='#CCCCCC',
        showline=True,
        linewidth=1,
        linecolor='#CCCCCC',
        row=1, col=2
    )

    #============================================================
    # TITLES / SUBTITLES (GIỐNG HỆT MẪU)
    #============================================================
    fig.add_annotation(
        text="<b>Duration Distribution is Distorted (Raw Data)</b><br>"
             "<span style='color:#555555; font-size:14px'>(Top 7 Duration Label Frequencies)</span>",
        x=-0.025, y=1.25,
        xref="paper", yref="paper",
        showarrow=False,
        align="left",
        font=dict(family="Tahoma", size=18)
    )

    fig.add_annotation(
        text="<b>True Duration Distribution is Right Skew</b><br>"
             "<span style='color:#555555; font-size:14px'>(Cleaned Numeric Duration Distribution)</span>",
        x=0.75, y=1.25,
        xref="paper", yref="paper",
        showarrow=False,
        align="left",
        font=dict(family="Tahoma", size=18)
    )

    try:
        save_chart(fig, step="03", stt="V_C_3.2.Duration", title="Raw_Cleaned_Duration_Distribution")
    except Exception as e:
        print(f"⚠️ Could not save chart due to Kaleido error: {e}")
        print("✅ Chart is displayed below regardless.")

    fig.show()

plot_duration_before_after_combined(df_anime_dataset_2023, df_anime_dataset_2023_prep)


✅ Saved: [03]_[V_C_3.2.Duration]_Raw_Cleaned_Duration_Distribution (Transparent)


#### **The Transformation Insight — Duration Minutes**

- **The Misleading View:**  
String-based duration values created artificial fragmentation, where identical runtimes appeared as separate categories due to inconsistent formatting ("24 min" vs "24 min per ep"). "Unknown" incorrectly appeared as a major duration type, while the true distribution of episode lengths was completely obscured by formatting noise.

- **The Truth:**  
Once standardized to numeric minutes, the pattern emerges clearly: **the actual distribution is heavily right-skewed**. This skew reflects the composition of anime types: most TV series, OVAs, and ONAs follow standard short formats (~24 min), while longer durations (>60 min) come primarily from Movies, which are fewer in number but form a visible long tail to the right.

- **The Strategic Value:**  
With clean numeric durations, runtime becomes a reliable feature for analyzing optimal episode length strategies **separately by anime type**, audience preference patterns, and how duration correlates with genre preferences and production budgets.

#### **3.3. Episodes**

In [21]:
def plot_episode_before_after_combined(df_anime_dataset_2023, df_anime_dataset_2023_prep):
    # ============================
    # 1. Prepare Before Data
    # ============================
    ep_counts = (
        df_anime_dataset_2023['Episodes']
        .apply(lambda x: str(int(float(x))) if str(x).replace('.0','').isdigit() else str(x))
        .value_counts()
        .head(10)
        .reset_index()
    )
    ep_counts.columns = ['Episodes', 'Count']

    invalid_values = ["UNKNOWN", "?", "0", "NONE", "N/A"]
    colors_before = [
        "#C3122F" if str(v).strip().upper() in invalid_values else "#D3D3D3"
        for v in ep_counts['Episodes']
    ]

    # ============================
    # 2. Prepare After Data
    # ============================
    bins = [1, 6, 13, 26, 51, 101, 501, 10000]
    labels = ["1–5", "6–12", "13–25", "26–50", "51–100", "101–500", "500+"]

    df_anime_dataset_2023_prep["Episode_Group"] = pd.cut(
        df_anime_dataset_2023_prep["Episodes"],
        bins=bins,
        labels=labels,
        right=False
    )

    episode_group_dist = (
        df_anime_dataset_2023_prep["Episode_Group"]
        .value_counts()
        .sort_index()
        .reset_index()
    )
    episode_group_dist.columns = ["Episode_Range", "Count"]

    colors_after = [
        "#ea568c",  # 1–5
        "#f69ac0",  # 6–12
        "#f8b2d3",  # 13–25
        "#f9cae2",  # 26–50
        "#fdd3e9",  # 51–100
        "#f7d5e6",  # 101–500
        "#fae9ee",  # 500+
    ]

    # ============================
    # 3. Create Subplots (1 row x 2 cols)
    # ============================
    fig = make_subplots(
        rows=1, cols=2,
        horizontal_spacing=0.15,
        subplot_titles=("", "")
    )

    # --- Add Before Bar ---
    fig.add_trace(go.Bar(
        x=ep_counts['Episodes'],
        y=ep_counts['Count'],
        marker_color=colors_before,
        marker_line_color="white",
        marker_line_width=1,
        text=ep_counts['Count'],
        texttemplate="%{text:,}",
        textposition="outside",
        showlegend=False,
        hovertemplate="<b>Episode:</b> %{x}<br><b>Count:</b> %{y:,}<extra></extra>"
    ), row=1, col=1)

    # --- Add After Bar ---
    fig.add_trace(go.Bar(
        x=episode_group_dist['Episode_Range'],
        y=episode_group_dist['Count'],
        marker_color=colors_after,
        marker_line_color="white",
        marker_line_width=1,
        text=episode_group_dist['Count'],
        texttemplate="%{text:,}",
        textposition="outside",
        showlegend=False,
        hovertemplate="<b>Episode Group:</b> %{x}<br><b>Count:</b> %{y:,}<extra></extra>"
    ), row=1, col=2)

    # ============================
    # 4. Layout
    # ============================
    fig.update_layout(
        template="plotly_white",
        height=500,
        width=1400,
        margin=dict(l=40, r=40, t=120, b=60),
        hovermode="x unified",
        font=dict(family="Tahoma"),
        bargap=0.6
    )

    # X & Y axes settings
    fig.update_xaxes(title="", tickangle=0, showgrid=False, tickfont=dict(family="Tahoma", size=13, weight="bold"), row=1, col=1)
    fig.update_yaxes(title="", showticklabels=False, showgrid=False, row=1, col=1)
    fig.update_xaxes(title="", tickangle=0, showgrid=False, tickfont=dict(family="Tahoma", size=13, weight="bold"), row=1, col=2)
    fig.update_yaxes(title="", showticklabels=False, showgrid=False, row=1, col=2)

    # Title / Subtitle annotations
    fig.add_annotation(
        text="<b>Episode Distribution is Distorted (Raw Data)</b><br><span style='color:#555555; font-size:14px'>(Top 10 Episode Label Frequencies)</span>",
        x=-0.025, y=1.25, xref='paper', yref='paper',
        showarrow=False, align="left",
        font=dict(family="Tahoma", size=18, color="black")
    )
    fig.add_annotation(
        text="<b>Mini-Series Dominate Anime (Cleaned & Binned)</b><br><span style='color:#555555; font-size:14px'>(Episode Frequencies Across Groups)</span>",
        x=0.85, y=1.25, xref='paper', yref='paper',
        showarrow=False, align="left",
        font=dict(family="Tahoma", size=18, color="black")
    )

    # Bar width
    fig.update_traces(width=0.7, textfont=dict(size=13))
    try:
        save_chart(fig, step="03", stt="V_C_3.3", title="Episodes_Raw_Cleaned")
    except Exception as e:
        print(f"⚠️ Could not save chart due to Kaleido error: {e}")
        print("✅ Chart is displayed below regardless.")

    fig.show()


# Gọi hàm
plot_episode_before_after_combined(df_anime_dataset_2023, df_anime_dataset_2023_prep)


✅ Saved: [03]_[V_C_3.3]_Episodes_Raw_Cleaned (Transparent)


#### **The Transformation Insight — Episodes**

- **The Misleading View:**  
String-based episode values made the distribution appear fragmented, with each unique string forming its own bar and “UNKNOWN” seeming like a valid episode type. This distorted the perception of episode length patterns and even scrambled the ordering, making shorter series appear mixed with longer ones.

- **The Truth:**  
Once converted to numeric and cleaned, the true distribution of series length becomes clear: **mini-series (1–5 episodes)** contain the highest number of titles, and the frequenies steadily decreases as episode counts increase — with the 500+ group containing fewer than 20 anime.

- **The Strategic Value:**  
With the noise removed, Episodes becomes a dependable numeric feature for analyzing how episode length relates to Score, Types, Genres, and broader production strategy.

### **4. Business Insights**

#### **A. Medium to Long Duration with Short Range Episodes delivers the Highest Scores** 
**Question**: Which combination of Duration Minutes and Episode Range Maximize the Score ?

In [22]:
def plot_duration_ep(df_anime_dataset_2023, df_anime_dataset_2023_prep):
   # -----------------------------
   # 1. Episode bins (7 groups)
   # -----------------------------
   episode_bins = [1, 6, 13, 26, 51, 101, 501, 10000]
   episode_labels = ["1–5", "6–12", "13–25", "26–50", "51–100", "101–500", "500+"]

   df = df_anime_dataset_2023_prep.copy()

   df["Episode_Group"] = pd.cut(
      df["Episodes"],
      bins=episode_bins,
      labels=episode_labels,
      right=False
   ).astype(str)

   # -----------------------------
   # 2. Duration bins (converted to range labels)
   # -----------------------------
   duration_bins = [0, 10, 20, 30, 60, 200]
   duration_labels = [
      "0–10 minutes",
      "10–20 minutes",
      "20–30 minutes",
      "30–60 minutes",
      "60–180 minutes"
   ]

   df["Duration_Group"] = pd.cut(
      df["Duration Minutes"],
      bins=duration_bins,
      labels=duration_labels,
      right=False,
      include_lowest=True
   ).astype(str)

   # -----------------------------
   # 3. Compute average Score
   # -----------------------------
   heatmap_data = (
      df
      .dropna(subset=["Episode_Group", "Duration_Group", "Score"])
      .groupby(["Episode_Group", "Duration_Group"], as_index=False)
      .agg(Average_Score=("Score", "mean"))
   )

   # -----------------------------
   # 4. Ensure all combinations exist
   # -----------------------------
   all_combinations = pd.MultiIndex.from_product(
      [episode_labels, duration_labels],
      names=["Episode_Group", "Duration_Group"]
   ).to_frame(index=False)

   heatmap_data = all_combinations.merge(
      heatmap_data,
      on=["Episode_Group", "Duration_Group"],
      how="left"
   )

   # Tạo matrix từ heatmap_data
   heatmap_matrix = heatmap_data.pivot(
      index="Duration_Group", 
      columns="Episode_Group", 
      values="Average_Score"
   )

   # Reindex để đảm bảo đúng thứ tự
   heatmap_matrix = heatmap_matrix.reindex(
      index=duration_labels,
      columns=episode_labels
   )

   # Tạo mask để ẩn text cho các ô có giá trị 0 hoặc NaN
   text_matrix = heatmap_matrix.map(
      lambda x: f"{x:.2f}" if pd.notna(x) and x != 0 else ""
   )

   # -----------------------------
   # 5. Custom gradient với màu tím theo ngưỡng 7
   # -----------------------------
   custom_color_scale = [
    [0.00, "#FFF4F4"],   # xám nhạt
    [0.33, "#F8C2FB"],   # hồng nhạt
    [0.55, "#E287EC"],   # hồng chính
    [0.75, "#9B2A88"],   # hồng-tím
    [1.00, "#6A1B9A"]    # tím đậm
]


   # -----------------------------
   # 6. Plot heatmap với imshow
   # -----------------------------
   fig = px.imshow(
      heatmap_matrix,
      x=heatmap_matrix.columns,
      y=heatmap_matrix.index,
      color_continuous_scale=custom_color_scale,
      aspect="auto",
      labels=dict(color="Avg Score")
   )

   # -----------------------------
   # 7. Layout & style với tất cả quy tắc chung
   # -----------------------------
   fig.update_layout(
      template="plotly_white",
      title=dict(
         text="<b>Mini-Series with over 60 Minutes per Episode Maximize Score</b><br><span style='color:#555555; font-size:14px'>(Average Score group by Duration Minutes and Episode Range)</span>",
         x=0.02,
         y=0.95,
         font=dict(family="Tahoma", size=20, color="black"),
         xanchor='left',
         yanchor='top'
      ),
      margin=dict(l=40, r=40, t=100, b=60),
      font=dict(family="Tahoma"),
      xaxis=dict(
         title="",
         tickfont=dict(family="Tahoma", size=13),
         showgrid=False
      ),
      yaxis=dict(
         title="",
         tickfont=dict(family="Tahoma", size=13),
         showgrid=False
      ),
      hovermode="x unified",
      height=550,
      width=1000,
      # BỔ SUNG LEGEND THEO YÊU CẦU
      legend=dict(
         x=1.02, 
         y=0.98, 
         xanchor='left', 
         yanchor='top', 
         bgcolor='rgba(0,0,0,0)'
      )
   )

   # Update colorbar để format số
   fig.update_coloraxes(
      colorbar=dict(
         title="Average Score",
         tickformat=".2f",
         tickfont=dict(family="Tahoma", size=11)
      )
   )

   # Thêm text custom - chỉ hiển thị cho giá trị khác 0
   fig.update_traces(
      text=text_matrix.values,
      texttemplate="%{text}",
      textfont=dict(family="Tahoma", size=13),
      hovertemplate='<b>Episodes:</b> %{x}<br><b>Duration:</b> %{y}<br><b>Average Score:</b> %{z:.2f}<extra></extra>'
   )

   # Tắt gridlines
   fig.update_xaxes(showgrid=False)
   fig.update_yaxes(showgrid=False)
   fig.update_yaxes(
      ticklabelposition="outside",
      ticklabelstandoff=10   # tăng lên 12–18 nếu muốn xa hơn
   )
   try:
      save_chart(fig, step="04", stt="V_C_4.A", title="Combination_Duration_Episode_maximize_score")
   except Exception as e:
      print(f"⚠️ Could not save chart due to Kaleido error: {e}")
      print("✅ Chart is displayed below regardless.")

   fig.show()


plot_duration_ep(df_anime_dataset_2023, df_anime_dataset_2023_prep)

✅ Saved: [04]_[V_C_4.A]_Combination_Duration_Episode_maximize_score (Transparent)


**Insight**

- **Mini-series with long duration (60–180 min) and 6–12 episodes** achieve the highest scores of 8.21, showing viewers reward compact, high-quality limited series.

- **Standard episodes (20–30 min) perform best when kept between 6–25 episodes**, showing stable scores around 6.9–7.2.

- **Very short episodes (0–10 min) consistently receive the lowest scores**, regardless of episode count.

- **More episodes ≠ higher score** unless the series is a major franchise. Quality and duration matter more than length.


**Business Takeaways**

- **Prioritize premium mini-series**: This combination (6–12 episodes, >60 min each) shows the best performance and strongest audience reception.

- **For shorter duration, keep it between 12–25 episodes**: This maximizes pacing and maintains higher scores.

- **Avoid producing ultra-short formats** (<10 min per episode) as main projects

- **Invest in quality over quantity**: Longer episode duration with fewer episodes yields better ratings than many short episodes.




### **B. TV Anime Episode Strategy: How Length Affects Score Consistency and Risk**
**Question**

- What is the optimal episode count range for TV anime to achieve both high and consistent audience scores?”

- Which TV anime episode range carries the highest score volatility, and how can studios manage creative and financial risk within it?

- Should studios prioritize shorter (6–12 tập), mid-length (13–25 tập), or long-running (>50 tập) formats when planning new TV anime series?


In [23]:
# ===================================================
# 1. LỌC DỮ LIỆU TV VÀ TẠO NHÓM EPISODE
# ===================================================
df_tv_only = df_anime_dataset_2023_prep[
    df_anime_dataset_2023_prep['Type'] == 'TV'
].copy()

bins = [-1, 5, 12, 25, 50, 100]
labels = ["1–5", "6–12", "13–25", "26–50", "> 50"]

df_tv_only.loc[:, 'Episode_Group'] = pd.cut(
    df_tv_only['Episodes'],
    bins=bins,
    labels=labels
)

category_order = ["1–5", "6–12", "13–25", "26–50", "> 50"]

# ===================================================
# 2. TÌM GROUP CÓ MEDIAN CAO NHẤT
# ===================================================
median_values = df_tv_only.groupby('Episode_Group')['Score'].median()
highest_median_group = median_values.idxmax()

# ===================================================
# 3. TẠO BOX PLOT
# ===================================================
fig = go.Figure()

for category in category_order:
    group_data = df_tv_only[df_tv_only['Episode_Group'] == category]['Score']

    fig.add_trace(go.Box(
        y=group_data,
        name=category,
        marker_color="#6A1B9A" if category == highest_median_group else "#D3D3D3",
        boxmean=True,
        showlegend=False
    ))

# ===================================================
# 4. LAYOUT THEO SETUP CHUNG + TRỤC Y DẠNG CỘT NGẮN
# ===================================================
fig.update_layout(
    template="plotly_white",
    font=dict(family="Tahoma"),
    margin=dict(l=40, r=40, t=100, b=60),
    title= dict(
        text="<b>13–25 Episode TV Series Deliver the Strongest Overall Performance</b><br>"
            "<span style='color:#555555; font-size:14px'>(Boxplot comparison of Anime TV series score across different episode ranges)</span>",
        x=0.02,            # vị trí ngang (0=trái, 1=phải)
        y=0.92,            # đẩy title/subtitle lên cao hơn
        xanchor='left',    # căn trái theo x
        yanchor='top',     # căn top theo y
        font=dict(size=18, family="Tahoma")
    ),
    xaxis=dict(
        title="Episode Range",
        tickfont=dict(size=13, family="Tahoma"),
        showgrid=False,
        showticklabels=True,
        showline=False,
        ticklen=6,
        tickwidth=1,
        tickcolor='black'
    ),
    yaxis=dict(
        title="",
        range=[0, 10],       # giữ nguyên trục Y
        showgrid= True,
        tickfont=dict(size=13, family="Tahoma"),
        showticklabels=True,
        showline=True,
        linewidth=1,
        linecolor='#CCCCCC',
        ticklen=8,
        tickwidth=0.5,
        tickcolor='#CCCCCC',
        ticks='inside'
    ),
    hovermode="x unified",
    height=550,
    width=900
)
fig.show()

try:
        save_chart(fig, step="04", stt="V_C_4.B", title="Episode_Range_maximize_Anime_TV")
except Exception as e:
        print(f"⚠️ Could not save chart due to Kaleido error: {e}")
        print("✅ Chart is displayed below regardless.")

✅ Saved: [04]_[V_C_4.B]_Episode_Range_maximize_Anime_TV (Transparent)


#### **Insight**

- **13–25 episode series continue to deliver the strongest median performance**, standing out as the most consistently high-quality range despite the presence of some low outliers.

- **The 6–12 episode range shows the widest score variability**: it includes the **lowest outliers (down to ~2.9)** but also contains **high-scoring outliers (~9.05)**. This makes it the most high-risk, high-reward episode range.

- **The 26–50 episode range has a median similar to 6–12**, but with **no outliers at all**, indicating highly stable but modest results. This range produces predictable mid-level performance but lacks both high-scoring standouts and low-scoring failures.

- **Series with more than 50 episodes have the most interesting outlier profile**:  
  - **No low outliers**, showing strong baseline stability.  
  - **The highest positive outlier in the dataset (~9.1)**, likely driven by long-running flagship franchises.  
  This range is generally steady but occasionally produces exceptional top-tier titles.




#### **Business Takeaways**

- **Invest primarily in 13–25 episode productions**: This range offers the strongest and reasonably consistent return on quality, making it the most efficient format for high-performing TV anime.

- **Use 6–12 episode series strategically due to their high volatility**:  
  - They can create hits (as seen from 9.05 outliers),  
  - but also carry the highest risk of underperformance.  
  These projects require strong creative direction, tight writing, and careful quality control.

- **Leverage the 26–50 episode range for safe, stable output**: This format delivers predictable mid-tier scores with minimal downside, suitable for secondary projects or steady programming without high-stakes expectations.

- **Approach >50 episode productions as franchise-driven investments**:  
  - They are low-risk in terms of poor scoring (no negative outliers) 
  - Occasionally produce top-tier successes (the highest outlier at ~9.1).  
  However, they require large budgets and are only viable when supported by strong IP demand.



### **C. Anime Movie Duration Strategy: How Length Influences Audience Scores**

**Question**

- How does movie duration affect average audience scores for anime films?  
- What is the optimal duration range for maximizing ratings and audience satisfaction?  


In [24]:
# ===================================================
# CHUẨN BỊ DỮ LIỆU - CHỈ LỌC TYPE == 'Movie'
# ===================================================
df_movies = df_anime_dataset_2023_prep[df_anime_dataset_2023_prep['Type'] == 'Movie'].copy()

# Lọc bỏ các giá trị Duration Minutes NaN hoặc <= 0
df_movies = df_movies.dropna(subset=['Duration Minutes', 'Score'])
df_movies = df_movies[df_movies['Duration Minutes'] > 0]

# Lọc bỏ các phim ngắn < 40 phút và > 180 phút
df_movies = df_movies[(df_movies['Duration Minutes'] >= 40) & (df_movies['Duration Minutes'] <= 180)]

# ===================================================
# TẠO BINS CHO DURATION
# ===================================================
# Tạo bin nhỏ hơn
bins = [40, 60, 75, 90, 105, 120, 135, 150, 165, 180]
labels = ["40-60", "60-75", "75-90", "90-105", "105-120", "120-135", "135-150", "150-165", "165-180"]
df_movies['DurationBin'] = pd.cut(df_movies['Duration Minutes'], bins=bins, labels=labels, include_lowest=True)

# Tính mean score cho từng nhóm
df_mean = df_movies.groupby('DurationBin')['Score'].mean().reset_index()

# ===================================================
# VẼ LINE PLOT (Mean Score theo Duration Bin) - CẬP NHẬT
# ===================================================
fig_line = go.Figure(go.Scatter(
    x=df_mean['DurationBin'],
    y=df_mean['Score'],
    mode='lines+markers',
    line=dict(color='#6A1B9A', width=3),
    marker=dict(size=10)
))

fig_line.update_layout(
    template="plotly_white",
    title=dict(
        text=(
            "<b>Longer Anime Movies Deliver Higher Average Scores</b><br>"
            "<span style='color:#555555; font-size:14px'>(Average Score by Duration Bin shows overall upward trend from short to long movies)</span>"
            "</span>"
        ),
        x=0.02,            # vị trí ngang (0=trái, 1=phải)
        y=0.92,            # đẩy title/subtitle lên cao hơn
        xanchor='left',    # căn trái theo x
        yanchor='top',     # căn top theo y
        font=dict(size=18, family="Tahoma")
    ),
    xaxis=dict(
        title="Duration Bin (Minutes)",
        showgrid=True,
        tickfont=dict(family="Tahoma", size=12, color="black")
    ),
    yaxis=dict(
        title="",  # Bỏ y-axis label
        range=[0, 10],
        showgrid= True,
        tickfont=dict(size=13, family="Tahoma"),
        showticklabels=True,
        showline=True,
        linewidth=2,
        linecolor='#CCCCCC',
        ticklen=8,
        tickwidth=0.5,
        tickcolor='#CCCCCC',
        ticks='inside'
    ),
    hovermode="x unified",
    showlegend=False,
    height=550,
    width=900
)
try:
    save_chart(fig, step="04", stt="V_C_4.C", title="dUration_Range_Maiximize_Anime_Movies")
except Exception as e:
    print(f"⚠️ Could not save chart due to Kaleido error: {e}")
    print("✅ Chart is displayed below regardless.")

fig_line.show()



✅ Saved: [04]_[V_C_4.C]_dUration_Range_Maiximize_Anime_Movies (Transparent)



#### **Insight**

- **Longer movies tend to achieve higher average scores**, with the highest scores observed in movies 165–180 minutes long (~8.2).  
- **Short Movies (under 90 minutes) has the lowest score** (~ 6.5 to 6.7), suggesting limited runtime may impact audience engagement and storytelling effectiveness.  
- **Mid-length movies (90–120 minutes) maintain stable, decent scores** (~7–7.35), representing a safe and predictable option.  
- **Minor fluctuations exist in the 135–150 min range**, but they do not break the overall upward trend, indicating moderate flexibility in runtime is acceptable.

#### **Business Takeaways**

- **Consider producing long movies (150–180 minutes) to maximize ratings**: Longer runtime allows deeper storytelling, which is favored by audiences and critics.  
- **Short movies (<90 min) should be more careful**: The overall score of this range is the lowest, suggeting a stronger plot and higher production quality to improve performance.  
- **Mid-length movies (90–120 min) are a reliable format**: Safe choice for studios seeking consistent quality without the risks associated with very short or very long films.


#### **D. Anime released in April, July and October have the highest number of releases and the highest average scores**
**Question**: Which months should studios target for releases to chieve high average scores and maximize production impact ?

In [32]:
def score_by_month_clean(df_anime_dataset_2023_prep):
    import plotly.graph_objects as go

    # --- 1. Extract month and score ---
    month_map = {
        1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun',
        7:'Jul', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'
    }

    score_by_month = df_anime_dataset_2023_prep.groupby("Aired Month")["Score"].mean().sort_index()

    month_names = [month_map[m] for m in score_by_month.index]
    scores = score_by_month.values

    # --- 2. Coloring ---
    colors = []
    for month in month_names:
        if month in ["Apr", "Jul", "Oct"]:     # peak months = tím
            colors.append("#6A1B9A")
        else:                                  # others = xám
            colors.append("#cfcfcf")

    # --- 3. Build figure ---
    fig = go.Figure()

    fig.add_trace(go.Bar(
        x=month_names,
        y=scores,
        marker_color=colors,
        marker_line_width=1,
        marker_line_color="white",
        text=[f"{s:.2f}" for s in scores],
        texttemplate="%{text}",
        textposition="outside",
        width=0.5,                # độ dày bar
        hovertemplate="<b>Month:</b> %{x}<br><b>Avg Score:</b> %{y:.2f}<extra></extra>",
        showlegend=False
    ))

    # --- 4. Layout ---
    fig.update_layout(
        template="plotly_white",
        height=500,
        width=1100,
        margin=dict(l=0, r=40, t=120, b=60),  # title sát trái
        font=dict(family="Tahoma"),
        bargap=0.5                           # khoảng cách = 1/2 độ dày bar
    )

    fig.update_xaxes(
        title="",
        tickangle=0,
        showgrid=False,
        tickfont=dict(family="Tahoma", size=13, color="black", weight="bold")
    )

    fig.update_yaxes(
        title="",
        showticklabels=False,
        showgrid=False
    )

    # --- 5. Title sát lề trái ---
    fig.add_annotation(
        text=(
            "<b>Average Anime Score by Month (Clean Data)</b><br>"
            "<span style='color:#555555; font-size:15px'>Seasonal cycles may influence anime quality</span>"
        ),
        x=0.0, y=1.18, xref="paper", yref="paper",   # x = 0 → sát trái
        showarrow=False, align="left",
        font=dict(family="Tahoma", size=18)
    )

    # --- 6. Save (optional) ---
    try:
        save_chart(fig, step="04", stt="V_C_4.D", title="Score_by_Month_Clean")
    except Exception as e:
        print("⚠️ Cannot save due to Kaleido error:", e)

    fig.show()


# Run
score_by_month_clean(df_anime_dataset_2023_prep)


✅ Saved: [04]_[V_C_4.D]_Score_by_Month_Clean (Transparent)


#### **Insight**

The analysis of average anime scores by release month (clean data) reveals a clear seasonal pattern:

* **Peak Months:** April, July, and October show the **highest average scores**, indicating that anime released during these months tend to receive better reception.
* **Low Months:** January, February, March, June, August, September, November, and December have comparatively lower average scores, reflecting fewer releases or less competitive seasons.
* **Seasonal Pattern:** These peak months align with the major anime seasons — **Spring (Apr), Summer (Jul), and Fall (Oct)** — highlighting that studios strategically release high-quality content during these periods.



#### **Business Takeaway**

* **Studio Planning:** Anime production and release schedules should prioritize **April, July, and October** to maximize both audience engagement and critical reception.
* **Marketing Strategy:** Promotional campaigns can be focused around peak months to leverage higher audience interest and potential viewership.
* **Content Strategy:** Understanding the seasonal scoring pattern allows studios to **optimize resource allocation**, releasing their strongest titles during months with historically higher scores.

#### **E. Anime score trend analysis (1920 - 2020+) for producers**
**Question**: How can we leverage the continually increasing score trend (currently near 7.0) to ensure the quality and market success of new anime projects?


In [33]:
score_by_year = df_anime_dataset_2023_prep.groupby('Aired Year')['Score'].mean().dropna()
x_years = score_by_year.index
y_scores = score_by_year.values

# Create Line Chart — Average Score by Release Year (Cleaned)
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=x_years,
    y=y_scores,
    mode='lines+markers',
    line=dict(color='#6A1B9A', width=3),
    marker=dict(size=7, color='#6A1B9A'),
    name='Dữ liệu sạch',
    showlegend=False,
    text=[f"{score:.2f}" for score in y_scores],
    hovertemplate='Year=%{x}<br>Average Score=%{y:.2f}<extra></extra>'
))

fig.update_layout(
    template='plotly_white',
    font=dict(family="Tahoma"),
    margin=dict(l=40, r=40, t=100, b=60),
    title=dict(
        text="<b>Complete Timeline Shows Steady Average Score Improvement</b><br><span style='color:#555555; font-size:14px'>(Average Score by Release Year — Cleaned Data)</span>",
        x=0.02
    ),
    hovermode="x unified",
    xaxis=dict(
        title="",  # Bỏ x label
        showgrid=False,
        tickformat="d",
        nticks=15
    ),
    yaxis=dict(
        title="",  # Bỏ y label
        showgrid=False,
        tickformat=".2f",
        range=[0, max(y_scores) * 1.1]  # Bắt đầu từ 0 và kết thúc ở max + 10%
    ),
    legend=dict(
        x=1.02, y=0.98, xanchor='left', yanchor='top', bgcolor='rgba(0,0,0,0)'
    )
)
try:
    save_chart(fig, step="04", stt="V_C_4.E", title="Average_Score_by_Release_Year_Cleaned")
except Exception as e:
    print(f"⚠️ Could not save chart due to Kaleido error: {e}")
    print("✅ Chart is displayed below regardless.")
fig.show()


✅ Saved: [04]_[V_C_4.E]_Average_Score_by_Release_Year_Cleaned (Transparent)


#### **Insight**
- **Overall Upward Trend**
Anime average scores have shown a clear long-term increase, moving from **≈ 5.0–5.3** (1920s–1940s) to a high of **≈ 6.8–6.9** (recent years). This indicates rising perceived quality and audience satisfaction.

- **Slow Growth Era (1920–1970)**
Scores were stable, mostly below **5.5**. This was the foundational development period for the industry.

- **The Leap (1970–1980)**
Scores saw a significant and sustained increase, surpassing **6.0** for the first time (around 1975). This marks the maturation of filmmaking techniques and the emergence of classic titles.

- **Modern Peak (2000 – Present)**
Scores continued to climb, reaching an all-time high (near **7.0**). This is driven by:  
-- Higher production values (Digital/CGI).  
-- Expanded global market, leading to more selective, higher ratings for truly good works.
#### **Business takeaway**
**1. Focus on Premium Quality**  
- **Action:** Target a minimum score above **6.5** to be considered successful in the current competitive market. Allocate larger budgets to top-tier animation studios and experienced scriptwriting teams.

**2. Leverage Modern Storytelling**  
- **Action:** Develop new IPs focusing on unique or complex themes, or reboot/remake older works with a more mature storytelling approach to meet evolving sophisticated audience tastes (post-2000 success).

**3. Risk Management for Legacy Projects**  
- **Action:** Any projects based on low-scoring periods (**1920–1970**) must be completely restructured to meet modern quality standards. Do not rely solely on brand recognition.

**4. Strategic Investment**  
- **Action:** Analyze the specific drivers of the score spike around **2020** (e.g., streaming platform boom, successful genres like Isekai/Fantasy) and prioritize funding for studios with a stable track record of high-scoring projects (**>6.5**) in these trending (meta) genres.

## **VI. Deep Business Insights & Strategic Recommendations**

**1. Key Patterns Across All Themes**

**2. Strategic Recommendations**

