# 1.Context
The team is acting in the role of Producer. Our goal is to develop a release strategy for anime that maximizes the chance of success. The team wants to leverage the current dataset to extract useful insights and build a machine learning model to quantify the impact of various factors on the success of an anime title.

In [None]:
from pathlib import Path
import pandas as pd
import numpy as np
import plotly.express as px
import os
import plotly.io as pio
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import kagglehub
import ast

In [63]:
# Download latest version
path = kagglehub.dataset_download("dbdmobile/myanimelist-dataset")

print("Path to dataset files:", path)

# List all files in the dataset directory
folder = os.listdir(path)
print("Files in dataset directory:", folder)
for file in folder:
    print(file)

Path to dataset files: C:\Users\Huy\.cache\kagglehub\datasets\dbdmobile\myanimelist-dataset\versions\5
Files in dataset directory: ['anime-dataset-2023.csv', 'anime-filtered.csv', 'anime_dataset_2023.csv', 'final_animedataset.csv', 'user-filtered.csv', 'users-details-2023.csv', 'users-score-2023.csv']
anime-dataset-2023.csv
anime-filtered.csv
anime_dataset_2023.csv
final_animedataset.csv
user-filtered.csv
users-details-2023.csv
users-score-2023.csv


In [64]:
current_dir = Path.cwd()

project_root = current_dir.parent 

raw_data_path = project_root / "data" / "raw" / "anime-dataset-2023.csv"
processed_data_path = project_root / "data" / "processed" / "prepared_data.csv"

df_anime_dataset_2023 = pd.read_csv(raw_data_path)
df_anime_dataset_2023_prep = pd.read_csv(processed_data_path)

In [65]:
pio.templates.default = "plotly_white"

In [66]:
csv_file2 = df_anime_dataset_2023_prep

In [67]:
print(df_anime_dataset_2023.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24905 entries, 0 to 24904
Data columns (total 24 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   anime_id      24905 non-null  int64 
 1   Name          24905 non-null  object
 2   English name  24905 non-null  object
 3   Other name    24905 non-null  object
 4   Score         24905 non-null  object
 5   Genres        24905 non-null  object
 6   Synopsis      24905 non-null  object
 7   Type          24905 non-null  object
 8   Episodes      24905 non-null  object
 9   Aired         24905 non-null  object
 10  Premiered     24905 non-null  object
 11  Status        24905 non-null  object
 12  Producers     24905 non-null  object
 13  Licensors     24905 non-null  object
 14  Studios       24905 non-null  object
 15  Source        24905 non-null  object
 16  Duration      24905 non-null  object
 17  Rating        24905 non-null  object
 18  Rank          24905 non-null  object
 19  Popu

In [68]:
print(df_anime_dataset_2023_prep.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24905 entries, 0 to 24904
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   anime_id          24905 non-null  int64  
 1   Name              24905 non-null  object 
 2   Score             15692 non-null  float64
 3   Genres            19976 non-null  object 
 4   Type              24831 non-null  object 
 5   Episodes          24294 non-null  float64
 6   Status            24905 non-null  object 
 7   Producers         11555 non-null  object 
 8   Studios           14379 non-null  object 
 9   Source            21216 non-null  object 
 10  Rating            24236 non-null  object 
 11  Rank              20293 non-null  float64
 12  Popularity        24718 non-null  float64
 13  Favorites         24905 non-null  int64  
 14  Scored By         15692 non-null  float64
 15  Members           24905 non-null  int64  
 16  Aired Date Start  20090 non-null  object

# 2.Data Frame Introduction

1. Thông tin cơ bản và nhận dạng (Basic Identification & Description)
- **anime_id**: ID duy nhất cho mỗi anime.
- **Name**: Tên của anime bằng ngôn ngữ gốc.
- **English name**: Tên tiếng Anh của anime.
- **Other name**: Tên bản địa hoặc tựa đề của anime.
- **Synopsis**: Mô tả hoặc tóm tắt ngắn gọn về cốt truyện của anime.
- **Genres**: Các thể loại của anime, được phân tách bằng dấu phẩy.
- **Image URL**: URL của hình ảnh hoặc poster của anime.

2. Chi tiết sản xuất và kỹ thuật (Production & Technical Details)
- **Type**: Loại anime.
- **Source**: Vật liệu gốc của anime.
- **Producers**: Các công ty sản xuất hoặc nhà sản xuất của anime.
- **Studios**: Các studio hoạt hình đã thực hiện anime.
- **Licensors**: Các nhà cấp phép của anime.
- **Episodes**: Số lượng tập trong anime.
- **Duration**: Thời lượng của mỗi tập phim.

3. Thông tin phát sóng và phát hành (Release & Airing Information)
- **Aired**: Ngày anime được phát sóng.
- **Premiered**: Mùa và năm anime ra mắt.
- **Status**: Trạng thái của anime.

4. Chỉ số tương tác người xem và hiệu suất (Audience Engagement & Performance Metrics)
- **Score**: Điểm được trao cho anime.
- **Rating**: Xếp hạng độ tuổi của anime.
- **Rank**: Xếp hạng của anime dựa trên mức độ phổ biến hoặc các tiêu chí khác.
- **Popularity**: Xếp hạng mức độ phổ biến của anime.
- **Favorites**: Số lần anime được người dùng đánh dấu là yêu thích.
- **Scored By**: Số lượng người dùng đã chấm điểm anime.
- **Members**: Số lượng thành viên đã thêm anime vào danh sách của họ trên nền tảng.

*Unnecessary Feature*
- Tiến hành drop feature không cần thiết khỏi dataframe

In [69]:
df_anime_dataset_2023 = df_anime_dataset_2023.drop(columns=['English name', 'Other name', 'Synopsis', 'Image URL'])

# 3.Data Insight

## 3.1.Insight Question 1:
Understanding the **'Score'** feature is paramount for anyone aspiring to create highly-rated anime or to analyze the factors contributing to critical and audience success. A deep dive into how anime are scored, the typical distribution of these scores, and what values are most common can provide invaluable preliminary insights into audience preferences and industry trends. However, drawing accurate conclusions hinges on the quality of our data. For our initial exploration, we will directly examine the raw 'Score' column, aiming to uncover its inherent characteristics and, more importantly, to identify where the unprocessed data might lead us astray, thereby underscoring the critical role of data preprocessing.

### 3.1.1.Misleading Insight
- What is the distribution of anime scores within our dataset?

In [70]:
# Kiểm tra xem cột 'Score' có tồn tại trong DataFrame không
if 'Score' in df_anime_dataset_2023.columns:
    try:
        # Vẽ biểu đồ phân phối
        fig = px.histogram(
            df_anime_dataset_2023,
            x='Score',
            title='Anime Score Distribution',
            labels={'Score': 'Score'}
        )

        # Tìm Mode (giá trị xuất hiện nhiều nhất)
        mode_value = df_anime_dataset_2023['Score'].mode()[0]
        fig.add_annotation(
            x=mode_value,
            y=0,
            text=f"Mode: {mode_value}",
            showarrow=True,
            arrowhead=2,
            ax=60,  # Di chuyển mũi tên sang phải (giá trị dương)
            ay=-80,  # Di chuyển mũi tên lên trên (giá trị âm)
            font=dict(color="#800080", size=11),  # Đổi màu chữ thành đỏ
            arrowcolor="black"  # Đổi màu mũi tên thành xanh
        )

        # Hiển thị biểu đồ
        fig.show()
    except Exception as e:
        print("Không thể vẽ biểu đồ này do data không phù hợp:", str(e))
else:
    print("Không thể vẽ biểu đồ này do cột 'Score' không tồn tại trong DataFrame.")

# Lấy tần suất xuất hiện của các giá trị trong cột 'Score'
score_counts = df_anime_dataset_2023['Score'].value_counts().reset_index()

# Đổi tên cột để dễ hiểu
score_counts.columns = ['Score', 'Frequency']

# Lấy 15 giá trị phổ biến nhất
top_15_scores = score_counts.head(15)

# Vẽ biểu đồ cột nằm ngang
fig = px.bar(
    top_15_scores,
    x='Frequency',
    y='Score',
    orientation='h',  # Đặt biểu đồ nằm ngang
    title='Tần suất 15 giá trị điểm số phổ biến nhất',
    labels={'Score': 'Giá trị Score', 'Frequency': 'Số lượng Anime'},
    text='Frequency'
)

# Tùy chỉnh giao diện biểu đồ
fig.update_traces(
    textposition='outside',
    marker_color='purple',  # Màu cột
    marker_line_color='white',  # Màu viền cột
    marker_line_width=1.2  # Độ dày viền cột
)

fig.update_layout(
    yaxis=dict(categoryorder='total ascending'),  # Sắp xếp trục y theo thứ tự tăng dần
    bargap=0.3  # Khoảng cách giữa các cột
)

# Hiển thị biểu đồ
fig.show()

**Conflict**: The initial insights derived from the 'Score' feature, as presented in Q1 and Q2, are significantly skewed and misleading due to critical data integrity issues. Without proper preprocessing, the visualizations fail to convey an accurate representation of anime scores.
- Error 1: Mismatched Data Type - 'Score' as an Object (String) instead of Numeric.
  - Impact on Q1 (Histogram): When the 'Score' column is treated as a generic object type, Plotly is unable to interpret these values as continuous numerical data. Instead, it might treat each unique string (e.g., '8.1', '7.5', '6') as a distinct categorical bin. This prevents the generation of a meaningful numerical distribution, resulting in a histogram that either appears fragmented, displays an excessive number of individual bars for each distinct score string, or even raises a TypeError, completely misrepresenting the intended continuous nature of scores.
  -  Impact on Q2 (Bar Chart of Unique Values): While a bar chart of unique values can still be generated, the underlying issue persists. The chart will correctly show the frequency of unique strings, but it will not allow for numerical analysis or proper sorting/grouping based on the true score value. It reinforces the idea that these are distinct categories rather than points on a numerical scale.
- Error 2: Non-standard Missing Value Representation - 'UNKNOWN' instead of NaN.
Impact on Q1 (Histogram): The presence of the string 'UNKNOWN' within the 'Score' column further corrupts any attempt at numerical analysis. If a numeric conversion is attempted without prior cleaning, 'UNKNOWN' values will cause errors or be implicitly dropped, potentially leading to an incomplete dataset for the plot. If the data is treated categorically, 'UNKNOWN' will appear as a prominent bar in the distribution, falsely suggesting it's a valid "score" and dominating the view, thus obscuring the actual score distribution.

### 3.1.2.Correct Insight

In [71]:
# Kiểm tra xem cột 'Score' có tồn tại trong DataFrame không
if 'Score' in df_anime_dataset_2023_prep.columns:
    try:
        # Vẽ biểu đồ phân phối
        fig = px.histogram(
            df_anime_dataset_2023_prep,
            x='Score',
            title='Anime Score Distribution',
            labels={'Score': 'Score'}
        )
        # Hiển thị biểu đồ
        fig.show()
    except Exception as e:
        print("Không thể vẽ biểu đồ này do data không phù hợp:", str(e))
else:
    print("Không thể vẽ biểu đồ này do cột 'Score' không tồn tại trong DataFrame.")

# Lấy tần suất xuất hiện của các giá trị trong cột 'Score'
score_counts = df_anime_dataset_2023_prep['Score'].value_counts().reset_index()

# Đổi tên cột để dễ hiểu
score_counts.columns = ['Score', 'Frequency']


## 3.2.Insight Question 2:
Understanding the different types of anime available in the dataset—such as TV series, movies, OVAs, or specials—is fundamental for grasping the landscape of anime production and consumption. This feature provides a high-level categorization that can influence viewership, production cycles, and audience expectations. Our initial exploration of the **Type** column will aim to quantify the prevalence of each type. However, similar to the 'Score' feature, we will first analyze the raw, unprocessed data to highlight how subtle inconsistencies or non-standard entries can distort our understanding and necessitate robust data cleaning.

### 3.2.1.Misleading Insight
- What is the proportional distribution of different anime types within our dataset

In [90]:
# Recalculate type_counts to include 'UNKNOWN' if present in the data
type_counts_full = (
    df_anime_dataset_2023['Type']
    .value_counts(dropna=False)
    .reset_index()
)
type_counts_full.columns = ['Type', 'Count']

# If there are missing values (NaN), replace them with 'UNKNOWN' for clarity
type_counts_full['Type'] = type_counts_full['Type'].fillna('UNKNOWN')

# Calculate percentage
type_counts_full['Percentage'] = (type_counts_full['Count'] / type_counts_full['Count'].sum()) * 100

# Find the type with the smallest percentage
min_idx_full = type_counts_full['Percentage'].idxmin()
min_type_full = type_counts_full.loc[min_idx_full, 'Type']
min_percentage_full = type_counts_full.loc[min_idx_full, 'Percentage']

# Create gradient color scheme
colors = ['#8B5CF6' if x != 'UNKNOWN' else '#CBD5E1' for x in type_counts_full['Type']]

fig = px.bar(
    type_counts_full,
    x='Type',
    y='Count',
    title='Distribution of Anime Types (Including UNKNOWN)',
    labels={'Type': 'Anime Type', 'Count': 'Number of Anime'},
    text='Count',
    color='Count',
    color_continuous_scale=['#DDD6FE', '#8B5CF6', '#6D28D9']
)

fig.update_traces(
    marker_line_color='white',
    marker_line_width=2,
    textposition='outside',
    texttemplate='%{text:,}',
    hovertemplate='<b>%{x}</b><br>Count: %{y:,}<br>Percentage: %{customdata:.2f}%<extra></extra>',
    customdata=type_counts_full['Percentage']
)

fig.add_annotation(
    x=min_type_full,
    y=type_counts_full.loc[min_idx_full, 'Count'],
    text=f"<b>Smallest</b><br>{min_type_full}<br>{min_percentage_full:.2f}%",
    showarrow=True,
    arrowhead=2,
    arrowsize=1,
    arrowwidth=3,
    arrowcolor="#EF4444",
    ax=70,
    ay=-90,
    font=dict(size=12, color="#DC2626", family="Inter", weight="bold"),
    align="center",
    bgcolor="rgba(254, 242, 242, 0.95)",
    bordercolor="#EF4444",
    borderwidth=2.5,
    borderpad=8,
    opacity=1
)

fig.update_layout(
    font=dict(family="Inter", size=13, color="#1F2937"),
    title={
        'text': '<b>Distribution of Anime Types</b><br><sup>Including UNKNOWN values</sup>',
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(size=22, color="#111827")
    },
    showlegend=False,
    coloraxis_showscale=False,
    height=650,
    plot_bgcolor='white',
    paper_bgcolor='#FAFAFA',
    margin=dict(t=100, l=60, r=220, b=80),
    xaxis=dict(
        showgrid=False,
        showline=True,
        linecolor='#E5E7EB',
        tickfont=dict(size=12, color="#4B5563")
    ),
    yaxis=dict(
        showgrid=True,
        gridcolor='#F3F4F6',
        showline=True,
        linecolor='#E5E7EB',
        tickfont=dict(size=11, color="#6B7280"),
        tickformat=','
    )
)

fig.show()

**Conflict**: he primary conflict distorting our initial insights for the 'Type' feature stems from the non-standard representation of missing values.
- Error: Non-standard Missing Value Representation - 'UNKNOWN' instead of NaN.
    - Impact on Insights (Pie Chart and Bar Chart): Instead of being treated as a missing entry that visualization tools typically ignore or handle separately, 'UNKNOWN' is interpreted by Plotly as a legitimate, distinct category within the 'Type' feature. This leads to 'UNKNOWN' appearing as a segment in the pie chart or a bar in the bar chart, falsely suggesting it is an actual anime type. Consequently, the true distribution and prevalence of valid anime types are misrepresented, as 'UNKNOWN' consumes a portion of the visual space and frequency count that should solely belong to actual categorical types.

*Note*: *For categorical data like **Type**, even if its dtype is object (which typically means it contains strings), Plotly will generally interpret and plot it correctly as categories. When creating bar charts or pie charts with a column of object dtype, Plotly intelligently treats each unique string value as a distinct category. The issues typically arise when an object column is supposed to be numeric but isn't (like 'Score'), or when it contains non-standard representations of missing values or erroneous strings that get treated as valid categories (like 'UNKNOWN' in 'Type' or 'N/A' in 'Score'). So, for Type, the object dtype itself is not the core problem for displaying categories, but the content of those categories (e.g., 'UNKNOWN') is.*


### 3.2.2.Correct Insight

In [92]:
# Recalculate type_counts, excluding UNKNOWN/NaN values
type_counts_full = (
    df_anime_dataset_2023_prep['Type']
    .value_counts(dropna=True)  # dropna=True để bỏ NaN
    .reset_index()
)
type_counts_full.columns = ['Type', 'Count']

# Filter out 'UNKNOWN' if it exists
type_counts_full = type_counts_full[type_counts_full['Type'].str.upper() != 'UNKNOWN']

# Calculate percentage
type_counts_full['Percentage'] = (type_counts_full['Count'] / type_counts_full['Count'].sum()) * 100

# Find the type with the smallest percentage
min_idx_full = type_counts_full['Percentage'].idxmin()
min_type_full = type_counts_full.loc[min_idx_full, 'Type']
min_percentage_full = type_counts_full.loc[min_idx_full, 'Percentage']

fig = px.bar(
    type_counts_full,
    x='Type',
    y='Count',
    title='Distribution of Anime Types',
    labels={'Type': 'Anime Type', 'Count': 'Number of Anime'},
    text='Count',
    color='Count',
    color_continuous_scale=['#DDD6FE', '#8B5CF6', '#6D28D9']
)

fig.update_traces(
    marker_line_color='white',
    marker_line_width=2,
    textposition='outside',
    texttemplate='%{text:,}',
    hovertemplate='<b>%{x}</b><br>Count: %{y:,}<br>Percentage: %{customdata:.2f}%<extra></extra>',
    customdata=type_counts_full['Percentage']
)

fig.add_annotation(
    x=min_type_full,
    y=type_counts_full.loc[min_idx_full, 'Count'],
    text=f"<b>Smallest</b><br>{min_type_full}<br>{min_percentage_full:.2f}%",
    showarrow=True,
    arrowhead=2,
    arrowsize=1,
    arrowwidth=3,
    arrowcolor="#EF4444",
    ax=70,
    ay=-90,
    font=dict(size=12, color="#DC2626", family="Inter", weight="bold"),
    align="center",
    bgcolor="rgba(254, 242, 242, 0.95)",
    bordercolor="#EF4444",
    borderwidth=2.5,
    borderpad=8,
    opacity=1
)

fig.update_layout(
    font=dict(family="Inter", size=13, color="#1F2937"),
    title={
        'text': '<b>Distribution of Anime Types</b><br><sup>Cleaned Data</sup>',
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(size=22, color="#111827")
    },
    showlegend=False,
    coloraxis_showscale=False,
    height=650,
    plot_bgcolor='white',
    paper_bgcolor='#FAFAFA',
    margin=dict(t=100, l=60, r=220, b=80),
    xaxis=dict(
        showgrid=False,
        showline=True,
        linecolor='#E5E7EB',
        tickfont=dict(size=12, color="#4B5563")
    ),
    yaxis=dict(
        showgrid=True,
        gridcolor='#F3F4F6',
        showline=True,
        linecolor='#E5E7EB',
        tickfont=dict(size=11, color="#6B7280"),
        tickformat=','
    )
)

fig.show()

**Deep Insight**

- Mỗi Type anime có điểm số trung bình là bao nhiêu.

In [74]:
# import pandas as pd
# import plotly.express as px
# import plotly.io as pio

# pio.templates.default = "plotly_white"

# # Convert Score to numeric
# df_anime_dataset_2023["Score"] = pd.to_numeric(
#     df_anime_dataset_2023["Score"],
#     errors="coerce"
# )

# # Average score by Type
# average_score_by_type = (
#     df_anime_dataset_2023
#     .groupby("Type")["Score"]
#     .mean()
#     .reset_index()
# )
# average_score_by_type.columns = ["Type", "Average Score"]

# base_color = "#C6A3FF"

# fig = px.bar(
#     average_score_by_type,
#     x="Type",
#     y="Average Score",
#     title="Average Score by Anime Type",
#     labels={"Type": "Anime Type", "Average Score": "Average Score"},
#     text="Average Score",
#     category_orders={"Type": sorted(average_score_by_type["Type"].unique())}
# )

# fig.update_traces(
#     marker_color=base_color,
#     marker_line_color="white",
#     marker_line_width=1.2,
#     textposition="outside",
#     texttemplate="%{text:.2f}",
#     hovertemplate="<b>Type:</b> %{x}<br><b>Average Score:</b> %{y:.2f}<extra></extra>",
# )

# fig.update_xaxes(
#     showgrid=False,
#     zeroline=False,
#     tickangle=-20
# )

# fig.update_yaxes(
#     showgrid=False,
#     gridcolor="#EEEEEE",
#     zeroline=False,
#     tickformat=".1f"
# )

# fig.update_layout(
#     title={"text": "Average Score by Anime Type", "x": 0.5, "xanchor": "center"},
#     bargap=0.3,
#     height=550,
#     font=dict(family="Inter", size=13),
#     title_font=dict(size=20),
#     margin=dict(t=80, l=60, r=40, b=80),
#     hoverlabel=dict(bgcolor="white", bordercolor="#888", font_size=12),
# )

# fig.show()


In [75]:
print(df_anime_dataset_2023_prep.shape)

(24905, 21)


## 3.3.Insight Question 3
The **Genres** feature is incredibly rich, offering deep insights into the thematic elements, target audience, and overall feel of an anime. Understanding the prevalence of different genres is crucial for identifying trends, audience preferences, and even predicting the potential success of new titles. However, unlike **Score** or **Type** which typically hold a single value, **Genres** often contains multiple categories listed together for a single anime. Our initial exploration will examine this raw, combined genre data to understand its direct distribution, while simultaneously setting the stage to reveal how such an aggregated format can significantly complicate accurate analysis and lead to misleading conclusions if not properly preprocessed.

### 3.3.1.Misleading Insight:
- What are the most frequently appearing genre combinations or genre strings as they are originally listed in the dataset?

In [94]:
genre_counts = df_anime_dataset_2023['Genres'].value_counts().head(15).reset_index()
genre_counts.columns = ['Genres', 'Count']

# Sắp xếp theo Count để dễ nhìn hơn
genre_counts = genre_counts.sort_values('Count', ascending=True)

fig = px.bar(
    genre_counts,
    x='Count',
    y='Genres',
    orientation='h',
    title='Top 15 Most Popular Anime Genre Combinations (Original Data)',
    labels={'Genres': 'Genre Combination', 'Count': 'Number of Anime'},
    text='Count',
    color='Count',
    color_continuous_scale='RdPu',
    range_color=[0, 7000]  # Chỉ giá trị trên 7000 mới có màu tím đậm
)

fig.update_traces(
    marker_line_color='white',
    marker_line_width=1.2,
    textposition='outside',
    texttemplate='%{text:,}',
    hovertemplate='<b>Genres:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>'
)

fig.update_xaxes(
    showgrid=False,
    zeroline=False,
    tickformat=','
)
fig.update_yaxes(
    showgrid=False,
    zeroline=False
)

fig.update_layout(
    title={'text': 'Top 15 Most Popular Anime Genre Combinations (Original Data)', 'x': 0.5, 'xanchor': 'center'},
    bargap=0.2,
    height=600,
    font=dict(family='Inter', size=13),
    title_font=dict(size=20),
    plot_bgcolor='white',
    paper_bgcolor='white',
    margin=dict(t=80, l=250, r=40, b=80),
    hoverlabel=dict(bgcolor='white', bordercolor='#888', font_size=12),
    coloraxis_colorbar=dict(title="Count", thickness=15, len=0.7)
)

fig.show()

**Conflict**: The initial insight derived from the **Genres** feature (identifying top genre combinations) is severely flawed and misleading due to two significant data integrity issues. Without proper preprocessing, the visualization fails to reveal the true popularity of individual genres or accurate genre groupings.
- Error 1: Aggregated Genre Strings (Multiple Genres in One Entry).
  - Impact on Insight (Top N Genre Combinations Bar Chart): The most critical issue is that the **Genres** column often contains multiple genres concatenated into a single string (e.g., "Action, Comedy, Shounen"). When we count the frequency of these strings, we are actually counting combinations of genres, not individual genres themselves. This completely obscures the true popularity of any single genre. For example, "Action" might be present in many combinations, but its individual popularity isn't reflected. The bar chart will show unique, often lengthy, genre combinations as categories, making it impossible to ascertain which core genres are truly prevalent across the dataset. We are getting insight into bundles of genres, rather than the constituent elements.
- Error 2: Non-standard Missing Value Representation - 'UNKNOWN' instead of NaN.
  - Impact on Insight (Top N Genre Combinations Bar Chart): Similar to the 'Type' feature, the presence of the string 'UNKNOWN' for missing genre information is treated as a valid and distinct genre combination. This causes 'UNKNOWN' to often rank highly among the "top genre combinations" in the bar chart. Its inclusion as a prominent category inflates its perceived importance, distorts the true ranking of actual genre combinations, and detracts from focusing on meaningful genre data. A correctly handled missing value (standard NaN) would typically be excluded from such frequency counts by default, providing a cleaner and more accurate insight into actual genre prevalence.

### 3.3.2.Correct Insight


In [None]:
# Tách Genres từ string representation sang list thực sự, rồi explode
df_genres_exploded = df_anime_dataset_2023_prep[['Genres']].copy()

# Parse string thành list (xử lý cả trường hợp lỗi)
df_genres_exploded['Genres'] = df_genres_exploded['Genres'].apply(
    lambda x: ast.literal_eval(x) if isinstance(x, str) and x.startswith('[') else []
)

# Bây giờ mới explode
df_genres_exploded = df_genres_exploded.explode('Genres')

# Loại bỏ các giá trị rỗng hoặc null
df_genres_exploded = df_genres_exploded[df_genres_exploded['Genres'].notna()]
df_genres_exploded = df_genres_exploded[df_genres_exploded['Genres'] != '']

# Đếm tần suất xuất hiện của từng Genre
genre_counts = df_genres_exploded['Genres'].value_counts().head(15).reset_index()
genre_counts.columns = ['Genres', 'Count']

# Sắp xếp theo Count để dễ nhìn hơn
genre_counts = genre_counts.sort_values('Count', ascending=True)

fig = px.bar(
    genre_counts,
    x='Count',
    y='Genres',
    orientation='h',
    title='Top 15 Most Popular Individual Genres (Preprocessed Data)',
    labels={'Genres': 'Genre', 'Count': 'Number of Anime'},
    text='Count',
    color='Count',
    color_continuous_scale='RdPu'
)

fig.update_traces(
    marker_line_color='white',
    marker_line_width=1.2,
    textposition='outside',
    texttemplate='%{text:,}',
    hovertemplate='<b>Genre:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>'
)

fig.update_xaxes(
    showgrid=False,
    zeroline=False,
    tickformat=','
)
fig.update_yaxes(
    showgrid=False,
    zeroline=False
)

fig.update_layout(
    title={'text': 'Top 15 Most Popular Individual Genres (Preprocessed Data)', 'x': 0.5, 'xanchor': 'center'},
    bargap=0.2,
    height=600,
    font=dict(family='Inter', size=13),
    title_font=dict(size=20),
    plot_bgcolor='white',
    paper_bgcolor='white',
    margin=dict(t=80, l=250, r=40, b=80),
    hoverlabel=dict(bgcolor='white', bordercolor='#888', font_size=12),
    coloraxis_colorbar=dict(title="Count", thickness=15, len=0.7)
)

fig.show()

print(f"Shape after explode: {df_genres_exploded.shape}")

Shape after explode: (38461, 1)


**Deep Insight**:
- Display the top 10 Genres with the highest average score
- Distribution of media types (TV, Movie, OVA, etc.) in the top 5 highest-scoring Genres (recommended: pie chart)
- Average score of each media type within the top 5 Genres (recommended: stacked bar chart)

In [77]:
# Display the top 10 Genres with the highest average score
# Extract individual genres from the 'Genres' column and explode
df_genre_score = csv_file2[['Genres', 'Score']].dropna(subset=['Genres', 'Score']).copy()
df_genre_score['Genres'] = df_genre_score['Genres'].apply(lambda x: [g.strip() for g in eval(x)] if isinstance(x, str) and x.startswith('[') else [])
df_genre_score = df_genre_score.explode('Genres')

# Remove 'UNKNOWN' and empty genres
df_genre_score = df_genre_score[df_genre_score['Genres'].notna() & (df_genre_score['Genres'] != 'UNKNOWN') & (df_genre_score['Genres'] != '')]

# Calculate average score and count for each genre
genre_avg_score = (
    df_genre_score.groupby('Genres')
    .agg(average_score=('Score', 'mean'), anime_count=('Score', 'count'))
    .reset_index()
)

# Get top 10 genres by average score (with at least 20 anime for reliability)
top_genres = genre_avg_score[genre_avg_score['anime_count'] >= 20].sort_values('average_score', ascending=False).head(10)

# Plot with vibrant color palette
fig = px.bar(
    top_genres,
    x='Genres',
    y='average_score',
    color='average_score',
    color_continuous_scale='Turbo',
    text='average_score',
    title='Top 10 Genres with Highest Average Score',
    labels={'Genres': 'Genre', 'average_score': 'Average Score'}
)

fig.update_traces(
    texttemplate='%{text:.2f}',
    textposition='outside',
    marker_line_color='white',
    marker_line_width=1.2
)

fig.update_layout(
    font=dict(family='Inter', size=13),
    title_font=dict(size=22, color='#1F2937'),
    plot_bgcolor='white',
    paper_bgcolor='white',
    bargap=0.25,
    height=500,
    margin=dict(t=80, l=40, r=40, b=80),
    hoverlabel=dict(bgcolor='white', bordercolor='#FF6B6B', font_size=13)
)

fig.show()

In [None]:
# Average score of each media type within the top 5 Genres (recommended: stacked bar chart)
# Top 5 genres by average score (already calculated in top_genres)
top5_genres = top_genres.head(5)['Genres'].tolist()

# Prepare dataframe: explode genres, filter top 5, and group by Type and Genre
df = csv_file2[['Genres', 'Type', 'Score']].dropna(subset=['Genres', 'Type', 'Score']).copy()
df['Genres'] = df['Genres'].apply(lambda x: [g.strip() for g in eval(x)] if isinstance(x, str) and x.startswith('[') else [])
df = df.explode('Genres')
df = df[df['Genres'].isin(top5_genres)]

# Remove UNKNOWN and empty types
df = df[df['Type'].notna() & (df['Type'] != 'UNKNOWN') & (df['Type'] != '')]

# Group by Genre and Type, calculate average score and count
grouped = (
    df.groupby(['Genres', 'Type'])
    .agg(average_score=('Score', 'mean'), anime_count=('Score', 'count'))
    .reset_index()
)

# Only show media types with at least 10 anime for reliability
grouped = grouped[grouped['anime_count'] >= 10]

# Sort genres for stacked bar chart order
genre_order = top5_genres

fig = px.bar(
    grouped,
    x='Genres',
    y='average_score',
    color='Type',
    text='average_score',
    category_orders={'Genres': genre_order},
    title='Average Score of Each Media Type within Top 5 Highest-Scoring Genres',
    labels={'Genres': 'Genre', 'average_score': 'Average Score', 'Type': 'Media Type'}
)

fig.update_traces(
    texttemplate='%{text:.2f}',
    textposition='outside',
    marker_line_color='white',
    marker_line_width=1.2
)

fig.update_layout(
    font=dict(family='Inter', size=13),
    title_font=dict(size=20, color='#1F2937'),
    plot_bgcolor='white',
    paper_bgcolor='white',
    bargap=0.18,
    height=550,
    margin=dict(t=80, l=60, r=40, b=80),
    hoverlabel=dict(bgcolor='white', bordercolor='#8B5CF6', font_size=13),
    legend_title_text='Media Type'
)

fig.show()

## 3.4.Insight Question 4
The **Duration** feature is a crucial indicator of an anime's length, directly influencing its format, target audience, and consumption patterns. Understanding typical durations is far more meaningful when considered in conjunction with the **Type** of anime (e.g., TV series, Movie, OVA). A 24-minute duration means something entirely different for a TV episode compared to a short film. Our initial exploration will therefore combine these two features, aiming to identify the various ways duration is represented within each anime type in the raw dataset. By working with this unprocessed, combined data, we intend to highlight how its current string-based format, inconsistent notations, and non-standard missing values can severely obstruct direct quantitative analysis and lead to ambiguous or uninterpretable insights, thereby demonstrating the necessity for extensive conversion and standardization.

### 3.4.1.Misleading Insight
- What is the distribution of anime scores within our dataset?

**Conflict**
The initial insight derived from the 'Duration' feature, even when viewed in conjunction with 'Type' (as a frequency of textual descriptions), is severely hampered and misleading due to multiple fundamental data quality issues. The raw state of this column makes it impossible to conduct accurate quantitative analysis or draw reliable conclusions about anime lengths.
- Error 1: Mismatched Data Type - 'Duration' as an Object (String) instead of Numeric.
    - Impact on Insight (Bar Chart of Duration Descriptions): Since the 'Duration' column is stored as an object (string) type, any attempt at numerical operations (like calculating averages, minimums, maximums, or plotting a continuous distribution) is impossible without prior conversion. While the bar chart for textual descriptions can be generated, this underlying data type issue prevents any deeper quantitative understanding. The lack of a numeric type means we cannot perform meaningful time-based comparisons or aggregations, immediately limiting the scope of our analysis to mere string counts.
- Error 2: Inconsistent String Formats and Units.
    - Impact on Insight (Bar Chart of Duration Descriptions): The 'Duration' column contains a highly inconsistent mix of string formats and units (e.g., "24 min per ep", "1 hr", "1h 30min", "3 days", "59 sec", etc.). This means that values representing the same actual duration (e.g., 60 minutes) might appear as several different unique strings (e.g., "1 hr", "60 min"). When analyzing the "most common textual descriptions," these inconsistencies lead to a fragmented and inaccurate view. The bar chart will show many distinct entries for what should be a unified duration, drastically underrepresenting the true frequency of certain anime lengths and making it impossible to identify real patterns in animation runtimes.
- Error 3: Non-standard Missing Value Representation - 'UNKNOWN' instead of NaN.
    - Impact on Insight (Bar Chart of Duration Descriptions): The presence of the string 'UNKNOWN' for missing duration information is treated by Plotly as a legitimate, distinct textual description of duration. Consequently, 'UNKNOWN' will often appear as a prominent category in the bar chart (especially when faceted by 'Type'), falsely implying it's a common duration rather than an indicator of missing data. This distorts the perceived distribution of actual durations for each anime type and overstates the prevalence of unrecorded information.

In [None]:
# Loại bỏ các row có Duration Minutes > 1000 khi Type = TV
df_cleaned = df_anime_dataset_2023_prep

# Lọc dữ liệu chỉ lấy Type = TV, Movie, OVA
filtered_df = df_cleaned[
    df_cleaned['Type'].isin(['TV', 'Movie', 'OVA'])
].copy()

# Làm tròn Duration về khoảng 1 phút
filtered_df['Duration_rounded'] = filtered_df['Duration Minutes'].round(0)

# Tạo violin chart
fig = px.violin(
    filtered_df,
    x='Type',
    y='Duration_rounded',
    color='Type',
    title='Distribution of Duration by Type (TV, Movie, OVA) - Cleaned Data',
    labels={'Duration_rounded': 'Duration (minutes)', 'Type': 'Anime Type'},
    box=True,  # Hiển thị box plot bên trong
    points='outliers',  # Chỉ hiển thị outliers
    color_discrete_sequence=['#8B5CF6', '#EC4899', '#10B981']
)

# Tùy chỉnh layout
fig.update_layout(
    title={
        'text': 'Distribution of Duration by Type (TV, Movie, OVA) - Cleaned Data',
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(size=20, color='#1F2937', family='Inter', weight='bold')
    },
    xaxis=dict(
        title='Anime Type',
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        tickfont=dict(size=12, color='#6B7280'),
        showgrid=False,
        showline=True,
        linecolor='#E5E7EB'
    ),
    yaxis=dict(
        title='Duration (minutes)',
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        tickfont=dict(size=12, color='#6B7280'),
        showgrid=True,
        gridcolor='#F3F4F6',
        showline=True,
        linecolor='#E5E7EB'
    ),
    plot_bgcolor='white',
    paper_bgcolor='#FAFAFA',
    height=600,
    margin=dict(t=80, l=60, r=40, b=80),
    showlegend=False,
    hoverlabel=dict(
        bgcolor='white',
        bordercolor='#8B5CF6',
        font=dict(size=12, family='Inter', color='#1F2937')
    )
)

fig.show()

# # In thông tin về số lượng row đã loại bỏ
# removed_count = len(df_anime_dataset_2023_prep) - len(df_cleaned)
# print(f"\nĐã loại bỏ {removed_count} anime Type=TV có Duration > 1000 phút")
# print(f"Tổng số anime còn lại: {len(df_cleaned)}")

**Conflict**: 
- Error 4: Undetectable Numerical Outliers due to String Format.
    - Impact on Insight (Bar Chart of Duration Descriptions): The existence of a numerical outlier, such as a 'Movie' with a 'Duration' of '1500' (presumably minutes, implying 25 hours), remains completely hidden in the raw string format. As a string, '1500' is just another unique duration description, indistinguishable from '24 min' or '1 hr'. Without converting the 'Duration' column to a consistent numerical format, it's impossible to identify, flag, or analyze such extreme values. This inability to detect outliers before preprocessing means we miss crucial data quality anomalies that would otherwise heavily skew any statistical summary (like average duration) or distort the visual scale of a proper numerical distribution plot.

In [38]:
# Loại bỏ các row có Duration Minutes > 1000 khi Type = TV
df_cleaned = df_anime_dataset_2023_prep[
    ~((df_anime_dataset_2023_prep['Type'] == 'TV') & 
      (df_anime_dataset_2023_prep['Duration Minutes'] > 1000))
].copy()

# Lọc dữ liệu chỉ lấy Type = TV, Movie, OVA
filtered_df = df_cleaned[
    df_cleaned['Type'].isin(['TV', 'Movie', 'OVA'])
].copy()

# Làm tròn Duration về khoảng 1 phút
filtered_df['Duration_rounded'] = filtered_df['Duration Minutes'].round(0)

# Tạo violin chart
fig = px.violin(
    filtered_df,
    x='Type',
    y='Duration_rounded',
    color='Type',
    title='Distribution of Duration by Type (TV, Movie, OVA) - Cleaned Data',
    labels={'Duration_rounded': 'Duration (minutes)', 'Type': 'Anime Type'},
    box=True,  # Hiển thị box plot bên trong
    points='outliers',  # Chỉ hiển thị outliers
    color_discrete_sequence=['#8B5CF6', '#EC4899', '#10B981']
)

# Tùy chỉnh layout
fig.update_layout(
    title={
        'text': 'Distribution of Duration by Type (TV, Movie, OVA) - Cleaned Data',
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(size=20, color='#1F2937', family='Inter', weight='bold')
    },
    xaxis=dict(
        title='Anime Type',
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        tickfont=dict(size=12, color='#6B7280'),
        showgrid=False,
        showline=True,
        linecolor='#E5E7EB'
    ),
    yaxis=dict(
        title='Duration (minutes)',
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        tickfont=dict(size=12, color='#6B7280'),
        showgrid=True,
        gridcolor='#F3F4F6',
        showline=True,
        linecolor='#E5E7EB'
    ),
    plot_bgcolor='white',
    paper_bgcolor='#FAFAFA',
    height=600,
    margin=dict(t=80, l=60, r=40, b=80),
    showlegend=False,
    hoverlabel=dict(
        bgcolor='white',
        bordercolor='#8B5CF6',
        font=dict(size=12, family='Inter', color='#1F2937')
    )
)

fig.show()

# # In thông tin về số lượng row đã loại bỏ
removed_count = len(df_anime_dataset_2023_prep) - len(df_cleaned)
print(f"\nRemove {removed_count} anime Type=TV which its Duration is wrong ")
print(f"Numer of anime remains: {len(df_cleaned)}")


Remove 1 anime Type=TV which its Duration is wrong 
Numer of anime remains: 24904


In [39]:
# Loại bỏ các row có Duration Minutes > 1000 khi Type = TV
df_cleaned = df_anime_dataset_2023_prep[
    ~((df_anime_dataset_2023_prep['Type'] == 'TV') & 
      (df_anime_dataset_2023_prep['Duration Minutes'] > 1000))
].copy()

# Lọc dữ liệu chỉ lấy Type = TV, Movie, OVA
filtered_df = df_cleaned[
    df_cleaned['Type'].isin(['TV', 'Movie', 'OVA'])
].copy()

# Nhóm Duration thành khoảng 5 phút
filtered_df['Duration_grouped'] = (filtered_df['Duration Minutes'] // 2) * 2

# Tạo histogram với nhóm 5 phút
fig = px.histogram(
    filtered_df,
    x='Duration_grouped',
    color='Type',
    title='Distribution of Duration by Type (Grouped by 5 minutes)',
    labels={'Duration_grouped': 'Duration (minutes)', 'count': 'Number of Anime'},
    barmode='overlay',
    nbins=100,
    color_discrete_sequence=['#8B5CF6', '#EC4899', '#10B981'],
    opacity=0.7
)

# Tùy chỉnh layout
fig.update_layout(
    title={
        'text': 'Distribution of Duration by Type (Grouped by 2 minutes)',
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(size=20, color='#1F2937', family='Inter', weight='bold')
    },
    xaxis=dict(
        title='Duration (minutes)',
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        tickfont=dict(size=12, color='#6B7280'),
        showgrid=True,
        gridcolor='#F3F4F6',
        showline=True,
        linecolor='#E5E7EB'
    ),
    yaxis=dict(
        title='Number of Anime',
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        tickfont=dict(size=12, color='#6B7280'),
        showgrid=True,
        gridcolor='#F3F4F6',
        showline=True,
        linecolor='#E5E7EB'
    ),
    plot_bgcolor='white',
    paper_bgcolor='#FAFAFA',
    height=600,
    margin=dict(t=80, l=60, r=40, b=80),
    legend=dict(
        title=dict(text='Type', font=dict(size=14, color='#4B5563')),
        font=dict(size=12, color='#6B7280'),
        bgcolor='rgba(255, 255, 255, 0.8)',
        bordercolor='#E5E7EB',
        borderwidth=1
    ),
    hoverlabel=dict(
        bgcolor='white',
        bordercolor='#8B5CF6',
        font=dict(size=12, family='Inter', color='#1F2937')
    )
)

fig.show()

# In thống kê cho mỗi Type
print("\n=== Duration Statistics by Type (grouped by 2 minutes) ===")
for type_name in ['TV', 'Movie', 'OVA']:
    type_data = filtered_df[filtered_df['Type'] == type_name]['Duration_grouped']
    print(f"\n{type_name}:")
    print(f"  - Number of anime: {len(type_data)}")
    print(f"  - Average duration: {type_data.mean():.2f} minutes")
    print(f"  - Most common duration: {type_data.mode().values[0] if len(type_data.mode()) > 0 else 'N/A'} minutes")
    print(f"  - Min duration: {type_data.min():.2f} minutes")
    print(f"  - Max duration: {type_data.max():.2f} minutes")


=== Duration Statistics by Type (grouped by 2 minutes) ===

TV:
  - Number of anime: 6215
  - Average duration: 20.20 minutes
  - Most common duration: 24.0 minutes
  - Min duration: 10.00 minutes
  - Max duration: 54.00 minutes

Movie:
  - Number of anime: 2311
  - Average duration: 83.47 minutes
  - Most common duration: 90.0 minutes
  - Min duration: 40.00 minutes
  - Max duration: 168.00 minutes

OVA:
  - Number of anime: 4076
  - Average duration: 27.32 minutes
  - Most common duration: 28.0 minutes
  - Min duration: 0.00 minutes
  - Max duration: 126.00 minutes


### 3.4.2.Correct Insight

**Deep Insight**:
- For each Type (TV, Movie, OVA, etc.), what is the average Duration of the Top 3 highest-scoring Genres?

## 3.5.Insight Question 5
The **Aired** feature provides critical temporal context, allowing us to track anime release patterns, seasonal trends, and historical production volumes. Understanding how many anime titles are released each month is fundamental for gauging industry activity and identifying peak periods. Our initial examination will focus on extracting this monthly release count directly from the raw **Aired** column. By intentionally avoiding any prior data cleaning or type conversion, we aim to immediately encounter the challenges posed by its current string format. This approach will demonstrate how unstructured date information prevents straightforward temporal aggregation, thereby highlighting the absolute necessity of robust date parsing and standardization for meaningful time-series analysis.

### 3.5.1.Misleading Insight
- Based on the raw 'Aired' column, what is the count of anime released in each distinct month (as represented in the original string format) across the entire dataset?

**Confilct**
The initial attempt to visualize the number of anime released per month from the **Aired** feature yields highly inaccurate and potentially uninterpretable results due to fundamental issues with its raw data format. The current state of the **Aired** column completely undermines any meaningful temporal analysis.
- Error 1: Mismatched Data Type - **Aired** as an Object (String) instead of Datetime.
    - Impact on Insight (Monthly Release Line Chart): Since the 'Aired' column is an object (string) type, Plotly cannot intrinsically understand these entries as dates or extract chronological months. When attempting to plot months on the x-axis, the chart will treat each unique month string (e.g., "Jan", "Feb", "Mar", but also potentially "Apr 2005", "May 1999 to Jun 1999") as a discrete categorical label. This results in an x-axis that is not chronologically ordered (it will likely be alphabetical or based on string appearance) and fails to represent a continuous timeline. Consequently, the line chart cannot accurately show trends or seasonality in anime releases, as the data points are disjointed and incorrectly positioned.
- Error 2: Inconsistent and Unstructured String Formats.
    - Impact on Insight (Monthly Release Line Chart): The **Aired** column contains a wide variety of string formats (e.g., "Jan 10, 2000", "Winter 2005", "Apr 2005 to Jun 2005", "Not available"). Extracting a consistent 'month' from these disparate strings for aggregation is impossible without prior parsing. For instance, "Jan 10, 2000" and "Jan 2005" both represent January releases but are unique strings. Furthermore, date ranges ("Apr 2005 to Jun 2005") present an ambiguity: should this count towards April, June, or all months in between? Without a standardized format, any attempt to count "anime per month" will be based on arbitrary string matching, leading to an incomplete, fragmented, and inaccurate count, drastically undercounting releases for actual months while showing counts for unparsed, full date strings.
- Error 3: Non-standard Missing Value Representation and Non-Date Entries.
    - Impact on Insight (Monthly Release Line Chart): Values like "Not available" or "Unknown" within the **Aired** column further corrupt the data. If they are not explicitly filtered out during the crude string-based month extraction, they might inadvertently be processed or cause errors. More critically, if a string-parsing method were to mistakenly extract a "month" from such entries or if they are simply ignored, the total count of anime included in the chart becomes less reliable, and the underlying issue of unaddressable missing temporal information persists, leading to an incomplete and potentially biased view of release trends.

&rArr; The line chart for feature **Aired** can not be drawed

### 3.5.2.Correct Insight

In [19]:
# Kiểm tra xem cột 'Aired Month' có tồn tại trong DataFrame không
if 'Aired Month' in df_anime_dataset_2023_prep.columns:
    # Nhóm dữ liệu theo 'Aired Month' và đếm số lượng anime
    anime_per_month = df_anime_dataset_2023_prep.groupby('Aired Month').size().reset_index(name='Anime Count')

    # Sắp xếp theo thứ tự tháng
    anime_per_month = anime_per_month.sort_values('Aired Month')
else:
    print("Cột 'Aired Month' không tồn tại trong DataFrame.")

# Map numeric months to month names
anime_per_month['Month'] = anime_per_month['Aired Month'].map({
    1.0: 'Jan', 2.0: 'Feb', 3.0: 'Mar', 4.0: 'Apr', 5.0: 'May', 6.0: 'Jun',
    7.0: 'Jul', 8.0: 'Aug', 9.0: 'Sep', 10.0: 'Oct', 11.0: 'Nov', 12.0: 'Dec'
})
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

# Sort by month order
anime_per_month['Month'] = pd.Categorical(anime_per_month['Month'], categories=month_order, ordered=True)
anime_per_month = anime_per_month.sort_values('Month')

# Create line chart with gradient color scheme
fig = px.line(
    anime_per_month,
    x='Month',
    y='Anime Count',
    title='Number of Anime Released Per Month',
    labels={'Month': 'Month', 'Anime Count': 'Number of Anime'},
    markers=True
)

# Customize traces with modern gradient colors
fig.update_traces(
    line=dict(color='#8B5CF6', width=3),  # Vibrant purple
    marker=dict(
        size=10,
        color="#FFA3AE",  # Light purple for markers
        line=dict(color='#7C3AED', width=2)  # Darker purple border
    ),
    hovertemplate='<b>%{x}</b><br>Anime Count: %{y:,}<extra></extra>'
)

# Enhanced layout with modern styling
fig.update_layout(
    title={
        'text': 'Number of Anime Released Per Month',
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(size=22, color='#1F2937', family='Inter', weight='bold')
    },
    xaxis=dict(
        title='Month',
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        tickfont=dict(size=12, color='#6B7280'),
        showgrid=False,
        showline=True,
        linecolor='#E5E7EB',
        linewidth=1
    ),
    yaxis=dict(
        title='Number of Anime',
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        tickfont=dict(size=12, color='#6B7280'),
        showgrid=True,
        gridcolor='#F3F4F6',
        gridwidth=1,
        showline=True,
        linecolor='#E5E7EB',
        linewidth=1,
        tickformat=','
    ),
    plot_bgcolor='white',
    paper_bgcolor='#FAFAFA',
    height=550,
    margin=dict(t=80, l=60, r=40, b=80),
    hoverlabel=dict(
        bgcolor='white',
        bordercolor='#8B5CF6',
        font=dict(size=13, family='Inter', color='#1F2937')
    )
)

fig.show()

**Deep Insight**:
- During major occasions such as Halloween, Christmas, etc., is there a sudden spike in the number of anime releases? (You can consider anime released within 15 days before and after the official date of the event as belonging to that holiday release window.)
- Which Genres tend to be released during holidays like Halloween, Christmas, etc.?
- What is the average score of the top Genres when released during these holiday periods?

## 3.6.Insight Question 6
The **Source** feature is invaluable for understanding the origins of anime, revealing whether a title is adapted from manga, light novels, games, or is an entirely original work. This information can shed light on industry trends, adaptation strategies, and even predict potential fan bases or production approaches. Our initial analysis will examine the raw 'Source' column to identify the most prevalent origins of anime within our dataset. By intentionally bypassing any preprocessing, we aim to uncover the immediate distribution of these source categories and, crucially, to identify if any non-standard entries or placeholder values are present, which could distort our understanding of the true source landscape and necessitate data cleaning.

### 3.6.1.Misleading Insight
- What is the frequency of each unique source category present in the raw **Source** column of the dataset?

In [20]:
# 1. Thiết lập theme mặc định là nền trắng (Học từ Code 2)
pio.templates.default = "plotly_white"

# 2. Xử lý dữ liệu
source_counts = df_anime_dataset_2023['Source'].value_counts().reset_index()
source_counts.columns = ['Source', 'Count']

# Sắp xếp ascending=True để khi vẽ ngang, thanh lớn nhất sẽ nằm trên cùng (Declutter)
source_counts = source_counts.sort_values('Count', ascending=True)

# 3. Tạo logic màu sắc (Học từ Code 2)
# Nếu là 'Unknown' thì màu xám, còn lại là màu Tím (#8B5CF6)
base_color = "#8B5CF6"
unknown_color = "#B0B0B0"

# Tạo list màu tương ứng với thứ tự dữ liệu đã sắp xếp
colors = [
    unknown_color if str(s).strip().upper() == "UNKNOWN" else base_color
    for s in source_counts["Source"]
]

# 4. Tạo biểu đồ Horizontal Bar
fig = px.bar(
    source_counts,
    x='Count',      # Số lượng nằm ngang
    y='Source',     # Tên nguồn nằm dọc
    orientation='h', # Chuyển sang dạng thanh ngang
    title='Frequency of Anime Sources (Raw Data)',
    labels={'Source': 'Source Type', 'Count': 'Number of Anime'},
    text='Count'
)

# 5. Tinh chỉnh giao diện (Kết hợp Code 1 và Code 2)
fig.update_traces(
    marker_color=colors,           # Áp dụng logic màu đã tạo
    marker_line_color='white',
    marker_line_width=1.2,
    textposition='outside',
    texttemplate='%{text:,}',
    hovertemplate='<b>Source:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>'
)

# 6. Tinh chỉnh Layout (Modern & Clean)
fig.update_layout(
    title={
        'text': 'Frequency of Anime Sources (Raw Data)',
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(size=22, family='Inter', weight='bold', color='#1F2937')
    },
    # Trục X: Số lượng
    xaxis=dict(
        title='Number of Anime',
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        showgrid=False,        # Giữ grid dọc mờ để dễ so sánh độ dài
        gridcolor='#F3F4F6',  # Grid màu rất nhạt
        showline=False,       # Bỏ đường kẻ trục đậm
        tickformat=',',
        zeroline=False
    ),
    # Trục Y: Tên nguồn (Nơi cần ticksuffix)
    yaxis=dict(
        title='',             # Bỏ tiêu đề trục Y vì tên Source đã rõ ràng
        tickfont=dict(size=13, color='#4B5563', family='Inter'),
        ticksuffix="  ",      # <--- Thêm khoảng cách theo yêu cầu của bạn
        showgrid=False,
        showline=False,       # Bỏ đường kẻ trục cho thoáng
        zeroline=False
    ),
    bargap=0.25,
    height=600,
    margin=dict(t=80, l=120, r=40, b=80),
    hoverlabel=dict(
        bgcolor='white',
        bordercolor='#8B5CF6',
        font=dict(size=13, family='Inter')
    )
)

fig.show()

**Conflict**
The initial insight derived from the **Source** feature (showing the frequency of anime origins) is skewed and misleading due to issues related to its data representation. The raw state of this column prevents an accurate understanding of the true prevalence of each source.
- Error 1: Mismatched Data Type - **Source** as a Generic Object (String) Type.
  - Impact on Insight (Bar/Pie Chart): While Plotly can render categories from an object (string) column, the generic nature of this data type means there's no inherent enforcement of valid source categories. This allows for the inclusion of arbitrary string values, which are then treated as legitimate categories by the plotting library. This lack of type specificity contributes to the problem of non-standard entries being interpreted as meaningful data, rather than flagging them as potential errors or missing information.
- Error 2: Non-standard Missing Value Representation - 'UNKNOWN' instead of NaN.
  - Impact on Insight (Bar/Pie Chart): The presence of the string 'UNKNOWN' within the **Source** column is a significant distortion. Instead of being recognized as a missing value (which Plotly would typically ignore or handle gracefully), 'UNKNOWN' is treated as a distinct and valid source category. Consequently, in both a bar chart and a pie chart, 'UNKNOWN' will appear as a prominent category, often ranking highly in frequency. This falsely inflates its importance and misrepresents the true distribution of actual anime sources, making it difficult to discern the genuine origins of anime in the dataset.


### 3.6.2.Correct Insight

In [23]:
# 1. Giữ theme mặc định
pio.templates.default = "plotly_white"

# 2. Xử lý dữ liệu (Cleaned Data)
# Đếm tần suất
source_counts = df_anime_dataset_2023_prep['Source'].value_counts().reset_index()
source_counts.columns = ['Source', 'Count']

# Sắp xếp để thanh dài nhất nằm trên cùng
source_counts = source_counts.sort_values('Count', ascending=True)

# 3. Tạo biểu đồ Horizontal Bar
fig = px.bar(
    source_counts,
    x='Count',
    y='Source',
    orientation='h',
    title='Frequency of Anime Sources (Cleaned Data)', 
    labels={'Source': 'Source Type', 'Count': 'Number of Anime'},
    text='Count',
    color='Count',
    color_continuous_scale=['#DDD6FE', '#8B5CF6', '#4C1D95']
)

# 4. Tinh chỉnh giao diện Traces (Đồng nhất style)
fig.update_traces(
    marker_line_color='white',
    marker_line_width=1.2,
    textposition='outside',
    texttemplate='%{text:,}',
    hovertemplate='<b>Source:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>'
)

# 5. Tinh chỉnh Layout (Copy style từ biểu đồ Raw để đồng bộ)
fig.update_layout(
    title={
        'text': 'Frequency of Anime Sources (Cleaned Data)',
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(size=22, family='Inter', weight='bold', color='#1F2937')
    },
    # TRỤC X: Hiển thị Grid, Format số
    xaxis=dict(
        title='Number of Anime',
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        tickfont=dict(size=11, color='#6B7280'),
        showgrid=False,       
        gridcolor='#F3F4F6',  # Grid màu nhạt
        showline=False,       # Bỏ đường kẻ trục đậm
        zeroline=False,
        tickformat=','        # Dấu phẩy phân cách hàng nghìn
    ),
    # TRỤC Y: Chứa nhãn Source, Ticksuffix
    yaxis=dict(
        title='',             # Bỏ tiêu đề trục "Source" cho thoáng
        tickfont=dict(size=13, color='#4B5563', family='Inter'),
        ticksuffix="  ",      # <--- QUAN TRỌNG: Tạo khoảng cách với thanh bar
        showgrid=False,       # Tắt grid ngang
        showline=False,
        zeroline=False
    ),
    coloraxis_showscale=False,
    plot_bgcolor='white',
    paper_bgcolor='white',    # Đồng bộ màu nền
    bargap=0.25,              # Khoảng cách giữa các thanh
    height=600,
    margin=dict(t=80, l=120, r=40, b=80), # Lề trái rộng (120) cho tên Source dài
    hoverlabel=dict(
        bgcolor='white',
        bordercolor='#8B5CF6',
        font=dict(size=13, family='Inter', color='#1F2937')
    )
)

fig.show()

##### Initial Analysis
After removing the **Unknown** category, the distribution of anime by Source has become more reliable and reshaped the market landscape. The Anime industry show a strong reliance on the traditional content sources with the **Original creations** dominating the largest share, followed by the **Manga** at a significant distance. 

Game-based and Visual Novel adaptations occupy the next major tiers, while sources such as Light Novels, Web Manga, Music, and niche formats like 4-koma Manga or Picture Books contribute much smaller proportions.

$\implies$ The Anime industry tends to produce a high number of Original works to avoid dependence on licensing, or relies on the safety of existing popular Manga. ***However, does “a large quantity” necessarily mean “good quality”?***


**Deep Insight**:
- Điểm số trung bình của mỗi loại Source là bao nhiêu?
- Các type anime như Movie và Serie thường chuyển thể từ các Source nào.

In [22]:

# --- 1. DATA PROCESSING ---
avg_score_by_source = df_anime_dataset_2023_prep.groupby('Source', dropna=True)['Score'].mean().reset_index()
avg_score_by_source = avg_score_by_source.dropna(subset=['Score']).sort_values('Score', ascending=False)

# --- 2. PLOTTING ---
fig_avg = px.bar(
    avg_score_by_source,
    x='Score',
    y='Source',
    orientation='h',
    text='Score',
    color='Score', 
    color_continuous_scale=[ '#DDD6FE', '#4C1D95' ],
)

# --- 3. STYLING TRACES ---
fig_avg.update_traces(
    marker_line_width=0, 
    textposition='outside',
    texttemplate='%{text:.2f}',
    hovertemplate='<b>Source:</b> %{y}<br><b>Avg Score:</b> %{x:.2f}<extra></extra>'
)

# --- 4. LAYOUT & AXES ---
fig_avg.update_layout(
    title={
        'text': '<b>Average Score by Source</b>', 
        'y': 0.95, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'
    },
    font=dict(family='inter', size=13, color="#333333"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    height=600,
    margin=dict(t=80, l=150, r=50, b=50), 
    showlegend=False,
    coloraxis_showscale=False,

    # Trục X: Ẩn hoàn toàn
    xaxis=dict(visible=False),

    # Trục Y: Tăng cỡ chữ lên 14
    yaxis=dict(
        title=None,
        showgrid=False,
        showline=False,
        ticklen=0,
        autorange="reversed",
        ticksuffix="  ",
        tickfont=dict(size=14) 
    )
)

fig_avg.show()

#### Deep Analysis
##### **A. The Paradox of "Quantity" vs. "Quality"** 
*   **The Fall of "Original":** Despite ranking **#1** in production release volume, the average score for *Original* works falls into the low tier (**6.07**).
    *   **Reason:** Producing an original script is immensely risky. Without an existing fanbase, the storyline has not yet been tested in the market. The data suggests a **large number of low-quality or short-lived original anime being mass-produced**, which lowers the overall average score.
*   **The Rise of the "Novel" Category:**
    *   **Web Novels (7.00)** and **Light Novels (6.96)** lead the rankings in score, despite their modest production numbers.
    *   **Manga (6.83)** maintains stable performance: High volume combined with high scores.
    *   **Reason - The "Selection Bias" Effect:** A Web Novel or Light Novel typically needs to reach high popularity and demonstrate strong narrative quality before it is selected for anime adaptation. As a result, works chosen from this source category tend to have a higher probability of critical success (e.g., *Re:Zero, Sword Art Online*).

$\implies$ **Strategy for the Producer:** For safety and guaranteed high ratings, prioritize adaptations from famous **Web Novels, Light Novels, or Manga**. If choosing to produce an **Original**, accept high risks and invest heavily in the scriptwriting team to avoid the "low-score trap."

In [None]:
# --- 1. SETUP SUBPLOTS ---
# Tạo khung chứa 1 hàng, 2 cột
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=["<b>Top Sources for TV</b>", "<b>Top Sources for Movie</b>"],
    horizontal_spacing=0.12  # Khoảng cách giữa 2 biểu đồ
)

types_of_interest = ['TV', 'Movie'] # Thứ tự hiển thị: Trái -> Phải

# Định nghĩa dải màu (Light Purple -> Deep Purple)
custom_colorscale = [ '#DDD6FE', '#4C1D95' ]

# --- 2. DATA PROCESSING & PLOTTING ---
for i, t in enumerate(types_of_interest):
    # Data Processing
    df_t = df_anime_dataset_2023_prep[df_anime_dataset_2023_prep['Type'] == t]
    counts = df_t['Source'].value_counts().reset_index()
    counts.columns = ['Source', 'Count']
    
    # Lấy Top 10 và sắp xếp giảm dần (để hiển thị đúng thứ tự từ trên xuống)
    counts = counts.head(10).sort_values('Count', ascending=False)
    
    # Tạo Trace (Bar Chart)
    fig.add_trace(
        go.Bar(
            x=counts['Count'],
            y=counts['Source'],
            orientation='h',
            text=counts['Count'],
            texttemplate='%{text:,}', # Format số có dấu phẩy
            textposition='outside',
            hovertemplate='<b>Source:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>',
            
            # Color Logic: Map màu theo số lượng
            marker=dict(
                color=counts['Count'],
                colorscale=custom_colorscale,
                line_width=0
            ),
            name=t # Tên cho legend (nhưng ta sẽ ẩn legend)
        ),
        row=1, col=i+1 # Đặt vào cột tương ứng (1 hoặc 2)
    )

# --- 3. GLOBAL STYLING (DECLUTTER) ---

# Cập nhật Layout chung
fig.update_layout(
    title={
        'text': '<b>Source Volume Distribution According to Type: TV vs Movie</b>',
        'y': 0.95,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    font=dict(family='Inter', size=13, color="#333333"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    height=600,
    width=1400, # Độ rộng đủ cho 2 chart
    margin=dict(t=80, l=50, r=50, b=50),
    showlegend=False
)

# Tinh chỉnh trục X (Ẩn số, ẩn grid) cho cả 2 biểu đồ
fig.update_xaxes(showticklabels=False, showgrid=False, showline=False, visible=False)

# Tinh chỉnh trục Y (Giữ nhãn Source, ẩn grid, đảo ngược chiều để Top 1 nằm trên)
fig.update_yaxes(
    showgrid=False, 
    showline=False, 
    ticklen=0, 
    autorange="reversed", # Đảo ngược trục Y để giá trị cao nhất nằm trên cùng
    ticksuffix="  "      # Tạo khoảng cách nhỏ giữa chữ và thanh bar
)

# Mở rộng trục X một chút để số không bị cắt (vì textposition='outside')
# Chúng ta lặp qua từng trace để lấy max value và set range
for i, t in enumerate(types_of_interest):
    df_t = df_anime_dataset_2023_prep[df_anime_dataset_2023_prep['Type']== t]
    max_val = df_t['Source'].value_counts().max()
    # Update axes cụ thể (xaxis1, xaxis2)
    if i == 0:
        fig.update_xaxes(range=[0, max_val * 1.25], row=1, col=1)
    else:
        fig.update_xaxes(range=[0, max_val * 1.25], row=1, col=2)

fig.show()

##### **B.Format and Source Compatibility (Movie vs. TV)**

According to the distribution of Type released, TV and Movie are leading the market. Selection behavior for script sources changes drastically depending on the release format:

*   **For TV Series:**
    *   **Top 3 Structure:** **Original > Manga > Light Novel**.
    *   *Analysis:* TV Series require long-form content to sustain broadcasting for 3-6 months (12-24 episodes). **Manga** and **Light Novels**, with their episodic structures and expansive world-building, are highly suitable for this format. Light Novels breaking into the Top 3 for TV series (488 titles) proves this is a "fertile ground" for serialized content.
*   **For Movies:**
    *   **Top 3 Structure:** **Original > Manga > Other**.
    *   *The Disappearance of Light Novels:* Light Novels drop significantly in the Movie category.
    *   *Analysis:* Movies have limited runtime (90-120 minutes), demanding concise, dramatic plots with clear conclusions.
        *   **Originals** dominate Movies (1,908 titles) as screenwriters can easily tailor a complete story specifically for the 2-hour format.
        *   **Manga** movies are often spin-offs of major franchises (e.g., *Conan, Doraemon*) or short one-shots.
        *   **Light/Web Novels** are often too complex and lengthy to compress into a single film, making them less prioritized for the Movie format.

## 3.7.Insight Question 7
A Producer can be understood as "investors," and an anime is essentially an "investment project." Producers are the entities that provide funding, plan, and coordinate the anime project. Normally, producing an anime requires a huge budget. To minimize financial risk in case the anime fails and causes significant losses, multiple producers often collaborate on a single anime. Consequently, the 'Producers' feature in our dataset often lists several production companies or individuals within a single entry, separated by commas. Our initial exploration of this raw column will aim to identify the most frequently appearing combinations of producers, which can hint at common collaborations or dominant production groups. By analyzing this unprocessed, aggregated data, we intend to highlight how its current string format will impede the identification of individual producers' prevalence and necessitate significant parsing and cleaning.

### 3.7.1.Misleading Insight
- What are the most frequently appearing producer combinations or producer strings as they are originally listed in the dataset?

In [25]:
# 2. Xử lý dữ liệu (Top 15 Producers)
producer_counts = df_anime_dataset_2023['Producers'].value_counts().head(15).reset_index()
producer_counts.columns = ['Producers', 'Count']

# Sắp xếp ascending=True để khi vẽ ngang, thanh lớn nhất nằm trên cùng
producer_counts = producer_counts.sort_values('Count', ascending=True)

# 3. Logic màu sắc (Gray cho Unknown, Tím cho còn lại)
base_color = "#8B5CF6"
unknown_color = "#B0B0B0"

# Tạo list màu dựa trên tên Producer
colors = [
    unknown_color if str(p).strip().upper() == "UNKNOWN" else base_color
    for p in producer_counts['Producers']
]

# 4. Tạo biểu đồ Horizontal Bar
fig = px.bar(
    producer_counts,
    x='Count',       # Số lượng nằm ngang
    y='Producers',   # Tên nằm dọc
    orientation='h', # Xoay ngang
    title='Top 15 Producer Combinations (Raw Data)',
    labels={'Producers': 'Producer Combination', 'Count': 'Number of Anime'},
    text='Count'
)

# 5. Tinh chỉnh traces (Áp dụng màu và format)
fig.update_traces(
    marker_color=colors,  # Áp dụng list màu đã tạo ở bước 3
    marker_line_color='white',
    marker_line_width=1.2,
    textposition='outside',
    texttemplate='%{text:,}',
    hovertemplate='<b>Producers:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>'
)

# 6. Tinh chỉnh Layout (Đồng bộ style)
fig.update_layout(
    title={
        'text': 'Top 15 Producer Combinations (Raw Data)',
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(size=22, family='Inter', weight='bold', color='#1F2937')
    },
    # Trục X: Số lượng (Hiển thị Grid để dễ so sánh)
    xaxis=dict(
        title='Number of Anime',
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        tickfont=dict(size=11, color='#6B7280'),
        showgrid=False,
        gridcolor='#F3F4F6',
        showline=False,
        zeroline=False,
        tickformat=','
    ),
    # Trục Y: Tên Producers (Cần ticksuffix)
    yaxis=dict(
        title='',  # Bỏ tiêu đề trục Y vì tên đã rõ nghĩa
        tickfont=dict(size=14, color='#4B5563', family='Inter'),
        ticksuffix="  ",  # Tạo khoảng cách
        showgrid=False,
        showline=False,
        zeroline=False
    ),
    plot_bgcolor='white',
    paper_bgcolor='white',
    bargap=0.25,
    height=700,  # Tăng chiều cao vì có 15 dòng
    # Margin left (l) cần lớn (250) vì tên Producer Combinations thường rất dài
    margin=dict(t=80, l=250, r=40, b=80),
    hoverlabel=dict(
        bgcolor='white',
        bordercolor='#8B5CF6',
        font=dict(size=13, family='Inter', color='#1F2937')
    )
)

fig.show()

**Conflict**
The initial insights derived from the 'Producers' feature (showing the frequency of producer combinations) are significantly flawed and misleading due to critical data integrity and structural issues. The raw state of this column prevents an accurate understanding of individual producers' contributions or true prevalence.
- Error 1: Aggregated Producer Strings (Multiple Producers in One Entry).
    - Impact on Insights (Bar Charts of Producer Combinations): The primary issue is that the 'Producers' column often contains multiple producers listed within a single string entry (e.g., "Aniplex, Lantis, MAGES."). When we count the frequency of these strings, we are tallying specific combinations of producers, not the individual producers themselves. This fundamentally obscures the true involvement and popularity of any single production company. For example, "Aniplex" might be a highly active producer, but its individual prevalence is hidden within numerous unique combinations. The bar charts will show these distinct, often lengthy, combination strings as categories, making it impossible to determine which individual producers are most active or involved across the dataset. We are gaining insight into producer teams or bundles, rather than the independent entities.
- Error 2: Non-standard Missing Value Representation - 'UNKNOWN' instead of NaN.
    - Impact on Insights (Bar Charts of Producer Combinations): The presence of the string 'UNKNOWN' for missing producer information is treated as a valid and distinct producer combination by Plotly. This means 'UNKNOWN' will likely appear as one of the most frequent "producer combinations" in the bar charts. Its inclusion as a prominent category inflates its perceived importance, distorts the true ranking of actual producer collaborations, and distracts from analyzing meaningful production data. A correctly handled missing value (standard NaN) would typically be excluded from such frequency counts by default, providing a clearer and more accurate insight into actual producer involvement.

### 3.7.2.Correct Insight

In [26]:
import plotly.express as px
import plotly.io as pio
import pandas as pd

# 1. Setup Theme
pio.templates.default = "plotly_white"

# --- 2. DATA PROCESSING (DEEP CLEANING - KEEP AS IS) ---
prod_df = df_anime_dataset_2023_prep[['Producers']].copy()
prod_df['Producers'] = prod_df['Producers'].astype(str)

# REGEX CLEANING
prod_df['Producers'] = prod_df['Producers'].str.replace(r"[\[\]\'\"]", "", regex=True)

# SPLIT & EXPLODE
prod_df['Producers'] = prod_df['Producers'].str.split(',')
prod_df = prod_df.explode('Producers')

# TRIM
prod_df['Producers'] = prod_df['Producers'].str.strip()

# FILTER
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL"]
prod_df = prod_df[~prod_df['Producers'].str.upper().isin(exclude_list)]
prod_df = prod_df[prod_df['Producers'] != ""]

# --- 3. AGGREGATION ---
top_producers = prod_df['Producers'].value_counts().head(20).reset_index()
top_producers.columns = ['Producers', 'Count']
top_producers = top_producers.sort_values('Count', ascending=True)

# --- 4. VISUALIZATION ---
fig = px.bar(
    top_producers,
    x='Count',
    y='Producers',
    orientation='h',
    text='Count',
    title='<b>Market Leaders: Top 20 Most Active Individual Producers</b>',
    color='Count',
    color_continuous_scale=[ '#DDD6FE', '#4C1D95' ]
)

# Tinh chỉnh Traces
fig.update_traces(
    texttemplate='%{text:,}',
    textposition='outside',
    hovertemplate='<b>Producer:</b> %{y}<br><b>Projects:</b> %{x:,}<extra></extra>'
)

# --- 5. LAYOUT ---
fig.update_layout(
    title={
        'text': '<b>Top 20 Most Active Individual Producers</b>',
        'y': 0.95,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    font=dict(family='Inter', size=13, color="#333333"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    height=750, 
    coloraxis_showscale=False,
    margin=dict(t=80, l=150, r=50, b=80),

    # --- CẤU HÌNH TRỤC X (Số lượng) ---
    xaxis=dict(
        title="Number of Anime Produced",
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        showticklabels=True, 
        tickfont=dict(size=14, color='#6B7280'),
        
        # Grid settings
        showgrid=False,        
        gridcolor='#F3F4F6',
        zeroline=False,
        
        # --- QUAN TRỌNG: CHỈNH BƯỚC NHẢY 200 ---
        tick0=0,     
        dtick=200,   
        tickformat=',' 
    ),

    # --- CẤU HÌNH TRỤC Y (Tên Producers) ---
    yaxis=dict(
        title="",
        tickfont=dict(size=14, color='#333333', family='Inter'),
        ticksuffix="  ",    # Khoảng cách đẹp
        showgrid=False,
        showline=False,
        zeroline=False
    )
)

fig.show()


**Deep Insight**

- Compare the average Score between anime with `Producers = "UNKNOWN"` and anime with at least one known producer, to test whether projects with clearly identified producer organizations systematically perform better.  
- Compare the average Score across different collaboration structures based on `num_producers` (for example: 1 producer vs 2 producers vs 3 or more), to quantify whether producer collaboration is associated with higher success than single producer projects.  
- Identify the top 10 producers with the highest average Score (the Score of a co produced anime is fully credited to each producer involved) to highlight which producer brands tend to be associated with successful titles.  
- Restrict to producers that have co produced at least 3 anime and compare their average Score, in order to find producer teams or recurring producer combinations that consistently deliver high performing projects.

<!-- **Deep Insight**:
- Hãy cho biết điểm trung bình score giữa việc colab và tự thực hiện 1 mình thì cái nào hiệu quả hơn.
- Hãy cho biết top 10 Producer đạt được điểm trung bình Score cao nhất (điểm score của anime khi colab cũng được tính là điểm riêng cho producer tham colab)
- Đội Producers nào có điểm score trung bình cao nhất (chỉ tính các đội colab trên 3 movie anime) -->

#### Deep insight 1

In [27]:

# 1. Setup Theme
pio.templates.default = "plotly_white"

# --- 2. DATA PREPARATION ---
# Copy dữ liệu
df_compare = df_anime_dataset_2023_prep[['Producers', 'Score']].copy()

# Xử lý cột Score
df_compare['Score'] = pd.to_numeric(df_compare['Score'], errors='coerce')
df_compare = df_compare.dropna(subset=['Score'])

# --- HÀM PHÂN LOẠI (XỬ LÝ CẢ Ô TRỐNG) ---
def categorize_producer(x):
    if pd.isna(x): return 'Unknown'          # Bắt NaN/Null
    if str(x).strip() == "": return 'Unknown' # Bắt chuỗi rỗng
    if str(x).strip().upper() == "UNKNOWN": return 'Unknown' # Bắt chữ Unknown
    return 'Known Producers'

# Áp dụng hàm
df_compare['Producer_Type'] = df_compare['Producers'].apply(categorize_producer)

# Tính toán Mean Insight
mean_scores = df_compare.groupby('Producer_Type')['Score'].mean()
unknown_mean = mean_scores.get('Unknown', 0)
known_mean = mean_scores.get('Known Producers', 0)
diff = known_mean - unknown_mean

# --- 3. VISUALIZATION ---
fig = px.box(
    df_compare,
    x='Producer_Type',
    y='Score',
    color='Producer_Type',
    # Định nghĩa màu: Xám cho Unknown, Tím cho Known
    color_discrete_map={
        'Unknown': '#B0B0B0',       
        'Known Producers': '#8B5CF6' 
    }
)

# --- SỬA LỖI MÀU SẮC TẠI ĐÂY ---
fig.update_traces(
    width=0.4,          # Độ rộng hộp
    marker_size=5,      # Kích thước điểm ngoại lai (outliers)
    marker_opacity=0.4, # Độ mờ điểm ngoại lai
    # marker_color="black"  <-- ĐÃ XÓA DÒNG NÀY ĐỂ KHÔNG BỊ ĐEN HẾT BIỂU ĐỒ
    line_width=2      # Độ dày đường viền hộp
)

# --- 4. LAYOUT CONFIGURATION ---
fig.update_layout(
    title={
        'text': f'<b>Score Comparison: Known vs. Unknown Producers</b><br>'
                f'<span style="font-size: 13px; color: #555;">'
                f'Known Average Score: <b>{known_mean:.2f}</b> vs Unknown Average Score: <b>{unknown_mean:.2f}</b> '
                f'(Difference: <span style="color:green">+{diff:.2f}</span>)</span>',
        'y': 0.95,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    font=dict(family='Inter', size=13, color="#333333"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    height=800,
    width=1200,
    showlegend=False,
    margin=dict(t=120, l=80, r=50, b=80),

    # --- TRỤC X ---
    xaxis=dict(
        title="",
        tickfont=dict(size=16, weight='bold', color='#333333'),
        showgrid=False,
        showline=True,
        linecolor='#E5E7EB',
        zeroline=False
    ),

    # --- TRỤC Y ---
    yaxis=dict(
        title="Anime Score",
        title_font=dict(size=16, color='#4B5563'),
        tickfont=dict(size=12, color='#6B7280'),
        showgrid=True,
        gridcolor='#F3F4F6',
        showline=True,
        linecolor='#E5E7EB',
        range=[0, 10], 
        dtick=1 
    )
)

fig.show()



**1. Clear Performance Gap (Quantitative Gap)**

  * Average Score (Mean):

    * **Known Producers:** 6.55
    * **Unknown Producers:** 6.11
    * **Difference:** +0.45 points (on a 10-point anime scale, this is a large gap)
  * Median:

    * The purple box (Known) is visibly higher than the grey box (Unknown), meaning most titles with a known producer land in a safer scoring range.

**2. Risk Distribution**

  * **Quality Floor (Worst Cases):**

    * Unknown group shows many low-score outliers (below **3.0–4.0**), suggesting weak funding, low control, or informal production.
  * **Quality Ceiling (Best Cases):**

    * Known group reaches up to **9.0**, showing that major successes (“masterpieces”) almost always come from projects with professional producers involved.

**3. Strategic Understanding (Producer Value)**

  * A producer is not only “money support” but also a **quality filter** providing:

    * Proper production workflow
    * Clear market direction
    * Marketing and distribution resources
  * The “Unknown” label acts as a **warning**, indicating a higher project failure risk and unstable results.

**4. Conclusion**

  * Data shows that having a recognized producer is a critical factor for achieving a stable baseline score and improving the chance of success. Serious projects should not leave this role empty.


#### Deep insight 2

In [28]:
# --- 2. DATA PROCESSING ---
# Copy dữ liệu
df_producers_collab = df_anime_dataset_2023_prep[['Producers', 'Score']].copy()

# Xử lý Score
df_producers_collab['Score'] = pd.to_numeric(df_producers_collab['Score'], errors='coerce')
df_producers_collab = df_producers_collab.dropna(subset=['Score'])

# Loại bỏ các dòng Unknown/Rỗng để việc đếm chính xác
exclude_list = ["UNKNOWN", "", "NONE"]
df_producers_collab = df_producers_collab[~df_producers_collab['Producers'].astype(str).str.strip().str.upper().isin(exclude_list)]
df_producers_collab = df_producers_collab.dropna(subset=['Producers'])

# TÍNH TOÁN SỐ LƯỢNG PRODUCERS
# Đếm số dấu phẩy + 1 (Ví dụ: "A, B" có 1 dấu phẩy -> 2 producers)
df_producers_collab['num_producers'] = df_producers_collab['Producers'].astype(str).str.count(',') + 1

# PHÂN NHÓM (BINNING) ĐỂ BIỂU ĐỒ GỌN GÀNG
def group_size(n):
    if n == 1: return '1 Producer'
    if n == 2: return '2 Producers'
    if n == 3: return '3 Producers'
    if 4 <= n <= 5: return '4-5 Producers'
    return '6+ Producers'

df_producers_collab['Collaboration_Group'] = df_producers_collab['num_producers'].apply(group_size)

# Sắp xếp thứ tự để biểu đồ vẽ từ 1 -> 6+
order_list = ['1 Producer', '2 Producers', '3 Producers', '4-5 Producers', '6+ Producers']

# Tính toán Insight (Mean Score cho từng nhóm)
group_means = df_producers_collab.groupby('Collaboration_Group')['Score'].mean().reindex(order_list)
# Lấy điểm thấp nhất (1 Producer) và cao nhất (thường là nhóm nhiều producers) để so sánh
solo_score = group_means['1 Producer']
max_group_score = group_means.max()
best_group = group_means.idxmax()
diff = max_group_score - solo_score

# --- 3. VISUALIZATION ---
fig = px.box(
    df_producers_collab,
    x='Collaboration_Group',
    y='Score',
    color_discrete_sequence=['#8B5CF6'],
    category_orders={'Collaboration_Group': order_list} # Bắt buộc vẽ đúng thứ tự
)

fig.update_traces(
    width=0.5,
    marker_size=5,
    marker_opacity=0.3,
    line_width=2
)

# --- 4. LAYOUT CONFIGURATION ---
fig.update_layout(
    title={
        'text': f'<b>Impact of Collaboration Size on Anime Score</b><br>'
                f'<span style="font-size: 14px; color: #555;">'
                f'Solo Projects Avg: <b>{solo_score:.2f}</b> vs {best_group} Avg: <b>{max_group_score:.2f}</b> '
                f'(Improvement: <span style="color:green">+{diff:.2f}</span>)</span>',
        'y': 0.95,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    font=dict(family='Inter', size=14, color="#333333"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    
    width=1200,
    height=800,
    
    showlegend=False, # Ẩn legend vì trục X đã rõ
    margin=dict(t=120, l=100, r=80, b=100),

    # TRỤC X
    xaxis=dict(
        title="Number of Producers Involved",
        title_font=dict(size=16, color='#4B5563'),
        tickfont=dict(size=14, weight='bold', color='#333333'),
        showgrid=False,
        showline=True,
        linecolor='#E5E7EB'
    ),

    # TRỤC Y
    yaxis=dict(
        title="Anime Score",
        title_font=dict(size=16, color='#4B5563'),
        tickfont=dict(size=14, color='#6B7280'),
        showgrid=True,
        gridcolor='#F3F4F6',
        showline=True,
        linecolor='#E5E7EB',
        range=[0, 10],
        dtick=1
    )
)

fig.show()



**Insight: The "Production Committee" Effect (Sức mạnh của Hội đồng Sản xuất)**

Dữ liệu chứng minh rằng mô hình hợp tác (Joint Venture) vượt trội hoàn toàn so với sản xuất đơn lẻ.

*   **Hiệu quả hợp tác (+0.98 điểm):** Có một sự tương quan thuận chiều rõ rệt: số lượng Producer càng tăng, điểm số càng cao. Mức chênh lệch gần **1.0 điểm** giữa dự án "Solo" (6.23) và nhóm "4-5 Producers" (7.21) là minh chứng cho việc **ngân sách lớn hơn và quy trình giám sát chặt chẽ hơn** sẽ tạo ra sản phẩm tốt hơn.
*   **Quản trị rủi ro (Risk Mitigation):** Nhóm **"1 Producer"** có biên độ dao động rất lớn với nhiều "thảm họa" (outliers) rơi xuống mức điểm 2.0 - 3.0. Ngược lại, khi có từ **3 Producers trở lên**, "sàn chất lượng" (the floor) được nâng lên đáng kể (hiếm khi dưới 5.0), giúp bảo vệ khoản đầu tư khỏi thất bại thảm hại.
*   **Điểm tối ưu (The Sweet Spot):** Hiệu suất đạt đỉnh ở nhóm **4-5 Producers**. Nhóm **6+ Producers** dù vẫn tốt nhưng không cho thấy sự tăng trưởng điểm số rõ rệt (bão hòa), gợi ý rằng việc thêm quá nhiều bên tham gia có thể gây ra hiệu suất giảm dần (*diminishing returns*) do khâu quản lý phức tạp.

**👉 Kết luận:** Chiến lược tối ưu là thành lập một Hội đồng sản xuất (Production Committee) gồm **3-5 đối tác** để cân bằng giữa nguồn vốn, kiểm soát chất lượng và hiệu quả quản lý.

In [None]:
# 1. TÍNH TOÁN SỐ LƯỢNG (AGGREGATION)
# Đếm số dòng trong mỗi nhóm
group_counts = df_producers_collab['Collaboration_Group'].value_counts()

# Sắp xếp lại theo đúng thứ tự logic (1 -> 6+) thay vì theo độ lớn
order_list = ['1 Producer', '2 Producers', '3 Producers', '4-5 Producers', '6+ Producers']
group_counts = group_counts.reindex(order_list).reset_index()
group_counts.columns = ['Collaboration_Group', 'Anime_Count']

# Tính thêm phần trăm để hiển thị cho hay
total_anime = group_counts['Anime_Count'].sum()
group_counts['Percentage'] = (group_counts['Anime_Count'] / total_anime * 100).round(1)

# 2. VẼ BIỂU ĐỒ (BAR CHART)
fig = px.bar(
    group_counts,
    x='Collaboration_Group',
    y='Anime_Count',
    text='Anime_Count', # Hiển thị số lượng trên đầu cột
    color_discrete_sequence=['#8B5CF6'] # Màu tím chủ đạo
)

# 3. TINH CHỈNH GIAO DIỆN
fig.update_traces(
    texttemplate='%{text:,}', # Format số có dấu phẩy (ví dụ 1,200)
    textposition='outside',
    marker_line_color='white',
    marker_line_width=1.5,
    # Hover hiển thị cả số lượng và phần trăm
    hovertemplate='<b>%{x}</b><br>Count: %{y:,}<br>Share: %{customdata[0]}%<extra></extra>',
    customdata=group_counts[['Percentage']] # Truyền dữ liệu % vào hover
)

fig.update_layout(
    title={
        'text': f'<b>Distribution of Anime by Collaboration Size</b><br>'
                f'<span style="font-size: 14px; color: #555;">'
                f'Total Analyzed Anime: <b>{total_anime:,}</b></span>',
        'y': 0.95,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    font=dict(family='Inter', size=14, color="#333333"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    
    width=1200,
    height=700,
    
    showlegend=False,
    margin=dict(t=100, l=80, r=80, b=100),

    # TRỤC X
    xaxis=dict(
        title="Number of Producers Involved",
        title_font=dict(size=16, color='#4B5563'),
        tickfont=dict(size=14, weight='bold', color='#333333'),
        showgrid=False,
        showline=True,
        linecolor='#E5E7EB'
    ),

    # TRỤC Y
    yaxis=dict(
        title="Number of Anime",
        title_font=dict(size=16, color='#4B5563'),
        tickfont=dict(size=14, color='#6B7280'),
        showgrid=True,
        gridcolor='#F3F4F6',
        showline=True,
        linecolor='#E5E7EB',
        tickformat=',' # Format trục Y có dấu phẩy
    )
)

fig.show()

#### Deep Insight 3

In [30]:
# --- 2. DATA PROCESSING ---
# Copy dữ liệu để xử lý
df_quality = df_anime_dataset_2023_prep[['Producers', 'Score']].copy()

# Chuyển đổi Score sang số và loại bỏ dòng không có điểm
df_quality['Score'] = pd.to_numeric(df_quality['Score'], errors='coerce')
df_quality = df_quality.dropna(subset=['Score'])

# --- BƯỚC LÀM SẠCH KỸ LƯỠNG (DEEP CLEANING) ---
df_quality['Producers'] = df_quality['Producers'].astype(str)
# Loại bỏ ký tự thừa [ ] ' "
df_quality['Producers'] = df_quality['Producers'].str.replace(r"[\[\]\'\"]", "", regex=True)
# Tách chuỗi và Explode
df_quality['Producers'] = df_quality['Producers'].str.split(',')
df_quality = df_quality.explode('Producers')
# Cắt khoảng trắng
df_quality['Producers'] = df_quality['Producers'].str.strip()
# Loại bỏ rác
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_quality = df_quality[~df_quality['Producers'].str.upper().isin(exclude_list)]

# --- 3. AGGREGATION & FILTERING (QUAN TRỌNG) ---
# Gom nhóm theo Producer và tính 2 chỉ số: Điểm trung bình (mean) và Số lượng phim (count)
producer_stats = df_quality.groupby('Producers')['Score'].agg(['mean', 'count']).reset_index()

# ĐẶT NGƯỠNG LỌC (THRESHOLD)
# Chỉ lấy những nhà sản xuất đã làm ít nhất 10 bộ phim
min_anime_count = 20
top_quality = producer_stats[producer_stats['count'] >= min_anime_count].copy()

# Lấy Top 10 dựa trên điểm trung bình (Mean Score)
top_quality = top_quality.sort_values('mean', ascending=False).head(20)

# Sắp xếp lại Tăng dần để vẽ biểu đồ ngang (người điểm cao nhất nằm trên cùng)
top_quality = top_quality.sort_values('mean', ascending=True)

# --- 4. VISUALIZATION ---
fig = px.bar(
    top_quality,
    x='mean',
    y='Producers',
    orientation='h',
    text='mean', # Hiển thị điểm số trên thanh
    
    title=f'<b>Top 10 High-Quality Producers</b> (Min. {min_anime_count} Titles)',
    
    # Tô màu theo điểm số (Càng cao càng đậm)
    color='mean',
    color_continuous_scale=['#DDD6FE', '#4C1D95']
)

# Tinh chỉnh Traces
fig.update_traces(
    texttemplate='%{text:.2f}', # Format điểm số lấy 2 số thập phân
    textposition='outside',
    marker_line_color='white',
    marker_line_width=1.5,
    # Hover hiển thị đầy đủ: Tên, Điểm, và Số lượng phim đã làm
    hovertemplate='<b>%{y}</b><br>' +
                  'Avg Score: <b>%{x:.2f}</b><br>' +
                  'Total Anime: %{customdata[0]}<extra></extra>',
    customdata=top_quality[['count']] # Truyền dữ liệu số lượng vào hover
)

# --- 5. LAYOUT CONFIGURATION ---
fig.update_layout(
    title={
        'text': f'<b>Top 20 High-Quality Producers (Quality over Quantity)</b><br>'
                f'<span style="font-size: 14px; color: #555;">'
                f'Criteria: Producers with at least {min_anime_count} releases</span>',
        'y': 0.95,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    font=dict(family='Inter', size=14, color="#333333"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    
    width=1200,
    height=700,
    
    showlegend=False,
    coloraxis_showscale=False, # Ẩn thanh màu bên phải
    margin=dict(t=100, l=250, r=80, b=80), # Margin trái lớn cho tên Producer

    # TRỤC X: Điểm số
    xaxis=dict(
        title="Average Anime Score",
        title_font=dict(size=16, color='#4B5563'),
        tickfont=dict(size=14, weight='bold', color='#333333'),
        showgrid=True,
        gridcolor='#F3F4F6',
        showline=True,
        linecolor='#E5E7EB',
        range=[0, 10], # Giữ thang điểm 10 để trung thực
        dtick=1
    ),

    # TRỤC Y: Tên Producer
    yaxis=dict(
        title="",
        tickfont=dict(size=14, color='#4B5563'),
        ticksuffix="  "
    )
)

fig.show()


 **DEEP INSIGHT: CÔNG THỨC "CHẤT LƯỢNG" CỦA NHỮNG NHÀ SẢN XUẤT HÀNG ĐẦU**

Từ danh sách Top 20 Producer có điểm số trung bình cao nhất (Criteria > 20 releases)

**1. Sức mạnh tối thượng của "IP Holding" (Sở hữu trí tuệ gốc)**

*   **Dẫn chứng:** Vị trí **Top 1 thuộc về Notes (7.65)**.
*   **Phân tích:** *Notes* chính là công ty quản lý thương hiệu **Type-Moon** (nổi tiếng với *Fate series, Kara no Kyoukai*).
    *   Họ không sản xuất đại trà. Họ là chủ sở hữu IP gốc. Khi họ đứng ra làm Producer, đó là để bảo vệ và nâng tầm thương hiệu "con cưng" của họ.
    *   **Bài học:** Khi Producer chính là "cha đẻ" của tác phẩm (hoặc nắm quyền kiểm soát tuyệt đối IP), họ sẽ chấp nhận chi ngân sách lớn (High Budget) và kiểm soát chất lượng cực gắt gao (Strict QC). Điều này dẫn đến điểm số cao vượt trội và lượng fan trung thành tuyệt đối.

**2. Chiến lược "Cherry-picking" của các Nhà xuất bản (Publishers)**
*   **Dẫn chứng:** Sự xuất hiện dày đặc của các ông lớn xuất bản Manga: **Hakusensha (#3), Kodansha (#10), Houbunsha (#12), Shueisha (#14)**.
*   **Phân tích:** Tại sao các nhà xuất bản lại có điểm cao?
    *   Họ sở hữu hàng nghìn đầu truyện Manga, nhưng họ chỉ chọn lọc (cherry-pick) những bộ **bán chạy nhất, nội dung hay nhất** để chuyển thể thành Anime.
    *   Nội dung gốc đã được thị trường kiểm chứng (Validated Content) $\rightarrow$ Rủi ro kịch bản thấp $\rightarrow$ Anime dễ dàng đạt điểm cao.
    *   *Đối chiếu:* Họ làm ít phim hơn các đài truyền hình (như NHK, TV Tokyo ở BXH Số lượng), nhưng tỷ lệ thành công (Hit-rate) cao hơn hẳn.

**3. " Quy mô lớn nhưng Chất lượng vẫn cao**
*   **Dẫn chứng:** 
    * Aniplex - Top 20 Active Producers (580 projects) & Top 20 High-Quality Producers (7.34 avg score)
    * Kodansha - Top 20 Active Producers (310 projects) & Top 20 High-Quality Producers (7.43 avg score)
    * Dentsu - Top 20 Active Producers (370 projects) & Top 20 High-Quality Producers (7.41 avg score)
    * Shueisha - Top 20 Active Producers (307 projects) & Top 20 High-Quality Producers (7.39 avg score)
    * Mainichi Broadcasting System - Top 20 Active Producers (262 projects) & Top 20 High-Quality Producers (7.38 avg score)
*   **Phân tích chéo (Cross-Reference):**
    * Trong khi các "trùm số lượng" khác như *NHK, TV Tokyo, Lantis* đều không xuất hiện trong bảng xếp hạng chất lượng, thì 5 cái tên nổi bật trên vẫn trụ lại trong bảng xếp hạng chất lượng.



#### Deep Insight 4


In [None]:
# --- 2. DATA PROCESSING ---
# Copy dữ liệu
df_producers_quality_collab = df_anime_dataset_2023_prep[['Producers', 'Score']].copy()

# Xử lý Score
df_producers_quality_collab['Score'] = pd.to_numeric(df_producers_quality_collab['Score'], errors='coerce')
df_producers_quality_collab = df_producers_quality_collab.dropna(subset=['Score'])

# --- LÀM SẠCH CHUỖI NHƯNG KHÔNG TÁCH (KEEP COMBINATIONS) ---
# Chỉ loại bỏ các ký tự rác, giữ nguyên dấu phẩy để nhận diện tổ hợp
df_producers_quality_collab['Producers'] = df_producers_quality_collab['Producers'].astype(str).str.replace(r"[\[\]\'\"]", "", regex=True)
df_producers_quality_collab['Producers'] = df_producers_quality_collab['Producers'].str.strip()

# Loại bỏ các dòng Unknown/Rỗng
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_producers_quality_collab = df_producers_quality_collab[~df_producers_quality_collab['Producers'].str.upper().isin(exclude_list)]

# (Tùy chọn) Chỉ lấy các dòng có ít nhất 1 dấu phẩy (tức là có sự hợp tác của >= 2 bên)
# Nếu bạn muốn tính cả các Studio đơn lẻ thì bỏ dòng dưới đi. 
# Nhưng "Synergy" (Hợp tác) thường ám chỉ 2 bên trở lên.
df_producers_quality_collab = df_producers_quality_collab[df_producers_quality_collab['Producers'].str.contains(',')]

# --- 3. AGGREGATION ---
# Gom nhóm theo đúng chuỗi Combination
combo_stats = df_producers_quality_collab.groupby('Producers')['Score'].agg(['mean', 'count']).reset_index()

# --- FILTERING (CRITERIA) ---
# 1. Phải hợp tác ít nhất 3 lần (Độ ổn định)
min_collabs = 5
top_combos = combo_stats[combo_stats['count'] >= min_collabs].copy()

# 2. Lấy Top 15 cặp đôi có điểm cao nhất
top_combos = top_combos.sort_values('mean', ascending=False).head(20)

# Sắp xếp tăng dần để vẽ biểu đồ ngang
top_combos = top_combos.sort_values('mean', ascending=True)

# --- 4. VISUALIZATION ---
fig = px.bar(
    top_combos,
    x='mean',
    y='Producers',
    orientation='h',
    text='mean',
    
    title=f'<b>Top 15 "Dream Teams": Best Producer Collaborations</b>',
    
    # Tô màu theo điểm số
    color='mean',
    color_continuous_scale=['#DDD6FE', '#4C1D95']
)

# --- 5. TINH CHỈNH GIAO DIỆN ---
fig.update_traces(
    texttemplate='%{text:.2f}', 
    textposition='outside',
    marker_line_color='white',
    marker_line_width=1.5,
    # Hover hiển thị: Tên tổ hợp, Điểm, Số lần hợp tác
    hovertemplate='<b>Combination:</b> %{y}<br>' +
                  'Avg Score: <b>%{x:.2f}</b><br>' +
                  'Collabs Count: %{customdata[0]}<extra></extra>',
    customdata=top_combos[['count']]
)

fig.update_layout(
    title={
        'text': f'<b>Top 15 Best Collaborations Producer Teams (Quality Focus)</b><br>'
                f'<span style="font-size: 14px; color: #555;">'
                f'Criteria: Combinations with ≥ {min_collabs} joint projects</span>',
        'y': 0.95, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'
    },
    font=dict(family='Inter', size=13, color="#333333"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    
    width=1400,
    height=800,
    
    showlegend=False,
    coloraxis_showscale=False,
    
    # Margin trái RẤT LỚN vì tên tổ hợp thường rất dài
    margin=dict(t=100, l=400, r=60, b=80),

    # TRỤC X
    xaxis=dict(
        title="Average Anime Score",
        title_font=dict(size=14, color='#4B5563'),
        tickfont=dict(size=12, color='#333333'),
        showgrid=True,
        gridcolor='#F3F4F6',
        zeroline=False,
        # Zoom vào vùng điểm cao để thấy sự chênh lệch
        range=[top_combos['mean'].min() - 0.5, 10] 
    ),

    # TRỤC Y
    yaxis=dict(
        title="",
        tickfont=dict(size=12, color='#4B5563'),
        ticksuffix="  "
    )
)

fig.show()

## 3.8.Insight Question 8
Unlike Producers, Studios are hired by Producers to directly create the anime. They are responsible for drawing, animation, compositing, coloring, editing, and post production. In other words, they are the teams that produce the actual visual content you see on the screen. Given that animation production can be a massive undertaking, it's common for multiple studios to collaborate on a single anime, especially for larger projects or to meet tight deadlines. As a result, the 'Studios' feature in our dataset often lists several animation studios within a single entry, separated by commas. Our initial exploration of this raw column will aim to identify the most frequently appearing combinations of studios. By analyzing this unprocessed, aggregated data, we intend to highlight how its current string format will impede the identification of individual studios' prevalence and necessitate significant parsing and cleaning to understand the true contributors to anime production.

### 3.8.1.Misleading Insight
- What are the most frequently appearing studio combinations or studio strings as they are originally listed in the dataset?

In [None]:
# --- 2. DATA PROCESSING ---
df_studios = df_anime_dataset_2023_prep[['Studios']].copy()

# Chuyển về dạng string
df_studios['Studios'] = df_studios['Studios'].astype(str)

# Loại bỏ các ký tự rác (ngoặc, nháy) để đếm cho chuẩn
df_studios['Studios'] = df_studios['Studios'].str.replace(r"[\[\]\'\"]", "", regex=True)

# Lọc bỏ Unknown/None
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_studios = df_studios[~df_studios['Studios'].str.upper().isin(exclude_list)]

# --- LOGIC PHÂN LOẠI (SOLO VS COLLAB) ---
def classify_studio(studio_str):
    # Đếm số lượng phần tử sau khi tách dấu phẩy
    # Ví dụ: "Madhouse" -> 1 phần tử -> Solo
    # Ví dụ: "Madhouse, MAPPA" -> 2 phần tử -> Collab
    count = len(studio_str.split(','))
    if count == 1:
        return 'Solo Studio'
    else:
        return 'Collaboration'

df_studios['Production_Type'] = df_studios['Studios'].apply(classify_studio)

# --- 3. AGGREGATION ---
studio_type_counts = df_studios['Production_Type'].value_counts().reset_index()
studio_type_counts.columns = ['Type', 'Count']

# Tính phần trăm thủ công để đưa vào title (Insight nhanh)
total = studio_type_counts['Count'].sum()
solo_count = studio_type_counts[studio_type_counts['Type'] == 'Solo Studio']['Count'].values[0]
collab_count = studio_type_counts[studio_type_counts['Type'] == 'Collaboration']['Count'].values[0]
solo_pct = (solo_count / total) * 100

# --- 4. VISUALIZATION (DONUT CHART) ---
fig = px.pie(
    studio_type_counts,
    values='Count',
    names='Type',
    hole=0.5, # Tạo lỗ hổng ở giữa (Donut Chart)
    
    # Màu sắc: Tím đậm cho Collab (tượng trưng sự phức tạp), Tím nhạt cho Solo
    color='Type',
    color_discrete_map={
        'Solo Studio': '#C4B5FD',    # Tím nhạt
        'Collaboration': '#6D28D9'   # Tím đậm
    },
    
    title='<b>Production Structure: Solo vs. Collaboration Studios</b>'
)

# --- 5. TINH CHỈNH GIAO DIỆN ---
fig.update_traces(
    textposition='outside', 
    textinfo='percent+label', # Hiển thị Nhãn + %
    textfont=dict(size=13, family='Inter', color='#333333'),
    marker=dict(line=dict(color='white', width=3)) # Viền trắng ngăn cách
)

fig.update_layout(
    font=dict(family='Inter', size=14, color="#333333"),
    
    # Thêm Annotation ở giữa lỗ tròn để nhấn mạnh con số tổng
    annotations=[dict(
        text=f'<b>{total:,}</b><br><span style="font-size:12px; color:gray">Total Projects</span>', 
        x=0.5, y=0.5, font_size=20, showarrow=False
    )],
    
    title={
        'text': f'<b>Studio Production Structure</b><br>'
                f'<span style="font-size: 14px; color: #555;">'
                f'Solo Projects dominate with <b>{solo_pct:.1f}%</b> of the market</span>',
        'y': 0.95, 'x': 0.5, 'xanchor': 'center'
    },
    
    showlegend=False, # Ẩn chú thích bên cạnh vì đã có nhãn trực tiếp
    height=600,
    width=800,
    margin=dict(t=100, b=50, l=50, r=50)
)

fig.show()

In [None]:
# 1. Xử lý dữ liệu (Top 15 Studios - Raw Data)
# Đếm tần suất xuất hiện của các chuỗi trong cột Studios
studio_counts = df_anime_dataset_2023['Studios'].value_counts().head(10).reset_index()
studio_counts.columns = ['Studios', 'Count']

# Sắp xếp ascending=True để khi vẽ ngang, thanh lớn nhất (Top 1) nằm trên cùng
studio_counts = studio_counts.sort_values('Count', ascending=True)

# 2. Logic màu sắc (Gray cho Unknown, Tím cho còn lại)
base_color = "#8B5CF6"
unknown_color = "#B0B0B0"

# Tạo list màu dựa trên tên Studios
colors = [
    unknown_color if str(s).strip().upper() == "UNKNOWN" else base_color
    for s in studio_counts['Studios']
]

# 3. Tạo biểu đồ Horizontal Bar
fig = px.bar(
    studio_counts,
    x='Count',       # Số lượng nằm ngang
    y='Studios',     # Tên nằm dọc
    orientation='h', # Xoay ngang
    title='Top 15 Studio Combinations (Raw Data)',
    labels={'Studios': 'Studio Combination', 'Count': 'Number of Anime'},
    text='Count'
)

# 4. Tinh chỉnh traces
fig.update_traces(
    marker_color=colors,  # Áp dụng list màu đã tạo
    marker_line_color='white',
    marker_line_width=1.2,
    textposition='outside',
    texttemplate='%{text:,}',
    hovertemplate='<b>Studios:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>'
)

# 5. Tinh chỉnh Layout (Style đồng bộ)
fig.update_layout(
    title={
        'text': 'Top 15 Studio Combinations (Raw Data)',
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(size=22, family='Inter', weight='bold', color='#1F2937')
    },
    # Trục X: Số lượng
    xaxis=dict(
        title='Number of Anime',
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        tickfont=dict(size=11, color='#6B7280'),
        showgrid=False,       # Tắt grid theo code mẫu của bạn
        gridcolor='#F3F4F6',
        showline=False,
        zeroline=False,
        tickformat=','
    ),
    # Trục Y: Tên Studios
    yaxis=dict(
        title='',  # Bỏ tiêu đề trục
        tickfont=dict(size=14, color='#4B5563', family='Inter'),
        ticksuffix="  ",  # Tạo khoảng cách giữa chữ và thanh
        showgrid=False,
        showline=False,
        zeroline=False
    ),
    plot_bgcolor='white',
    paper_bgcolor='white',
    bargap=0.25,
    height=800,
    
    # Margin left (l) để 250 là an toàn cho tên Studio (thường ngắn hơn Producer chút nhưng vẫn dài)
    margin=dict(t=80, l=250, r=40, b=80),
    
    hoverlabel=dict(
        bgcolor='white',
        bordercolor='#8B5CF6',
        font=dict(size=13, family='Inter', color='#1F2937')
    )
)

fig.show()

**Conflict**

The initial insights derived from the **'Studios'** feature (showing the frequency of studio combinations) are significantly flawed and misleading due to critical data integrity and structural issues. The raw state of this column prevents an accurate understanding of individual studios' contributions or true prevalence.

*   **Error 1: Aggregated Studio Strings (Multiple Studios in One Entry).**
    *   **Impact on Insights (Bar Charts of Studio Combinations):** The primary issue is that the **'Studios'** column often contains multiple studios listed within a single string entry (e.g., **"Madhouse, MAPPA"**). When we count the frequency of these strings, we are tallying specific combinations of studios, not the individual studios themselves. This fundamentally obscures the true involvement and popularity of any single animation studio. For example, **"Madhouse"** might be a highly active studio, but its individual prevalence is hidden within numerous unique combinations. The bar charts will show these distinct, often lengthy, combination strings as categories, making it impossible to determine which individual studios are most active or involved across the dataset. We are gaining insight into **studio collaborations**, rather than the independent entities.

*   **Error 2: Non-standard Missing Value Representation - 'UNKNOWN' instead of NaN.**
    *   **Impact on Insights (Bar Charts of Studio Combinations):** The presence of the string 'UNKNOWN' for missing studio information is treated as a valid and distinct studio combination by Plotly. This means 'UNKNOWN' will likely appear as one of the most frequent "studio combinations" in the bar charts. Its inclusion as a prominent category inflates its perceived importance, distorts the true ranking of actual **studio collaborations**, and distracts from analyzing meaningful production data. A correctly handled missing value (standard NaN) would typically be excluded from such frequency counts by default, providing a clearer and more accurate insight into actual studio involvement.

### 3.8.2.Correct Insight

In [None]:
# --- 2. DATA PROCESSING (DEEP CLEANING FOR STUDIOS) ---
# Lấy cột Studios thay vì Producers
studio_df = df_anime_dataset_2023_prep[['Studios']].copy()
studio_df['Studios'] = studio_df['Studios'].astype(str)

# REGEX CLEANING
studio_df['Studios'] = studio_df['Studios'].str.replace(r"[\[\]\'\"]", "", regex=True)

# SPLIT & EXPLODE (Tách các Studio hợp tác ra đếm riêng)
studio_df['Studios'] = studio_df['Studios'].str.split(',')
studio_df = studio_df.explode('Studios')

# TRIM
studio_df['Studios'] = studio_df['Studios'].str.strip()

# FILTER GARBAGE
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL"]
studio_df = studio_df[~studio_df['Studios'].str.upper().isin(exclude_list)]
studio_df = studio_df[studio_df['Studios'] != ""]

# --- 3. AGGREGATION ---
# Đếm số lượng và lấy Top 20
top_studios = studio_df['Studios'].value_counts().head(25).reset_index()
top_studios.columns = ['Studios', 'Count']
top_studios = top_studios.sort_values('Count', ascending=True)

# --- 4. VISUALIZATION ---
fig = px.bar(
    top_studios,
    x='Count',
    y='Studios',
    orientation='h',
    text='Count',
    title='<b>Market Leaders: Top 25 Most Active Individual Studios</b>',
    color='Count',
    # Giữ nguyên dải màu Tím
    color_continuous_scale=['#DDD6FE', '#4C1D95'] 
)

# Tinh chỉnh Traces
fig.update_traces(
    texttemplate='%{text:,}',
    textposition='outside',
    hovertemplate='<b>Studio:</b> %{y}<br><b>Projects:</b> %{x:,}<extra></extra>'
)

# --- 5. LAYOUT ---
fig.update_layout(
    title={
        'text': '<b>Top 25 Most Active Individual Studios</b>',
        'y': 0.95,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    font=dict(family='Inter', size=13, color="#333333"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    height=750, 
    coloraxis_showscale=False,
    margin=dict(t=80, l=150, r=50, b=80),

    # --- CẤU HÌNH TRỤC X (Số lượng) ---
    xaxis=dict(
        title="Number of Anime Produced",
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        showticklabels=True, 
        tickfont=dict(size=14, color='#6B7280'),
        
        # Grid settings
        showgrid=False, # Nên bật Grid mờ để dễ dóng hàng
        gridcolor='#F3F4F6',
        zeroline=False,
        
        # --- QUAN TRỌNG: ĐIỀU CHỈNH BƯỚC NHẢY CHO STUDIO ---
        tick0=0,     
        dtick=200,  
        tickformat=',' 
    ),

    # --- CẤU HÌNH TRỤC Y (Tên Studios) ---
    yaxis=dict(
        title="",
        tickfont=dict(size=14, color='#333333', family='Inter'),
        ticksuffix="  ",    # Khoảng cách đẹp
        showgrid=False,
        showline=False,
        zeroline=False
    )
)

fig.show()

In [None]:
# Convert 'Genres' column from string representation of list to actual list
# Use errors='coerce' to turn unparseable strings into NaN
df_anime_dataset_2023_prep['Genres_list'] = df_anime_dataset_2023_prep['Genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
genre_score_pairs = []

# Iterate through each row of the DataFrame
for index, row in df_anime_dataset_2023_prep.iterrows():
    genres = row['Genres_list']
    score = row['Score']

    # Check if both genres and score are not null
    if isinstance(genres, list) and not pd.isna(score):
        for genre in genres:
            genre_score_pairs.append({'Genre': genre, 'Score': score})

# Convert the list of dictionaries to a DataFrame
genre_scores_df = pd.DataFrame(genre_score_pairs)

# Group by genre and calculate the mean score
average_score_per_genre = genre_scores_df.groupby('Genre')['Score'].mean().reset_index()

# Sort by average score in descending order
average_score_per_genre = average_score_per_genre.sort_values(by='Score', ascending=False)

print("Top 10 Genres by Average Score:")
display(average_score_per_genre.head(10))

Top 10 Genres by Average Score:


Unnamed: 0,Genre,Score
3,Award Winning,7.296308
14,Mystery,6.995093
20,Suspense,6.962963
6,Drama,6.850645
15,Romance,6.804509
19,Supernatural,6.7446
18,Sports,6.722046
0,Action,6.674112
1,Adventure,6.673997
11,Gourmet,6.627664


In [None]:
# --- 1. DATA PREPARATION ---

# Hàm parse list an toàn
def parse_list_field(x):
    if pd.isna(x): return []
    if isinstance(x, list): return [str(s).strip() for s in x]
    try:
        val = ast.literal_eval(x)
        if isinstance(val, (list, tuple)): return [str(s).strip() for s in val]
    except Exception: pass
    return [s.strip() for s in str(x).split(',') if s.strip()]

# Tạo dataframe làm việc
df_matrix = df_anime_dataset_2023_prep[['anime_id', 'Studios', 'Genres', 'Score']].copy()
df_matrix = df_matrix.dropna(subset=['Studios', 'Genres', 'Score'])

# Parse columns
df_matrix['Studios_list'] = df_matrix['Studios'].apply(parse_list_field)
df_matrix['Genres_list'] = df_matrix['Genres'].apply(parse_list_field)

# --- BƯỚC 1: LỌC TOP 20 STUDIOS (Market Leaders) ---
df_studios = df_matrix.explode('Studios_list')
df_studios['Studio'] = df_studios['Studios_list'].astype(str).str.strip()

top_20_studio_names = df_studios['Studio'].value_counts().head(20).index.tolist()
df_top_studios = df_studios[df_studios['Studio'].isin(top_20_studio_names)].copy()

# --- BƯỚC 2: LỌC TOP 10 GENRES (Major Markets) ---
df_final = df_top_studios.explode('Genres_list')
df_final['Genre'] = df_final['Genres_list'].astype(str).str.strip()

# Chỉ lấy 10 Genre xuất hiện nhiều nhất trong tập dữ liệu của Top 20 Studio
top_10_genres = df_final['Genre'].value_counts().head(10).index.tolist()
df_final = df_final[df_final['Genre'].isin(top_10_genres)]

# --- BƯỚC 3: TẠO MA TRẬN ĐIỂM SỐ (PIVOT TABLE) ---
heatmap_data = df_final.groupby(['Studio', 'Genre'])['Score'].mean().unstack()

# Reindex để sắp xếp gọn gàng
heatmap_data = heatmap_data.reindex(index=top_20_studio_names, columns=top_10_genres)

# --- 2. VISUALIZATION (HEATMAP) ---

# Giữ nguyên dải màu trong code gốc của bạn
custom_colorscale = [
    [0.0, "#F3E5F5"], 
    [0.6, "#9575CD"], 
    [1.0, "#4A235A"]  
]

fig = px.imshow(
    heatmap_data,
    # [EDIT 1] Đổi label Avg Score -> Average Score
    labels=dict(x="Genre", y="Studio", color="Average Score"),
    x=heatmap_data.columns,
    y=heatmap_data.index,
    aspect="auto",
    color_continuous_scale=custom_colorscale,
    text_auto=".2f",
    title="<b>Top 20 Active Individual Studios Specialization: Performance in Top 10 Genres</b>"
)

# --- 3. STYLING ---
fig.update_layout(
    height=800, 
    width=1000,
    
    font=dict(family='Roboto', size=11, color="#2c3e50"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    
    # [EDIT 2] Tinh chỉnh Title để chữ Genre không bị dính
    title=dict(
        text="<b>Top 20 Active Individual Studios Specialization: Performance in Top 10 Genres</b>",
        y=0.95,
        x=0.5,
        xanchor='center',
        yanchor='top',
        pad=dict(b=30) # Đẩy nội dung bên dưới (trục Genre) ra xa Title 30px
    ),
    
    # Tăng margin Top lên để chứa phần padding mới
    margin=dict(t=100, l=150, r=50, b=50),
    
    coloraxis_colorbar=dict(
        title="Average Score", # [EDIT 1] Đổi title colorbar
        title_font=dict(family="Inter", size=11),
        tickfont=dict(family="Inter", size=10),
        thickness=10, 
        len=0.5,
        outlinecolor="white", 
        outlinewidth=0
    )
)

# Trục X lên đầu (Matrix Style)
fig.update_xaxes(side="top", tickfont=dict(family="Inter", size=11))
fig.update_yaxes(tickfont=dict(family="Inter", size=11), title=None, ticksuffix="  ")

# Tạo khoảng cách giữa các ô
fig.update_traces(xgap=1, ygap=1)

fig.show()

#### Deep insight 1
**1. Kyoto Animation (KyoAni): Studio toàn diện nhất**
*   **Quan sát:** Dòng cuối cùng của KyoAni là dòng có màu **tím đậm nhất** và đồng đều nhất trên toàn biểu đồ, đặc biệt là các thể loại:
    *   **Action (7.98), Drama (7.98), Mystery (7.82)**.
*   **Insight Chiến lược:** KyoAni không chỉ là "trùm" của các thể loại *Slice of Life* hay *Drama* truyền thống mà còn là Studio có khả năng thực thi các thể loại phức tạp như *Action* hay *Mystery* ở cấp độ cao nhất (**~8.00** điểm).
    *   **Quyết định:** Nếu mục tiêu là **tối đa hóa điểm số** và **uy tín thương hiệu** (Brand Prestige) với ngân sách không giới hạn, KyoAni là lựa chọn ưu tiên số một, vượt qua cả Production I.G và Madhouse.
*   Bên cạnh đó **A-1 Pictures** cũng cho thấy sự toàn diện của mình trong tất cả các lĩnh vực khi điểm trung bình đều trên 7. Tuy nhiên điều đó lại khiến studio này không thực sự nổi bật trong thể loại nào 

 **2. Phân Cấp Studio theo Thể loại (Tiering by Genre)**
Các Studio lớn khác cho thấy sự chuyên môn hóa rõ rệt:

| Studio | Thể loại Mạnh nhất (Tím Đậm) |
| :--- | :--- | 
| **Bones** | Action (**7.42**), Fantasy (**7.53**) | 
| **Shaft** | Mystery (**8.07**), Supernatural (**7.60**) | 
| **TMS Entertainment** | Supernatural (**7.57**), Romance (**7.46**), Drama (**7.40**) |
| **Production I.G** | Sports (**7.59**), Mystery (**7.40**)
| **Madhouse** | Sports (**7.57**) | 

*   **Actionable Insight:**
    *   Producer muốn làm **Action** nên ưu tiên **Bones** (chuyên biệt), **KyoAni** (chất lượng tuyệt đối), hoặc **Madhouse**.
    *   Producer muốn làm **Tâm lý/Tình cảm** nên chọn **TMS Entertainment** hoặc **KyoAni**.
    *   **Cạm bẫy:** Tránh giao dự án *Slice of Life* cho **Bones** và *Sci-Fi/Fantasy* cho **Pierrot** (điểm thấp 6.51 - 7.13) vì không phải sở trường.

 **3. Sàn Chất Lượng (The Baseline Risk)**
*   **Quan sát:** Các Studio như **Tatsunoko Production, AIC, OLM** có rất nhiều ô màu tím nhạt (điểm trung bình quanh 6.50 - 6.80) hoặc có ô màu trắng (dữ liệu thiếu/chưa đủ dự án).
*   **Insight:** Đây là những Studio an toàn cho các dự án **ngân sách trung bình** hoặc **đầu tư mạo hiểm**, nhưng chúng ta không nên kỳ vọng chúng sẽ tạo ra siêu phẩm. Điểm yếu của họ là thiếu sự nhất quán.



In [37]:

# --- 1. DATA PREPARATION ---

# Hàm parse list an toàn (Giữ nguyên)
def parse_list_field(x):
    if pd.isna(x): return []
    if isinstance(x, list): return [str(s).strip() for s in x]
    try:
        val = ast.literal_eval(x)
        if isinstance(val, (list, tuple)): return [str(s).strip() for s in val]
    except Exception: pass
    return [s.strip() for s in str(x).split(',') if s.strip()]

# Tạo dataframe làm việc
df_matrix = df_anime_dataset_2023_prep[['anime_id', 'Studios', 'Type', 'Score']].copy()
df_matrix['Score'] = pd.to_numeric(df_matrix['Score'], errors='coerce')
df_matrix = df_matrix.dropna(subset=['Studios', 'Type', 'Score'])

# Parse Studios list
df_matrix['Studios_list'] = df_matrix['Studios'].apply(parse_list_field)

# Lọc các list rỗng/UNKNOWN/NONE sau khi parse
def filter_empty_or_unknown(lst):
    cleaned_list = [s.strip().upper() for s in lst if s.strip()]
    return len(cleaned_list) > 0 and 'UNKNOWN' not in cleaned_list and 'NONE' not in cleaned_list

df_matrix = df_matrix[df_matrix['Studios_list'].apply(filter_empty_or_unknown)]


# --- BƯỚC 1: LỌC TOP 20 STUDIOS (Market Leaders) ---
df_studios = df_matrix.explode('Studios_list')
df_studios['Studio'] = df_studios['Studios_list'].astype(str).str.strip()

# Lấy Top 20 Studio theo số lượng Anime đã làm
top_20_studio_names = df_studios['Studio'].value_counts().head(20).index.tolist()
df_final = df_studios[df_studios['Studio'].isin(top_20_studio_names)].copy()


# --- BƯỚC 2: TẠO MA TRẬN ĐIỂM SỐ (PIVOT TABLE) ---
# TÍNH ĐIỂM TRUNG BÌNH CỦA TỪNG CẶP STUDIO-TYPE
heatmap_data = df_final.groupby(['Studio', 'Type'])['Score'].mean().unstack()

# Sắp xếp lại thứ tự cột theo logic (TV, OVA, Movie...)
type_order = ['TV', 'Movie', 'OVA', 'Special', 'ONA', 'Music']
# Lấy các cột Type có tồn tại trong dữ liệu
type_columns = [t for t in type_order if t in heatmap_data.columns] 

# Reindex để sắp xếp gọn gàng
heatmap_data = heatmap_data.reindex(index=top_20_studio_names, columns=type_columns)

# --- 2. VISUALIZATION (HEATMAP) ---
custom_colorscale = [
    [0.0, "#F3E5F5"], 
    [0.6, "#9575CD"], 
    [1.0, "#4A235A"]  
]

fig = px.imshow(
    heatmap_data,
    labels=dict(x="Anime Type", y="Studio", color="Average Score"),
    x=heatmap_data.columns,
    y=heatmap_data.index,
    aspect="auto", # Tỉ lệ sẽ hợp lý hơn vì số lượng Type ít
    color_continuous_scale=custom_colorscale,
    text_auto=".2f",
    title="<b>Studio Specialization: Performance by Anime Type</b>"
)

# --- 3. STYLING ---
fig.update_layout(
    height=800, 
    width=1000, 
    
    font=dict(family='Inter', size=12, color="#2c3e50"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    
    title=dict(
        text="<b>Top 20 Studios Specialization: Performance by Anime Type</b>",
        y=0.95,
        x=0.5,
        xanchor='center',
        yanchor='top',
        pad=dict(b=30)
    ),
    
    margin=dict(t=100, l=150, r=50, b=50),
    
    coloraxis_colorbar=dict(
        title="Average Score", 
        title_font=dict(family="Inter", size=11),
        tickfont=dict(family="Inter", size=10),
        thickness=10, 
        len=0.5,
        outlinecolor="white", 
        outlinewidth=0
    )
)

# Trục X lên đầu (Matrix Style)
fig.update_xaxes(
    side="top", 
    tickfont=dict(family="Inter", size=13), 
    tickangle=0, # Không cần xoay chữ vì tên Type ngắn
    title="Anime Type"
)

fig.update_yaxes(
    tickfont=dict(family="Inter", size=11), 
    title="Studio",
    ticksuffix="  "
)

# Tạo khoảng cách giữa các ô
fig.update_traces(xgap=1, ygap=1)

fig.show()

<!-- **Conflict**: Có thể là chỉ có lỗi do mising value

*   List item
*   List item
 -->


# 4.Deep Insight