# 1.Context
The team is acting in the role of Producer. Our goal is to develop a release strategy for anime that maximizes the chance of success. The team wants to leverage the current dataset to extract useful insights and build a machine learning model to quantify the impact of various factors on the success of an anime title.

In [269]:
from pathlib import Path
import pandas as pd
import numpy as np
import plotly.express as px
import os
import plotly.io as pio
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import kagglehub
import ast

In [270]:
# Download latest version
path = kagglehub.dataset_download("dbdmobile/myanimelist-dataset")

print("Path to dataset files:", path)

# List all files in the dataset directory
folder = os.listdir(path)
print("Files in dataset directory:", folder)
for file in folder:
    print(file)

Path to dataset files: C:\Users\ChooChoo\.cache\kagglehub\datasets\dbdmobile\myanimelist-dataset\versions\5
Files in dataset directory: ['anime-dataset-2023.csv', 'anime-filtered.csv', 'final_animedataset.csv', 'user-filtered.csv', 'users-details-2023.csv', 'users-score-2023.csv']
anime-dataset-2023.csv
anime-filtered.csv
final_animedataset.csv
user-filtered.csv
users-details-2023.csv
users-score-2023.csv


In [271]:
current_dir = Path.cwd()

project_root = current_dir.parent 

raw_data_path = project_root / "data" / "raw" / "anime-dataset-2023.csv"
processed_data_path = project_root / "data" / "processed" / "prepared_data.csv"

df_anime_dataset_2023 = pd.read_csv(raw_data_path)
df_anime_dataset_2023_prep = pd.read_csv(processed_data_path)

In [272]:
pio.templates.default = "plotly_white"

In [273]:
csv_file2 = df_anime_dataset_2023_prep

In [274]:
print(df_anime_dataset_2023.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24905 entries, 0 to 24904
Data columns (total 24 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   anime_id      24905 non-null  int64 
 1   Name          24905 non-null  object
 2   English name  24905 non-null  object
 3   Other name    24905 non-null  object
 4   Score         24905 non-null  object
 5   Genres        24905 non-null  object
 6   Synopsis      24905 non-null  object
 7   Type          24905 non-null  object
 8   Episodes      24905 non-null  object
 9   Aired         24905 non-null  object
 10  Premiered     24905 non-null  object
 11  Status        24905 non-null  object
 12  Producers     24905 non-null  object
 13  Licensors     24905 non-null  object
 14  Studios       24905 non-null  object
 15  Source        24905 non-null  object
 16  Duration      24905 non-null  object
 17  Rating        24905 non-null  object
 18  Rank          24905 non-null  object
 19  Popu

In [275]:
print(df_anime_dataset_2023_prep.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24905 entries, 0 to 24904
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   anime_id          24905 non-null  int64  
 1   Name              24905 non-null  object 
 2   Score             15692 non-null  float64
 3   Genres            19976 non-null  object 
 4   Type              24831 non-null  object 
 5   Episodes          24294 non-null  float64
 6   Status            24905 non-null  object 
 7   Producers         11555 non-null  object 
 8   Studios           14379 non-null  object 
 9   Source            24905 non-null  object 
 10  Rating            24236 non-null  object 
 11  Rank              20293 non-null  float64
 12  Popularity        24718 non-null  float64
 13  Favorites         24905 non-null  int64  
 14  Scored By         15692 non-null  float64
 15  Members           24905 non-null  int64  
 16  Aired Date Start  20090 non-null  object

# 2.Data Frame Introduction

1. Thông tin cơ bản và nhận dạng (Basic Identification & Description)
- **anime_id**: ID duy nhất cho mỗi anime.
- **Name**: Tên của anime bằng ngôn ngữ gốc.
- **English name**: Tên tiếng Anh của anime.
- **Other name**: Tên bản địa hoặc tựa đề của anime.
- **Synopsis**: Mô tả hoặc tóm tắt ngắn gọn về cốt truyện của anime.
- **Genres**: Các thể loại của anime, được phân tách bằng dấu phẩy.
- **Image URL**: URL của hình ảnh hoặc poster của anime.

2. Chi tiết sản xuất và kỹ thuật (Production & Technical Details)
- **Type**: Loại anime.
- **Source**: Vật liệu gốc của anime.
- **Producers**: Các công ty sản xuất hoặc nhà sản xuất của anime.
- **Studios**: Các studio hoạt hình đã thực hiện anime.
- **Licensors**: Các nhà cấp phép của anime.
- **Episodes**: Số lượng tập trong anime.
- **Duration**: Thời lượng của mỗi tập phim.

3. Thông tin phát sóng và phát hành (Release & Airing Information)
- **Aired**: Ngày anime được phát sóng.
- **Premiered**: Mùa và năm anime ra mắt.
- **Status**: Trạng thái của anime.

4. Chỉ số tương tác người xem và hiệu suất (Audience Engagement & Performance Metrics)
- **Score**: Điểm được trao cho anime.
- **Rating**: Xếp hạng độ tuổi của anime.
- **Rank**: Xếp hạng của anime dựa trên mức độ phổ biến hoặc các tiêu chí khác.
- **Popularity**: Xếp hạng mức độ phổ biến của anime.
- **Favorites**: Số lần anime được người dùng đánh dấu là yêu thích.
- **Scored By**: Số lượng người dùng đã chấm điểm anime.
- **Members**: Số lượng thành viên đã thêm anime vào danh sách của họ trên nền tảng.

*Unnecessary Feature*
- Tiến hành drop feature không cần thiết khỏi dataframe

In [276]:
df_anime_dataset_2023 = df_anime_dataset_2023.drop(columns=['English name', 'Other name', 'Synopsis', 'Image URL'])

# 3.Data Insight

## 3.1.Insight Question 1:
Understanding the **'Score'** feature is paramount for anyone aspiring to create highly-rated anime or to analyze the factors contributing to critical and audience success. A deep dive into how anime are scored, the typical distribution of these scores, and what values are most common can provide invaluable preliminary insights into audience preferences and industry trends. However, drawing accurate conclusions hinges on the quality of our data. For our initial exploration, we will directly examine the raw 'Score' column, aiming to uncover its inherent characteristics and, more importantly, to identify where the unprocessed data might lead us astray, thereby underscoring the critical role of data preprocessing.

### 3.1.1.Misleading Insight
- What is the distribution of anime scores within our dataset?

In [277]:
# Kiểm tra xem cột 'Score' có tồn tại trong DataFrame không
if 'Score' in df_anime_dataset_2023.columns:
    try:
        # Vẽ biểu đồ phân phối
        fig = px.histogram(
            df_anime_dataset_2023,
            x='Score',
            title='Anime Score Distribution',
            labels={'Score': 'Score'}
        )

        # Tìm Mode (giá trị xuất hiện nhiều nhất)
        mode_value = df_anime_dataset_2023['Score'].mode()[0]
        mode_count = df_anime_dataset_2023[df_anime_dataset_2023['Score'] == mode_value].shape[0]
        fig.add_annotation(
            x=mode_value,
            y=mode_count,
            text=f"Mode: {mode_value}",
            showarrow=True,
            arrowhead=2,
            ax=100,  # Di chuyển mũi tên sang phải (giá trị dương)
            ay=-50,  # Di chuyển mũi tên lên trên (giá trị âm)
            font=dict(color="#C3122F", size=17),  # Đổi màu chữ thành đỏ
            arrowcolor="black"  # Đổi màu mũi tên thành xanh
        )
        fig.update_layout(
            xaxis=dict(
                showticklabels=False,
                title=""
            )
        )
        # Hiển thị biểu đồ
        fig.show()
    except Exception as e:
        print("Không thể vẽ biểu đồ này do data không phù hợp:", str(e))
else:
    print("Không thể vẽ biểu đồ này do cột 'Score' không tồn tại trong DataFrame.")

# Lấy tần suất xuất hiện của các giá trị trong cột 'Score'
score_counts = df_anime_dataset_2023['Score'].value_counts().reset_index()

# Đổi tên cột để dễ hiểu
score_counts.columns = ['Score', 'Frequency']

# Lấy 15 giá trị phổ biến nhất
top_15_scores = score_counts.head(15)

# Tìm giá trị có tần suất cao nhất (top 1)
max_frequency = top_15_scores['Frequency'].max()

# Tạo danh sách màu: màu vàng cho top 1, màu tím cho các cột còn lại
colors = ['#C3122F' if freq == max_frequency else '#7A0177' 
          for freq in top_15_scores['Frequency']]

# Vẽ biểu đồ cột nằm ngang
fig = px.bar(
    top_15_scores,
    x='Frequency',
    y='Score',
    orientation='h',
    title='Tần suất 15 giá trị điểm số phổ biến nhất',
    labels={'Score': 'Giá trị Score', 'Frequency': 'Số lượng Anime'},
    text='Frequency'
)

# Tùy chỉnh giao diện biểu đồ
fig.update_traces(
    textposition='outside',
    marker_color=colors,  # Áp dụng danh sách màu
    marker_line_color='white',
    marker_line_width=1.2
)

fig.update_layout(
    yaxis=dict(categoryorder='total ascending'),
    bargap=0.3
)

fig.show()

**Conflict**: The initial insights derived from the 'Score' feature, as presented in Q1 and Q2, are significantly skewed and misleading due to critical data integrity issues. Without proper preprocessing, the visualizations fail to convey an accurate representation of anime scores.
- Error 1: Mismatched Data Type - 'Score' as an Object (String) instead of Numeric.
  - Impact on Q1 (Histogram): When the 'Score' column is treated as a generic object type, Plotly is unable to interpret these values as continuous numerical data. Instead, it might treat each unique string (e.g., '8.1', '7.5', '6') as a distinct categorical bin. This prevents the generation of a meaningful numerical distribution, resulting in a histogram that either appears fragmented, displays an excessive number of individual bars for each distinct score string, or even raises a TypeError, completely misrepresenting the intended continuous nature of scores.
  -  Impact on Q2 (Bar Chart of Unique Values): While a bar chart of unique values can still be generated, the underlying issue persists. The chart will correctly show the frequency of unique strings, but it will not allow for numerical analysis or proper sorting/grouping based on the true score value. It reinforces the idea that these are distinct categories rather than points on a numerical scale.
- Error 2: Non-standard Missing Value Representation - 'UNKNOWN' instead of NaN.
Impact on Q1 (Histogram): The presence of the string 'UNKNOWN' within the 'Score' column further corrupts any attempt at numerical analysis. If a numeric conversion is attempted without prior cleaning, 'UNKNOWN' values will cause errors or be implicitly dropped, potentially leading to an incomplete dataset for the plot. If the data is treated categorically, 'UNKNOWN' will appear as a prominent bar in the distribution, falsely suggesting it's a valid "score" and dominating the view, thus obscuring the actual score distribution.

### 3.1.2.Correct Insight

In [278]:
# Kiểm tra xem cột 'Score' có tồn tại trong DataFrame không
if 'Score' in df_anime_dataset_2023_prep.columns:
    try:
        # Vẽ biểu đồ phân phối
        fig = px.histogram(
            df_anime_dataset_2023_prep,
            x='Score',
            title='Anime Score Distribution',
            labels={'Score': 'Score'}
        )
        # Hiển thị biểu đồ
        fig.show()
    except Exception as e:
        print("Không thể vẽ biểu đồ này do data không phù hợp:", str(e))
else:
    print("Không thể vẽ biểu đồ này do cột 'Score' không tồn tại trong DataFrame.")

# Lấy tần suất xuất hiện của các giá trị trong cột 'Score'
score_counts = df_anime_dataset_2023_prep['Score'].value_counts().reset_index()

# Đổi tên cột để dễ hiểu
score_counts.columns = ['Score', 'Frequency']


## 3.2.Insight Question 2:
Understanding the different types of anime available in the dataset—such as TV series, movies, OVAs, or specials—is fundamental for grasping the landscape of anime production and consumption. This feature provides a high-level categorization that can influence viewership, production cycles, and audience expectations. Our initial exploration of the **Type** column will aim to quantify the prevalence of each type. However, similar to the 'Score' feature, we will first analyze the raw, unprocessed data to highlight how subtle inconsistencies or non-standard entries can distort our understanding and necessitate robust data cleaning.

### 3.2.1.Misleading Insight
- What is the proportional distribution of different anime types within our dataset

In [279]:
# Recalculate type_counts to include 'UNKNOWN' if present in the data
type_counts_full = (
    df_anime_dataset_2023['Type']
    .value_counts(dropna=False)
    .reset_index()
)
type_counts_full.columns = ['Type', 'Count']

# If there are missing values (NaN), replace them with 'UNKNOWN' for clarity
type_counts_full['Type'] = type_counts_full['Type'].fillna('UNKNOWN')

# Calculate percentage
type_counts_full['Percentage'] = (type_counts_full['Count'] / type_counts_full['Count'].sum()) * 100

# Find the type with the smallest percentage
min_idx_full = type_counts_full['Percentage'].idxmin()
min_type_full = type_counts_full.loc[min_idx_full, 'Type']
min_percentage_full = type_counts_full.loc[min_idx_full, 'Percentage']

# Create gradient color scheme
colors = ['#8B5CF6' if x != 'UNKNOWN' else '#CBD5E1' for x in type_counts_full['Type']]

fig = px.bar(
    type_counts_full,
    x='Type',
    y='Count',
    title='Distribution of Anime Types (Including UNKNOWN)',
    labels={'Type': 'Anime Type', 'Count': 'Number of Anime'},
    text='Count',
    color='Count',
    color_continuous_scale=['#DDD6FE', '#8B5CF6', '#6D28D9']
)

fig.update_traces(
    marker_line_color='white',
    marker_line_width=2,
    textposition='outside',
    texttemplate='%{text:,}',
    hovertemplate='<b>%{x}</b><br>Count: %{y:,}<br>Percentage: %{customdata:.2f}%<extra></extra>',
    customdata=type_counts_full['Percentage']
)

fig.add_annotation(
    x=min_type_full,
    y=type_counts_full.loc[min_idx_full, 'Count'],
    text=f"<b>Smallest</b><br>{min_type_full}<br>{min_percentage_full:.2f}%",
    showarrow=True,
    arrowhead=2,
    arrowsize=1,
    arrowwidth=3,
    arrowcolor="#EF4444",
    ax=70,
    ay=-90,
    font=dict(size=12, color="#DC2626", family="Inter", weight="bold"),
    align="center",
    bgcolor="rgba(254, 242, 242, 0.95)",
    bordercolor="#EF4444",
    borderwidth=2.5,
    borderpad=8,
    opacity=1
)

fig.update_layout(
    font=dict(family="Inter", size=13, color="#1F2937"),
    title={
        'text': '<b>Distribution of Anime Types</b><br><sup>Including UNKNOWN values</sup>',
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(size=22, color="#111827")
    },
    showlegend=False,
    coloraxis_showscale=False,
    height=650,
    plot_bgcolor='white',
    paper_bgcolor='#FAFAFA',
    margin=dict(t=100, l=60, r=220, b=80),
    xaxis=dict(
        showgrid=False,
        showline=False,
        linecolor='#E5E7EB',
        tickfont=dict(size=12, color="#4B5563")
    ),
    yaxis=dict(
        showgrid=True,
        gridcolor='#F3F4F6',
        showline=True,
        linecolor='#E5E7EB',
        tickfont=dict(size=11, color="#6B7280"),
        tickformat=','
    )
)

fig.show()

**Conflict**: he primary conflict distorting our initial insights for the 'Type' feature stems from the non-standard representation of missing values.
- Error: Non-standard Missing Value Representation - 'UNKNOWN' instead of NaN.
    - Impact on Insights (Pie Chart and Bar Chart): Instead of being treated as a missing entry that visualization tools typically ignore or handle separately, 'UNKNOWN' is interpreted by Plotly as a legitimate, distinct category within the 'Type' feature. This leads to 'UNKNOWN' appearing as a segment in the pie chart or a bar in the bar chart, falsely suggesting it is an actual anime type. Consequently, the true distribution and prevalence of valid anime types are misrepresented, as 'UNKNOWN' consumes a portion of the visual space and frequency count that should solely belong to actual categorical types.

*Note*: *For categorical data like **Type**, even if its dtype is object (which typically means it contains strings), Plotly will generally interpret and plot it correctly as categories. When creating bar charts or pie charts with a column of object dtype, Plotly intelligently treats each unique string value as a distinct category. The issues typically arise when an object column is supposed to be numeric but isn't (like 'Score'), or when it contains non-standard representations of missing values or erroneous strings that get treated as valid categories (like 'UNKNOWN' in 'Type' or 'N/A' in 'Score'). So, for Type, the object dtype itself is not the core problem for displaying categories, but the content of those categories (e.g., 'UNKNOWN') is.*


### 3.2.2.Correct Insight

In [280]:
# Recalculate type_counts, excluding UNKNOWN/NaN values
type_counts_full = (
    df_anime_dataset_2023_prep['Type']
    .value_counts(dropna=True)  # dropna=True để bỏ NaN
    .reset_index()
)
type_counts_full.columns = ['Type', 'Count']

# Filter out 'UNKNOWN' if it exists
type_counts_full = type_counts_full[type_counts_full['Type'].str.upper() != 'UNKNOWN']

# Calculate percentage
type_counts_full['Percentage'] = (type_counts_full['Count'] / type_counts_full['Count'].sum()) * 100

# Find the type with the smallest percentage
min_idx_full = type_counts_full['Percentage'].idxmin()
min_type_full = type_counts_full.loc[min_idx_full, 'Type']
min_percentage_full = type_counts_full.loc[min_idx_full, 'Percentage']

fig = px.bar(
    type_counts_full,
    x='Type',
    y='Count',
    title='Distribution of Anime Types',
    labels={'Type': 'Anime Type', 'Count': 'Number of Anime'},
    text='Count',
    color='Count',
    color_continuous_scale=['#DDD6FE', '#8B5CF6', '#6D28D9']
)

fig.update_traces(
    marker_line_color='white',
    marker_line_width=2,
    textposition='outside',
    texttemplate='%{text:,}',
    hovertemplate='<b>%{x}</b><br>Count: %{y:,}<br>Percentage: %{customdata:.2f}%<extra></extra>',
    customdata=type_counts_full['Percentage']
)

fig.add_annotation(
    x=min_type_full,
    y=type_counts_full.loc[min_idx_full, 'Count'],
    text=f"<b>Smallest</b><br>{min_type_full}<br>{min_percentage_full:.2f}%",
    showarrow=True,
    arrowhead=2,
    arrowsize=1,
    arrowwidth=3,
    arrowcolor="#EF4444",
    ax=70,
    ay=-90,
    font=dict(size=12, color="#DC2626", family="Inter", weight="bold"),
    align="center",
    bgcolor="rgba(254, 242, 242, 0.95)",
    bordercolor="#EF4444",
    borderwidth=2.5,
    borderpad=8,
    opacity=1
)

fig.update_layout(
    font=dict(family="Inter", size=13, color="#1F2937"),
    title={
        'text': '<b>Distribution of Anime Types</b><br><sup>Cleaned Data</sup>',
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(size=22, color="#111827")
    },
    showlegend=False,
    coloraxis_showscale=False,
    height=650,
    plot_bgcolor='white',
    paper_bgcolor='#FAFAFA',
    margin=dict(t=100, l=60, r=220, b=80),
    xaxis=dict(
        showgrid=False,
        showline=True,
        linecolor='#E5E7EB',
        tickfont=dict(size=12, color="#4B5563")
    ),
    yaxis=dict(
        showgrid=True,
        gridcolor='#F3F4F6',
        showline=True,
        linecolor='#E5E7EB',
        tickfont=dict(size=11, color="#6B7280"),
        tickformat=','
    )
)

fig.show()

**Deep Insight**

- Mỗi Type anime có điểm số trung bình là bao nhiêu.

In [281]:
# import pandas as pd
# import plotly.express as px
# import plotly.io as pio

# pio.templates.default = "plotly_white"

# # Convert Score to numeric
# df_anime_dataset_2023["Score"] = pd.to_numeric(
#     df_anime_dataset_2023["Score"],
#     errors="coerce"
# )

# # Average score by Type
# average_score_by_type = (
#     df_anime_dataset_2023
#     .groupby("Type")["Score"]
#     .mean()
#     .reset_index()
# )
# average_score_by_type.columns = ["Type", "Average Score"]

# base_color = "#C6A3FF"

# fig = px.bar(
#     average_score_by_type,
#     x="Type",
#     y="Average Score",
#     title="Average Score by Anime Type",
#     labels={"Type": "Anime Type", "Average Score": "Average Score"},
#     text="Average Score",
#     category_orders={"Type": sorted(average_score_by_type["Type"].unique())}
# )

# fig.update_traces(
#     marker_color=base_color,
#     marker_line_color="white",
#     marker_line_width=1.2,
#     textposition="outside",
#     texttemplate="%{text:.2f}",
#     hovertemplate="<b>Type:</b> %{x}<br><b>Average Score:</b> %{y:.2f}<extra></extra>",
# )

# fig.update_xaxes(
#     showgrid=False,
#     zeroline=False,
#     tickangle=-20
# )

# fig.update_yaxes(
#     showgrid=False,
#     gridcolor="#EEEEEE",
#     zeroline=False,
#     tickformat=".1f"
# )

# fig.update_layout(
#     title={"text": "Average Score by Anime Type", "x": 0.5, "xanchor": "center"},
#     bargap=0.3,
#     height=550,
#     font=dict(family="Inter", size=13),
#     title_font=dict(size=20),
#     margin=dict(t=80, l=60, r=40, b=80),
#     hoverlabel=dict(bgcolor="white", bordercolor="#888", font_size=12),
# )

# fig.show()


In [282]:
print(df_anime_dataset_2023_prep.shape)

(24905, 21)


## 3.3.Insight Question 3
The **Genres** feature is incredibly rich, offering deep insights into the thematic elements, target audience, and overall feel of an anime. Understanding the prevalence of different genres is crucial for identifying trends, audience preferences, and even predicting the potential success of new titles. However, unlike **Score** or **Type** which typically hold a single value, **Genres** often contains multiple categories listed together for a single anime. Our initial exploration will examine this raw, combined genre data to understand its direct distribution, while simultaneously setting the stage to reveal how such an aggregated format can significantly complicate accurate analysis and lead to misleading conclusions if not properly preprocessed.

### 3.3.1.Misleading Insight:
- What are the most frequently appearing genre combinations or genre strings as they are originally listed in the dataset?

In [283]:
# 1. Data Processing
genre_counts = df_anime_dataset_2023['Genres'].value_counts().head(15).reset_index()
genre_counts.columns = ['Genres', 'Count']

# Sort by Count ascending so the largest bar appears at the top
genre_counts = genre_counts.sort_values('Count', ascending=True)

# --- BOLD TEXT LOGIC ---
# 1. It is a combination (contains ',')
# 2. OR the name is "UNKNOWN"
genre_counts['Display_Name'] = genre_counts['Genres'].apply(
    lambda x: f"<b>{x}</b>" if (',' in str(x) or str(x).strip().upper() == 'UNKNOWN') else x
)

# --- COLOR LOGIC ---
# Highlight in RED if:
# 1. It is a combination (contains ',')
# 2. OR the name is "UNKNOWN"
# Otherwise, use Light Pink
red_color = '#A82516'
pink_color = '#FDE0DD'

colors = [
    red_color if (',' in str(x) or str(x).strip().upper() == 'UNKNOWN') else pink_color
    for x in genre_counts['Genres']
]

# 2. Create Horizontal Bar Chart
fig = px.bar(
    genre_counts,
    x='Count',
    y='Display_Name',  # Use the bolded display names
    orientation='h',
    title='Top 15 Most Popular Anime Genre Combinations (Original Data)',
    labels={'Display_Name': 'Genre Combination', 'Count': 'Number of Anime'},
    text='Count',
    # Pass original name to custom_data for clean tooltips
    custom_data=['Genres']
)

# 3. Visual Refinements (Traces)
fig.update_traces(
    marker_color=colors,           # Apply the color logic
    marker_line_color='white',
    marker_line_width=1.2,
    textposition='outside',
    texttemplate='%{text:,}',
    # Use customdata[0] to show the original name in the tooltip
    hovertemplate='<b>Genres:</b> %{customdata[0]}<br><b>Count:</b> %{x:,}<extra></extra>'
)

# 4. Layout & Axis Refinements
fig.update_xaxes(
    showgrid=False,
    zeroline=False,
    tickformat=',',
    dtick=1000,
    tick0=0,
    range=[0, 5000] 
)

fig.update_yaxes(
    showgrid=False,
    zeroline=False,
    title=None # Hide Y-axis title for cleaner look
)

fig.update_layout(
    title={
        'text': '<b>Top 15 Most Popular Anime Genre Combinations (Raw Data)</b>', 
        'x': 0.5, 
        'xanchor': 'center',
        'font': dict(size=22, family='Inter', color='#1F2937')
    },
    bargap=0.2,
    height=600,
    font=dict(family='Inter', size=13),
    title_font=dict(size=20),
    plot_bgcolor='white',
    paper_bgcolor='white',
    margin=dict(t=80, l=250, r=40, b=80),
    hoverlabel=dict(bgcolor='white', bordercolor='#888', font_size=12)
)

fig.show()

**Conflict**: The initial insight derived from the **Genres** feature (identifying top genre combinations) is severely flawed and misleading due to two significant data integrity issues. Without proper preprocessing, the visualization fails to reveal the true popularity of individual genres or accurate genre groupings.
- Error 1: Aggregated Genre Strings (Multiple Genres in One Entry).
  - Impact on Insight (Top N Genre Combinations Bar Chart): The most critical issue is that the **Genres** column often contains multiple genres concatenated into a single string (e.g., "Action, Comedy, Shounen"). When we count the frequency of these strings, we are actually counting combinations of genres, not individual genres themselves. This completely obscures the true popularity of any single genre. For example, "Action" might be present in many combinations, but its individual popularity isn't reflected. The bar chart will show unique, often lengthy, genre combinations as categories, making it impossible to ascertain which core genres are truly prevalent across the dataset. We are getting insight into bundles of genres, rather than the constituent elements.
- Error 2: Non-standard Missing Value Representation - 'UNKNOWN' instead of NaN.
  - Impact on Insight (Top N Genre Combinations Bar Chart): Similar to the 'Type' feature, the presence of the string 'UNKNOWN' for missing genre information is treated as a valid and distinct genre combination. This causes 'UNKNOWN' to often rank highly among the "top genre combinations" in the bar chart. Its inclusion as a prominent category inflates its perceived importance, distorts the true ranking of actual genre combinations, and detracts from focusing on meaningful genre data. A correctly handled missing value (standard NaN) would typically be excluded from such frequency counts by default, providing a cleaner and more accurate insight into actual genre prevalence.

### 3.3.2.Correct Insight


In [284]:
# --- 1. DATA PROCESSING (DEEP CLEANING) ---
# Create a copy and ensure string format
df_genres_exploded = df_anime_dataset_2023_prep[['Genres']].copy()
df_genres_exploded['Genres'] = df_genres_exploded['Genres'].astype(str)

# REGEX CLEANING: Remove brackets and quotes
df_genres_exploded['Genres'] = df_genres_exploded['Genres'].str.replace(r"[\[\]\'\"]", "", regex=True)

# SPLIT & EXPLODE: Separate genres and create individual rows
df_genres_exploded['Genres'] = df_genres_exploded['Genres'].str.split(',')
df_genres_exploded = df_genres_exploded.explode('Genres')

# TRIM: Remove leading/trailing whitespace
df_genres_exploded['Genres'] = df_genres_exploded['Genres'].str.strip()

# FILTER: Remove garbage values
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_genres_exploded = df_genres_exploded[~df_genres_exploded['Genres'].str.upper().isin(exclude_list)]
df_genres_exploded = df_genres_exploded[df_genres_exploded['Genres'] != ""]

# --- 2. AGGREGATION ---
# Count occurrences and take Top 15
genre_counts = df_genres_exploded['Genres'].value_counts().head(15).reset_index()
genre_counts.columns = ['Genres', 'Count']
genre_counts = genre_counts.sort_values('Count', ascending=True)

# --- 3. VISUALIZATION ---
fig = px.bar(
    genre_counts,
    x='Count',
    y='Genres',
    orientation='h',
    text='Count',
    color='Count',
    # Pink to Red gradient (Same theme as Producers chart)
    color_continuous_scale=['#FCC5C0', '#C3122F'] 
)

# --- 4. VISUAL REFINEMENTS ---
fig.update_traces(
    texttemplate='%{text:,}',
    textposition='outside',
    hovertemplate='<b>Genre:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>'
)

fig.update_layout(
    template='plotly_white',
    
    # Title Configuration (Font Size 22)
    title={
        'text': '<b>Top 15 Most Popular Individual Genres</b>',
        'y': 0.95,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': dict(size=22, family='Inter', color='#333333')
    },
    
    # Global Font Settings
    font=dict(family='Inter', size=13, color="#333333"),
    
    # Layout Dimensions & Color
    plot_bgcolor='white',
    paper_bgcolor='white',
    height=600, 
    coloraxis_showscale=False, # Hide the color bar
    margin=dict(t=80, l=150, r=50, b=80),

    # --- X-AXIS CONFIGURATION ---
    xaxis=dict(
        title="Number of Anime",
        title_font=dict(size=14, color='#333333'),
        showticklabels=True, 
        tickfont=dict(size=14, color='#333333'),
        
        # Grid settings
        showgrid=False,        
        zeroline=False,
        
        # Tick Step Configuration
        tick0=0,     
        dtick=2000,   # Step size of 2000
        range=[0, 8000],
        tickformat=',' 
    ),

    # --- Y-AXIS CONFIGURATION ---
    yaxis=dict(
        title=None,
        tickfont=dict(size=14, color='#333333'),
        ticksuffix="  ",    
        showgrid=False,
        showline=False
    )
)

fig.show()

print(f"Shape after explode: {df_genres_exploded.shape}")

Shape after explode: (38461, 1)


**Deep Insight**:
- Display the top 10 Genres with the highest average score
- Distribution of media types (TV, Movie, OVA, etc.) in the top 5 highest-scoring Genres (recommended: pie chart)
- Average score of each media type within the top 5 Genres (recommended: stacked bar chart)

In [285]:
# Display the top 10 Genres with the highest average score
# Extract individual genres from the 'Genres' column and explode
df_genre_score = csv_file2[['Genres', 'Score']].dropna(subset=['Genres', 'Score']).copy()
df_genre_score['Genres'] = df_genre_score['Genres'].apply(lambda x: [g.strip() for g in eval(x)] if isinstance(x, str) and x.startswith('[') else [])
df_genre_score = df_genre_score.explode('Genres')

# Remove 'UNKNOWN' and empty genres
df_genre_score = df_genre_score[df_genre_score['Genres'].notna() & (df_genre_score['Genres'] != 'UNKNOWN') & (df_genre_score['Genres'] != '')]

# Calculate average score and count for each genre
genre_avg_score = (
    df_genre_score.groupby('Genres')
    .agg(average_score=('Score', 'mean'), anime_count=('Score', 'count'))
    .reset_index()
)

# Get top 10 genres by average score (with at least 20 anime for reliability)
top_genres = genre_avg_score[genre_avg_score['anime_count'] >= 20].sort_values('average_score', ascending=False).head(10)

# Plot with vibrant color palette
fig = px.bar(
    top_genres,
    x='Genres',
    y='average_score',
    color='average_score',
    color_continuous_scale='Turbo',
    text='average_score',
    title='Top 10 Genres with Highest Average Score',
    labels={'Genres': 'Genre', 'average_score': 'Average Score'}
)

fig.update_traces(
    texttemplate='%{text:.2f}',
    textposition='outside',
    marker_line_color='white',
    marker_line_width=1.2
)

fig.update_layout(
    font=dict(family='Inter', size=13),
    title_font=dict(size=22, color='#1F2937'),
    plot_bgcolor='white',
    paper_bgcolor='white',
    bargap=0.25,
    height=500,
    margin=dict(t=80, l=40, r=40, b=80),
    hoverlabel=dict(bgcolor='white', bordercolor='#FF6B6B', font_size=13)
)

fig.show()

In [286]:
# Average score of each media type within the top 5 Genres (recommended: stacked bar chart)
# Top 5 genres by average score (already calculated in top_genres)
top5_genres = top_genres.head(5)['Genres'].tolist()

# Prepare dataframe: explode genres, filter top 5, and group by Type and Genre
df = csv_file2[['Genres', 'Type', 'Score']].dropna(subset=['Genres', 'Type', 'Score']).copy()
df['Genres'] = df['Genres'].apply(lambda x: [g.strip() for g in eval(x)] if isinstance(x, str) and x.startswith('[') else [])
df = df.explode('Genres')
df = df[df['Genres'].isin(top5_genres)]

# Remove UNKNOWN and empty types
df = df[df['Type'].notna() & (df['Type'] != 'UNKNOWN') & (df['Type'] != '')]

# Group by Genre and Type, calculate average score and count
grouped = (
    df.groupby(['Genres', 'Type'])
    .agg(average_score=('Score', 'mean'), anime_count=('Score', 'count'))
    .reset_index()
)

# Only show media types with at least 10 anime for reliability
grouped = grouped[grouped['anime_count'] >= 10]

# Sort genres for stacked bar chart order
genre_order = top5_genres

fig = px.bar(
    grouped,
    x='Genres',
    y='average_score',
    color='Type',
    text='average_score',
    category_orders={'Genres': genre_order},
    title='Average Score of Each Media Type within Top 5 Highest-Scoring Genres',
    labels={'Genres': 'Genre', 'average_score': 'Average Score', 'Type': 'Media Type'}
)

fig.update_traces(
    texttemplate='%{text:.2f}',
    textposition='outside',
    marker_line_color='white',
    marker_line_width=1.2
)

fig.update_layout(
    font=dict(family='Inter', size=13),
    title_font=dict(size=20, color='#1F2937'),
    plot_bgcolor='white',
    paper_bgcolor='white',
    bargap=0.18,
    height=550,
    margin=dict(t=80, l=60, r=40, b=80),
    hoverlabel=dict(bgcolor='white', bordercolor='#8B5CF6', font_size=13),
    legend_title_text='Media Type'
)

fig.show()

## 3.4.Insight Question 4
The **Duration** feature is a crucial indicator of an anime's length, directly influencing its format, target audience, and consumption patterns. Understanding typical durations is far more meaningful when considered in conjunction with the **Type** of anime (e.g., TV series, Movie, OVA). A 24-minute duration means something entirely different for a TV episode compared to a short film. Our initial exploration will therefore combine these two features, aiming to identify the various ways duration is represented within each anime type in the raw dataset. By working with this unprocessed, combined data, we intend to highlight how its current string-based format, inconsistent notations, and non-standard missing values can severely obstruct direct quantitative analysis and lead to ambiguous or uninterpretable insights, thereby demonstrating the necessity for extensive conversion and standardization.

### 3.4.1.Misleading Insight
- What is the distribution of anime scores within our dataset?

**Conflict**
The initial insight derived from the 'Duration' feature, even when viewed in conjunction with 'Type' (as a frequency of textual descriptions), is severely hampered and misleading due to multiple fundamental data quality issues. The raw state of this column makes it impossible to conduct accurate quantitative analysis or draw reliable conclusions about anime lengths.
- Error 1: Mismatched Data Type - 'Duration' as an Object (String) instead of Numeric.
    - Impact on Insight (Bar Chart of Duration Descriptions): Since the 'Duration' column is stored as an object (string) type, any attempt at numerical operations (like calculating averages, minimums, maximums, or plotting a continuous distribution) is impossible without prior conversion. While the bar chart for textual descriptions can be generated, this underlying data type issue prevents any deeper quantitative understanding. The lack of a numeric type means we cannot perform meaningful time-based comparisons or aggregations, immediately limiting the scope of our analysis to mere string counts.
- Error 2: Inconsistent String Formats and Units.
    - Impact on Insight (Bar Chart of Duration Descriptions): The 'Duration' column contains a highly inconsistent mix of string formats and units (e.g., "24 min per ep", "1 hr", "1h 30min", "3 days", "59 sec", etc.). This means that values representing the same actual duration (e.g., 60 minutes) might appear as several different unique strings (e.g., "1 hr", "60 min"). When analyzing the "most common textual descriptions," these inconsistencies lead to a fragmented and inaccurate view. The bar chart will show many distinct entries for what should be a unified duration, drastically underrepresenting the true frequency of certain anime lengths and making it impossible to identify real patterns in animation runtimes.
- Error 3: Non-standard Missing Value Representation - 'UNKNOWN' instead of NaN.
    - Impact on Insight (Bar Chart of Duration Descriptions): The presence of the string 'UNKNOWN' for missing duration information is treated by Plotly as a legitimate, distinct textual description of duration. Consequently, 'UNKNOWN' will often appear as a prominent category in the bar chart (especially when faceted by 'Type'), falsely implying it's a common duration rather than an indicator of missing data. This distorts the perceived distribution of actual durations for each anime type and overstates the prevalence of unrecorded information.

In [287]:
# Tạo dictionary để lưu frequency của từng Type
duration_by_type = {}

for type_name in ['TV', 'Movie', 'OVA']:
    # Lọc dữ liệu theo Type
    type_df = df_anime_dataset_2023[df_anime_dataset_2023['Type'] == type_name]
    
    # Đếm frequency
    counts = type_df['Duration'].value_counts().reset_index()
    counts.columns = ['Duration', 'Count']
    counts = counts.sort_values('Count', ascending=False)
    
    # Lưu vào dictionary
    duration_by_type[type_name] = counts
    
# Định nghĩa màu cho từng Type
# colors = {
#     'TV': '#8B5CF6',
#     'Movie': '#EC4899',
#     'OVA': '#10B981'
# }

colors = {
    'TV': '#49006A',
    'Movie': '#F768A1',
    'OVA': '#AE017E'
}

# Tạo subplot với 1 hàng, 3 cột
fig = make_subplots(
    rows=1, cols=3,
    subplot_titles=(
        '<b>Top 15 Duration Distribution - TV</b>',
        '<b>Top 15 Duration Distribution - Movie</b>',
        '<b>Top 15 Duration Distribution - OVA</b>'
    ),
    horizontal_spacing=0.12,
    vertical_spacing=0.15
)

# Vẽ cho từng Type
for idx, type_name in enumerate(['TV', 'Movie', 'OVA'], start=1):
    data = duration_by_type[type_name].head(15)
    
    # Tạo list màu: đỏ cho UNKNOWN, màu gốc cho các giá trị khác
    bar_colors = []
    for duration in data['Duration']:
        if str(duration).upper().strip() == 'UNKNOWN':
            bar_colors.append('#C3122F')  # Màu đỏ cho UNKNOWN
        else:
            bar_colors.append(colors[type_name])  # Màu gốc cho các giá trị khác
    
    fig.add_trace(
        go.Bar(
            x=data['Count'],
            y=data['Duration'],
            orientation='h',
            text=data['Count'],
            texttemplate='%{text:,}',
            textposition='auto',
            marker=dict(
                color=bar_colors,  # Sử dụng list màu đã tạo
                line=dict(color='white', width=1.2)
            ),
            hovertemplate='<b>Duration:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>',
            showlegend=False
        ),
        row=1, col=idx
    )

# Cập nhật layout tổng thể
fig.update_layout(
    title={
        'text': '<b>Duration Distribution by Type</b>',
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(size=20, color='#1F2937', family='Inter')
    },
    plot_bgcolor='white',
    paper_bgcolor='white',
    height=500,
    width=1400,
    margin=dict(t=80, l=180, r=50, b=60)
)

# Cập nhật các trục
fig.update_xaxes(
    showgrid=True,
    gridcolor='#F3F4F6',
    tickformat=','
)

fig.update_yaxes(
    showgrid=False,
    ticksuffix='  ',
    categoryorder='total ascending'
)

fig.show()

### 3.4.2.Correct Insight

In [288]:
# Lọc dữ liệu chỉ lấy Type = TV, Movie, OVA
filtered_df = df_anime_dataset_2023_prep[
    df_anime_dataset_2023_prep['Type'].isin(['TV', 'Movie', 'OVA'])
].copy()

# Nhóm Duration thành khoảng 5 phút
filtered_df['Duration_grouped'] = (filtered_df['Duration Minutes'] // 2) * 2

# Tạo histogram với nhóm 5 phút
fig = px.histogram(
    filtered_df,
    x='Duration_grouped',
    color='Type',
    title='Distribution of Duration by Type (Grouped by 5 minutes)',
    labels={'Duration_grouped': 'Duration (minutes)', 'count': 'Number of Anime'},
    barmode='overlay',
    nbins=100,
    # color_discrete_sequence=['#8B5CF6', '#EC4899', '#10B981'],
    color_discrete_sequence=['#49006A', '#F768A1', '#AE017E'],
    opacity=0.7
)

# Tùy chỉnh layout
fig.update_layout(
    title={
        'text': 'Distribution of Duration by Type (Grouped by 2 minutes)',
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(size=20, color='#1F2937', family='Inter', weight='bold')
    },
    xaxis=dict(
        title='Duration (minutes)',
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        tickfont=dict(size=12, color='#6B7280'),
        showgrid=True,
        gridcolor='#F3F4F6',
        showline=True,
        linecolor='#E5E7EB'
    ),
    yaxis=dict(
        title='Number of Anime',
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        tickfont=dict(size=12, color='#6B7280'),
        showgrid=True,
        gridcolor='#F3F4F6',
        showline=True,
        linecolor='#E5E7EB'
    ),
    plot_bgcolor='white',
    paper_bgcolor='#FAFAFA',
    height=600,
    margin=dict(t=80, l=60, r=40, b=80),
    legend=dict(
        title=dict(text='Type', font=dict(size=14, color='#4B5563')),
        font=dict(size=12, color='#6B7280'),
        bgcolor='rgba(255, 255, 255, 0.8)',
        bordercolor='#E5E7EB',
        borderwidth=1
    ),
    hoverlabel=dict(
        bgcolor='white',
        bordercolor='#8B5CF6',
        font=dict(size=12, family='Inter', color='#1F2937')
    )
)

fig.show()

# In thống kê cho mỗi Type
print("\n=== Duration Statistics by Type (grouped by 2 minutes) ===")
for type_name in ['TV', 'Movie', 'OVA']:
    type_data = filtered_df[filtered_df['Type'] == type_name]['Duration_grouped']
    print(f"\n{type_name}:")
    print(f"  - Number of anime: {len(type_data)}")
    print(f"  - Average duration: {type_data.mean():.2f} minutes")
    print(f"  - Most common duration: {type_data.mode().values[0] if len(type_data.mode()) > 0 else 'N/A'} minutes")
    print(f"  - Min duration: {type_data.min():.2f} minutes")
    print(f"  - Max duration: {type_data.max():.2f} minutes")


=== Duration Statistics by Type (grouped by 2 minutes) ===

TV:
  - Number of anime: 6216
  - Average duration: 20.20 minutes
  - Most common duration: 24.0 minutes
  - Min duration: 10.00 minutes
  - Max duration: 54.00 minutes

Movie:
  - Number of anime: 2311
  - Average duration: 83.47 minutes
  - Most common duration: 90.0 minutes
  - Min duration: 40.00 minutes
  - Max duration: 168.00 minutes

OVA:
  - Number of anime: 6146
  - Average duration: 22.22 minutes
  - Most common duration: 28.0 minutes
  - Min duration: 0.00 minutes
  - Max duration: 126.00 minutes


**Deep Insight**:
- For each Type (TV, Movie, OVA, etc.), what is the average Duration of the Top 3 highest-scoring Genres?

## 3.5.Insight Question 5
The **Aired** feature provides critical temporal context, allowing us to track anime release patterns, seasonal trends, and historical production volumes. Understanding how many anime titles are released each month is fundamental for gauging industry activity and identifying peak periods. Our initial examination will focus on extracting this monthly release count directly from the raw **Aired** column. By intentionally avoiding any prior data cleaning or type conversion, we aim to immediately encounter the challenges posed by its current string format. This approach will demonstrate how unstructured date information prevents straightforward temporal aggregation, thereby highlighting the absolute necessity of robust date parsing and standardization for meaningful time-series analysis.

### 3.5.1.Misleading Insight
- Based on the raw 'Aired' column, what is the count of anime released in each distinct month (as represented in the original string format) across the entire dataset?

**Confilct**
The initial attempt to visualize the number of anime released per month from the **Aired** feature yields highly inaccurate and potentially uninterpretable results due to fundamental issues with its raw data format. The current state of the **Aired** column completely undermines any meaningful temporal analysis.
- Error 1: Mismatched Data Type - **Aired** as an Object (String) instead of Datetime.
    - Impact on Insight (Monthly Release Line Chart): Since the 'Aired' column is an object (string) type, Plotly cannot intrinsically understand these entries as dates or extract chronological months. When attempting to plot months on the x-axis, the chart will treat each unique month string (e.g., "Jan", "Feb", "Mar", but also potentially "Apr 2005", "May 1999 to Jun 1999") as a discrete categorical label. This results in an x-axis that is not chronologically ordered (it will likely be alphabetical or based on string appearance) and fails to represent a continuous timeline. Consequently, the line chart cannot accurately show trends or seasonality in anime releases, as the data points are disjointed and incorrectly positioned.
- Error 2: Inconsistent and Unstructured String Formats.
    - Impact on Insight (Monthly Release Line Chart): The **Aired** column contains a wide variety of string formats (e.g., "Jan 10, 2000", "Winter 2005", "Apr 2005 to Jun 2005", "Not available"). Extracting a consistent 'month' from these disparate strings for aggregation is impossible without prior parsing. For instance, "Jan 10, 2000" and "Jan 2005" both represent January releases but are unique strings. Furthermore, date ranges ("Apr 2005 to Jun 2005") present an ambiguity: should this count towards April, June, or all months in between? Without a standardized format, any attempt to count "anime per month" will be based on arbitrary string matching, leading to an incomplete, fragmented, and inaccurate count, drastically undercounting releases for actual months while showing counts for unparsed, full date strings.
- Error 3: Non-standard Missing Value Representation and Non-Date Entries.
    - Impact on Insight (Monthly Release Line Chart): Values like "Not available" or "Unknown" within the **Aired** column further corrupt the data. If they are not explicitly filtered out during the crude string-based month extraction, they might inadvertently be processed or cause errors. More critically, if a string-parsing method were to mistakenly extract a "month" from such entries or if they are simply ignored, the total count of anime included in the chart becomes less reliable, and the underlying issue of unaddressable missing temporal information persists, leading to an incomplete and potentially biased view of release trends.

&rArr; The line chart for feature **Aired** can not be drawed

### 3.5.2.Correct Insight

In [289]:
# Kiểm tra xem cột 'Aired Month' có tồn tại trong DataFrame không
if 'Aired Month' in df_anime_dataset_2023_prep.columns:
    # Nhóm dữ liệu theo 'Aired Month' và đếm số lượng anime
    anime_per_month = df_anime_dataset_2023_prep.groupby('Aired Month').size().reset_index(name='Anime Count')

    # Sắp xếp theo thứ tự tháng
    anime_per_month = anime_per_month.sort_values('Aired Month')
else:
    print("Cột 'Aired Month' không tồn tại trong DataFrame.")

# Map numeric months to month names
anime_per_month['Month'] = anime_per_month['Aired Month'].map({
    1.0: 'Jan', 2.0: 'Feb', 3.0: 'Mar', 4.0: 'Apr', 5.0: 'May', 6.0: 'Jun',
    7.0: 'Jul', 8.0: 'Aug', 9.0: 'Sep', 10.0: 'Oct', 11.0: 'Nov', 12.0: 'Dec'
})
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

# Sort by month order
anime_per_month['Month'] = pd.Categorical(anime_per_month['Month'], categories=month_order, ordered=True)
anime_per_month = anime_per_month.sort_values('Month')

# Create line chart with gradient color scheme
fig = px.line(
    anime_per_month,
    x='Month',
    y='Anime Count',
    title='Number of Anime Released Per Month',
    labels={'Month': 'Month', 'Anime Count': 'Number of Anime'},
    markers=True
)

# Customize traces with modern gradient colors
fig.update_traces(
    line=dict(color='#8B5CF6', width=3),  # Vibrant purple
    marker=dict(
        size=10,
        color="#FFA3AE",  # Light purple for markers
        line=dict(color='#7C3AED', width=2)  # Darker purple border
    ),
    hovertemplate='<b>%{x}</b><br>Anime Count: %{y:,}<extra></extra>'
)

# Enhanced layout with modern styling
fig.update_layout(
    title={
        'text': 'Number of Anime Released Per Month',
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(size=22, color='#1F2937', family='Inter', weight='bold')
    },
    xaxis=dict(
        title='Month',
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        tickfont=dict(size=12, color='#6B7280'),
        showgrid=False,
        showline=True,
        linecolor='#E5E7EB',
        linewidth=1
    ),
    yaxis=dict(
        title='Number of Anime',
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        tickfont=dict(size=12, color='#6B7280'),
        showgrid=True,
        gridcolor='#F3F4F6',
        gridwidth=1,
        showline=True,
        linecolor='#E5E7EB',
        linewidth=1,
        tickformat=','
    ),
    plot_bgcolor='white',
    paper_bgcolor='#FAFAFA',
    height=550,
    margin=dict(t=80, l=60, r=40, b=80),
    hoverlabel=dict(
        bgcolor='white',
        bordercolor='#8B5CF6',
        font=dict(size=13, family='Inter', color='#1F2937')
    )
)

fig.show()

In [290]:
# Lọc chỉ lấy Type = Movie
df_movie = df_anime_dataset_2023_prep[df_anime_dataset_2023_prep['Type'] == 'Movie' ].copy()

# Kiểm tra xem cột 'Aired Month' có tồn tại trong DataFrame không
if 'Aired Month' in df_movie.columns:
    # Nhóm dữ liệu theo 'Aired Month' và đếm số lượng anime
    anime_per_month = df_movie.groupby('Aired Month').size().reset_index(name='Anime Count')

    # Sắp xếp theo thứ tự tháng
    anime_per_month = anime_per_month.sort_values('Aired Month')
else:
    print("Cột 'Aired Month' không tồn tại trong DataFrame.")

# Map numeric months to month names
anime_per_month['Month'] = anime_per_month['Aired Month'].map({
    1.0: 'Jan', 2.0: 'Feb', 3.0: 'Mar', 4.0: 'Apr', 5.0: 'May', 6.0: 'Jun',
    7.0: 'Jul', 8.0: 'Aug', 9.0: 'Sep', 10.0: 'Oct', 11.0: 'Nov', 12.0: 'Dec'
})
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

# Sort by month order
anime_per_month['Month'] = pd.Categorical(anime_per_month['Month'], categories=month_order, ordered=True)
anime_per_month = anime_per_month.sort_values('Month')

# Create line chart with gradient color scheme
fig = px.line(
    anime_per_month,
    x='Month',
    y='Anime Count',
    title='Number of Movie Anime Released Per Month',
    labels={'Month': 'Month', 'Anime Count': 'Number of Movies'},
    markers=True
)

# Customize traces with modern gradient colors
fig.update_traces(
    line=dict(color='#8B5CF6', width=3),
    marker=dict(
        size=10,
        color="#FFA3AE",
        line=dict(color='#7C3AED', width=2)
    ),
    hovertemplate='<b>%{x}</b><br>Movie Count: %{y:,}<extra></extra>'
)

# Enhanced layout with modern styling
fig.update_layout(
    title={
        'text': 'Number of Movie Anime Released Per Month',
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(size=22, color='#1F2937', family='Inter', weight='bold')
    },
    xaxis=dict(
        title='Month',
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        tickfont=dict(size=12, color='#6B7280'),
        showgrid=False,
        showline=True,
        linecolor='#E5E7EB',
        linewidth=1
    ),
    yaxis=dict(
        title='Number of Movies',
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        tickfont=dict(size=12, color='#6B7280'),
        showgrid=True,
        gridcolor='#F3F4F6',
        gridwidth=1,
        showline=True,
        linecolor='#E5E7EB',
        linewidth=1,
        tickformat=','
    ),
    plot_bgcolor='white',
    paper_bgcolor='#FAFAFA',
    height=550,
    margin=dict(t=80, l=60, r=40, b=80),
    hoverlabel=dict(
        bgcolor='white',
        bordercolor='#8B5CF6',
        font=dict(size=13, family='Inter', color='#1F2937')
    )
)

fig.show()

# In thống kê
print(f"\nTotal Movie Anime: {df_movie.shape[0]:,}")
print(f"Movies with Aired Month data: {anime_per_month['Anime Count'].sum():,}")


Total Movie Anime: 2,311
Movies with Aired Month data: 2,124


**Deep Insight**:
- During major occasions such as Halloween, Christmas, etc., is there a sudden spike in the number of anime releases? (You can consider anime released within 15 days before and after the official date of the event as belonging to that holiday release window.)
- Which Genres tend to be released during holidays like Halloween, Christmas, etc.?
- What is the average score of the top Genres when released during these holiday periods?

## 3.6.Insight Question 6
The **Source** feature is invaluable for understanding the origins of anime, revealing whether a title is adapted from manga, light novels, games, or is an entirely original work. This information can shed light on industry trends, adaptation strategies, and even predict potential fan bases or production approaches. Our initial analysis will examine the raw 'Source' column to identify the most prevalent origins of anime within our dataset. By intentionally bypassing any preprocessing, we aim to uncover the immediate distribution of these source categories and, crucially, to identify if any non-standard entries or placeholder values are present, which could distort our understanding of the true source landscape and necessitate data cleaning.

### 3.6.1.Misleading Insight
- What is the frequency of each unique source category present in the raw **Source** column of the dataset?

In [291]:

pio.templates.default = "plotly_white"

# 2. Xử lý dữ liệu
source_counts = df_anime_dataset_2023['Source'].value_counts().reset_index()
source_counts.columns = ['Source', 'Count']

# Sắp xếp ascending=True để khi vẽ ngang, thanh lớn nhất sẽ nằm trên cùng (Declutter)
source_counts = source_counts.sort_values('Count', ascending=True)

# 3. Tạo logic màu sắc (Học từ Code 2)
# Nếu là 'Unknown' thì màu xám, còn lại là màu Tím (#8B5CF6)
base_color = "#8B5CF6"
unknown_color = "#B0B0B0"

# Tạo list màu tương ứng với thứ tự dữ liệu đã sắp xếp
colors = [
    unknown_color if str(s).strip().upper() == "UNKNOWN" else base_color
    for s in source_counts["Source"]
]

# 4. Tạo biểu đồ Horizontal Bar
fig = px.bar(
    source_counts,
    x='Count',      # Số lượng nằm ngang
    y='Source',     # Tên nguồn nằm dọc
    orientation='h', # Chuyển sang dạng thanh ngang
    title='Frequency of Anime Sources (Raw Data)',
    labels={'Source': 'Source Type', 'Count': 'Number of Anime'},
    text='Count'
)

# 5. Tinh chỉnh giao diện (Kết hợp Code 1 và Code 2)
fig.update_traces(
    marker_color=colors,           # Áp dụng logic màu đã tạo
    marker_line_color='white',
    marker_line_width=1.2,
    textposition='outside',
    texttemplate='%{text:,}',
    hovertemplate='<b>Source:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>'
)

# 6. Tinh chỉnh Layout (Modern & Clean)
fig.update_layout(
    title={
        'text': 'Frequency of Anime Sources (Raw Data)',
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(size=22, family='Inter', weight='bold', color='#1F2937')
    },
    # Trục X: Số lượng
    xaxis=dict(
        title='Number of Anime',
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        showgrid=False,        # Giữ grid dọc mờ để dễ so sánh độ dài
        gridcolor='#F3F4F6',  # Grid màu rất nhạt
        showline=False,       # Bỏ đường kẻ trục đậm
        tickformat=',',
        zeroline=False
    ),
    # Trục Y: Tên nguồn (Nơi cần ticksuffix)
    yaxis=dict(
        title='',             # Bỏ tiêu đề trục Y vì tên Source đã rõ ràng
        tickfont=dict(size=13, color='#4B5563', family='Inter'),
        ticksuffix="  ",      # <--- Thêm khoảng cách theo yêu cầu của bạn
        showgrid=False,
        showline=False,       # Bỏ đường kẻ trục cho thoáng
        zeroline=False
    ),
    bargap=0.25,
    height=600,
    margin=dict(t=80, l=120, r=40, b=80),
    hoverlabel=dict(
        bgcolor='white',
        bordercolor='#8B5CF6',
        font=dict(size=13, family='Inter')
    )
)

fig.show()

**Conflict**
The initial insight derived from the **Source** feature (showing the frequency of anime origins) is skewed and misleading due to issues related to its data representation. The raw state of this column prevents an accurate understanding of the true prevalence of each source.
- Error 1: Mismatched Data Type - **Source** as a Generic Object (String) Type.
  - Impact on Insight (Bar/Pie Chart): While Plotly can render categories from an object (string) column, the generic nature of this data type means there's no inherent enforcement of valid source categories. This allows for the inclusion of arbitrary string values, which are then treated as legitimate categories by the plotting library. This lack of type specificity contributes to the problem of non-standard entries being interpreted as meaningful data, rather than flagging them as potential errors or missing information.
- Error 2: Non-standard Missing Value Representation - 'UNKNOWN' instead of NaN.
  - Impact on Insight (Bar/Pie Chart): The presence of the string 'UNKNOWN' within the **Source** column is a significant distortion. Instead of being recognized as a missing value (which Plotly would typically ignore or handle gracefully), 'UNKNOWN' is treated as a distinct and valid source category. Consequently, in both a bar chart and a pie chart, 'UNKNOWN' will appear as a prominent category, often ranking highly in frequency. This falsely inflates its importance and misrepresents the true distribution of actual anime sources, making it difficult to discern the genuine origins of anime in the dataset.


### 3.6.2.Correct Insight

In [292]:
# 1. Giữ theme mặc định
pio.templates.default = "plotly_white"

# 2. Xử lý dữ liệu (Cleaned Data)
# Đếm tần suất
source_counts = df_anime_dataset_2023_prep['Source'].value_counts().reset_index()
source_counts.columns = ['Source', 'Count']

# Sắp xếp để thanh dài nhất nằm trên cùng
source_counts = source_counts.sort_values('Count', ascending=True)

# 3. Tạo biểu đồ Horizontal Bar
fig = px.bar(
    source_counts,
    x='Count',
    y='Source',
    orientation='h',
    title='Frequency of Anime Sources (Cleaned Data)', 
    labels={'Source': 'Source Type', 'Count': 'Number of Anime'},
    text='Count',
    color='Count',
    color_continuous_scale=['#DDD6FE', '#8B5CF6', '#4C1D95']
)

# 4. Tinh chỉnh giao diện Traces (Đồng nhất style)
fig.update_traces(
    marker_line_color='white',
    marker_line_width=1.2,
    textposition='outside',
    texttemplate='%{text:,}',
    hovertemplate='<b>Source:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>'
)

# 5. Tinh chỉnh Layout (Copy style từ biểu đồ Raw để đồng bộ)
fig.update_layout(
    title={
        'text': 'Frequency of Anime Sources (Cleaned Data)',
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(size=22, family='Inter', weight='bold', color='#1F2937')
    },
    # TRỤC X: Hiển thị Grid, Format số
    xaxis=dict(
        title='Number of Anime',
        title_font=dict(size=14, color='#4B5563', family='Inter'),
        tickfont=dict(size=11, color='#6B7280'),
        showgrid=False,       
        gridcolor='#F3F4F6',  # Grid màu nhạt
        showline=False,       # Bỏ đường kẻ trục đậm
        zeroline=False,
        tickformat=','        # Dấu phẩy phân cách hàng nghìn
    ),
    # TRỤC Y: Chứa nhãn Source, Ticksuffix
    yaxis=dict(
        title='',             # Bỏ tiêu đề trục "Source" cho thoáng
        tickfont=dict(size=13, color='#4B5563', family='Inter'),
        ticksuffix="  ",      # <--- QUAN TRỌNG: Tạo khoảng cách với thanh bar
        showgrid=False,       # Tắt grid ngang
        showline=False,
        zeroline=False
    ),
    coloraxis_showscale=False,
    plot_bgcolor='white',
    paper_bgcolor='white',    # Đồng bộ màu nền
    bargap=0.25,              # Khoảng cách giữa các thanh
    height=600,
    margin=dict(t=80, l=120, r=40, b=80), # Lề trái rộng (120) cho tên Source dài
    hoverlabel=dict(
        bgcolor='white',
        bordercolor='#8B5CF6',
        font=dict(size=13, family='Inter', color='#1F2937')
    )
)

fig.show()

##### Initial Analysis
After removing the **Unknown** category, the distribution of anime by Source has become more reliable and reshaped the market landscape. The Anime industry show a strong reliance on the traditional content sources with the **Original creations** dominating the largest share, followed by the **Manga** at a significant distance. 

Game-based and Visual Novel adaptations occupy the next major tiers, while sources such as Light Novels, Web Manga, Music, and niche formats like 4-koma Manga or Picture Books contribute much smaller proportions.

$\implies$ The Anime industry tends to produce a high number of Original works to avoid dependence on licensing, or relies on the safety of existing popular Manga. ***However, does “a large quantity” necessarily mean “good quality”?***


**Deep Insight**:
- Điểm số trung bình của mỗi loại Source là bao nhiêu?
- Các type anime như Movie và Serie thường chuyển thể từ các Source nào.

In [293]:

# --- 1. DATA PROCESSING ---
avg_score_by_source = df_anime_dataset_2023_prep.groupby('Source', dropna=True)['Score'].mean().reset_index()
avg_score_by_source = avg_score_by_source.dropna(subset=['Score']).sort_values('Score', ascending=False)

# --- 2. PLOTTING ---
fig_avg = px.bar(
    avg_score_by_source,
    x='Score',
    y='Source',
    orientation='h',
    text='Score',
    color='Score', 
    color_continuous_scale=[ '#DDD6FE', '#4C1D95' ],
)

# --- 3. STYLING TRACES ---
fig_avg.update_traces(
    marker_line_width=0, 
    textposition='outside',
    texttemplate='%{text:.2f}',
    hovertemplate='<b>Source:</b> %{y}<br><b>Avg Score:</b> %{x:.2f}<extra></extra>'
)

# --- 4. LAYOUT & AXES ---
fig_avg.update_layout(
    title={
        'text': '<b>Average Score by Source</b>', 
        'y': 0.95, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'
    },
    font=dict(family='inter', size=13, color="#333333"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    height=600,
    margin=dict(t=80, l=150, r=50, b=50), 
    showlegend=False,
    coloraxis_showscale=False,

    # Trục X: Ẩn hoàn toàn
    xaxis=dict(visible=False),

    # Trục Y: Tăng cỡ chữ lên 14
    yaxis=dict(
        title=None,
        showgrid=False,
        showline=False,
        ticklen=0,
        autorange="reversed",
        ticksuffix="  ",
        tickfont=dict(size=14) 
    )
)

fig_avg.show()

#### Deep Analysis
##### **A. The Paradox of "Quantity" vs. "Quality"** 
*   **The Fall of "Original":** Despite ranking **#1** in production release volume, the average score for *Original* works falls into the low tier (**6.07**).
    *   **Reason:** Producing an original script is immensely risky. Without an existing fanbase, the storyline has not yet been tested in the market. The data suggests a **large number of low-quality or short-lived original anime being mass-produced**, which lowers the overall average score.
*   **The Rise of the "Novel" Category:**
    *   **Web Novels (7.00)** and **Light Novels (6.96)** lead the rankings in score, despite their modest production numbers.
    *   **Manga (6.83)** maintains stable performance: High volume combined with high scores.
    *   **Reason - The "Selection Bias" Effect:** A Web Novel or Light Novel typically needs to reach high popularity and demonstrate strong narrative quality before it is selected for anime adaptation. As a result, works chosen from this source category tend to have a higher probability of critical success (e.g., *Re:Zero, Sword Art Online*).

$\implies$ **Strategy for the Producer:** For safety and guaranteed high ratings, prioritize adaptations from famous **Web Novels, Light Novels, or Manga**. If choosing to produce an **Original**, accept high risks and invest heavily in the scriptwriting team to avoid the "low-score trap."

In [294]:
# --- 1. SETUP SUBPLOTS ---
# Tạo khung chứa 1 hàng, 2 cột
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=["<b>Top Sources for TV</b>", "<b>Top Sources for Movie</b>"],
    horizontal_spacing=0.12  # Khoảng cách giữa 2 biểu đồ
)

types_of_interest = ['TV', 'Movie'] # Thứ tự hiển thị: Trái -> Phải

# Định nghĩa dải màu (Light Purple -> Deep Purple)
custom_colorscale = [ '#DDD6FE', '#4C1D95' ]

# --- 2. DATA PROCESSING & PLOTTING ---
for i, t in enumerate(types_of_interest):
    # Data Processing
    df_t = df_anime_dataset_2023_prep[df_anime_dataset_2023_prep['Type'] == t]
    counts = df_t['Source'].value_counts().reset_index()
    counts.columns = ['Source', 'Count']
    
    # Lấy Top 10 và sắp xếp giảm dần (để hiển thị đúng thứ tự từ trên xuống)
    counts = counts.head(10).sort_values('Count', ascending=False)
    
    # Tạo Trace (Bar Chart)
    fig.add_trace(
        go.Bar(
            x=counts['Count'],
            y=counts['Source'],
            orientation='h',
            text=counts['Count'],
            texttemplate='%{text:,}', # Format số có dấu phẩy
            textposition='outside',
            hovertemplate='<b>Source:</b> %{y}<br><b>Count:</b> %{x:,}<extra></extra>',
            
            # Color Logic: Map màu theo số lượng
            marker=dict(
                color=counts['Count'],
                colorscale=custom_colorscale,
                line_width=0
            ),
            name=t # Tên cho legend (nhưng ta sẽ ẩn legend)
        ),
        row=1, col=i+1 # Đặt vào cột tương ứng (1 hoặc 2)
    )

# --- 3. GLOBAL STYLING (DECLUTTER) ---

# Cập nhật Layout chung
fig.update_layout(
    title={
        'text': '<b>Source Volume Distribution According to Type: TV vs Movie</b>',
        'y': 0.95,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    font=dict(family='Inter', size=13, color="#333333"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    height=600,
    width=1400, # Độ rộng đủ cho 2 chart
    margin=dict(t=80, l=50, r=50, b=50),
    showlegend=False
)

# Tinh chỉnh trục X (Ẩn số, ẩn grid) cho cả 2 biểu đồ
fig.update_xaxes(showticklabels=False, showgrid=False, showline=False, visible=False)

# Tinh chỉnh trục Y (Giữ nhãn Source, ẩn grid, đảo ngược chiều để Top 1 nằm trên)
fig.update_yaxes(
    showgrid=False, 
    showline=False, 
    ticklen=0, 
    autorange="reversed", # Đảo ngược trục Y để giá trị cao nhất nằm trên cùng
    ticksuffix="  "      # Tạo khoảng cách nhỏ giữa chữ và thanh bar
)

# Mở rộng trục X một chút để số không bị cắt (vì textposition='outside')
# Chúng ta lặp qua từng trace để lấy max value và set range
for i, t in enumerate(types_of_interest):
    df_t = df_anime_dataset_2023_prep[df_anime_dataset_2023_prep['Type']== t]
    max_val = df_t['Source'].value_counts().max()
    # Update axes cụ thể (xaxis1, xaxis2)
    if i == 0:
        fig.update_xaxes(range=[0, max_val * 1.25], row=1, col=1)
    else:
        fig.update_xaxes(range=[0, max_val * 1.25], row=1, col=2)

fig.show()

##### **B.Format and Source Compatibility (Movie vs. TV)**

According to the distribution of Type released, TV and Movie are leading the market. Selection behavior for script sources changes drastically depending on the release format:

*   **For TV Series:**
    *   **Top 3 Structure:** **Original > Manga > Light Novel**.
    *   *Analysis:* TV Series require long-form content to sustain broadcasting for 3-6 months (12-24 episodes). **Manga** and **Light Novels**, with their episodic structures and expansive world-building, are highly suitable for this format. Light Novels breaking into the Top 3 for TV series (488 titles) proves this is a "fertile ground" for serialized content.
*   **For Movies:**
    *   **Top 3 Structure:** **Original > Manga > Other**.
    *   *The Disappearance of Light Novels:* Light Novels drop significantly in the Movie category.
    *   *Analysis:* Movies have limited runtime (90-120 minutes), demanding concise, dramatic plots with clear conclusions.
        *   **Originals** dominate Movies (1,908 titles) as screenwriters can easily tailor a complete story specifically for the 2-hour format.
        *   **Manga** movies are often spin-offs of major franchises (e.g., *Conan, Doraemon*) or short one-shots.
        *   **Light/Web Novels** are often too complex and lengthy to compress into a single film, making them less prioritized for the Movie format.

## 3.7.Insight Question 7
A Producer can be understood as "investors," and an anime is essentially an "investment project." Producers are the entities that provide funding, plan, and coordinate the anime project. Normally, producing an anime requires a huge budget. To minimize financial risk in case the anime fails and causes significant losses, multiple producers often collaborate on a single anime. Consequently, the 'Producers' feature in our dataset often lists several production companies or individuals within a single entry, separated by commas. Our initial exploration of this raw column will aim to identify the most frequently appearing combinations of producers, which can hint at common collaborations or dominant production groups. By analyzing this unprocessed, aggregated data, we intend to highlight how its current string format will impede the identification of individual producers' prevalence and necessitate significant parsing and cleaning.

### 3.7.1.Misleading Insight
- What are the most frequently appearing producer combinations or producer strings as they are originally listed in the dataset?

In [307]:
# 1. Data Processing (Top 15 Producers - Raw Data)
# Count frequency of Producers and take the top 15
producer_counts = df_anime_dataset_2023['Producers'].value_counts().head(20).reset_index()
producer_counts.columns = ['Producers', 'Count']

# Sort by Count ascending so the largest bar appears at the top
producer_counts = producer_counts.sort_values('Count', ascending=True)

# --- BOLD TEXT LOGIC (UPDATED) ---
# Bold text (add <b> tags) if:
# 1. It is a collaboration (contains ',')
# 2. OR the name is "UNKNOWN"
producer_counts['Display_Name'] = producer_counts['Producers'].apply(
    lambda x: f"<b>{x}</b>" if (',' in str(x) or str(x).strip().upper() == 'UNKNOWN') else x
)

# --- COLOR LOGIC ---
# Highlight in RED if:
# 1. It is a collaboration (contains ',')
# 2. OR the name is "UNKNOWN"
# Otherwise, use Light Pink
red_color = '#A82516'
pink_color = '#FDE0DD'

colors = []
for p in producer_counts['Producers']:
    p_str = str(p).strip().upper()
    if p_str == 'UNKNOWN' or ',' in str(p):
        colors.append(red_color)
    else:
        colors.append(pink_color)

# 3. Create Horizontal Bar Chart
fig = px.bar(
    producer_counts,
    x='Count',
    y='Display_Name', 
    orientation='h',
    title='Top 15 Producer Combinations By Number of Animes (Raw Data)',
    labels={'Display_Name': 'Producer Combination', 'Count': 'Number of Anime'},
    text='Count',
    # Pass original name to custom_data for clean tooltips
    custom_data=['Producers'] 
)

# 4. Visual Refinements (Traces)
fig.update_traces(
    marker_color=colors,           # Apply the color logic
    marker_line_color='white',     
    marker_line_width=1.2,
    textposition='outside',
    texttemplate='%{text:,}',      
    hovertemplate='<b>Producers:</b> %{customdata[0]}<br><b>Count:</b> %{x:,}<extra></extra>'
)

# 5. Layout & Axis Refinements
fig.update_layout(
    title={
        'text': '<b>Top 15 Producer Combinations By Number of Animes (Raw Data)</b>',
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(size=22, family='Inter', color='#1F2937')
    },
    
    # X-Axis Configuration
    xaxis=dict(
        showgrid=False,    
        zeroline=False,
        tickformat=',',
        dtick=2000,        
        title=None,
        showticklabels=False        
    ),
    
    # Y-Axis Configuration
    yaxis=dict(
        showgrid=False,
        zeroline=True,
        title=None,
        tickfont=dict(size=13, family='Inter', color='#4B5563'),
        ticksuffix="  "    
    ),
    
    # General Styling
    plot_bgcolor='white',
    paper_bgcolor='white',
    bargap=0.2,
    height=700,            
    margin=dict(t=80, l=250, r=40, b=50), 
    
    hoverlabel=dict(
        bgcolor='white', 
        bordercolor='#888', 
        font=dict(size=12, family='Inter')
    )
)

fig.show()

**Conflict**
The initial insights derived from the 'Producers' feature (showing the frequency of producer combinations) are significantly flawed and misleading due to critical data integrity and structural issues. The raw state of this column prevents an accurate understanding of individual producers' contributions or true prevalence.
- Error 1: Aggregated Producer Strings (Multiple Producers in One Entry).
    - Impact on Insights (Bar Charts of Producer Combinations): The primary issue is that the 'Producers' column often contains multiple producers listed within a single string entry (e.g., "Aniplex, Lantis, MAGES."). When we count the frequency of these strings, we are tallying specific combinations of producers, not the individual producers themselves. This fundamentally obscures the true involvement and popularity of any single production company. For example, "Aniplex" might be a highly active producer, but its individual prevalence is hidden within numerous unique combinations. The bar charts will show these distinct, often lengthy, combination strings as categories, making it impossible to determine which individual producers are most active or involved across the dataset. We are gaining insight into producer teams or bundles, rather than the independent entities.
- Error 2: Non-standard Missing Value Representation - 'UNKNOWN' instead of NaN.
    - Impact on Insights (Bar Charts of Producer Combinations): The presence of the string 'UNKNOWN' for missing producer information is treated as a valid and distinct producer combination by Plotly. This means 'UNKNOWN' will likely appear as one of the most frequent "producer combinations" in the bar charts. Its inclusion as a prominent category inflates its perceived importance, distorts the true ranking of actual producer collaborations, and distracts from analyzing meaningful production data. A correctly handled missing value (standard NaN) would typically be excluded from such frequency counts by default, providing a clearer and more accurate insight into actual producer involvement.

### 3.7.2.Correct Insight

In [315]:
# --- 1. DATA PROCESSING (DEEP CLEANING) ---
# Create a copy and ensure string format
prod_df = df_anime_dataset_2023_prep[['Producers']].copy()
prod_df['Producers'] = prod_df['Producers'].astype(str)

# REGEX CLEANING: Remove brackets and quotes
prod_df['Producers'] = prod_df['Producers'].str.replace(r"[\[\]\'\"]", "", regex=True)

# SPLIT & EXPLODE: Separate producers and create individual rows
prod_df['Producers'] = prod_df['Producers'].str.split(',')
prod_df = prod_df.explode('Producers')

# TRIM: Remove leading/trailing whitespace
prod_df['Producers'] = prod_df['Producers'].str.strip()

# FILTER: Remove garbage values
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
prod_df = prod_df[~prod_df['Producers'].str.upper().isin(exclude_list)]

# --- 2. AGGREGATION ---
# Count occurrences and take Top 20
top_producers = prod_df['Producers'].value_counts().head(20).reset_index()
top_producers.columns = ['Producers', 'Count']
top_producers = top_producers.sort_values('Count', ascending=True)

# --- 3. VISUALIZATION ---
fig = px.bar(
    top_producers,
    x='Count',
    y='Producers',
    orientation='h',
    text='Count',
    color='Count',
    # Pink to Red gradient
    color_continuous_scale=['#FCC5C0', '#C3122F'] 
)

# --- 4. VISUAL REFINEMENTS ---
fig.update_traces(
    texttemplate='%{text:,}',
    textposition='outside',
    hovertemplate='<b>Producer:</b> %{y}<br><b>Projects:</b> %{x:,}<extra></extra>'
)

fig.update_layout(
    template='plotly_white', # Set theme directly in layout
    
    # Title Configuration (Font Size 22)
    title={
        'text': '<b>Top 20 Most Active Individual Producers By Number of Animes Produced</b>',
        'y': 0.95,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': dict(size=22, family='Inter', color='#333333')
    },
    
    # Global Font Settings
    font=dict(family='Inter', size=13, color="#333333"),
    
    # Layout Dimensions & Color
    plot_bgcolor='white',
    paper_bgcolor='white',
    height=800, 
    coloraxis_showscale=False, # Hide the color bar
    margin=dict(t=80, l=250, r=40, b=80),

    # --- X-AXIS CONFIGURATION (Count) ---
    xaxis=dict(
        title=None,
        showticklabels=False, 
        tickfont=dict(size=14, color='#333333'),
        
        # Grid settings
        showgrid=False,        
        zeroline=False,
        
        # Tick Step Configuration
        tick0=0,     
        dtick=200,   # Step size of 200
        tickformat=',' 
    ),

    # --- Y-AXIS CONFIGURATION (Names) ---
    yaxis=dict(
        title=None,
        tickfont=dict(size=15, color='#333333'),
        ticksuffix="  ",    
        showgrid=False,
        showline=True
    )
)

fig.show()


**Initial Analysis**

* The raw data significantly suppressed the activity of major players by treating every unique production team (e.g., "Aniplex, Lantis") as a separate entry. Once the data was cleaned (by splitting the strings and counting each individual producer), the reported production volume for most major companies **skyrocketed**. For example, **Aniplex** and **TV Tokyo** saw their credit counts increase by over 450% (from 123 to **580** and 115 to **573**, respectively). This dramatic increase proves that these entities are not isolated producers but are **extremely active members** in a vast number of different **Production Committees**.
*   Conversely, companies like **Pink Pineapple** (which went from 259 to 274) saw only a negligible increase, indicating that their volume is primarily derived from **solo production or fixed-partner collaborations** rather than widespread committee involvement.

$\implies$ The corrected data confirms that the largest production volumes are driven by a few powerful entities—broadcasters and major content conglomerates. The colossal increase in their individual credit counts validates the understanding that their primary role is one of **financial partner and IP facilitator** across a multitude of anime projects. The raw data, therefore, presented a false picture of fragmentation, while the clean data reveals a market characterized by **high centralized involvement** from a few dominant, highly collaborative producers.



**The corrected chart provides a clear and accurate view of which individual entities are driving anime production:**

1.  **Broadcasters Dominate the Top Ranks:**
    *   The most active producer is **NHK** (Japan Broadcasting Corporation) with **827** titles.
    *   Other major broadcasters like **TV Tokyo** (573) and **Fuji TV** (372) are also highly ranked, demonstrating that television networks are central players, often financing and securing broadcast slots for the majority of anime.

2.  **Major Conglomerates and Studios are Core Producers:**
    *   **Aniplex** (580) stands as the second most active producer, confirming its status as a core production, distribution, and licensing giant.
    *   Companies like **Lantis** (481), **Bandai Visual** (422), and **Pony Canyon** (390) are highly active, highlighting the critical role of music, licensing, and merchandising conglomerates in funding anime.

3.  **Key Publishing Houses are Highly Active:**
    *   Publishers like **Kadokawa** (334, plus Kadokawa Shoten 213), **Kodansha** (310), and **Shueisha** (307) are key players, reinforcing the fact that anime production is often a part of a larger strategy to monetize existing manga or light novel intellectual property (IP).

### 3.7.3 **Deep Insight**


In [319]:
# --- 1. DATA PROCESSING (FROM SCRATCH TO EXCLUDE GARBAGE) ---
# Create a working copy
df_collab = df_anime_dataset_2023_prep[['anime_id', 'Producers']].copy()
df_collab['Producers'] = df_collab['Producers'].astype(str)

# REGEX CLEANING
df_collab['Producers'] = df_collab['Producers'].str.replace(r"[\[\]\'\"]", "", regex=True)

# SPLIT & EXPLODE
df_collab['Producers'] = df_collab['Producers'].str.split(',')
df_collab = df_collab.explode('Producers')

# TRIM
df_collab['Producers'] = df_collab['Producers'].str.strip()

# --- FILTER: REMOVE GARBAGE VALUES ---
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_collab = df_collab[~df_collab['Producers'].str.upper().isin(exclude_list)]
df_collab = df_collab[df_collab['Producers'] != ""]

# COUNT PRODUCERS PER ANIME
anime_prod_counts = df_collab.groupby('anime_id')['Producers'].count().reset_index()
anime_prod_counts.columns = ['anime_id', 'Producer_Count']

# --- DEFINE NEW GROUPS (1-2, 3-5, 6+) ---
def get_collab_group(n):
    if n <= 2: return '1-2 Producers'
    if n <= 5: return '3-5 Producers'
    return '6+ Producers'

anime_prod_counts['Collaboration_Group'] = anime_prod_counts['Producer_Count'].apply(get_collab_group)

# --- 2. AGGREGATION ---
# Count the number of rows in each group
group_counts = anime_prod_counts['Collaboration_Group'].value_counts()

# Reindex to the specific 3 categories
order_list = ['1-2 Producers', '3-5 Producers', '6+ Producers']
group_counts = group_counts.reindex(order_list).reset_index()
group_counts.columns = ['Collaboration_Group', 'Anime_Count']

# Calculate total and percentage
total_anime = group_counts['Anime_Count'].sum()
group_counts['Percentage'] = (group_counts['Anime_Count'] / total_anime * 100).round(1)

# Identify the group with the maximum count for highlighting
max_group = group_counts.loc[group_counts['Anime_Count'].idxmax(), 'Collaboration_Group']

# Create a 'Highlight' column for conditional coloring
group_counts['Highlight'] = group_counts['Collaboration_Group'].apply(
    lambda x: 'Max Count' if x == max_group else 'Other'
)

# --- 3. PLOT THE BAR CHART ---
fig = px.bar(
    group_counts,
    x='Collaboration_Group',
    y='Anime_Count',
    text='Anime_Count', 
    color='Highlight',
    color_discrete_map={ 
        'Max Count': '#C3122F', # Red for the maximum bar
        'Other': '#FCC5C0'      # Pink for others
    }
)

# --- 4. LAYOUT AND TRACES CUSTOMIZATION ---
fig.update_traces(
    texttemplate='%{text:,}', 
    textposition='outside',
    marker_line_color='white',
    marker_line_width=1.5,
    width=0.6, # Narrower bars since there are only 3 columns
    # Hover template
    hovertemplate='<b>%{x}</b><br>Count: %{y:,}<br>Share: %{customdata[0]}%<extra></extra>',
    customdata=group_counts[['Percentage']] 
)

fig.update_layout(
    template='plotly_white',

    # Chart Title Settings (Size 22)
    title={
        'text': f'<b>Producer Collaboration Size By Number of Animes</b><br>'
                f'<span style="font-size: 14px; color: #555;">'
                f'Total Anime: {total_anime:,}</span>',
        'y': 0.95,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': dict(size=22, family='Inter', color='#1F2937')
    },
    
    font=dict(family='Inter', size=14, color="#333333"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    
    width=1000,
    height=700,
    
    showlegend=False,
    margin=dict(t=120, l=70, r=50, b=100),

    # X-AXIS Configuration
    xaxis=dict(
        title="Number of Producers Involved",
        title_font=dict(size=16, color='#333333'),
        tickfont=dict(size=15, color='#333333'), 
        showgrid=False,
        showline=True,
        linecolor='#E5E7EB'
    ),

    # Y-AXIS Configuration
    yaxis=dict(
        title=None,
        tickfont=dict(size=14, color='#333333'),
        showgrid=False, 
        gridcolor='#F3F4F6',
        showline=False,
        linecolor='#E5E7EB',
        tickformat=',',
        ticksuffix="  ",
        showticklabels=False
    )
)

fig.show()

**Initial Analysis**

The chart illustrates the collaboration size for a total of **11,555** anime titles, emphasizing a heavy concentration on smaller producer teams.

* **Overwhelming Dominance of Small Teams (1-2 Producers):**
    *   The single largest category is the **1-2 producer(s)** group, which accounts for **8,800** anime titles, representing approximately **72%**
    *   **Conclusion:** This strongly indicates that the vast majority of anime production is handled either by a single entity or a simple two-party partnership, highlighting a preference for lean production models.

* **Mid-Sized Collaboration (3-5 Producers) is the Second Largest:**
    *   The **3-5 producers** category is the second-largest, with **1,819** titles (**15,7%**)
    *   **Conclusion:** This suggests that projects requiring a moderate level of shared investment or expertise often form production teams within this range.

* **Large-Scale Production Committees (6+ Producers) are the Least Common:**
    *   The largest collaboration group, **6+ producers**, is the smallest category, with **936** titles (**less then 10%**)
    *   **Conclusion:** Very large production committees, which are typically formed for high-budget or high-risk projects to distribute financial burden, are a significant minority in the overall anime market structure.



<!-- **Deep Insight**:
- Hãy cho biết điểm trung bình score giữa việc colab và tự thực hiện 1 mình thì cái nào hiệu quả hơn.
- Hãy cho biết top 10 Producer đạt được điểm trung bình Score cao nhất (điểm score của anime khi colab cũng được tính là điểm riêng cho producer tham colab)
- Đội Producers nào có điểm score trung bình cao nhất (chỉ tính các đội colab trên 3 movie anime) -->

In [322]:


# --- 1. DATA PROCESSING: CALCULATION AND FILTERING ---

# Filter relevant columns and create a copy
df_trend = df_anime_dataset_2023_prep[['Producers', 'Aired Year']].copy()

# 1.1. CLEANING AND FILTERING
# Convert 'Aired Year' to numeric, dropping unconvertible rows
df_trend['Aired Year'] = pd.to_numeric(df_trend['Aired Year'], errors='coerce')
df_trend.dropna(subset=['Aired Year', 'Producers'], inplace=True)

# Remove rows with Unknown/None/Empty producers
exclude_list = {"UNKNOWN", "NONE", "NAN", "NULL", ""}
df_trend = df_trend[
    ~df_trend['Producers'].astype(str).str.strip().str.upper().isin(exclude_list)
]

# 1.2. FILTER FOR THE LAST 10 YEARS
max_year = 2023 
min_year = max_year - 9 # 10 years including max_year (e.g., 2014-2023)
df_trend = df_trend[
    (df_trend['Aired Year'] >= min_year) & 
    (df_trend['Aired Year'] <= max_year)
]

# 1.3. CALCULATE NUMBER OF PRODUCERS
# The number of producers is count of commas + 1
df_trend['num_producers'] = df_trend['Producers'].astype(str).str.count(',') + 1

# 1.4. GROUPING FUNCTION (Optimized grouping logic)
def group_producers(n):
    if n <= 2: 
        return '1-2 producers'
    if n <= 5: # No need to check 3 <= n
        return '3-5 producers'
    return '6+ producers'

df_trend['Collab_Group'] = df_trend['num_producers'].apply(group_producers)

# 1.5. AGGREGATION
# Group by Year and Group, then count the number of Anime
trend_data = (
    df_trend.groupby(['Aired Year', 'Collab_Group'])
    .size()
    .reset_index(name='Anime_Count')
)

# Set the categorical order for correct sorting and legend display
order_list = ['1-2 producers', '3-5 producers', '6+ producers']
trend_data['Collab_Group'] = pd.Categorical(
    trend_data['Collab_Group'], 
    categories=order_list, 
    ordered=True
)
trend_data.sort_values(['Aired Year', 'Collab_Group'], inplace=True)


# --- 2. VISUALIZATION (LINE CHART) ---

# Define the custom color palette
custom_palette = {
    '1-2 producers': '#7A0177',
    '3-5 producers': "#DBAB0D",
    '6+ producers': '#C3122F'
}

fig = px.line(
    trend_data,
    x='Aired Year',
    y='Anime_Count',
    color='Collab_Group',
    
    color_discrete_map=custom_palette,
    category_orders={'Collab_Group': order_list},
    
    title=f'<b>Trend of Anime Production Volume by Collaboration Size ({min_year} - {max_year})</b>'
)

# Update line and marker styles
fig.update_traces(
    mode='lines+markers',
    marker=dict(size=8, opacity=0.8),
    line=dict(width=3)
)

# --- 4. LAYOUT CONFIGURATION ---
fig.update_layout(
    font=dict(family='Inter', size=14, color="#333333"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    
    width=1000,
    height=600,
    
    # Title positioning
    title={'y': 0.95, 'x': 0.5, 'xanchor': 'center'},
    
    # Legend: Horizontal, positioned above the plot area
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    ),

    # X-AXIS (Year) Configuration
    xaxis=dict(
        title="Aired Year",
        tickmode='linear',
        dtick=1, # Tick every year
        showgrid=False,
        gridcolor='#F3F4F6',
        tickformat='.0f', # Display year as integer

    ),

    # Y-AXIS (Anime Count) Configuration
    yaxis=dict(
        title="Number of Anime",
        showgrid=True,
        gridcolor='#F3F4F6',
        tickformat=',' # Thousands separator format
    )
)

fig.show()

1.  **Dominance and Volatility of Small Teams (1-2 Producers):**
    *   **Overwhelming Size:** The **1-2 producers** group  consistently represents the largest volume of anime production throughout the entire 2014–2023 period, often producing 3 to 6 times more anime than the other two groups combined.
    *   **Overall Declining Trend:** Despite significant fluctuation, the general trend for the 1-2 producers group is downward, starting at over 400 anime in 2014 and dropping to approximately 230 anime by 2023.
    *   **High Volatility:** The line shows significant peaks and troughs, notably a starting peak in 2014 (~410), a sharp decline into 2019, a rebound peak in 2021 (~355), and a final sharp decline into 2023.

2.  **Relative Stability of Larger Collaboration Groups (3-5 and 6+ Producers):**
    *   **Smaller Scale:** Both the **3-5 producers**  and **6+ producers** groups maintain a significantly lower volume, primarily staying below 100-150 anime per year.
    *   **6+ Producers (Large Committees):** This group shows a slight general upward trend from 2014 to 2017 before leveling off and fluctuating. The overall volume in 2023 is roughly similar to the volume in the 2016-2017 period, suggesting a relatively stable, though small, commitment to large-scale production committees.
    *   **3-5 Producers (Mid-sized Teams):** This group starts high (around 90) in 2014 and shows a gentle downward drift and oscillation, generally maintaining a stable range of 70–100 titles annually, demonstrating less sensitivity to the yearly fluctuations seen in the largest group.

$\implies$ The anime market remains overwhelmingly **dominated** by projects involving **one or two producers**. However, the data for 2014-2023 indicates a **decline in the total production volume of this dominant group** over the decade. Conversely, **larger collaboration models**, though much smaller in volume, show greater year-to-year stability, suggesting that while small teams drive the overall volume, production committees **maintain a consistent output for complex, resource-intensive projects.**

In [299]:
# --- 1. DATA PROCESSING ---
# 1. Copy data for processing
df_producers_collab = df_anime_dataset_2023_prep[['Producers', 'Score']].copy()

# 2. Convert Score to numeric and drop rows with missing Score
df_producers_collab['Score'] = pd.to_numeric(df_producers_collab['Score'], errors='coerce')
df_producers_collab = df_producers_collab.dropna(subset=['Score'])

# 3. Filter out rows with Unknown/Empty Producers (for accurate counting)
exclude_list = ["UNKNOWN", "", "NONE"]
df_producers_collab = df_producers_collab[~df_producers_collab['Producers'].astype(str).str.strip().str.upper().isin(exclude_list)]
df_producers_collab = df_producers_collab.dropna(subset=['Producers'])

# 4. CALCULATE NUMBER OF PRODUCERS
df_producers_collab['num_producers'] = df_producers_collab['Producers'].astype(str).str.count(',') + 1

# 5. BINNING FUNCTION to categorize collaboration size (NEW LOGIC)
def group_size(n):
    if n <= 2: return '1-2 producers' # Group 1 and 2
    if 3 <= n <= 5: return '3-5 producers'
    return '6+ producers'

# 6. Apply binning function
df_producers_collab['Collaboration_Group'] = df_producers_collab['num_producers'].apply(group_size)

# 7. Define the display order for categories (NEW ORDER)
order_list = ['1-2 producers', '3-5 producers', '6+ producers']

# 8. Calculate Insight (Mean Score for each group)
group_means = df_producers_collab.groupby('Collaboration_Group')['Score'].mean().reindex(order_list)
# 9. Get scores for comparison
solo_score = group_means['1-2 producers'] # The baseline is now '1-2 producers'
max_group_score = group_means.max()
best_group = group_means.idxmax()
diff = max_group_score - solo_score

# 10. Define Custom Color Palette
custom_palette = {
    '1-2 producers': '#7A0177',
    '3-5 producers': '#DBAB0D',
    '6+ producers': '#C3122F'
}

# --- 2. VISUALIZATION (BOX PLOT) ---
# 11. Create the Box Plot
fig = px.box(
    df_producers_collab,
    x='Collaboration_Group',
    y='Score',
    color='Collaboration_Group', 
    color_discrete_map=custom_palette, # <--- APPLY CUSTOM COLORS
    category_orders={'Collaboration_Group': order_list} 
)

# 12. Update traces (Box appearance)
fig.update_traces(
    width=0.3,
    marker_size=5,
    marker_opacity=0.2,
    line_width=2
)

# --- 3. LAYOUT CONFIGURATION ---
# 13. Update layout with Title, Size, and Margin
fig.update_layout(
    title={
        'text': f'<b>Impact of Collaboration Size on Anime Score</b><br>'
                f'<span style="font-size: 14px; color: #555;">'
                f'1-2 Producers Avg: <b>{solo_score:.2f}</b> vs {best_group} Avg: <b>{max_group_score:.2f}</b> '
                f'(Improvement: <span style="color:green">+{diff:.2f}</span>)</span>',
        'y': 0.95,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    font=dict(family='Inter', size=14, color="#333333"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    
    width=700,
    height=800,
    
    showlegend=False, 
    margin=dict(t=120, l=100, r=80, b=100),

    # 14. X-axis configuration
    xaxis=dict(
        title="Number of Producers Involved",
        title_font=dict(size=16, color='#333333'),
        tickfont=dict(size=14, weight='bold', color='#333333'),
        showgrid=False,
        showline=True,
        linecolor='#E5E7EB'
    ),

    # 15. Y-axis configuration
    yaxis=dict(
        title="Anime Score",
        title_font=dict(size=16, color='#333333'),
        tickfont=dict(size=14, color='#333333'),
        showgrid=True,
        gridcolor='#F3F4F6',
        showline=True,
        linecolor='#E5E7EB',
        range=[0, 10],
        dtick=1
    )
)

fig.show()

**1. Collaborative Efficiency and Performance Gap**

There is a clear and direct **positive correlation** between the number of producers involved and the final Anime Score. The score systematically improves as the collaboration size increases.

*   The chart explicitly shows the largest groups significantly outperforming the smallest: **1-2 Producers Avg: 6.34** vs. **6+ Producers Avg: 7.15**, representing an improvement of **+0.81** points.
*   This performance gap of nearly a full point validates the notion that **larger budgets, pooled resources, and stricter quality control** inherent in multi-party production committees reliably lead to a better-received product.

**2. Risk Mitigation and Quality Floor Elevation**

Collaboration acts as an effective mechanism for risk management, substantially raising the quality floor for the investment.

*   The **"1-2 Producers"** group exhibits extremely high volatility, with a wide score range and a significant cluster of "disaster" **outliers** plummeting down to the 2.0 – 3.0 score range. This configuration carries the highest risk of catastrophic failure.
*   In sharp contrast, the presence of **3 Producers or more** significantly elevates the "quality floor." The lower whisker for both the 3-5 and 6+ producer groups rests at a much higher score (around 5.1), demonstrating that collaboration effectively protects the investment from critically low-scoring failure.

$\implies$ The optimal strategic balance for maximizing quality and controlling risk is to form a **Production Committee** consisting of **3 to 5 partners**. This structure successfully balances the need for robust financing and stringent quality oversight while maintaining effective management efficiency.

## 3.8.Insight Question 8
### a. Question
Unlike Producers, Studios are hired by Producers to directly create the anime. They are responsible for drawing, animation, compositing, coloring, editing, and post production. In other words, they are the teams that produce the actual visual content you see on the screen. Given that animation production can be a massive undertaking, it's common for multiple studios to collaborate on a single anime, especially for larger projects or to meet tight deadlines. As a result, the 'Studios' feature in our dataset often lists several animation studios within a single entry, separated by commas. Our initial exploration of this raw column will aim to identify the most frequently appearing combinations of studios. By analyzing this unprocessed, aggregated data, we intend to highlight how its current string format will impede the identification of individual studios' prevalence and necessitate significant parsing and cleaning to understand the true contributors to anime production.

### b. Misleading Insight
- What are the most frequently appearing studio combinations or studio strings as they are originally listed in the dataset?

In [300]:
import pandas as pd
import plotly.express as px

# --- 1. Data Processing & Classification ---
# Create a copy and clean specific characters
df_studios = df_anime_dataset_2023_prep[['Studios']].copy()
df_studios['Studios'] = df_studios['Studios'].astype(str).str.replace(r"[\[\]\'\"]", "", regex=True)

# Function to classify studio types
def classify_studio(studio_str):
    s = studio_str.strip().upper()
    if s in ["UNKNOWN", "NAN", ""]:
        return 'UNKNOWN'
    if ',' in studio_str:
        # If a comma exists, it implies multiple studios involved
        return 'Collaboration'
    return 'Solo Studio'

# Apply classification and count frequencies
df_studios['Production_Type'] = df_studios['Studios'].apply(classify_studio)
studio_counts = df_studios['Production_Type'].value_counts().reset_index()
studio_counts.columns = ['Type', 'Count']
total_projects = studio_counts['Count'].sum()

# --- 2. Sort Data (Largest to Smallest) ---
# Sorting the DataFrame here ensures the chart follows this order
studio_counts = studio_counts.sort_values(by='Count', ascending=False)

# --- 3. Visualization (Donut Chart) ---
fig = px.pie(
    studio_counts,
    values='Count',
    names='Type',
    hole=0.4,
    color='Type',
    color_discrete_map={
        'Solo Studio': '#FCC5C0',
        'Collaboration': '#FDE0DD',
        'UNKNOWN': '#C3122F'
    },
    title='<b>Studio Proportion Structure (Raw Data)</b>'
)

# --- 4. Visual Refinements ---
fig.update_traces(
    # Display Percentage Inside
    textposition='inside',      # Place text inside the slices
    textinfo='percent',         # Show only percentage (labels are in the legend)
    
    textfont=dict(size=14, family='Inter', color='white'), # White text for better contrast
    marker=dict(line=dict(color='white', width=2)),
    
    # Configure Order: 0h start, Largest to Smallest
    sort=False,            # Disable Plotly's auto-sort (forces use of DataFrame order)
    rotation=0,            # Start drawing at 12 o'clock (0 degrees)
    direction='clockwise'  # Draw in clockwise direction
)

fig.update_layout(
    font=dict(family='Inter', size=16, color="#333333"),
    
    # Display Legend on the Right
    showlegend=True,
    legend=dict(
        title='Production Type',
        yanchor="middle",
        y=0.5,  # Vertically centered
        xanchor="left",
        x=1  # Positioned to the right of the chart
    ),
    
    height=600,
    width=900, # Increased width to accommodate the legend
    margin=dict(t=80, b=50, l=50, r=50),
    
    # Central annotation showing total count
    annotations=[dict(
        text=f'<b>{total_projects:,}</b><br><span style="font-size:12px; color:gray">Total Projects</span>',
        x=0.5, y=0.5, font_size=20, showarrow=False
    )],
    
    title={
        'y': 0.95,
        'x': 0.5,
        'xanchor': 'center'
    }
)

fig.show()



| Category | Proportion | Observation |
| :--- | :--- | :--- |
| **Solo Studio** | **53.6%** | This is the single largest group, indicating that over half of all anime projects are primarily credited to just one animation studio. |
| **UNKNOWN** | **42.3%** | This category is the second largest, representing nearly half of the entire dataset. These are projects where the main studio information is missing or uncredited. |
| **Collaboration** | **4.12%** | Explicit collaborations (where 2 or more main studios are officially listed together) are a rare event, making up a tiny fraction of the total projects. |

**Initial Analysis** based on the chart

1.  **The Common Perception (The Easy Insight):** Many people often assume that **Studio Collaboration** is common, especially for major projects, due to the massive workload of anime production.
2.  **The Misleading Raw Data (The Trap):** The chart suggests the opposite. Only **4.12%** of projects are classified as "Collaboration." Conversely, the **Solo Studio** type represents a clear majority at **53.6%**.
3.  **The Result:** A quick glance at the chart leads to the easy, but potentially incorrect, conclusion that a single studio completing a project by itself ("Solo Studio") is the dominant model, and collaboration is a rare exception—contradicting common industry knowledge.

**In Addition:** The large **42.3%** chunk of **"Unknown"** projects further complicates any strong conclusion about "Solo Studio" dominance. This massive unknown segment could potentially contain many instances of collaboration that the raw data failed to capture, making the **53.6%** figure an unreliable basis for claiming majority dominance across all projects.

In [323]:

# --- 1. Data Processing ---
# Count the frequency of studio combinations and select the Top 15 (based on chart title)
# Note: Ensure the raw 'Studios' column is used to treat combinations as unique entities
studio_counts = df_anime_dataset_2023['Studios'].value_counts().head(15).reset_index()
studio_counts.columns = ['Studios', 'Count']

# Sort by Count ascending so the largest bar appears at the top when plotted horizontally
studio_counts = studio_counts.sort_values('Count', ascending=True)

# --- 2. Color Logic ---
# Highlight 'UNKNOWN' in red, others in pink
base_color = '#FCC5C0'
highlight_color = '#C3122F'

colors = [
    highlight_color if str(s).strip().upper() == "UNKNOWN" else base_color
    for s in studio_counts['Studios']
]

# --- 3. Create Horizontal Bar Chart ---
fig = px.bar(
    studio_counts,
    x='Count',
    y='Studios',
    orientation='h',
    title='<b>Top 15 Studio Combinations By Number of Animes Produced(Raw Data)</b>',
    text='Count'
)

# --- 4. Visual Refinements ---
fig.update_traces(
    marker_color=colors,         # Apply the custom color list
    marker_line_color='white',   # White border for better definition
    marker_line_width=0,
    textposition='outside',      # Place numbers next to the bars
    texttemplate='%{text:,}',    # Format numbers with commas (e.g., 1,000)
    hovertemplate='<b>%{y}</b><br>Count: %{x:,}<extra></extra>'
)

fig.update_layout(
    # Typography & Title
    font=dict(family='Inter', size=14, color='#333333'),
    title={
        'y': 0.95,
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(size=24, color='#333333')
    },
    
    # X-Axis (Count)
    xaxis=dict(
        title=None,
        tickfont=dict(size=12, color='#333333'),
        showgrid=False,
        zeroline=False,
        showline=False,
        tickformat=',',
        showticklabels=False
    ),
    
    # Y-Axis (Studio Names)
    yaxis=dict(
        title=None,
        tickfont=dict(size=14, color='#333333'),
        ticksuffix="  ", # Spacing between text and bar
        showgrid=False,
        showline=False
    ),
    
    # General Layout
    plot_bgcolor='white',
    paper_bgcolor='white',
    bargap=0.3,
    height=800,
    margin=dict(t=100, l=250, r=50, b=80) # Large left margin for long names
)

fig.show()

### **c. Error**

*   **Error 1: Aggregated Studio Strings (Multiple Studios in One Entry).**
    *   **Impact on Insights (Bar Charts of Studio Combinations):** The primary issue is that the **'Studios'** column often contains multiple studios listed within a single string entry (e.g., **"Madhouse, MAPPA"**). When we count the frequency of these strings, we are tallying specific combinations of studios, not the individual studios themselves. This fundamentally obscures the true involvement and popularity of any single animation studio. For example, **"Madhouse"** might be a highly active studio, but its individual prevalence is hidden within numerous unique combinations. The bar charts will show these distinct, often lengthy, combination strings as categories, making it impossible to determine which individual studios are most active or involved across the dataset. We are gaining insight into **studio collaborations**, rather than the independent entities.

*   **Error 2: Non-standard Missing Value Representation - 'UNKNOWN' instead of NaN.**
    *   **Impact on Insights (Bar Charts of Studio Combinations):** The presence of the string 'UNKNOWN' for missing studio information is treated as a valid and distinct studio combination by Plotly. This means 'UNKNOWN' will likely appear as one of the most frequent "studio combinations" in the bar charts. Its inclusion as a prominent category inflates its perceived importance, distorts the true ranking of actual **studio collaborations**, and distracts from analyzing meaningful production data. A correctly handled missing value (standard NaN) would typically be excluded from such frequency counts by default, providing a clearer and more accurate insight into actual studio involvement.

### **d. Correct Insight**

In [328]:
# --- 1. Data Processing (Deep Cleaning) ---
# Create a copy and ensure string format
df_studios = df_anime_dataset_2023_prep[['Studios']].copy()
df_studios['Studios'] = df_studios['Studios'].astype(str)

# Clean characters: Remove brackets and quotes
df_studios['Studios'] = df_studios['Studios'].str.replace(r"[\[\]\'\"]", "", regex=True)

# Split collaborations into individual studios and explode rows
# e.g., "Madhouse, Mappa" becomes two separate rows
df_studios['Studios'] = df_studios['Studios'].str.split(',')
df_studios = df_studios.explode('Studios')

# Trim whitespace
df_studios['Studios'] = df_studios['Studios'].str.strip()

# Filter out garbage values (Unknown, empty strings, etc.)
exclude_list = ["UNKNOWN", "NONE", "NAN", "NULL", ""]
df_studios = df_studios[~df_studios['Studios'].str.upper().isin(exclude_list)]

# --- 2. Aggregation ---
# Count occurrences and select Top 25
top_studios = df_studios['Studios'].value_counts().head(20).reset_index()
top_studios.columns = ['Studios', 'Count']

# Sort ascending so the largest bar appears at the top in the chart
top_studios = top_studios.sort_values('Count', ascending=True)

# --- 3. Visualization ---
fig = px.bar(
    top_studios,
    x='Count',
    y='Studios',
    orientation='h',
    text='Count',
    title='<b>Top 20 Most Active Individual Studios By Number of Anime Produced (Processed Data)</b>',
    color='Count',
    # Purple gradient logic
    color_continuous_scale=['#FCC5C0', '#C3122F'] 
)


# --- 4. Visual Refinements ---
fig.update_traces(
    texttemplate='%{text:,}',
    textposition='outside',
    hovertemplate='<b>Studio:</b> %{y}<br><b>Projects:</b> %{x:,}<extra></extra>'
)

fig.update_layout(
    title={
        'y': 0.95,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': dict(size=22, family='Inter', color='#1F2937')
    },
    
    # Global Font Settings
    font=dict(family='Inter', size=13, color="#333333"),
    
    # Background & Sizing
    plot_bgcolor='white',
    paper_bgcolor='white',
    height=750, 
    coloraxis_showscale=False, # Hide the color legend bar
    margin=dict(t=100, l=250, r=50, b=50),

    # X-Axis Configuration (Count)
    xaxis=dict(
        title=None,
        tickfont=dict(size=14, color='#333333'),
        showgrid=False,
        zeroline=False,
        tickformat=',',
        showticklabels=False
    ),

    # Y-Axis Configuration (Studio Names)
    yaxis=dict(
        title=None,
        tickfont=dict(size=15, color='#333333'),
        ticksuffix="  ", # Add padding between text and bar
        showgrid=False,
        showline=False
    )
)

fig.show()

**Data Processed Correction**
1. The **"UNKNOWN" category has been removed**, allowing the chart to focus entirely on valid production entities.
2.  The **counts are  more accurate**. Toei Animation has risen to **864** (up from 834), and Madhouse has risen to **371** (up from 333). This indicates the data was successfully split; the bar chart counts the studio regardless of whether they worked alone or in a partnership.
    * **Toei Animation** successfully keeps the first place, followed by **Sunrise, J.C.Staff**. These three studios remain the top three giants of the industry
    * **Production I.G** jumped in ranking (passing Shanghai Animation Film Studio) because its count increased significantly from **269** to **346** (+77). We can conclude that this studio may has many co-productions with others studio 



### **e. Strategy**

In [303]:
# Convert 'Genres' column from string representation of list to actual list
# Use errors='coerce' to turn unparseable strings into NaN
df_anime_dataset_2023_prep['Genres_list'] = df_anime_dataset_2023_prep['Genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
genre_score_pairs = []

# Iterate through each row of the DataFrame
for index, row in df_anime_dataset_2023_prep.iterrows():
    genres = row['Genres_list']
    score = row['Score']

    # Check if both genres and score are not null
    if isinstance(genres, list) and not pd.isna(score):
        for genre in genres:
            genre_score_pairs.append({'Genre': genre, 'Score': score})

# Convert the list of dictionaries to a DataFrame
genre_scores_df = pd.DataFrame(genre_score_pairs)

# Group by genre and calculate the mean score
average_score_per_genre = genre_scores_df.groupby('Genre')['Score'].mean().reset_index()

# Sort by average score in descending order
average_score_per_genre = average_score_per_genre.sort_values(by='Score', ascending=False)

print("Top 10 Genres by Average Score:")
display(average_score_per_genre.head(10))

Top 10 Genres by Average Score:


Unnamed: 0,Genre,Score
3,Award Winning,7.296308
14,Mystery,6.995093
20,Suspense,6.962963
6,Drama,6.850645
15,Romance,6.804509
19,Supernatural,6.7446
18,Sports,6.722046
0,Action,6.674112
1,Adventure,6.673997
11,Gourmet,6.627664


In [304]:

# --- 1. DATA PREPARATION ---

# Helper function to safely parse lists from strings or mixed types
def parse_list_field(x):
    if pd.isna(x): return []
    if isinstance(x, list): return [str(s).strip() for s in x]
    try:
        # Attempt to parse string representation of list e.g., "['Action', 'Comedy']"
        val = ast.literal_eval(x)
        if isinstance(val, (list, tuple)): return [str(s).strip() for s in val]
    except Exception: pass
    # Fallback: Split by comma for simple strings e.g., "Action, Comedy"
    return [s.strip() for s in str(x).split(',') if s.strip()]

# Initialize working dataframe
df_matrix = df_anime_dataset_2023_prep[['anime_id', 'Studios', 'Genres', 'Score']].copy()
df_matrix = df_matrix.dropna(subset=['Studios', 'Genres', 'Score'])

# Apply parsing logic
df_matrix['Studios_list'] = df_matrix['Studios'].apply(parse_list_field)
df_matrix['Genres_list'] = df_matrix['Genres'].apply(parse_list_field)

# --- STEP 1: FILTER MARKET LEADERS (TOP 20 STUDIOS) ---
# Explode studios to count individual studio participation
df_studios = df_matrix.explode('Studios_list')
df_studios['Studio'] = df_studios['Studios_list'].astype(str).str.strip()

# Identify and filter the Top 20 most active studios
top_20_studio_names = df_studios['Studio'].value_counts().head(20).index.tolist()
df_top_studios = df_studios[df_studios['Studio'].isin(top_20_studio_names)].copy()

# --- STEP 2: IDENTIFY MAJOR MARKETS (TOP 10 GENRES) ---
# Explode genres specifically within the dataset of the Top 20 Studios
df_final = df_top_studios.explode('Genres_list')
df_final['Genre'] = df_final['Genres_list'].astype(str).str.strip()

# Select the Top 10 Genres based on frequency in this subset
top_10_genres = df_final['Genre'].value_counts().head(10).index.tolist()
df_final = df_final[df_final['Genre'].isin(top_10_genres)]

# --- STEP 3: CREATE PERFORMANCE MATRIX (HEATMAP DATA) ---
# Calculate Average Score for every (Studio, Genre) pair
heatmap_data = df_final.groupby(['Studio', 'Genre'])['Score'].mean().unstack()

# Reorder rows and columns to match the "Top" lists for better organization
heatmap_data = heatmap_data.reindex(index=top_20_studio_names, columns=top_10_genres)

# --- 2. VISUALIZATION (HEATMAP) ---

# Custom purple color scale
custom_colorscale = [
    [0.0, "#FFF7F3"], # Light
    [0.6, '#F768A1'], # Medium
    [1.0, "#C3122F"]  # Dark
]

fig = px.imshow(
    heatmap_data,
    labels=dict(x="Genre", y="Studio", color="Average Score"),
    x=heatmap_data.columns,
    y=heatmap_data.index,
    aspect="auto",
    color_continuous_scale=custom_colorscale,
    text_auto=".2f", # Show scores with 2 decimal places
    title="<b>Top 20 Active Individual Studios Specialization: Performance in Top 10 Genres</b>"
)

# --- 3. STYLING ---
fig.update_layout(
    height=800, 
    width=1000,
    font=dict(family='Inter', size=12, color="#333333"),
    plot_bgcolor='white',
    paper_bgcolor='white',
    
    # Title configuration with padding
    title=dict(
        text="<b>Top 20 Most Active Individual Studios Specialization: Performance in Top 10 Genres</b>",
        y=0.97,
        x=0.5,
        xanchor='center',
        yanchor='top',
        font=dict(size=22, family='Inter', color='#1F2937'),
        pad=dict(b=30) # Add padding to separate title from the chart area
    ),
    
    # Adjust margins to accommodate title and top x-axis labels
    margin=dict(t=100, l=150, r=50, b=50),
    
    # Colorbar styling
    coloraxis_colorbar=dict(
        title="Average Score",
        title_font=dict(family="Inter", size=11),
        tickfont=dict(family="Inter", size=10),
        thickness=10, 
        len=0.5,
        outlinecolor="white", 
        outlinewidth=0
    )
)

# Move X-axis labels to the top (Matrix style)
fig.update_xaxes(side="top", tickfont=dict(family="Inter", size=13, color='#333333'))
fig.update_yaxes(tickfont=dict(family="Inter", size=12), title=None, ticksuffix="  ")

# Add whitespace gaps between cells for clarity
fig.update_traces(xgap=1, ygap=1)

fig.show()

#### **Visual Quality with Genre**
The visual quality, including art and animation, constitutes the primary factor attracting audience attention in an anime. Accordingly, producers should first determine the genre of anime they intend to invest in, and subsequently identify and collaborate with studios that demonstrate particular expertise in that genre. For a rigorous evaluation, it is necessary to establish the top 10 genres and assess each studio’s performance within these categories.

**Initial Analysis**
* **Shanghai Animation Film Studio and DLE**, despite being in top 20 most active individual studios, there names do not appear in the current heatmap due to the some productions' **lack of information(Genre and Score)**.
* **Kyoto Animation** does not appear in the top 20 quantity, they **focus on quality over quantity**, releasing fewer titles per year. Therefore, its overall performance has surpassed many other studios with top-tier average score in every genre in top 10. Specifically, for **Drama** genre, it is in first rank with an average score of **7.98**.
* Although **A-1 Pictures** does not have the dominating genre, its performance still in top-tier group with **average score greater than 7 in every genre.**


**Deep Insight**

**1. The "Masters of Emotion" (Drama, Romance, Slice of Life, Comedy)**

If your script is designed to make the audience cry or feel warm and fuzzy or funny, these are your targets:

*   **Kyoto Animation:**
    * They have deep red blocks for **Drama (7.98)**, **Romance (7.51)**, **Slice of Life (7.28)**, and **Comedy (7.24)**.
    * **Strategy:** They are the #1 choice for "premium" emotional storytelling. However, they are very exclusive and expensive, we need to consider our Anime plot and budger before reaching this studio.
*   **TMS Entertainment:**
    * They are an old studio, they score incredibly high in **Drama (7.40)** and **Romance (7.46)** (likely due to modern hits like *Fruits Basket*).
    *   **Strategy:** A perfect alternative if you need high-quality romance but with a faster, more industrial production speed than Kyoto Animation.

**2. The "Blockbuster Makers" (Action, Fantasy, Sci-Fi, Adventure)**

If you are making a Shounen anime with big battles and special effects:

*   **Bones:**
    *   They are consistent high-performers. Their scores in **Fantasy (7.53)** and **Action (7.42)** are very reliable.
    *   **Strategy:** The safest bet for a high-budget action series. Bones rarely makes a "bad" action movie.
*   **A-1 Pictures:**
    *   The "All-Rounder." Their row is a solid dark pink across the board. They perform well in almost every genre,  **Adventure (7.27)** and **Sci-Fi (7.07)**.
    *   **Strategy:** If Bones is busy, A-1 Pictures is the best reliable alternative. They manage projects well and maintain good quality.

**3. The "Mysterious" Niche (Mystery, Supernatural)**
*   **Shaft:**
    *   Look at the **Mystery** column. Shaft has the highest score on the entire board (**8.07**).
    *   **Strategy:** Only hire them for unique, psychological, or "weird" anime (like *Monogatari*). Do not hire them for standard, generic action anime; their style is too specific.

**4. The "Mass Production" Factories**
*   **Toei Animation, Sunrise, J.C.Staff:**
    *   In top 20 quantity, Toei is #1 by a mile (864 projects). But in the Heatmap, their colors are light pink (Scores 6.4 - 6.9).
    *   **Reason:** They produce long-running shows (100+ episodes like *One Piece* or *Naruto*). Long shows often have budget dips, lowering the average score.
    *   **Strategy:** Choose them if you want a long-term partner for a franchise that runs for years. Don't choose them if you want a short, high-quality 12-episode masterpiece.

**5. The "Danger Zones" (White Spaces)**
*   **Studios like T-Rex, Nippon Animation:**
    *   You will see white boxes in their rows. This means they have **zero data** for those genres in the top tier.
    *   **Strategy:** Avoid asking *Nippon Animation* to make a *Sci-Fi* movie. They have no track record there, making it a high risk for investors.
