## **I. Introduction**
**1. Narrative Context**

In the previous stage, we took on the role of a Producer seeking the formula behind high-scoring anime. However, our first obstacle was immediate and unavoidable: the Raw Dataset was in poor condition. It contained missing values, unstructured text, inconsistent formats, and fragmented categories — making meaningful analysis impossible.

This notebook presents the next stage of the workflow: **the transformation**.
After applying a complete Data Processing Pipeline to clean, standardize, and engineer features, we now visualize the impact of that transformation. This notebook serves as evidence of how structured data enables structured insights.

**2. Objective of This Analysis**

The focus of this notebook is **Comparative Analysis**.
For every major feature, we address two key questions:

1. **The Transformation Insight** — How did the feature evolve from **Raw** to **Processed**?
This shows why Data Preparation is not optional but foundational.

2. **The Business Insight** — Once the data is clean, what does it reveal about the drivers of **high Score**?

In short, the purpose is not only to clean the data, but to demonstrate how cleaning transforms noise into clarity and reveals the signals that matter.

**3. Analytical Roadmap**

This transformation is explored through four structured themes:

1. **The Foundation – Target Variable (Score)**
   Revealing the true distribution and statistical behavior of the target metric.

2. **Theme A – Market Factors**
   Understanding how Media Type (TV, Movie, OVA…) and Source Material (Manga, Original, Game…) influence performance.

3. **Theme B – Creative Factors**
   Unpacking Genres, Producers, and Studios to identify collaboration patterns, specialization strengths, and creative drivers.

4. **Theme C – Release Strategy**
   Converting unstructured Aired dates, Duration formats, and Episode structures into analyzable fields to uncover timing and format advantages.

**4. Deep Business Insights → Strategic Recommendations**

At the end of this notebook, we consolidate the strongest patterns discovered across all themes to form **Strategic Recommendations** for creating or selecting high-scoring anime.
These insights connect data evidence to real-world decision-making — guiding choices about **content strategy**, **production planning**, **studio partnerships**, and **release timing**.




In [4]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

import pandas as pd
import numpy as np
import joblib
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.io as pio
pio.templates.default = "plotly_white"

In [18]:
path_raw = r'..\data\raw\anime-dataset-2023.csv'
df_anime_dataset_2023 = pd.read_csv(path_raw)

path = r'..\data\processed\prepared_data.csv'
df_anime_dataset_2023_prep= pd.read_csv(path)

## **II. Target Variable: Score**

### **1. Issue Overview**


### **2. Solution**


### **3. Visual Evidence: Misleading (Raw) vs. Correct (Clean) Insights**


In [19]:
import plotly.express as px
from plotly.subplots import make_subplots
import pandas as pd

# --- Prepare Data ---
df_raw_viz = df_anime_dataset_2023.copy()
df_raw_viz['Score'] = df_raw_viz['Score'].astype(str)

unknown_count = df_raw_viz[df_raw_viz['Score']=='UNKNOWN'].shape[0]

score_counts = df_raw_viz['Score'].value_counts().reset_index()
score_counts.columns = ['Score','Frequency']
score_counts['Color'] = score_counts['Score'].apply(lambda x: 'Misleading (Unknown)' if x=='UNKNOWN' else 'Valid Score')
top_scores = score_counts.head(15)

# --- Create Subplots ---
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("<b>Broken Histogram: Raw Score<b>", "<b>Top 15 Scores (Unknown Highlighted)<b>"),
    horizontal_spacing=0.15  # khoảng cách giữa 2 chart
)

# --- Chart 1: Histogram ---
hist_fig = px.histogram(df_raw_viz, x='Score', color_discrete_sequence=["#7814cf"])
for trace in hist_fig.data:
    fig.add_trace(trace, row=1, col=1)

fig.add_annotation(
    x='UNKNOWN', y=unknown_count,
    text=f"<b>UNKNOWN: {unknown_count}</b><br>(Dominates the data)",
    showarrow=True, arrowhead=2, ax=0, ay=-40,
    font=dict(color="#C3122F", size=14), arrowcolor="#C3122F",
    row=1, col=1
)

fig.update_xaxes(type='category', row=1, col=1, showticklabels=False)
fig.update_yaxes(title_text="Count", row=1, col=1)

# --- Chart 2: Horizontal Bar ---
bar_fig = px.bar(
    top_scores,
    x='Frequency',
    y='Score',
    orientation='h',
    color='Color',
    color_discrete_map={'Misleading (Unknown)': '#C3122F', 'Valid Score': '#95a5a6'},
    text='Frequency'
)
for trace in bar_fig.data:
    fig.add_trace(trace, row=1, col=2)

# --- Critical Settings to keep text outside ---
fig.update_traces(textposition='outside', textfont=dict(size=14, color='black'), row=1, col=2)

fig.update_yaxes(categoryorder='total ascending', row=1, col=2, showgrid=False, showline=False)
fig.update_xaxes(showticklabels=False, showgrid=False, zeroline=False, showline=False,
                 row=1, col=2, automargin=True, range=[0, max(top_scores['Frequency'])*1.3])

# --- Layout chung ---
fig.update_layout(
    template='plotly_white',
    showlegend=False,
    height=500,
    width=1300,  # tăng chiều rộng tổng thể
    margin=dict(l=60, r=150, t=80, b=50),
    title_text="<b>Raw Score Analysis: Broken vs Unknown Dominance</b>"
)

fig.show()


# 1. Drop NaN just for plotting (Storytelling purpose: Show the valid distribution)
# In prepared_data.csv, Score is already float, but might have NaNs. 
# We visualize the AVAILABLE valid scores.
valid_scores = df_cleaned['Score'].dropna()

# 2. Calculate Statistics for Annotation
mean_score = valid_scores.mean()
median_score = valid_scores.median()

# 3. Plot Histogram
fig = px.histogram(
    x=valid_scores,
    nbins=40, # Granular bins
    title='<b>True Anime Score Distribution (Cleaned)</b><br><i>(Gaussian-like distribution after removing noise)</i>',
    color_discrete_sequence=["#7C2EC1"], # Professional Blue
    opacity=0.8
)

# 4. Add Mean Line (Vertical Line)
fig.add_vline(
    x=mean_score, 
    line_width=3, 
    line_dash="dash", 
    line_color="#C3122F",
    annotation_text=f"Mean: {mean_score:.2f}", 
    annotation_position="top right"
)

# 5. Declutter
fig.update_layout(
    template='plotly_white',
    xaxis_title="Score (0-10)",
    yaxis_title="Frequency",
    bargap=0.1,
    height=500
)

fig.show()

##### **The Transformation Insight:** 

**The Misleading View:** The raw data suggested that a large portion of anime had undefined or unknown scores, obscuring the true distribution and making it impossible to assess typical performance.

**The Truth:** The clean data reveals a roughly Gaussian distribution, with most scores clustering around 6.38, and extreme low or high scores being rare.

**The Strategic Value:** Instead of noise, we now have a reliable numeric feature to analyze how Score varies across Types, Sources, and other factors, enabling accurate business insights.

## **III. Theme A — Market Factors (Type & Source)**

### **1. Issue Overview**


### **2. Solution**


### **3. Visual Evidence: Misleading (Raw) vs. Correct (Clean) Insights**

##### **The Transformation Insight:**
Áp dụng công thức

The Misleading View: The raw data suggested that [Misleading Insight] (or obscured the pattern due to [Specific Noise]).

The Truth: The clean data corrects this by revealing that [Correct Insight/Real Pattern].

The Strategic Value: Instead of noise, we now have a reliable feature to analyze [How this feature impacts Score].

### **4. Business Insights**

#### **A. [Tên]**
**Question**:

**Insight:**

**Business Takeaway:**

## **IV. Theme B — Creative & Production Factors (Genres, Producers, Studios)**

### **1. Issue Overview**


### **2. Solution**


### **3. Visual Evidence: Misleading (Raw) vs. Correct (Clean) Insights**


##### **The Transformation Insight:**

**Before**:

**After**

### **4. Business Insights**

#### **A. Producer Count Analysis** 
**Question**:

**Insight:**

**Business Takeaway:**

## **V. Theme C — Release Strategy (Aired, Episodes, Duration)**

## **IV. Theme B — Creative & Production Factors (Genres, Producers, Studios)**

### **1. Issue Overview**


### **2. Solution**


### **3. Visual Evidence: Misleading (Raw) vs. Correct (Clean) Insights**


##### **The Transformation Insight:**

**Before**:

**After**

### **4. Business Insights**

#### **A.** 
**Question**:

**Insight:**

**Business Takeaway:**

## **VI. Deep Business Insights & Strategic Recommendations**

**1. Key Patterns Across All Themes**

**2. Strategic Recommendations**

