# 01. Data Exploration & Visualization

**Mục tiêu**: Khám phá dữ liệu, phân tích EDA, và tạo visualizations để justify các quyết định tiền xử lý.

**Tương ứng Report Section 2**: Data Visualization & Evidence-Based Decisions

---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Style
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries loaded successfully!")

## 1.1 Load Raw Data

In [None]:
# Load raw data
df = pd.read_csv('../data/raw/Global_Data_filtered.csv')
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
df.head()

## 1.2 Missing Values Analysis

**Report Section 2.1**: Phân tích Dữ liệu Thiếu

In [None]:
# Missing values per column
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Missing': missing, 'Percent': missing_pct}).sort_values('Percent', ascending=False)
print(missing_df[missing_df['Percent'] > 0])

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))
missing_df[missing_df['Percent'] > 0].plot(kind='barh', y='Percent', ax=ax, color='coral')
ax.set_xlabel('Missing Percentage (%)')
ax.set_title('Missing Values by Column')
plt.tight_layout()
plt.show()

## 1.3 Skewness Analysis

**Report Section 2.3**: Phân tích Skewness - Justify Log Transform

In [None]:
# Numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Calculate skewness
skewness = df[numeric_cols].skew().sort_values(key=abs, ascending=False)
print("Top 10 Skewed Features:")
print(skewness.head(10))

# Visualize top skewed feature
top_skewed = skewness.index[0]
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Before log
axes[0].hist(df[top_skewed].dropna(), bins=50, color='steelblue', edgecolor='black')
axes[0].set_title(f'{top_skewed}\nSkewness: {skewness[top_skewed]:.2f}')
axes[0].set_xlabel('Value')

# After log
log_values = np.log1p(df[top_skewed].dropna())
axes[1].hist(log_values, bins=50, color='seagreen', edgecolor='black')
axes[1].set_title(f'log1p({top_skewed})\nSkewness: {stats.skew(log_values):.2f}')
axes[1].set_xlabel('Log Value')

plt.tight_layout()
plt.show()

## 1.4 Outlier Analysis

**Report Section 2.4**: Phân tích Outliers - Signal vs Noise

In [None]:
# Target column
target = 'Value_co2_emissions_kt_by_country'

# Top emitters
top_emitters = df.groupby('Entity')[target].mean().sort_values(ascending=False).head(10)
print("Top 10 CO2 Emitters (Avg):")
print(top_emitters)

# Box plot
fig, ax = plt.subplots(figsize=(12, 6))
df.boxplot(column=target, ax=ax)
ax.set_title('CO2 Emissions Distribution (Global)')
ax.set_ylabel('kt CO2')
plt.show()

print("\n⚠️ 'Outliers' are actually major economies (China, USA, India) - NOT noise!")

## 1.5 Correlation & Multicollinearity

**Report Section 2.5**: Phân tích Đa cộng tuyến

In [None]:
# Correlation matrix
corr_matrix = df[numeric_cols].corr()

# Heatmap
fig, ax = plt.subplots(figsize=(14, 10))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=False, cmap='RdBu_r', center=0, ax=ax)
ax.set_title('Correlation Matrix (Numeric Features)')
plt.tight_layout()
plt.show()

# High correlations
high_corr = []
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        if abs(corr_matrix.iloc[i, j]) > 0.9:
            high_corr.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i, j]))

print(f"\nHigh Correlation Pairs (|r| > 0.9): {len(high_corr)}")
for c1, c2, r in high_corr[:5]:
    print(f"  {c1} <-> {c2}: r = {r:.2f}")

## 1.6 Time Series Analysis

In [None]:
# Global CO2 over time
global_co2 = df.groupby('Year')[target].sum()

fig, ax = plt.subplots(figsize=(12, 5))
global_co2.plot(ax=ax, marker='o', linewidth=2, color='darkred')
ax.axvline(x=2015, color='gray', linestyle='--', label='Train/Test Split')
ax.set_xlabel('Year')
ax.set_ylabel('Total CO2 (kt)')
ax.set_title('Global CO2 Emissions Over Time')
ax.legend()
plt.tight_layout()
plt.show()

print(f"\nTrain Period: 2001-2014")
print(f"Test Period: 2015-2019")

## Summary

**Kết luận từ EDA**:
1. **Missing Values**: Cần imputation (Median)
2. **Skewness**: Một số features cần Log Transform
3. **Outliers**: Top emitters (China, USA) là signal - KHÔNG loại bỏ
4. **Multicollinearity**: Nhiều features tương quan cao - cần VIF analysis
5. **Time-Series**: Phải dùng Time-Series Split để tránh data leakage