# Home Credit Default Risk — Preprocessing

**Project:** AI-Powered Intelligent Risk Management System  
**Dataset:** [Kaggle - Home Credit Default Risk](https://www.kaggle.com/competitions/home-credit-default-risk)  
**Input:** `data/train_featured.parquet` (307,511 × 212) | `data/test_featured.parquet` (48,744 × 211)  
**Objective:** Validate full-data statistics, handle missing values, encode categoricals, remove multicollinearity, and prepare the final model-ready dataset.

---

## Table of Contents
1. [Libraries & Configuration](#1)
2. [Data Loading](#2)
3. [Full-Data Validation (EDA Sanity Check)](#3)
4. [Missing Value Strategy](#4)
5. [Outlier Handling](#5)
6. [Categorical Encoding](#6)
7. [Multicollinearity Check](#7)
8. [Final Dataset & Export](#8)

<a id='1'></a>
## 1. Libraries & Configuration

In [13]:
import sys
import os
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Add project root to path for local imports
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '..'))
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

# Paths
DATA_DIR = os.path.join(PROJECT_ROOT, 'data')
PLOT_DIR = os.path.join(PROJECT_ROOT, 'notebooks', 'plots')
os.makedirs(PLOT_DIR, exist_ok=True)

# Display settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.4f}'.format)

# Plot style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

print(f"Data directory : {DATA_DIR}")
print(f"Plot directory : {PLOT_DIR}")

Data directory : c:\Users\busra\Projects\Ai_Credit_Risk\data
Plot directory : c:\Users\busra\Projects\Ai_Credit_Risk\notebooks\plots


<a id='2'></a>
## 2. Data Loading

Feature Engineering pipeline çıktısı olan parquet dosyalarını yüklüyoruz.  
Train: 307,511 × 212 | Test: 48,744 × 211 (TARGET sütunu yok).

In [12]:
train = pd.read_parquet(os.path.join(DATA_DIR, 'train_featured.parquet'))
test  = pd.read_parquet(os.path.join(DATA_DIR, 'test_featured.parquet'))

print(f"Train shape: {train.shape}")
print(f"Test  shape: {test.shape}")
print(f"\nTrain memory: {train.memory_usage(deep=True).sum() / 1e6:.1f} MB")
print(f"Test  memory: {test.memory_usage(deep=True).sum() / 1e6:.1f} MB")

ArrowKeyError: A type extension with name pandas.period already defined

<a id='3'></a>
## 3. Full-Data Validation (EDA Sanity Check)

EDA %20 örneklem üzerinde yapılmıştı. Şimdi tam veri üzerinde temel bulguları doğruluyoruz:  
- Sınıf dağılımı (TARGET oranı)  
- Eksik değer haritası  
- Sayısal dağılım istatistikleri (aykırı değer tespiti)  
- Kategorik sütun dağılımları

### 3.1 — Target Distribution (Sınıf Dengesi)

In [None]:
# Sınıf dağılımı — EDA'da %20 örneklemde ~%8 default bulunmuştu
target_counts = train['TARGET'].value_counts()
target_pct    = train['TARGET'].value_counts(normalize=True) * 100

print("=" * 50)
print("TARGET DISTRIBUTION (Full Data)")
print("=" * 50)
for val in [0, 1]:
    print(f"  TARGET={val}: {target_counts[val]:>7,} ({target_pct[val]:.2f}%)")
print(f"  Ratio: 1:{target_counts[0] / target_counts[1]:.1f}")
print(f"  Total: {len(train):,}")
print("=" * 50)

# Visualization
fig, ax = plt.subplots(figsize=(6, 4))
colors = ['#2ecc71', '#e74c3c']
target_counts.plot(kind='bar', color=colors, edgecolor='black', ax=ax)
ax.set_title('Target Distribution (Full Data)', fontsize=14, fontweight='bold')
ax.set_xlabel('TARGET')
ax.set_ylabel('Count')
ax.set_xticklabels(['0 (No Default)', '1 (Default)'], rotation=0)
for i, (count, pct) in enumerate(zip(target_counts, target_pct)):
    ax.text(i, count + 2000, f'{count:,}\n({pct:.1f}%)', ha='center', fontweight='bold')
plt.tight_layout()
plt.show()

### 3.2 — Missing Value Map (Eksik Değer Haritası)

In [None]:
# Eksik değer analizi — tam veri üzerinde
missing = train.isnull().sum()
missing_pct = (missing / len(train)) * 100
missing_df = pd.DataFrame({
    'missing_count': missing,
    'missing_pct': missing_pct
}).query('missing_count > 0').sort_values('missing_pct', ascending=False)

print(f"Total columns with missing values: {len(missing_df)} / {train.shape[1]}")
print(f"Columns with >50% missing: {(missing_df['missing_pct'] > 50).sum()}")
print(f"Columns with 20-50% missing: {((missing_df['missing_pct'] > 20) & (missing_df['missing_pct'] <= 50)).sum()}")
print(f"Columns with <20% missing: {(missing_df['missing_pct'] <= 20).sum()}")
print()

# Kategoriye göre grupla
print("=" * 70)
print("COLUMNS WITH >50% MISSING (drop candidates)")
print("=" * 70)
high_missing = missing_df[missing_df['missing_pct'] > 50]
if len(high_missing) > 0:
    for col, row in high_missing.iterrows():
        print(f"  {col:<45} {row['missing_pct']:>6.1f}%  ({row['missing_count']:>7,.0f})")
else:
    print("  None")

print()
print("=" * 70)
print("COLUMNS WITH 20-50% MISSING")
print("=" * 70)
mid_missing = missing_df[(missing_df['missing_pct'] > 20) & (missing_df['missing_pct'] <= 50)]
if len(mid_missing) > 0:
    for col, row in mid_missing.iterrows():
        print(f"  {col:<45} {row['missing_pct']:>6.1f}%  ({row['missing_count']:>7,.0f})")
else:
    print("  None")

print()
print("=" * 70)
print("COLUMNS WITH <20% MISSING")
print("=" * 70)
low_missing = missing_df[missing_df['missing_pct'] <= 20]
if len(low_missing) > 0:
    for col, row in low_missing.iterrows():
        print(f"  {col:<45} {row['missing_pct']:>6.1f}%  ({row['missing_count']:>7,.0f})")
else:
    print("  None")

In [None]:
# Missing value heatmap — en çok eksik olan ilk 30 sütun
top_missing = missing_df.head(30)

fig, ax = plt.subplots(figsize=(14, 8))
colors = ['#e74c3c' if pct > 50 else '#f39c12' if pct > 20 else '#3498db'
          for pct in top_missing['missing_pct']]
bars = ax.barh(range(len(top_missing)), top_missing['missing_pct'], color=colors, edgecolor='black', linewidth=0.5)
ax.set_yticks(range(len(top_missing)))
ax.set_yticklabels(top_missing.index, fontsize=9)
ax.set_xlabel('Missing %')
ax.set_title('Top 30 Columns by Missing Percentage (Full Data)', fontsize=14, fontweight='bold')
ax.axvline(x=50, color='red', linestyle='--', alpha=0.7, label='50% threshold')
ax.axvline(x=20, color='orange', linestyle='--', alpha=0.7, label='20% threshold')
ax.legend()
ax.invert_yaxis()

for i, (pct, count) in enumerate(zip(top_missing['missing_pct'], top_missing['missing_count'])):
    ax.text(pct + 0.5, i, f'{pct:.1f}%', va='center', fontsize=8)

plt.tight_layout()
plt.show()

### 3.3 — Numerical Distribution Statistics (Aykırı Değer Tespiti)

In [None]:
# Sayısal sütunların istatistikleri — aykırı değer tespiti için
num_cols = train.select_dtypes(include=[np.number]).columns.tolist()
num_cols = [c for c in num_cols if c not in ['SK_ID_CURR', 'TARGET']]

print(f"Numerical columns (excluding SK_ID_CURR, TARGET): {len(num_cols)}")
print()

# Persentil tablosu
desc = train[num_cols].describe(percentiles=[.01, .05, .25, .5, .75, .95, .99]).T
desc['iqr'] = desc['75%'] - desc['25%']
desc['range'] = desc['max'] - desc['min']
desc['skew'] = train[num_cols].skew()

# Aşırı çarpık veya aşırı geniş range'e sahip sütunlar
print("=" * 70)
print("HIGHLY SKEWED COLUMNS (|skew| > 5) — log transform candidates")
print("=" * 70)
skewed = desc[desc['skew'].abs() > 5].sort_values('skew', ascending=False)
if len(skewed) > 0:
    print(skewed[['mean', 'std', 'min', '50%', '99%', 'max', 'skew']].to_string())
else:
    print("  None")

print()
print("=" * 70)
print("EXTREME OUTLIER COLUMNS (max > 10x 99th percentile)")
print("=" * 70)
extreme = desc[desc['max'] > desc['99%'] * 10]
if len(extreme) > 0:
    print(extreme[['mean', '99%', 'max', 'skew']].to_string())
else:
    print("  None")

### 3.4 — Categorical Column Overview (Kategorik Dağılımlar)

In [None]:
# Kategorik sütunları tespit et
cat_cols = train.select_dtypes(include=['object', 'category']).columns.tolist()

print(f"Categorical columns: {len(cat_cols)}")
print()

if len(cat_cols) > 0:
    cat_summary = pd.DataFrame({
        'dtype': train[cat_cols].dtypes,
        'nunique': train[cat_cols].nunique(),
        'missing_pct': (train[cat_cols].isnull().sum() / len(train) * 100).round(2),
        'top_value': [train[c].mode().iloc[0] if not train[c].mode().empty else 'N/A' for c in cat_cols],
        'top_freq_pct': [(train[c].value_counts(normalize=True).iloc[0] * 100).round(1)
                         if train[c].notna().any() else 0 for c in cat_cols]
    }).sort_values('nunique', ascending=False)

    print(cat_summary.to_string())
    print()

    # Çok az kategorili sütunlar (binary veya 3-4 kategorili) — Label Encoding adayları
    print("=" * 70)
    print("BINARY / LOW-CARDINALITY COLUMNS (nunique <= 4) — Label Encoding candidates")
    print("=" * 70)
    low_card = cat_summary[cat_summary['nunique'] <= 4]
    for col in low_card.index:
        vals = train[col].value_counts(dropna=False).head(5)
        print(f"\n  {col} (nunique={low_card.loc[col, 'nunique']}):")
        for v, cnt in vals.items():
            print(f"    {str(v):<30} {cnt:>7,} ({cnt/len(train)*100:.1f}%)")

    print()
    print("=" * 70)
    print("HIGH-CARDINALITY COLUMNS (nunique > 4) — Target/Frequency Encoding candidates")
    print("=" * 70)
    high_card = cat_summary[cat_summary['nunique'] > 4]
    for col in high_card.index:
        print(f"  {col}: {high_card.loc[col, 'nunique']} unique values")
else:
    print("  No categorical columns found (all may have been encoded already).")

### 3.5 — EDA vs Full Data Comparison Summary

In [None]:
# EDA bulgularıyla karşılaştırma özeti
print("=" * 70)
print("EDA (%20 SAMPLE) vs FULL DATA — COMPARISON")
print("=" * 70)

default_rate = train['TARGET'].mean() * 100
print(f"\n  Default rate:          EDA ~8.0%  |  Full data: {default_rate:.2f}%")

# EXT_SOURCE korelasyonları
for col in ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']:
    if col in train.columns:
        corr = train[col].corr(train['TARGET'])
        print(f"  {col} corr:  EDA ~-0.16  |  Full data: {corr:.4f}")

# Toplam eksik sütun sayısı
n_missing_cols = (train.isnull().sum() > 0).sum()
n_high_missing = (train.isnull().sum() / len(train) > 0.50).sum()
print(f"\n  Columns with any missing:  {n_missing_cols}")
print(f"  Columns with >50% missing: {n_high_missing}")
print(f"  Total columns:             {train.shape[1]}")
print(f"  Total rows:                {train.shape[0]:,}")

print("\n" + "=" * 70)
print("VALIDATION COMPLETE — Ready for preprocessing.")
print("=" * 70)

---

<a id='4'></a>
## 4. Missing Value Strategy

*TODO — Next step*

<a id='5'></a>
## 5. Outlier Handling

*TODO*

<a id='6'></a>
## 6. Categorical Encoding

*TODO*

<a id='7'></a>
## 7. Multicollinearity Check

*TODO*

<a id='8'></a>
## 8. Final Dataset & Export

*TODO*