# Credit Risk Data Exploration

## Overview
This notebook performs comprehensive exploratory data analysis (EDA) on the credit risk dataset. We will:

1. **Data Loading**: Load the training dataset from CSV
2. **Data Summary**: Generate descriptive statistics and data quality metrics
3. **Data Visualization**: Create various charts and graphs to understand data distributions and relationships
4. **Insights**: Extract key insights to inform model development

## Dataset
- **Source**: `/mnt/data/Credit-Risk-Model/data/train_data_10.csv`
- **Purpose**: Credit risk modeling and default prediction

Let's begin our exploration!

## 1. Data Loading and Initial Setup

First, let's import the necessary libraries and load our dataset.

In [15]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings

# Configure display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
plt.style.use('default')
sns.set_palette("husl")
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


In [16]:
# Load the dataset
data_path = "/mnt/data/Credit-Risk-Model/data/train_data_10.csv"

try:
    df = pd.read_csv(data_path)
    print(f"Dataset loaded successfully!")
    print(f"Shape: {df.shape}")
    print(f"Columns: {df.columns.tolist()}")
except FileNotFoundError:
    print(f"Error: File not found at {data_path}")
    # Show available files for debugging
    import os
    if os.path.exists("/mnt/data/Credit-Risk-Model/data/"):
        print("Available files in data directory:")
        for file in os.listdir("/mnt/data/Credit-Risk-Model/data/"):
            if file.endswith('.csv'):
                print(f"  - {file}")
except Exception as e:
    print(f"Error loading data: {e}")

Dataset loaded successfully!
Shape: (100000, 62)
Columns: ['duration', 'credit_amount', 'installment_rate', 'residence', 'age', 'credits', 'dependents', 'checking_account_A11', 'checking_account_A12', 'checking_account_A13', 'checking_account_A14', 'credit_history_A30', 'credit_history_A31', 'credit_history_A32', 'credit_history_A33', 'credit_history_A34', 'purpose_A40', 'purpose_A41', 'purpose_A410', 'purpose_A42', 'purpose_A43', 'purpose_A44', 'purpose_A45', 'purpose_A46', 'purpose_A48', 'purpose_A49', 'savings_A61', 'savings_A62', 'savings_A63', 'savings_A64', 'savings_A65', 'employment_since_A71', 'employment_since_A72', 'employment_since_A73', 'employment_since_A74', 'employment_since_A75', 'status_A91', 'status_A92', 'status_A93', 'status_A94', 'debtors_guarantors_A101', 'debtors_guarantors_A102', 'debtors_guarantors_A103', 'property_A121', 'property_A122', 'property_A123', 'property_A124', 'other_installments_A141', 'other_installments_A142', 'other_installments_A143', 'housing_

## 2. Data Summary and Quality Assessment

Let's examine the structure, quality, and basic statistics of our dataset.

In [17]:
# Basic dataset information
print("=== DATASET OVERVIEW ===")
print(f"Dataset shape: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print()

# Display first few rows
print("=== FIRST 5 ROWS ===")
display(df.head())

print("\n=== LAST 5 ROWS ===")
display(df.tail())

=== DATASET OVERVIEW ===
Dataset shape: (100000, 62)
Memory usage: 47.30 MB

=== FIRST 5 ROWS ===


Unnamed: 0,duration,credit_amount,installment_rate,residence,age,credits,dependents,checking_account_A11,checking_account_A12,checking_account_A13,checking_account_A14,credit_history_A30,credit_history_A31,credit_history_A32,credit_history_A33,credit_history_A34,purpose_A40,purpose_A41,purpose_A410,purpose_A42,purpose_A43,purpose_A44,purpose_A45,purpose_A46,purpose_A48,purpose_A49,savings_A61,savings_A62,savings_A63,savings_A64,savings_A65,employment_since_A71,employment_since_A72,employment_since_A73,employment_since_A74,employment_since_A75,status_A91,status_A92,status_A93,status_A94,debtors_guarantors_A101,debtors_guarantors_A102,debtors_guarantors_A103,property_A121,property_A122,property_A123,property_A124,other_installments_A141,other_installments_A142,other_installments_A143,housing_A151,housing_A152,housing_A153,job_A171,job_A172,job_A173,job_A174,telephone_A191,telephone_A192,foreign_worker_A201,foreign_worker_A202,credit
0,0.294118,0.178167,0.666667,1.0,0.071429,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1
1,0.294118,0.246836,0.333333,1.0,0.107143,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0
2,0.205882,0.179322,1.0,0.0,0.107143,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1
3,0.073529,0.091834,1.0,0.666667,0.285714,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1
4,0.294118,0.209475,0.666667,0.666667,0.428571,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0



=== LAST 5 ROWS ===


Unnamed: 0,duration,credit_amount,installment_rate,residence,age,credits,dependents,checking_account_A11,checking_account_A12,checking_account_A13,checking_account_A14,credit_history_A30,credit_history_A31,credit_history_A32,credit_history_A33,credit_history_A34,purpose_A40,purpose_A41,purpose_A410,purpose_A42,purpose_A43,purpose_A44,purpose_A45,purpose_A46,purpose_A48,purpose_A49,savings_A61,savings_A62,savings_A63,savings_A64,savings_A65,employment_since_A71,employment_since_A72,employment_since_A73,employment_since_A74,employment_since_A75,status_A91,status_A92,status_A93,status_A94,debtors_guarantors_A101,debtors_guarantors_A102,debtors_guarantors_A103,property_A121,property_A122,property_A123,property_A124,other_installments_A141,other_installments_A142,other_installments_A143,housing_A151,housing_A152,housing_A153,job_A171,job_A172,job_A173,job_A174,telephone_A191,telephone_A192,foreign_worker_A201,foreign_worker_A202,credit
99995,0.294118,0.036591,1.0,0.333333,0.017857,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0
99996,0.470588,0.108672,1.0,1.0,0.678571,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0
99997,0.294118,0.037911,1.0,0.333333,0.053571,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0
99998,0.117647,0.025806,0.333333,0.333333,0.214286,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0
99999,0.117647,0.025806,0.333333,0.333333,0.214286,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0


In [18]:
# Data types and basic info
print("=== DATA TYPES AND INFO ===")
df.info()

print("\n=== COLUMN DETAILS ===")
dtype_summary = df.dtypes.value_counts()
print("Data type distribution:")
for dtype, count in dtype_summary.items():
    print(f"  {dtype}: {count} columns")
    
print(f"\nTotal columns: {len(df.columns)}")
print(f"Numeric columns: {len(df.select_dtypes(include=[np.number]).columns)}")
print(f"Categorical/Object columns: {len(df.select_dtypes(include=['object']).columns)}")

# Define column types for use in later cells
numeric_cols = df.select_dtypes(include=[np.number]).columns
categorical_cols = df.select_dtypes(include=['object']).columns

=== DATA TYPES AND INFO ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 62 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration                 100000 non-null  float64
 1   credit_amount            100000 non-null  float64
 2   installment_rate         100000 non-null  float64
 3   residence                100000 non-null  float64
 4   age                      100000 non-null  float64
 5   credits                  100000 non-null  float64
 6   dependents               100000 non-null  float64
 7   checking_account_A11     100000 non-null  float64
 8   checking_account_A12     100000 non-null  float64
 9   checking_account_A13     100000 non-null  float64
 10  checking_account_A14     100000 non-null  float64
 11  credit_history_A30       100000 non-null  float64
 12  credit_history_A31       100000 non-null  float64
 13  credit_history_A32       100000 

In [19]:
# Missing values analysis
print("=== MISSING VALUES ANALYSIS ===")
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100

missing_df = pd.DataFrame({
    'Column': df.columns,
    'Missing Count': missing_values,
    'Missing Percentage': missing_percentage
}).sort_values('Missing Count', ascending=False)

print("Missing values summary:")
display(missing_df[missing_df['Missing Count'] > 0])

# Store missing count for use in later cells
missing_count = missing_df['Missing Count'].sum()
if missing_count == 0:
    print("✓ No missing values found in the dataset!")
else:
    print(f"Total missing values: {missing_count}")
    print(f"Columns with missing values: {(missing_df['Missing Count'] > 0).sum()}")

=== MISSING VALUES ANALYSIS ===
Missing values summary:


Unnamed: 0,Column,Missing Count,Missing Percentage


✓ No missing values found in the dataset!


In [20]:
# Descriptive statistics for numerical columns
print("=== DESCRIPTIVE STATISTICS ===")

if len(numeric_cols) > 0:
    print("Numerical columns statistics:")
    display(df[numeric_cols].describe().round(3))
    
    print("\n=== ADDITIONAL STATISTICS ===")
    additional_stats = pd.DataFrame({
        'Skewness': df[numeric_cols].skew(),
        'Kurtosis': df[numeric_cols].kurtosis(),
        'Min': df[numeric_cols].min(),
        'Max': df[numeric_cols].max(),
        'Range': df[numeric_cols].max() - df[numeric_cols].min()
    }).round(3)
    display(additional_stats)
else:
    print("No numerical columns found for statistical analysis.")

=== DESCRIPTIVE STATISTICS ===
Numerical columns statistics:


Unnamed: 0,duration,credit_amount,installment_rate,residence,age,credits,dependents,checking_account_A11,checking_account_A12,checking_account_A13,checking_account_A14,credit_history_A30,credit_history_A31,credit_history_A32,credit_history_A33,credit_history_A34,purpose_A40,purpose_A41,purpose_A410,purpose_A42,purpose_A43,purpose_A44,purpose_A45,purpose_A46,purpose_A48,purpose_A49,savings_A61,savings_A62,savings_A63,savings_A64,savings_A65,employment_since_A71,employment_since_A72,employment_since_A73,employment_since_A74,employment_since_A75,status_A91,status_A92,status_A93,status_A94,debtors_guarantors_A101,debtors_guarantors_A102,debtors_guarantors_A103,property_A121,property_A122,property_A123,property_A124,other_installments_A141,other_installments_A142,other_installments_A143,housing_A151,housing_A152,housing_A153,job_A171,job_A172,job_A173,job_A174,telephone_A191,telephone_A192,foreign_worker_A201,foreign_worker_A202,credit
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,0.25,0.141,0.668,0.581,0.247,0.102,0.12,0.29,0.295,0.039,0.363,0.027,0.037,0.561,0.073,0.212,0.173,0.056,0.012,0.154,0.207,0.015,0.017,0.039,0.007,0.083,0.61,0.071,0.049,0.054,0.151,0.032,0.159,0.268,0.159,0.185,0.034,0.278,0.51,0.059,0.924,0.037,0.032,0.22,0.183,0.332,0.139,0.107,0.029,0.827,0.161,0.732,0.095,0.012,0.168,0.617,0.102,0.576,0.381,0.973,0.027,0.5
std,0.179,0.156,0.36,0.351,0.205,0.178,0.324,0.454,0.456,0.194,0.481,0.162,0.188,0.496,0.26,0.409,0.378,0.23,0.11,0.361,0.405,0.12,0.13,0.194,0.085,0.276,0.488,0.256,0.215,0.225,0.358,0.175,0.365,0.443,0.365,0.389,0.182,0.448,0.5,0.235,0.264,0.188,0.175,0.414,0.387,0.471,0.346,0.31,0.169,0.378,0.368,0.443,0.293,0.11,0.374,0.486,0.303,0.494,0.486,0.162,0.162,0.5
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.118,0.049,0.333,0.333,0.089,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
50%,0.206,0.072,0.667,0.667,0.196,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
75%,0.294,0.179,1.0,1.0,0.321,0.333,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
max,0.824,0.864,1.0,1.0,0.982,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0



=== ADDITIONAL STATISTICS ===


Unnamed: 0,Skewness,Kurtosis,Min,Max,Range
duration,1.105,0.686,0.0,0.824,0.824
credit_amount,2.313,5.853,0.0,0.864,0.864
installment_rate,-0.533,-1.164,0.0,1.000,1.000
residence,-0.065,-1.345,0.0,1.000,1.000
age,1.325,1.234,0.0,0.982,0.982
...,...,...,...,...,...
telephone_A191,-0.306,-1.906,0.0,1.000,1.000
telephone_A192,0.492,-1.758,0.0,1.000,1.000
foreign_worker_A201,-5.855,32.287,0.0,1.000,1.000
foreign_worker_A202,5.855,32.287,0.0,1.000,1.000


In [21]:
# Categorical columns analysis
print("=== CATEGORICAL COLUMNS ANALYSIS ===")

if len(categorical_cols) > 0:
    for col in categorical_cols[:10]:  # Show first 10 categorical columns
        print(f"\n--- {col} ---")
        value_counts = df[col].value_counts()
        print(f"Unique values: {df[col].nunique()}")
        print(f"Most frequent values:")
        display(value_counts.head())
        
        if df[col].nunique() > 20:
            print(f"... and {df[col].nunique() - 5} more unique values")
else:
    print("No categorical columns found in the dataset.")

=== CATEGORICAL COLUMNS ANALYSIS ===
No categorical columns found in the dataset.


## 3. Data Visualizations

Now let's create various visualizations to better understand the data distributions, relationships, and patterns.

### 3.1 Target Variable Analysis

First, let's examine the target variable (if it exists) and understand the class distribution.

In [22]:
# Identify potential target variable (common names for credit risk)
potential_targets = ['target', 'default', 'class', 'label', 'outcome', 'risk', 'y']
target_col = None

for col in df.columns:
    if col.lower() in potential_targets:
        target_col = col
        break

if target_col:
    print(f"Identified target variable: {target_col}")
    
    # Target variable distribution
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # Count plot
    df[target_col].value_counts().plot(kind='bar', ax=axes[0])
    axes[0].set_title(f'Distribution of {target_col}')
    axes[0].set_xlabel(target_col)
    axes[0].set_ylabel('Count')
    
    # Pie chart
    df[target_col].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%')
    axes[1].set_title(f'Proportion of {target_col}')
    
    plt.tight_layout()
    plt.show()
    
    # Print statistics
    print(f"\nTarget variable statistics:")
    target_stats = df[target_col].value_counts()
    for value, count in target_stats.items():
        percentage = (count / len(df)) * 100
        print(f"  {value}: {count} ({percentage:.1f}%)")
        
else:
    print("No obvious target variable found. Examining all binary/categorical columns:")
    binary_cols = [col for col in df.columns if df[col].nunique() == 2]
    if binary_cols:
        print(f"Binary columns that might be targets: {binary_cols}")
    else:
        print("No binary columns found.")

No obvious target variable found. Examining all binary/categorical columns:
Binary columns that might be targets: ['dependents', 'checking_account_A11', 'checking_account_A12', 'checking_account_A13', 'checking_account_A14', 'credit_history_A30', 'credit_history_A31', 'credit_history_A32', 'credit_history_A33', 'credit_history_A34', 'purpose_A40', 'purpose_A41', 'purpose_A410', 'purpose_A42', 'purpose_A43', 'purpose_A44', 'purpose_A45', 'purpose_A46', 'purpose_A48', 'purpose_A49', 'savings_A61', 'savings_A62', 'savings_A63', 'savings_A64', 'savings_A65', 'employment_since_A71', 'employment_since_A72', 'employment_since_A73', 'employment_since_A74', 'employment_since_A75', 'status_A91', 'status_A92', 'status_A93', 'status_A94', 'debtors_guarantors_A101', 'debtors_guarantors_A102', 'debtors_guarantors_A103', 'property_A121', 'property_A122', 'property_A123', 'property_A124', 'other_installments_A141', 'other_installments_A142', 'other_installments_A143', 'housing_A151', 'housing_A152', '

### 3.2 Numerical Variable Distributions

Let's examine the distributions of numerical variables using histograms and box plots.

In [None]:
# Summary distribution analysis for numerical variables
if len(numeric_cols) > 0:
    # Create a single summary plot with key numerical features
    print("=== NUMERICAL FEATURES SUMMARY ===")
    
    # Select key continuous features (first 7 columns that are not binary)
    continuous_cols = []
    for col in numeric_cols:
        unique_vals = df[col].nunique()
        if unique_vals > 10:  # Not binary/categorical
            continuous_cols.append(col)
        if len(continuous_cols) >= 6:  # Limit to 6 for readability
            break
    
    if continuous_cols:
        fig, axes = plt.subplots(2, 3, figsize=(18, 10))
        axes = axes.flatten()
        
        for i, col in enumerate(continuous_cols):
            # Histogram with KDE
            df[col].hist(bins=30, ax=axes[i], alpha=0.7, density=True, color='skyblue', edgecolor='black')
            df[col].plot(kind='kde', ax=axes[i], color='red', linewidth=2)
            axes[i].set_title(f'{col}\n(μ={df[col].mean():.2f}, σ={df[col].std():.2f})', fontsize=10)
            axes[i].set_xlabel(col)
            axes[i].set_ylabel('Density')
            axes[i].grid(True, alpha=0.3)
            
            # Add skewness info
            skew_val = df[col].skew()
            axes[i].text(0.7, 0.8, f'Skew: {skew_val:.2f}', 
                        transform=axes[i].transAxes, 
                        bbox=dict(boxstyle='round', facecolor='white', alpha=0.7))
        
        plt.suptitle('Distribution Analysis - Key Numerical Features', fontsize=16, y=1.02)
        plt.tight_layout()
        plt.show()
        
        # Summary statistics table
        print("\nSummary Statistics for Key Numerical Features:")
        summary_stats = df[continuous_cols].describe().round(2)
        summary_stats.loc['skewness'] = df[continuous_cols].skew().round(2)
        summary_stats.loc['kurtosis'] = df[continuous_cols].kurtosis().round(2)
        display(summary_stats)
    
    # Create a separate chart for binary/categorical numerical features
    binary_cols = [col for col in numeric_cols if df[col].nunique() == 2]
    if binary_cols:
        print(f"\n=== BINARY FEATURES ANALYSIS ({len(binary_cols)} features) ===")
        
        # Count how many 1s vs 0s for each binary feature
        binary_summary = {}
        for col in binary_cols:
            counts = df[col].value_counts()
            binary_summary[col] = counts.get(1.0, 0) / len(df)
        
        # Plot top 10 binary features by proportion of 1s
        top_binary = sorted(binary_summary.items(), key=lambda x: x[1], reverse=True)[:10]
        
        plt.figure(figsize=(12, 6))
        cols_plot = [item[0] for item in top_binary]
        props_plot = [item[1] for item in top_binary]
        
        bars = plt.bar(range(len(cols_plot)), props_plot, color='lightcoral', alpha=0.7, edgecolor='black')
        plt.title('Top 10 Binary Features - Proportion of Positive Values', fontsize=14)
        plt.xlabel('Features')
        plt.ylabel('Proportion of 1s')
        plt.xticks(range(len(cols_plot)), [col.replace('_', '\n') for col in cols_plot], rotation=45, ha='right')
        plt.grid(True, alpha=0.3, axis='y')
        
        # Add value labels on bars
        for i, bar in enumerate(bars):
            height = bar.get_height()
            plt.text(bar.get_x() + bar.get_width()/2., height + 0.01,
                    f'{height:.2f}', ha='center', va='bottom')
        
        plt.tight_layout()
        plt.show()
        
        print(f"Total binary features: {len(binary_cols)}")
        print(f"Features with >50% positive values: {len([x for x in binary_summary.values() if x > 0.5])}")
        print(f"Features with <10% positive values: {len([x for x in binary_summary.values() if x < 0.1])}")

else:
    print("No numerical columns found for distribution analysis.")

In [None]:
# Outlier Analysis Summary
if len(numeric_cols) > 0:
    print("=== OUTLIER ANALYSIS SUMMARY ===")
    
    # Focus on continuous variables for outlier analysis
    continuous_cols = [col for col in numeric_cols if df[col].nunique() > 10][:8]
    
    if continuous_cols:
        # Create box plots for key continuous variables
        fig, axes = plt.subplots(2, 4, figsize=(20, 10))
        axes = axes.flatten()
        
        outlier_summary = []
        
        for i, col in enumerate(continuous_cols):
            if i < len(axes):
                # Box plot
                bp = axes[i].boxplot(df[col].values, patch_artist=True)
                bp['boxes'][0].set_facecolor('lightblue')
                bp['boxes'][0].set_alpha(0.7)
                
                axes[i].set_title(f'{col}')
                axes[i].set_ylabel('Values')
                axes[i].grid(True, alpha=0.3)
                
                # Calculate outlier statistics
                Q1 = df[col].quantile(0.25)
                Q3 = df[col].quantile(0.75)
                IQR = Q3 - Q1
                lower_bound = Q1 - 1.5 * IQR
                upper_bound = Q3 + 1.5 * IQR
                
                outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)][col]
                outlier_percentage = (len(outliers) / len(df)) * 100
                
                outlier_summary.append({
                    'Feature': col,
                    'Outliers_Count': len(outliers),
                    'Outliers_Percentage': outlier_percentage,
                    'Q1': Q1,
                    'Q3': Q3,
                    'IQR': IQR
                })
                
                # Add outlier count to plot
                axes[i].text(0.5, 0.95, f'Outliers: {outlier_percentage:.1f}%', 
                            transform=axes[i].transAxes, ha='center',
                            bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7))
        
        plt.suptitle('Outlier Analysis - Key Continuous Features', fontsize=16)
        plt.tight_layout()
        plt.show()
        
        # Outlier summary table
        outlier_df = pd.DataFrame(outlier_summary)
        outlier_df = outlier_df.sort_values('Outliers_Percentage', ascending=False)
        
        print("Outlier Analysis Summary (sorted by percentage):")
        display(outlier_df.round(2))
        
        # Overall outlier insights
        high_outlier_features = outlier_df[outlier_df['Outliers_Percentage'] > 5]
        print(f"\n📊 OUTLIER INSIGHTS:")
        print(f"• Features with >5% outliers: {len(high_outlier_features)}")
        if len(high_outlier_features) > 0:
            print("  High outlier features:")
            for _, row in high_outlier_features.head(3).iterrows():
                print(f"    - {row['Feature']}: {row['Outliers_Percentage']:.1f}% outliers")
        
        avg_outlier_rate = outlier_df['Outliers_Percentage'].mean()
        print(f"• Average outlier rate: {avg_outlier_rate:.1f}%")
        
        if avg_outlier_rate > 10:
            print("  ⚠️  High overall outlier rate - consider robust preprocessing")
        elif avg_outlier_rate < 5:
            print("  ✅ Reasonable outlier rate")
    
    else:
        print("No continuous variables found for outlier analysis.")
        
else:
    print("No numerical columns found for outlier analysis.")

### 3.3 Correlation Analysis

Let's examine correlations between numerical variables.

In [None]:
# Correlation Analysis - Summary View
if len(numeric_cols) >= 2:
    print("=== CORRELATION ANALYSIS SUMMARY ===")
    
    # Calculate correlation matrix for all numerical features
    correlation_matrix = df[numeric_cols].corr()
    
    # Create a focused correlation heatmap with most important correlations
    plt.figure(figsize=(16, 12))
    
    # Use a mask to show only the lower triangle (avoid redundancy)
    mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
    
    # Create heatmap with better color scheme
    sns.heatmap(correlation_matrix, mask=mask, annot=False, cmap='RdBu_r', center=0,
                square=True, linewidths=0.1, cbar_kws={"shrink": .8})
    plt.title('Correlation Matrix - All Numerical Features', fontsize=16, pad=20)
    plt.tight_layout()
    plt.show()
    
    # Find and highlight high correlation pairs
    print("\\n=== HIGH CORRELATION ANALYSIS ===")
    high_corr_pairs = []
    moderate_corr_pairs = []
    
    for i in range(len(correlation_matrix.columns)):
        for j in range(i+1, len(correlation_matrix.columns)):
            corr_val = correlation_matrix.iloc[i, j]
            col1, col2 = correlation_matrix.columns[i], correlation_matrix.columns[j]
            
            if abs(corr_val) > 0.8:
                high_corr_pairs.append((col1, col2, corr_val))
            elif abs(corr_val) > 0.5:
                moderate_corr_pairs.append((col1, col2, corr_val))
    
    # Display high correlations
    if high_corr_pairs:
        print(f"🔴 STRONG CORRELATIONS (|r| > 0.8): {len(high_corr_pairs)} pairs")
        high_corr_df = pd.DataFrame(high_corr_pairs, columns=['Feature_1', 'Feature_2', 'Correlation'])
        high_corr_df = high_corr_df.sort_values('Correlation', key=abs, ascending=False)
        display(high_corr_df.head(10))
    else:
        print("🟢 No strong correlations found (|r| > 0.8)")
    
    # Display moderate correlations
    if moderate_corr_pairs:
        print(f"\\n🟡 MODERATE CORRELATIONS (0.5 < |r| < 0.8): {len(moderate_corr_pairs)} pairs")
        moderate_corr_df = pd.DataFrame(moderate_corr_pairs, columns=['Feature_1', 'Feature_2', 'Correlation'])
        moderate_corr_df = moderate_corr_df.sort_values('Correlation', key=abs, ascending=False)
        display(moderate_corr_df.head(10))
    else:
        print("\\nNo moderate correlations found")
    
    # Correlation with target variable (if identified as 'credit')
    if 'credit' in df.columns:
        print("\\n=== TARGET CORRELATION ANALYSIS ===")
        target_correlations = correlation_matrix['credit'].drop('credit').sort_values(key=abs, ascending=False)
        
        # Plot top correlations with target
        top_target_corr = target_correlations.head(15)
        
        plt.figure(figsize=(12, 8))
        colors = ['red' if x < 0 else 'green' for x in top_target_corr.values]
        bars = plt.barh(range(len(top_target_corr)), top_target_corr.values, color=colors, alpha=0.7)
        plt.yticks(range(len(top_target_corr)), [name.replace('_', '\\n') for name in top_target_corr.index])
        plt.xlabel('Correlation with Target (Credit)')
        plt.title('Top 15 Features - Correlation with Target Variable', fontsize=14)
        plt.grid(True, alpha=0.3, axis='x')
        
        # Add value labels
        for i, bar in enumerate(bars):
            width = bar.get_width()
            plt.text(width + (0.01 if width > 0 else -0.01), bar.get_y() + bar.get_height()/2,
                    f'{width:.3f}', ha='left' if width > 0 else 'right', va='center')
        
        plt.tight_layout()
        plt.show()
        
        print("Top 10 features most correlated with target:")
        display(pd.DataFrame({
            'Feature': top_target_corr.head(10).index,
            'Correlation': top_target_corr.head(10).values
        }).round(3))
    
    # Summary statistics
    print("\\n📊 CORRELATION SUMMARY:")
    all_corrs = correlation_matrix.values[np.triu(np.ones_like(correlation_matrix, dtype=bool), k=1)]
    all_corrs = all_corrs[~np.isnan(all_corrs)]  # Remove NaN values
    
    print(f"• Total feature pairs analyzed: {len(all_corrs)}")
    print(f"• Mean absolute correlation: {np.mean(np.abs(all_corrs)):.3f}")
    print(f"• Strong correlations (|r| > 0.8): {len(high_corr_pairs)}")
    print(f"• Moderate correlations (0.5 < |r| < 0.8): {len(moderate_corr_pairs)}")
    
    if len(high_corr_pairs) > 5:
        print("  ⚠️  High multicollinearity detected - consider feature selection")
    elif len(high_corr_pairs) == 0:
        print("  ✅ No multicollinearity issues detected")
    
elif len(numeric_cols) == 1:
    print(f"Only one numerical column found: {numeric_cols[0]}")
else:
    print("No numerical columns found for correlation analysis.")

### 3.4 Categorical Variable Analysis

Let's visualize the distribution of categorical variables.

In [None]:
# Feature Type Summary Analysis
print("=== FEATURE TYPE ANALYSIS ===")

# Since this dataset has mostly binary encoded categorical variables, let's analyze them as categories
print(f"\\n📊 DATASET COMPOSITION:")
print(f"• Total features: {df.shape[1]}")

# Identify different types of features
continuous_features = [col for col in numeric_cols if df[col].nunique() > 10]
binary_features = [col for col in numeric_cols if df[col].nunique() == 2]
discrete_features = [col for col in numeric_cols if 2 < df[col].nunique() <= 10]

print(f"• Continuous features: {len(continuous_features)}")
print(f"• Binary features: {len(binary_features)}")
print(f"• Discrete features: {len(discrete_features)}")

# Create summary visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Feature type distribution
feature_types = ['Continuous', 'Binary', 'Discrete']
feature_counts = [len(continuous_features), len(binary_features), len(discrete_features)]

axes[0,0].pie(feature_counts, labels=feature_types, autopct='%1.1f%%', startangle=90)
axes[0,0].set_title('Distribution of Feature Types')

# 2. Binary features value distribution
if binary_features:
    binary_proportions = []
    for col in binary_features[:10]:  # Top 10 for readability
        prop_ones = df[col].mean()
        binary_proportions.append(prop_ones)
    
    axes[0,1].hist(binary_proportions, bins=10, alpha=0.7, color='skyblue', edgecolor='black')
    axes[0,1].set_title('Binary Features - Distribution of Positive Values')
    axes[0,1].set_xlabel('Proportion of 1s')
    axes[0,1].set_ylabel('Number of Features')
    axes[0,1].grid(True, alpha=0.3)

# 3. Continuous features range comparison
if continuous_features:
    ranges = []
    names = []
    for col in continuous_features[:8]:  # Show top 8
        col_range = df[col].max() - df[col].min()
        ranges.append(col_range)
        names.append(col.replace('_', '\\n')[:15])  # Truncate long names
    
    axes[1,0].barh(range(len(names)), ranges, color='lightcoral', alpha=0.7)
    axes[1,0].set_yticks(range(len(names)))
    axes[1,0].set_yticklabels(names)
    axes[1,0].set_title('Continuous Features - Value Ranges')
    axes[1,0].set_xlabel('Range (Max - Min)')
    axes[1,0].grid(True, alpha=0.3, axis='x')

# 4. Feature variance analysis
if len(numeric_cols) > 0:
    # Calculate normalized variance for comparison
    variances = []
    feature_names = []
    
    for col in numeric_cols:
        if df[col].std() > 0:  # Avoid division by zero
            normalized_var = df[col].var() / (df[col].mean() ** 2) if df[col].mean() != 0 else 0
            variances.append(normalized_var)
            feature_names.append(col)
    
    # Plot top 15 most variable features
    if variances:
        sorted_vars = sorted(zip(feature_names, variances), key=lambda x: x[1], reverse=True)[:15]
        names_plot = [item[0].replace('_', '\\n')[:15] for item in sorted_vars]
        vars_plot = [item[1] for item in sorted_vars]
        
        axes[1,1].barh(range(len(names_plot)), vars_plot, color='lightgreen', alpha=0.7)
        axes[1,1].set_yticks(range(len(names_plot)))
        axes[1,1].set_yticklabels(names_plot)
        axes[1,1].set_title('Top 15 Features - Normalized Variance')
        axes[1,1].set_xlabel('Coefficient of Variation')
        axes[1,1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

# Print detailed insights
print("\\n🔍 DETAILED INSIGHTS:")

if continuous_features:
    print(f"\\n📈 CONTINUOUS FEATURES ({len(continuous_features)}):")
    for i, col in enumerate(continuous_features[:5]):
        mean_val = df[col].mean()
        std_val = df[col].std()
        skew_val = df[col].skew()
        print(f"  • {col}: μ={mean_val:.2f}, σ={std_val:.2f}, skew={skew_val:.2f}")

if binary_features:
    print(f"\\n🔗 BINARY FEATURES ({len(binary_features)}):")
    balanced_features = [col for col in binary_features if 0.3 <= df[col].mean() <= 0.7]
    imbalanced_features = [col for col in binary_features if df[col].mean() < 0.1 or df[col].mean() > 0.9]
    
    print(f"  • Balanced features (30-70%): {len(balanced_features)}")
    print(f"  • Highly imbalanced features (<10% or >90%): {len(imbalanced_features)}")
    
    if imbalanced_features[:3]:
        print("  • Most imbalanced features:")
        for col in imbalanced_features[:3]:
            print(f"    - {col}: {df[col].mean():.3f} positive rate")

if discrete_features:
    print(f"\\n🎯 DISCRETE FEATURES ({len(discrete_features)}):")
    for col in discrete_features:
        print(f"  • {col}: {df[col].nunique()} unique values")

### 3.5 Relationship Analysis

Let's examine relationships between variables, especially with respect to the target variable if identified.

In [None]:
# Feature Relationship Analysis - Key Insights
if len(numeric_cols) >= 2:
    print("=== FEATURE RELATIONSHIP INSIGHTS ===")
    
    # Focus on the most interesting relationships
    # 1. Continuous vs Continuous relationships
    continuous_cols = [col for col in numeric_cols if df[col].nunique() > 10][:4]
    
    if len(continuous_cols) >= 2:
        print("\\n📈 CONTINUOUS FEATURE RELATIONSHIPS")
        
        # Create a 2x2 subplot for key relationships
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        axes = axes.flatten()
        
        plot_count = 0
        relationships = []
        
        # Calculate correlations to find most interesting pairs
        if len(continuous_cols) >= 4:
            corr_matrix = df[continuous_cols].corr()
            
            # Find pairs with moderate to high correlation
            pairs_to_plot = []
            for i in range(len(continuous_cols)):
                for j in range(i+1, len(continuous_cols)):
                    corr_val = abs(corr_matrix.iloc[i, j])
                    if corr_val > 0.1:  # Only plot if some relationship exists
                        pairs_to_plot.append((continuous_cols[i], continuous_cols[j], corr_val))
            
            # Sort by correlation and take top 4
            pairs_to_plot.sort(key=lambda x: x[2], reverse=True)
            pairs_to_plot = pairs_to_plot[:4]
        else:
            # If fewer than 4 features, plot all possible pairs
            pairs_to_plot = []
            for i in range(len(continuous_cols)):
                for j in range(i+1, len(continuous_cols)):
                    pairs_to_plot.append((continuous_cols[i], continuous_cols[j], 0))
                    if len(pairs_to_plot) >= 4:
                        break
        
        for col1, col2, corr_val in pairs_to_plot[:4]:
            if plot_count < 4:
                # Scatter plot with trend line
                axes[plot_count].scatter(df[col1], df[col2], alpha=0.5, s=20)
                
                # Add trend line
                z = np.polyfit(df[col1], df[col2], 1)
                p = np.poly1d(z)
                axes[plot_count].plot(df[col1].sort_values(), p(df[col1].sort_values()), "r--", alpha=0.8)
                
                axes[plot_count].set_xlabel(col1)
                axes[plot_count].set_ylabel(col2)
                axes[plot_count].set_title(f'{col1} vs {col2}\\n(r = {df[col1].corr(df[col2]):.3f})')
                axes[plot_count].grid(True, alpha=0.3)
                
                relationships.append({
                    'Feature_1': col1,
                    'Feature_2': col2,
                    'Correlation': df[col1].corr(df[col2]),
                    'Relationship': 'Positive' if df[col1].corr(df[col2]) > 0 else 'Negative'
                })
                
                plot_count += 1
        
        # Hide unused subplots
        for i in range(plot_count, 4):
            axes[i].set_visible(False)
        
        plt.suptitle('Key Feature Relationships - Continuous Variables', fontsize=16)
        plt.tight_layout()
        plt.show()
    
    # 2. Feature categories analysis (group similar features)
    print("\\n📊 FEATURE CATEGORY ANALYSIS")
    
    # Group features by common prefixes/themes
    feature_groups = {}
    
    for col in numeric_cols:
        # Extract feature category from column name
        if '_A' in col:
            category = col.split('_')[0]
            if category not in feature_groups:
                feature_groups[category] = []
            feature_groups[category].append(col)
        else:
            # Standalone numerical features
            if 'numerical' not in feature_groups:
                feature_groups['numerical'] = []
            feature_groups['numerical'].append(col)
    
    # Display feature grouping
    print("Feature categories found:")
    category_summary = []
    for category, features in feature_groups.items():
        print(f"• {category}: {len(features)} features")
        
        # Calculate some summary stats for this category
        if len(features) > 1:
            category_df = df[features]
            avg_correlation = category_df.corr().values[np.triu(np.ones_like(category_df.corr().values, dtype=bool), k=1)]
            avg_correlation = avg_correlation[~np.isnan(avg_correlation)]
            
            category_summary.append({
                'Category': category,
                'Feature_Count': len(features),
                'Avg_Internal_Correlation': np.mean(np.abs(avg_correlation)) if len(avg_correlation) > 0 else 0,
                'Features': ', '.join(features[:3]) + ('...' if len(features) > 3 else '')
            })
    
    # Display category summary
    if category_summary:
        category_df = pd.DataFrame(category_summary)
        category_df = category_df.sort_values('Avg_Internal_Correlation', ascending=False)
        
        print("\\nFeature Category Summary:")
        display(category_df.round(3))
        
        # Visualize category correlations
        if len(category_summary) > 1:
            plt.figure(figsize=(12, 6))
            categories = category_df['Category'].tolist()
            correlations = category_df['Avg_Internal_Correlation'].tolist()
            counts = category_df['Feature_Count'].tolist()
            
            # Create bubble chart
            plt.scatter(range(len(categories)), correlations, s=[c*20 for c in counts], alpha=0.6, c=correlations, cmap='viridis')
            plt.xticks(range(len(categories)), categories, rotation=45)
            plt.ylabel('Average Internal Correlation')
            plt.xlabel('Feature Categories')
            plt.title('Feature Categories - Internal Correlation and Size')
            plt.colorbar(label='Avg Correlation')
            plt.grid(True, alpha=0.3)
            
            # Add text annotations
            for i, (cat, corr, count) in enumerate(zip(categories, correlations, counts)):
                plt.annotate(f'{count} features', (i, corr), xytext=(5, 5), 
                           textcoords='offset points', fontsize=8)
            
            plt.tight_layout()
            plt.show()
        
        # Insights
        high_internal_corr = [item for item in category_summary if item['Avg_Internal_Correlation'] > 0.5]
        if high_internal_corr:
            print(f"\\n⚠️  Categories with high internal correlation ({len(high_internal_corr)}):")
            for item in high_internal_corr:
                print(f"   • {item['Category']}: {item['Avg_Internal_Correlation']:.3f} avg correlation")
            print("   Consider dimensionality reduction within these categories")

else:
    print("Not enough numerical columns for relationship analysis.")

In [28]:
# Target variable analysis (if found)
if target_col and target_col in df.columns:
    print(f"=== ANALYSIS BY TARGET VARIABLE: {target_col} ===")
    
    # Numerical variables vs target
    if len(numeric_cols) > 0:
        numeric_analysis_cols = [col for col in numeric_cols if col != target_col][:4]
        
        if numeric_analysis_cols:
            fig, axes = plt.subplots(2, 2, figsize=(16, 12))
            axes = axes.flatten()
            
            for i, col in enumerate(numeric_analysis_cols):
                if i < 4:
                    # Box plots by target
                    df.boxplot(column=col, by=target_col, ax=axes[i])
                    axes[i].set_title(f'{col} by {target_col}')
                    axes[i].set_xlabel(target_col)
            
            # Hide unused subplots
            for i in range(len(numeric_analysis_cols), 4):
                axes[i].set_visible(False)
            
            plt.tight_layout()
            plt.show()
    
    # Categorical variables vs target
    if len(categorical_cols) > 0:
        cat_analysis_cols = [col for col in categorical_cols if col != target_col][:4]
        
        if cat_analysis_cols:
            fig, axes = plt.subplots(2, 2, figsize=(16, 12))
            axes = axes.flatten()
            
            for i, col in enumerate(cat_analysis_cols):
                if i < 4 and df[col].nunique() <= 10:  # Only for low cardinality
                    # Stacked bar chart
                    crosstab = pd.crosstab(df[col], df[target_col])
                    crosstab.plot(kind='bar', ax=axes[i], stacked=True)
                    axes[i].set_title(f'{col} vs {target_col}')
                    axes[i].set_xlabel(col)
                    axes[i].tick_params(axis='x', rotation=45)
            
            # Hide unused subplots
            for i in range(len([col for col in cat_analysis_cols if df[col].nunique() <= 10]), 4):
                axes[i].set_visible(False)
            
            plt.tight_layout()
            plt.show()
            
else:
    print("No target variable identified for relationship analysis.")

No target variable identified for relationship analysis.


## 4. Key Insights and Conclusions

Based on our comprehensive exploratory data analysis, let's summarize the key findings and insights from the credit risk dataset.

In [29]:
# Generate comprehensive insights report
print("=" * 60)
print("         CREDIT RISK DATA ANALYSIS INSIGHTS")
print("=" * 60)

# Dataset Overview Insights
print("\n🔍 DATASET OVERVIEW:")
print(f"• Dataset contains {df.shape[0]:,} records with {df.shape[1]} features")
print(f"• Memory footprint: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
print(f"• Data types: {len(numeric_cols)} numerical, {len(categorical_cols)} categorical")

# Data Quality Insights
if missing_count == 0:
    print(f"• ✅ Data quality: Excellent - No missing values detected")
else:
    missing_percentage = (missing_count / (df.shape[0] * df.shape[1])) * 100
    print(f"• ⚠️  Data quality: {missing_count:,} missing values ({missing_percentage:.1f}%)")

# Feature Distribution Insights
print("\n📊 FEATURE DISTRIBUTION INSIGHTS:")

if len(numeric_cols) > 0:
    print(f"• {len(numeric_cols)} numerical features available for modeling")
    
    # Skewness analysis
    skewed_features = []
    for col in numeric_cols:
        skewness = df[col].skew()
        if abs(skewness) > 1:
            skewed_features.append((col, skewness))
    
    if skewed_features:
        print(f"• {len(skewed_features)} features show high skewness (may need transformation):")
        for col, skew in sorted(skewed_features, key=lambda x: abs(x[1]), reverse=True)[:3]:
            print(f"  - {col}: {skew:.2f}")
    
    # Outlier insights
    total_outliers = 0
    for col in numeric_cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        outliers = df[(df[col] < Q1 - 1.5 * IQR) | (df[col] > Q3 + 1.5 * IQR)]
        total_outliers += len(outliers)
    
    outlier_percentage = (total_outliers / (len(df) * len(numeric_cols))) * 100
    print(f"• Outlier analysis: {outlier_percentage:.1f}% of numerical data points are outliers")

if len(categorical_cols) > 0:
    print(f"• {len(categorical_cols)} categorical features for analysis")
    
    # Cardinality insights
    high_cardinality = [col for col in categorical_cols if df[col].nunique() / len(df) > 0.8]
    if high_cardinality:
        print(f"• ⚠️  {len(high_cardinality)} features have high cardinality (may be identifiers):")
        for col in high_cardinality[:3]:
            print(f"  - {col}: {df[col].nunique()} unique values")

print("\n🔗 RELATIONSHIP INSIGHTS:")

# Correlation insights
if len(numeric_cols) >= 2:
    correlation_matrix = df[numeric_cols].corr()
    high_corr_pairs = []
    
    for i in range(len(correlation_matrix.columns)):
        for j in range(i+1, len(correlation_matrix.columns)):
            corr_val = abs(correlation_matrix.iloc[i, j])
            if corr_val > 0.7:
                high_corr_pairs.append((correlation_matrix.columns[i], 
                                       correlation_matrix.columns[j], 
                                       correlation_matrix.iloc[i, j]))
    
    if high_corr_pairs:
        print(f"• Found {len(high_corr_pairs)} highly correlated feature pairs (|r| > 0.7)")
        print("  This suggests potential multicollinearity issues")
    else:
        print("• No strong correlations found between numerical features")

# Target variable insights
if target_col and target_col in df.columns:
    print(f"\n🎯 TARGET VARIABLE INSIGHTS ({target_col}):")
    target_dist = df[target_col].value_counts(normalize=True)
    print(f"• Class distribution:")
    for class_val, proportion in target_dist.items():
        print(f"  - {class_val}: {proportion:.1%}")
    
    # Check for class imbalance
    minority_class = target_dist.min()
    if minority_class < 0.1:
        print("• ⚠️  Severe class imbalance detected - consider rebalancing techniques")
    elif minority_class < 0.3:
        print("• ⚠️  Moderate class imbalance - may need special handling")
    else:
        print("• ✅ Reasonably balanced classes")

else:
    print("• No clear target variable identified - this might be unsupervised learning data")

print("\n" + "=" * 60)

         CREDIT RISK DATA ANALYSIS INSIGHTS

🔍 DATASET OVERVIEW:
• Dataset contains 100,000 records with 62 features
• Memory footprint: 47.3 MB
• Data types: 62 numerical, 0 categorical
• ✅ Data quality: Excellent - No missing values detected

📊 FEATURE DISTRIBUTION INSIGHTS:
• 62 numerical features available for modeling
• 48 features show high skewness (may need transformation):
  - purpose_A48: 11.56
  - purpose_A410: 8.89
  - job_A171: 8.89
• Outlier analysis: 6.6% of numerical data points are outliers

🔗 RELATIONSHIP INSIGHTS:
• Found 5 highly correlated feature pairs (|r| > 0.7)
  This suggests potential multicollinearity issues
• No clear target variable identified - this might be unsupervised learning data



### 4.1 Recommendations for Model Development

Based on the analysis above, here are specific recommendations for the next steps in the credit risk modeling pipeline:

In [30]:
# Model development recommendations
print("🚀 RECOMMENDATIONS FOR MODEL DEVELOPMENT:")
print("-" * 50)

print("\n1. DATA PREPROCESSING:")
if missing_count > 0:
    print("   • Handle missing values using appropriate imputation strategies")
else:
    print("   • ✅ No missing value handling required")

if len(numeric_cols) > 0:
    skewed_count = len([col for col in numeric_cols if abs(df[col].skew()) > 1])
    if skewed_count > 0:
        print(f"   • Apply transformations (log, Box-Cox) to {skewed_count} skewed features")
    
    print("   • Consider feature scaling/normalization for numerical variables")

if len(categorical_cols) > 0:
    high_card_count = len([col for col in categorical_cols if df[col].nunique() / len(df) > 0.8])
    if high_card_count > 0:
        print(f"   • Handle {high_card_count} high-cardinality categorical variables")
        print("     (consider target encoding or dimensionality reduction)")
    
    print("   • Encode categorical variables (one-hot, label, or target encoding)")

print("\n2. FEATURE ENGINEERING:")
print("   • Create interaction features between important variables")
print("   • Consider polynomial features for non-linear relationships")
print("   • Derive domain-specific features (debt-to-income ratio, etc.)")

if len(numeric_cols) >= 2:
    correlation_matrix = df[numeric_cols].corr()
    high_corr_count = 0
    for i in range(len(correlation_matrix.columns)):
        for j in range(i+1, len(correlation_matrix.columns)):
            if abs(correlation_matrix.iloc[i, j]) > 0.7:
                high_corr_count += 1
    
    if high_corr_count > 0:
        print(f"   • Address multicollinearity in {high_corr_count} feature pairs")
        print("     (use VIF analysis, PCA, or feature selection)")

print("\n3. MODEL SELECTION:")
if target_col and target_col in df.columns and df[target_col].nunique() == 2:
    print("   • Binary classification problem detected")
    print("   • Consider: Logistic Regression, Random Forest, XGBoost, LightGBM")
    
    # Class balance check
    target_dist = df[target_col].value_counts(normalize=True)
    minority_class = target_dist.min()
    if minority_class < 0.3:
        print("   • Use stratified sampling and appropriate evaluation metrics")
        print("   • Consider rebalancing techniques (SMOTE, undersampling)")
elif target_col and target_col in df.columns and df[target_col].nunique() > 2:
    print("   • Multi-class classification or regression problem")
else:
    print("   • Problem type unclear - investigate target variable")

print("\n4. MODEL EVALUATION:")
print("   • Use cross-validation for robust performance estimation")
if target_col and target_col in df.columns and df[target_col].nunique() == 2:
    minority_class = df[target_col].value_counts(normalize=True).min()
    if minority_class < 0.3:
        print("   • Focus on Precision, Recall, F1-score, and AUC-ROC")
        print("   • Consider cost-sensitive learning approaches")
    else:
        print("   • Standard classification metrics (Accuracy, Precision, Recall)")

print("   • Implement feature importance analysis")
print("   • Perform residual analysis and model diagnostics")

print("\n5. BUSINESS CONSIDERATIONS:")
print("   • Ensure model interpretability for regulatory compliance")
print("   • Implement proper model governance and monitoring")
print("   • Consider fairness and bias assessment")
print("   • Plan for model deployment and real-time scoring")

print("\n" + "=" * 50)
print("End of Data Exploration Analysis")
print("=" * 50)

🚀 RECOMMENDATIONS FOR MODEL DEVELOPMENT:
--------------------------------------------------

1. DATA PREPROCESSING:
   • ✅ No missing value handling required
   • Apply transformations (log, Box-Cox) to 48 skewed features
   • Consider feature scaling/normalization for numerical variables

2. FEATURE ENGINEERING:
   • Create interaction features between important variables
   • Consider polynomial features for non-linear relationships
   • Derive domain-specific features (debt-to-income ratio, etc.)
   • Address multicollinearity in 5 feature pairs
     (use VIF analysis, PCA, or feature selection)

3. MODEL SELECTION:
   • Problem type unclear - investigate target variable

4. MODEL EVALUATION:
   • Use cross-validation for robust performance estimation
   • Implement feature importance analysis
   • Perform residual analysis and model diagnostics

5. BUSINESS CONSIDERATIONS:
   • Ensure model interpretability for regulatory compliance
   • Implement proper model governance and monito