# Exploratory Data Analysis: ROSE Women's Foundation Microfinance Loan Dataset

This notebook provides a comprehensive exploratory data analysis (EDA) of the ROSE Women's Foundation microfinance loan dataset. The analysis covers data quality, demographic patterns, loan characteristics, financial metrics, and factors associated with loan defaults.

## Table of Contents
1. Data Loading and Overview
2. Missing Value Analysis
3. Target Variable Analysis
4. Demographic Analysis
5. Loan Analysis
6. Financial Analysis
7. Business Analysis
8. Correlation Analysis
9. Bivariate Analysis
10. Key Insights and Findings

## Setup and Imports

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings

# Configuration
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# Set visualization style - colorblind-friendly palette
sns.set_style('whitegrid')
sns.set_palette('colorblind')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['axes.labelsize'] = 12

print("Libraries imported successfully!")

## 1. Data Loading and Overview\n\nIn this section, we load the microfinance loan dataset and examine its basic structure.

In [None]:
# Load the dataset
df = pd.read_csv('../data/rose.csv', encoding='latin-1')

# Display basic information
print("=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)
print(f"\nDataset Shape: {df.shape[0]} rows, {df.shape[1]} columns")

In [None]:
# Display column names and data types
print("\n" + "=" * 60)
print("COLUMN NAMES AND DATA TYPES")
print("=" * 60)
print(df.dtypes)

In [None]:
# Display first few rows
print("\n" + "=" * 60)
print("FIRST 5 ROWS OF THE DATASET")
print("=" * 60)
df.head()

In [None]:
# Display basic statistics for numerical columns
print("\n" + "=" * 60)
print("STATISTICAL SUMMARY - NUMERICAL COLUMNS")
print("=" * 60)
df.describe()

In [None]:
# Display summary for categorical columns
print("\n" + "=" * 60)
print("STATISTICAL SUMMARY - CATEGORICAL COLUMNS")
print("=" * 60)
df.describe(include=['object'])

In [None]:
# Memory usage
print("\n" + "=" * 60)
print("MEMORY USAGE")
print("=" * 60)
print(f"Total memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

## 2. Missing Value Analysis\n\nUnderstanding missing data is crucial for data quality assessment and preprocessing decisions.

In [None]:
# Calculate missing values
missing_count = df.isnull().sum()
missing_percent = (df.isnull().sum() / len(df)) * 100
missing_df = pd.DataFrame({
    'Column': missing_count.index,
    'Missing Count': missing_count.values,
    'Missing Percentage': missing_percent.values
}).sort_values('Missing Percentage', ascending=False)

# Display missing values summary
print("=" * 60)
print("MISSING VALUES SUMMARY")
print("=" * 60)
print(f"\nTotal columns with missing values: {(missing_count > 0).sum()}")
print(f"Total missing values in dataset: {missing_count.sum()}")

# Show columns with missing values
print("\nColumns with Missing Values:")
missing_df[missing_df['Missing Count'] > 0]

In [None]:
# Identify columns with high missing rates (>50%)
high_missing = missing_df[missing_df['Missing Percentage'] > 50]
print("\n" + "=" * 60)
print("COLUMNS WITH HIGH MISSING RATES (>50%)")
print("=" * 60)
if len(high_missing) > 0:
    print(high_missing)
else:
    print("No columns have more than 50% missing values.")

In [None]:
# Visualize missing values
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

# Bar chart of missing values (top 20 columns)
top_missing = missing_df[missing_df['Missing Count'] > 0].head(20)
if len(top_missing) > 0:
    ax1 = axes[0]
    bars = ax1.barh(top_missing['Column'], top_missing['Missing Percentage'], color='coral')
    ax1.set_xlabel('Missing Percentage (%)')
    ax1.set_title('Top 20 Columns with Missing Values')
    ax1.axvline(x=50, color='red', linestyle='--', label='50% threshold')
    ax1.legend()
else:
    axes[0].text(0.5, 0.5, 'No missing values in dataset', ha='center', va='center', fontsize=14)
    axes[0].set_title('Missing Values Bar Chart')

# Heatmap of missing values
ax2 = axes[1]
cols_with_missing = missing_df[missing_df['Missing Count'] > 0]['Column'].tolist()[:15]
if len(cols_with_missing) > 0:
    sample_df = df[cols_with_missing].isnull().astype(int)
    sns.heatmap(sample_df.head(50), cmap='YlOrRd', cbar_kws={'label': 'Missing (1) / Present (0)'}, ax=ax2)
    ax2.set_title('Missing Values Heatmap (First 50 Rows)')
else:
    ax2.text(0.5, 0.5, 'No missing values to display', ha='center', va='center', fontsize=14)

plt.tight_layout()
plt.show()

## 3. Target Variable Analysis\n\nThe **Defaulted** column is our main target variable for predictive modeling.

In [None]:
# Analyze the Defaulted column
print("=" * 60)
print("TARGET VARIABLE ANALYSIS: DEFAULTED")
print("=" * 60)

# Check unique values
print("\nUnique values in 'Defaulted' column:")
print(df['Defaulted'].value_counts())

# Calculate default rate
default_counts = df['Defaulted'].value_counts()
default_rate = (default_counts.get(1, 0) / len(df)) * 100
non_default_rate = (default_counts.get(0, 0) / len(df)) * 100

print(f"\nDefault Rate: {default_rate:.2f}%")
print(f"Non-Default Rate: {non_default_rate:.2f}%")

In [None]:
# Visualize target variable distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

default_counts = df['Defaulted'].value_counts()
colors = ['#2ecc71', '#e74c3c']

# Bar chart
ax1 = axes[0]
default_counts.plot(kind='bar', ax=ax1, color=colors, edgecolor='black')
ax1.set_xlabel('Defaulted Status')
ax1.set_ylabel('Count')
ax1.set_title('Loan Default Distribution (Bar Chart)')
ax1.set_xticklabels(['Not Defaulted (0)', 'Defaulted (1)'], rotation=0)
for i, v in enumerate(default_counts.values):
    ax1.text(i, v + 5, str(v), ha='center', fontsize=12, fontweight='bold')

# Pie chart
ax2 = axes[1]
labels = ['Not Defaulted', 'Defaulted']
sizes = [default_counts.get(0, 0), default_counts.get(1, 0)]
ax2.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=90)
ax2.set_title('Loan Default Distribution (Pie Chart)')

plt.tight_layout()
plt.show()

# Check for class imbalance
default_rate = df['Defaulted'].mean() * 100
print("\nClass Imbalance Assessment:")
if default_rate < 30:
    print("Warning: The dataset shows CLASS IMBALANCE - Defaulted class is underrepresented.")
elif default_rate > 70:
    print("Warning: The dataset shows CLASS IMBALANCE - Non-defaulted class is underrepresented.")
else:
    print("The classes are relatively balanced.")

## 4. Demographic Analysis\n\nThis section explores the demographic characteristics of loan applicants.

In [None]:
# Age Distribution
print("=" * 60)
print("AGE DISTRIBUTION ANALYSIS")
print("=" * 60)

print("\nAge Statistics:")
print(df['Age'].describe())

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
ax1 = axes[0]
df['Age'].dropna().hist(bins=20, ax=ax1, color='steelblue', edgecolor='black', alpha=0.7)
ax1.set_xlabel('Age')
ax1.set_ylabel('Frequency')
ax1.set_title('Age Distribution (Histogram)')
ax1.axvline(df['Age'].mean(), color='red', linestyle='--', label=f"Mean: {df['Age'].mean():.1f}")
ax1.axvline(df['Age'].median(), color='green', linestyle='--', label=f"Median: {df['Age'].median():.1f}")
ax1.legend()

# Box plot
ax2 = axes[1]
df.boxplot(column='Age', ax=ax2)
ax2.set_ylabel('Age')
ax2.set_title('Age Distribution (Box Plot)')

plt.tight_layout()
plt.show()

In [None]:
# Age Group Breakdown
print("\n" + "=" * 60)
print("AGE GROUP DISTRIBUTION")
print("=" * 60)

age_group_counts = df['Age Group'].value_counts()
print(age_group_counts)

fig, ax = plt.subplots(figsize=(12, 6))
age_group_counts.plot(kind='bar', ax=ax, color='teal', edgecolor='black')
ax.set_xlabel('Age Group')
ax.set_ylabel('Count')
ax.set_title('Distribution by Age Group')
plt.xticks(rotation=45, ha='right')
for i, v in enumerate(age_group_counts.values):
    ax.text(i, v + 2, str(v), ha='center', fontsize=10)
plt.tight_layout()
plt.show()

In [None]:
# Gender Distribution
print("\n" + "=" * 60)
print("GENDER DISTRIBUTION")
print("=" * 60)

gender_counts = df["Respondent's sex"].value_counts()
print(gender_counts)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

ax1 = axes[0]
gender_counts.plot(kind='bar', ax=ax1, color=['pink', 'lightblue'], edgecolor='black')
ax1.set_xlabel('Gender')
ax1.set_ylabel('Count')
ax1.set_title('Gender Distribution')
plt.sca(ax1)
plt.xticks(rotation=0)

ax2 = axes[1]
ax2.pie(gender_counts.values, labels=gender_counts.index, autopct='%1.1f%%',
        colors=['pink', 'lightblue'], startangle=90)
ax2.set_title('Gender Distribution (Pie Chart)')

plt.tight_layout()
plt.show()

In [None]:
# Marital Status Distribution
print("\n" + "=" * 60)
print("MARITAL STATUS DISTRIBUTION")
print("=" * 60)

marital_counts = df['Marital status'].value_counts()
print(marital_counts)

fig, ax = plt.subplots(figsize=(10, 6))
colors = sns.color_palette('Set2', len(marital_counts))
marital_counts.plot(kind='bar', ax=ax, color=colors, edgecolor='black')
ax.set_xlabel('Marital Status')
ax.set_ylabel('Count')
ax.set_title('Distribution by Marital Status')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Education Level Distribution
print("\n" + "=" * 60)
print("EDUCATION LEVEL DISTRIBUTION")
print("=" * 60)

education_counts = df['Education'].value_counts()
print(education_counts)

fig, ax = plt.subplots(figsize=(12, 6))
colors = sns.color_palette('viridis', len(education_counts))
education_counts.plot(kind='barh', ax=ax, color=colors, edgecolor='black')
ax.set_xlabel('Count')
ax.set_ylabel('Education Level')
ax.set_title('Distribution by Education Level')
plt.tight_layout()
plt.show()

In [None]:
# Geographic Distribution
print("\n" + "=" * 60)
print("GEOGRAPHIC DISTRIBUTION")
print("=" * 60)

fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# County distribution
county_counts = df['County'].value_counts().head(10)
print("\nTop 10 Counties:")
print(county_counts)
ax1 = axes[0]
county_counts.plot(kind='bar', ax=ax1, color='darkblue', edgecolor='black')
ax1.set_xlabel('County')
ax1.set_ylabel('Count')
ax1.set_title('Top 10 Counties')
plt.sca(ax1)
plt.xticks(rotation=45, ha='right')

# Constituency distribution
constituency_counts = df['Constituency'].value_counts().head(10)
print("\nTop 10 Constituencies:")
print(constituency_counts)
ax2 = axes[1]
constituency_counts.plot(kind='bar', ax=ax2, color='darkgreen', edgecolor='black')
ax2.set_xlabel('Constituency')
ax2.set_ylabel('Count')
ax2.set_title('Top 10 Constituencies')
plt.sca(ax2)
plt.xticks(rotation=45, ha='right')

# Ward distribution
ward_counts = df['Ward'].value_counts().head(10)
print("\nTop 10 Wards:")
print(ward_counts)
ax3 = axes[2]
ward_counts.plot(kind='bar', ax=ax3, color='darkred', edgecolor='black')
ax3.set_xlabel('Ward')
ax3.set_ylabel('Count')
ax3.set_title('Top 10 Wards')
plt.sca(ax3)
plt.xticks(rotation=45, ha='right')

plt.tight_layout()
plt.show()

## 5. Loan Analysis\n\nThis section analyzes the characteristics of loans including amounts, purposes, sources, and repayment patterns.

In [None]:
# Loan Given Amount Distribution
print("=" * 60)
print("LOAN AMOUNT ANALYSIS")
print("=" * 60)

print("\nLoan Given Statistics:")
print(df['Loan Given'].describe())

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

ax1 = axes[0]
df['Loan Given'].dropna().hist(bins=30, ax=ax1, color='gold', edgecolor='black', alpha=0.7)
ax1.set_xlabel('Loan Amount')
ax1.set_ylabel('Frequency')
ax1.set_title('Loan Amount Distribution (Histogram)')
ax1.axvline(df['Loan Given'].mean(), color='red', linestyle='--', label=f"Mean: {df['Loan Given'].mean():,.0f}")
ax1.axvline(df['Loan Given'].median(), color='blue', linestyle='--', label=f"Median: {df['Loan Given'].median():,.0f}")
ax1.legend()

ax2 = axes[1]
df.boxplot(column='Loan Given', ax=ax2)
ax2.set_ylabel('Loan Amount')
ax2.set_title('Loan Amount Distribution (Box Plot)')

plt.tight_layout()
plt.show()

In [None]:
# Loan Purpose Analysis
print("\n" + "=" * 60)
print("LOAN PURPOSE ANALYSIS")
print("=" * 60)

loan_purposes = df['Loan Purpose'].dropna().str.split(';').explode().str.strip()
purpose_counts = loan_purposes.value_counts()
print(purpose_counts)

fig, ax = plt.subplots(figsize=(10, 6))
purpose_counts.plot(kind='bar', ax=ax, color='mediumpurple', edgecolor='black')
ax.set_xlabel('Loan Purpose')
ax.set_ylabel('Count')
ax.set_title('Loan Purpose Distribution')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Loan Source Analysis
print("\n" + "=" * 60)
print("LOAN SOURCE ANALYSIS")
print("=" * 60)

loan_source_counts = df['Loan source'].value_counts()
print(loan_source_counts)

fig, ax = plt.subplots(figsize=(10, 6))
colors = sns.color_palette('Set3', len(loan_source_counts))
loan_source_counts.plot(kind='bar', ax=ax, color=colors, edgecolor='black')
ax.set_xlabel('Loan Source')
ax.set_ylabel('Count')
ax.set_title('Loan Source Distribution')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Loan Status Breakdown
print("\n" + "=" * 60)
print("LOAN STATUS ANALYSIS")
print("=" * 60)

print("\nDetailed Loan Status:")
loan_status_counts = df['Loan Status'].value_counts()
print(loan_status_counts)

print("\nSummarised Loan Status:")
summarised_status = df['Summarised Loan Status'].value_counts()
print(summarised_status)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

ax1 = axes[0]
loan_status_counts.plot(kind='barh', ax=ax1, color='steelblue', edgecolor='black')
ax1.set_xlabel('Count')
ax1.set_ylabel('Loan Status')
ax1.set_title('Detailed Loan Status Distribution')

ax2 = axes[1]
summarised_status.plot(kind='bar', ax=ax2, color='coral', edgecolor='black')
ax2.set_xlabel('Summarised Status')
ax2.set_ylabel('Count')
ax2.set_title('Summarised Loan Status Distribution')
plt.sca(ax2)
plt.xticks(rotation=45, ha='right')

plt.tight_layout()
plt.show()

In [None]:
# Loan Paid Status
print("\n" + "=" * 60)
print("LOAN PAID STATUS ANALYSIS")
print("=" * 60)

paid_status_counts = df['Loan Paid Status'].value_counts()
print(paid_status_counts)

fig, ax = plt.subplots(figsize=(8, 6))
colors = ['#27ae60' if 'Paid' in status else '#e74c3c' for status in paid_status_counts.index]
paid_status_counts.plot(kind='bar', ax=ax, color=colors, edgecolor='black')
ax.set_xlabel('Payment Status')
ax.set_ylabel('Count')
ax.set_title('Loan Payment Status Distribution')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

## 6. Financial Analysis\n\nThis section explores the financial characteristics of loan applicants.

In [None]:
# Monthly Income Distribution
print("=" * 60)
print("MONTHLY INCOME ANALYSIS")
print("=" * 60)

print("\nMonthly Income Statistics:")
print(df['Monthly income'].describe())

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

ax1 = axes[0]
df['Monthly income'].dropna().hist(bins=30, ax=ax1, color='green', edgecolor='black', alpha=0.7)
ax1.set_xlabel('Monthly Income')
ax1.set_ylabel('Frequency')
ax1.set_title('Monthly Income Distribution')
ax1.axvline(df['Monthly income'].mean(), color='red', linestyle='--', label=f"Mean: {df['Monthly income'].mean():,.0f}")
ax1.legend()

ax2 = axes[1]
df.boxplot(column='Monthly income', ax=ax2)
ax2.set_ylabel('Monthly Income')
ax2.set_title('Monthly Income (Box Plot)')

plt.tight_layout()
plt.show()

In [None]:
# Income Brackets Analysis
print("\n" + "=" * 60)
print("INCOME BRACKETS ANALYSIS")
print("=" * 60)

income_bracket_counts = df['Income Brackets'].value_counts()
print(income_bracket_counts)

fig, ax = plt.subplots(figsize=(10, 6))
income_bracket_counts.plot(kind='bar', ax=ax, color='forestgreen', edgecolor='black')
ax.set_xlabel('Income Bracket')
ax.set_ylabel('Count')
ax.set_title('Distribution by Income Bracket')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Total Savings Distribution
print("\n" + "=" * 60)
print("TOTAL SAVINGS ANALYSIS")
print("=" * 60)

print("\nTotal Savings Statistics:")
print(df['Total Savings'].describe())

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

ax1 = axes[0]
savings_nonzero = df[df['Total Savings'] > 0]['Total Savings']
savings_nonzero.hist(bins=30, ax=ax1, color='teal', edgecolor='black', alpha=0.7)
ax1.set_xlabel('Total Savings')
ax1.set_ylabel('Frequency')
ax1.set_title('Total Savings Distribution (Excluding Zero)')

ax2 = axes[1]
df.boxplot(column='Total Savings', ax=ax2)
ax2.set_ylabel('Total Savings')
ax2.set_title('Total Savings (Box Plot)')

plt.tight_layout()
plt.show()

no_savings_pct = (df['Total Savings'] == 0).sum() / len(df) * 100
print(f"\nPercentage of clients with no savings: {no_savings_pct:.1f}%")

In [None]:
# Savings Frequency Patterns
print("\n" + "=" * 60)
print("SAVINGS FREQUENCY ANALYSIS")
print("=" * 60)

savings_freq_counts = df['Savings frequency'].value_counts()
print(savings_freq_counts)

fig, ax = plt.subplots(figsize=(10, 6))
savings_freq_counts.plot(kind='bar', ax=ax, color='darkcyan', edgecolor='black')
ax.set_xlabel('Savings Frequency')
ax.set_ylabel('Count')
ax.set_title('Distribution by Savings Frequency')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Business Expenses Analysis
print("\n" + "=" * 60)
print("BUSINESS EXPENSES ANALYSIS")
print("=" * 60)

print("\nBusiness Expenses Statistics:")
print(df['Business expenses'].describe())

print("\nExpense Relative to Income:")
expense_ratio = df['Expense Relative to Income'].value_counts()
print(expense_ratio)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

ax1 = axes[0]
df['Business expenses'].dropna().hist(bins=30, ax=ax1, color='salmon', edgecolor='black', alpha=0.7)
ax1.set_xlabel('Business Expenses')
ax1.set_ylabel('Frequency')
ax1.set_title('Business Expenses Distribution')

ax2 = axes[1]
expense_ratio.plot(kind='bar', ax=ax2, color='indianred', edgecolor='black')
ax2.set_xlabel('Expense Relative to Income')
ax2.set_ylabel('Count')
ax2.set_title('Expense Relative to Income Distribution')
plt.sca(ax2)
plt.xticks(rotation=45, ha='right')

plt.tight_layout()
plt.show()

In [None]:
# Profit/Loss Analysis
print("\n" + "=" * 60)
print("PROFIT/LOSS ANALYSIS")
print("=" * 60)

print("\nProfit-Loss Statistics:")
print(df['Profit Loss'].value_counts())

print("\nProfit-Expenses Statistics:")
print(df['Profit- Expenses'].describe())

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

ax1 = axes[0]
profit_loss_counts = df['Profit Loss'].value_counts()
colors = ['#27ae60' if 'Profit' in str(x) else '#e74c3c' for x in profit_loss_counts.index]
profit_loss_counts.plot(kind='bar', ax=ax1, color=colors, edgecolor='black')
ax1.set_xlabel('Profit/Loss Status')
ax1.set_ylabel('Count')
ax1.set_title('Profit/Loss Distribution')
plt.sca(ax1)
plt.xticks(rotation=45, ha='right')

ax2 = axes[1]
df['Profit- Expenses'].dropna().hist(bins=30, ax=ax2, color='gold', edgecolor='black', alpha=0.7)
ax2.set_xlabel('Profit - Expenses')
ax2.set_ylabel('Frequency')
ax2.set_title('Profit minus Expenses Distribution')
ax2.axvline(0, color='red', linestyle='--', label='Break-even')
ax2.legend()

plt.tight_layout()
plt.show()

In [None]:
# Affordability Analysis
print("\n" + "=" * 60)
print("AFFORDABILITY ANALYSIS")
print("=" * 60)

print("\nAffordability Status:")
affordability_counts = df['Affordability'].value_counts()
print(affordability_counts)

print("\nAffordability (HH) Status:")
affordability_hh_counts = df['Affordability (HH)'].value_counts()
print(affordability_hh_counts)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

ax1 = axes[0]
colors = ['#27ae60' if 'Affordable' in str(x) else '#e74c3c' for x in affordability_counts.index]
affordability_counts.plot(kind='bar', ax=ax1, color=colors, edgecolor='black')
ax1.set_xlabel('Affordability Status')
ax1.set_ylabel('Count')
ax1.set_title('Affordability Distribution')
plt.sca(ax1)
plt.xticks(rotation=45, ha='right')

ax2 = axes[1]
colors = ['#27ae60' if 'Affordable' in str(x) else '#e74c3c' for x in affordability_hh_counts.index]
affordability_hh_counts.plot(kind='bar', ax=ax2, color=colors, edgecolor='black')
ax2.set_xlabel('Affordability (HH) Status')
ax2.set_ylabel('Count')
ax2.set_title('Affordability (HH) Distribution')
plt.sca(ax2)
plt.xticks(rotation=45, ha='right')

plt.tight_layout()
plt.show()

## 7. Business Analysis\n\nThis section analyzes the business characteristics of loan applicants.

In [None]:
# Business Type Distribution
print("=" * 60)
print("BUSINESS TYPE ANALYSIS")
print("=" * 60)

business_type_counts = df['Business type'].value_counts()
print(business_type_counts)

fig, ax = plt.subplots(figsize=(10, 6))
business_type_counts.plot(kind='bar', ax=ax, color='purple', edgecolor='black')
ax.set_xlabel('Business Type')
ax.set_ylabel('Count')
ax.set_title('Distribution by Business Type')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Top Business Categories (Specific Business)
print("\n" + "=" * 60)
print("TOP SPECIFIC BUSINESS CATEGORIES")
print("=" * 60)

specific_biz_counts = df['Specific Biz'].value_counts().head(15)
print(specific_biz_counts)

fig, ax = plt.subplots(figsize=(12, 8))
specific_biz_counts.plot(kind='barh', ax=ax, color='mediumorchid', edgecolor='black')
ax.set_xlabel('Count')
ax.set_ylabel('Specific Business')
ax.set_title('Top 15 Specific Business Categories')
plt.tight_layout()
plt.show()

In [None]:
# Business Type vs Default Rate
print("\n" + "=" * 60)
print("BUSINESS TYPE VS DEFAULT RATE")
print("=" * 60)

business_default = df.groupby('Business type')['Defaulted'].agg(['sum', 'count'])
business_default['default_rate'] = (business_default['sum'] / business_default['count']) * 100
business_default = business_default.sort_values('default_rate', ascending=False)

print(business_default)

fig, ax = plt.subplots(figsize=(10, 6))
colors = ['#e74c3c' if rate > df['Defaulted'].mean() * 100 else '#3498db' 
          for rate in business_default['default_rate']]
business_default['default_rate'].plot(kind='bar', ax=ax, color=colors, edgecolor='black')
ax.set_xlabel('Business Type')
ax.set_ylabel('Default Rate (%)')
ax.set_title('Default Rate by Business Type')
ax.axhline(y=df['Defaulted'].mean() * 100, color='red', linestyle='--', 
           label=f'Overall Default Rate: {df["Defaulted"].mean() * 100:.1f}%')
ax.legend()
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## 8. Correlation Analysis\n\nThis section examines the relationships between numerical variables.

In [None]:
# Select numerical columns for correlation analysis
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()

exclude_cols = ['ID Number', 'Mobile', 'Phone']
numerical_cols = [col for col in numerical_cols if col not in exclude_cols]

print("=" * 60)
print("NUMERICAL COLUMNS FOR CORRELATION ANALYSIS")
print("=" * 60)
print(f"\nNumber of numerical columns: {len(numerical_cols)}")
print(numerical_cols)

In [None]:
# Calculate correlation matrix
correlation_matrix = df[numerical_cols].corr()

plt.figure(figsize=(16, 14))
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
sns.heatmap(correlation_matrix, mask=mask, annot=False, cmap='RdBu_r', center=0,
            square=True, linewidths=0.5, cbar_kws={'shrink': 0.8})
plt.title('Correlation Matrix Heatmap', fontsize=16)
plt.tight_layout()
plt.show()

In [None]:
# Top correlations with Defaulted variable
print("\n" + "=" * 60)
print("TOP CORRELATIONS WITH DEFAULTED VARIABLE")
print("=" * 60)

if 'Defaulted' in correlation_matrix.columns:
    defaulted_correlations = correlation_matrix['Defaulted'].drop('Defaulted').sort_values(key=abs, ascending=False)
    print("\nCorrelations with Defaulted (sorted by absolute value):")
    print(defaulted_correlations.head(15))
    
    fig, ax = plt.subplots(figsize=(10, 8))
    top_corr = defaulted_correlations.head(15)
    colors = ['#e74c3c' if x < 0 else '#27ae60' for x in top_corr.values]
    top_corr.plot(kind='barh', ax=ax, color=colors, edgecolor='black')
    ax.set_xlabel('Correlation Coefficient')
    ax.set_ylabel('Feature')
    ax.set_title('Top 15 Features Correlated with Defaulted')
    ax.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
    plt.tight_layout()
    plt.show()

In [None]:
# Top feature correlations (all pairs)
print("\n" + "=" * 60)
print("TOP FEATURE CORRELATIONS (ALL PAIRS)")
print("=" * 60)

upper_tri = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))
correlations = upper_tri.stack().sort_values(key=abs, ascending=False)
print("\nTop 20 Feature Correlations:")
print(correlations.head(20))

## 9. Bivariate Analysis\n\nThis section explores the relationships between pairs of variables.

In [None]:
# Default Rate by Age Group
print("=" * 60)
print("DEFAULT RATE BY AGE GROUP")
print("=" * 60)

age_group_default = df.groupby('Age Group')['Defaulted'].agg(['sum', 'count'])
age_group_default['default_rate'] = (age_group_default['sum'] / age_group_default['count']) * 100
age_group_default = age_group_default.sort_values('default_rate', ascending=False)

print(age_group_default)

fig, ax = plt.subplots(figsize=(12, 6))
colors = ['#e74c3c' if rate > df['Defaulted'].mean() * 100 else '#3498db' 
          for rate in age_group_default['default_rate']]
age_group_default['default_rate'].plot(kind='bar', ax=ax, color=colors, edgecolor='black')
ax.set_xlabel('Age Group')
ax.set_ylabel('Default Rate (%)')
ax.set_title('Default Rate by Age Group')
ax.axhline(y=df['Defaulted'].mean() * 100, color='red', linestyle='--', 
           label=f'Overall Default Rate: {df["Defaulted"].mean() * 100:.1f}%')
ax.legend()
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Default Rate by Education Level
print("\n" + "=" * 60)
print("DEFAULT RATE BY EDUCATION LEVEL")
print("=" * 60)

education_default = df.groupby('Education')['Defaulted'].agg(['sum', 'count'])
education_default['default_rate'] = (education_default['sum'] / education_default['count']) * 100
education_default = education_default.sort_values('default_rate', ascending=False)

print(education_default)

fig, ax = plt.subplots(figsize=(12, 6))
colors = ['#e74c3c' if rate > df['Defaulted'].mean() * 100 else '#3498db' 
          for rate in education_default['default_rate']]
education_default['default_rate'].plot(kind='barh', ax=ax, color=colors, edgecolor='black')
ax.set_xlabel('Default Rate (%)')
ax.set_ylabel('Education Level')
ax.set_title('Default Rate by Education Level')
ax.axvline(x=df['Defaulted'].mean() * 100, color='red', linestyle='--', 
           label=f'Overall Default Rate: {df["Defaulted"].mean() * 100:.1f}%')
ax.legend()
plt.tight_layout()
plt.show()

In [None]:
# Default Rate by Income Bracket
print("\n" + "=" * 60)
print("DEFAULT RATE BY INCOME BRACKET")
print("=" * 60)

income_default = df.groupby('Income Brackets')['Defaulted'].agg(['sum', 'count'])
income_default['default_rate'] = (income_default['sum'] / income_default['count']) * 100
income_default = income_default.sort_values('default_rate', ascending=False)

print(income_default)

fig, ax = plt.subplots(figsize=(10, 6))
colors = ['#e74c3c' if rate > df['Defaulted'].mean() * 100 else '#3498db' 
          for rate in income_default['default_rate']]
income_default['default_rate'].plot(kind='bar', ax=ax, color=colors, edgecolor='black')
ax.set_xlabel('Income Bracket')
ax.set_ylabel('Default Rate (%)')
ax.set_title('Default Rate by Income Bracket')
ax.axhline(y=df['Defaulted'].mean() * 100, color='red', linestyle='--', 
           label=f'Overall Default Rate: {df["Defaulted"].mean() * 100:.1f}%')
ax.legend()
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Default Rate by Loan Purpose
print("\n" + "=" * 60)
print("DEFAULT RATE BY LOAN PURPOSE")
print("=" * 60)

df['Loan_Purpose_Clean'] = df['Loan Purpose'].fillna('Unknown').str.split(';').str[0].str.strip()

purpose_default = df.groupby('Loan_Purpose_Clean')['Defaulted'].agg(['sum', 'count'])
purpose_default['default_rate'] = (purpose_default['sum'] / purpose_default['count']) * 100
purpose_default = purpose_default.sort_values('default_rate', ascending=False)

print(purpose_default)

fig, ax = plt.subplots(figsize=(10, 6))
colors = ['#e74c3c' if rate > df['Defaulted'].mean() * 100 else '#3498db' 
          for rate in purpose_default['default_rate']]
purpose_default['default_rate'].plot(kind='bar', ax=ax, color=colors, edgecolor='black')
ax.set_xlabel('Loan Purpose')
ax.set_ylabel('Default Rate (%)')
ax.set_title('Default Rate by Loan Purpose')
ax.axhline(y=df['Defaulted'].mean() * 100, color='red', linestyle='--', 
           label=f'Overall Default Rate: {df["Defaulted"].mean() * 100:.1f}%')
ax.legend()
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Income vs Loan Amount
print("\n" + "=" * 60)
print("INCOME VS LOAN AMOUNT ANALYSIS")
print("=" * 60)

fig, ax = plt.subplots(figsize=(12, 8))

scatter = ax.scatter(df['Monthly income'], df['Loan Given'], 
                     c=df['Defaulted'], cmap='RdYlGn_r', alpha=0.6, edgecolors='black', linewidth=0.5)
ax.set_xlabel('Monthly Income')
ax.set_ylabel('Loan Given')
ax.set_title('Monthly Income vs Loan Amount (Colored by Default Status)')

cbar = plt.colorbar(scatter, ax=ax)
cbar.set_label('Defaulted (0=No, 1=Yes)')

z = np.polyfit(df['Monthly income'].dropna(), df['Loan Given'].dropna(), 1)
p = np.poly1d(z)
x_line = np.linspace(df['Monthly income'].min(), df['Monthly income'].max(), 100)
ax.plot(x_line, p(x_line), "r--", alpha=0.8, label='Trend Line')
ax.legend()

plt.tight_layout()
plt.show()

corr = df['Monthly income'].corr(df['Loan Given'])
print(f"\nCorrelation between Monthly Income and Loan Given: {corr:.4f}")

In [None]:
# Savings vs Default Rate
print("\n" + "=" * 60)
print("SAVINGS VS DEFAULT RATE ANALYSIS")
print("=" * 60)

df['Savings_Category'] = pd.cut(df['Total Savings'], 
                                 bins=[-1, 0, 5000, 20000, 50000, float('inf')],
                                 labels=['No Savings', 'Low (1-5K)', 'Medium (5K-20K)', 
                                        'High (20K-50K)', 'Very High (>50K)'])

savings_default = df.groupby('Savings_Category')['Defaulted'].agg(['sum', 'count'])
savings_default['default_rate'] = (savings_default['sum'] / savings_default['count']) * 100

print(savings_default)

fig, ax = plt.subplots(figsize=(10, 6))
colors = ['#e74c3c' if rate > df['Defaulted'].mean() * 100 else '#3498db' 
          for rate in savings_default['default_rate']]
savings_default['default_rate'].plot(kind='bar', ax=ax, color=colors, edgecolor='black')
ax.set_xlabel('Savings Category')
ax.set_ylabel('Default Rate (%)')
ax.set_title('Default Rate by Savings Category')
ax.axhline(y=df['Defaulted'].mean() * 100, color='red', linestyle='--', 
           label=f'Overall Default Rate: {df["Defaulted"].mean() * 100:.1f}%')
ax.legend()
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Statistical Tests
print("\n" + "=" * 60)
print("STATISTICAL TESTS")
print("=" * 60)

print("\nChi-Square Tests for Independence with Defaulted:")
print("-" * 50)

categorical_cols = ['Age Group', 'Education', 'Income Brackets', 'Business type', 'Marital status']

for col in categorical_cols:
    try:
        contingency_table = pd.crosstab(df[col], df['Defaulted'])
        chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
        significance = "Significant" if p_value < 0.05 else "Not Significant"
        print(f"\n{col}:")
        print(f"  Chi-square statistic: {chi2:.4f}")
        print(f"  P-value: {p_value:.4f}")
        print(f"  Result: {significance} (alpha=0.05)")
    except Exception as e:
        print(f"\n{col}: Could not perform test - {str(e)}")

In [None]:
# T-test for numerical variables
print("\n" + "=" * 60)
print("T-TESTS FOR NUMERICAL VARIABLES")
print("=" * 60)

numerical_test_cols = ['Age', 'Monthly income', 'Loan Given', 'Total Savings', 'Business expenses']

print("\nIndependent Samples T-Tests (Defaulted vs Non-Defaulted):")
print("-" * 60)

for col in numerical_test_cols:
    try:
        group_0 = df[df['Defaulted'] == 0][col].dropna()
        group_1 = df[df['Defaulted'] == 1][col].dropna()
        
        if len(group_0) > 1 and len(group_1) > 1:
            t_stat, p_value = stats.ttest_ind(group_0, group_1)
            significance = "Significant" if p_value < 0.05 else "Not Significant"
            print(f"\n{col}:")
            print(f"  Mean (Non-Defaulted): {group_0.mean():,.2f}")
            print(f"  Mean (Defaulted): {group_1.mean():,.2f}")
            print(f"  T-statistic: {t_stat:.4f}")
            print(f"  P-value: {p_value:.4f}")
            print(f"  Result: {significance} (alpha=0.05)")
    except Exception as e:
        print(f"\n{col}: Could not perform test - {str(e)}")

## 10. Key Insights and Findings\n\nThis section summarizes the key findings from the exploratory data analysis.

In [None]:
# Summary Statistics for Key Findings
print("=" * 70)
print("SUMMARY OF KEY INSIGHTS AND FINDINGS")
print("=" * 70)

# 1. Dataset Overview
print("\n1. DATASET OVERVIEW")
print("-" * 50)
print(f"   Total records: {len(df)}")
print(f"   Total features: {len(df.columns)}")
print(f"   Numerical features: {len(df.select_dtypes(include=[np.number]).columns)}")
print(f"   Categorical features: {len(df.select_dtypes(include=['object']).columns)}")

# 2. Target Variable
print("\n2. TARGET VARIABLE (Defaulted)")
print("-" * 50)
default_rate = df['Defaulted'].mean() * 100
print(f"   Default Rate: {default_rate:.1f}%")
print(f"   Non-Default Rate: {100 - default_rate:.1f}%")

# 3. Demographics
print("\n3. DEMOGRAPHIC INSIGHTS")
print("-" * 50)
print(f"   Average Age: {df['Age'].mean():.1f} years")
print(f"   Most Common Age Group: {df['Age Group'].mode().iloc[0]}")
print(f"   Most Common Education Level: {df['Education'].mode().iloc[0]}")
print(f"   Most Common Marital Status: {df['Marital status'].mode().iloc[0]}")
print(f"   Primary County: {df['County'].mode().iloc[0]}")

# 4. Loan Characteristics
print("\n4. LOAN CHARACTERISTICS")
print("-" * 50)
print(f"   Average Loan Amount: KES {df['Loan Given'].mean():,.0f}")
print(f"   Median Loan Amount: KES {df['Loan Given'].median():,.0f}")

# 5. Financial Profile
print("\n5. FINANCIAL PROFILE")
print("-" * 50)
print(f"   Average Monthly Income: KES {df['Monthly income'].mean():,.0f}")
print(f"   Median Monthly Income: KES {df['Monthly income'].median():,.0f}")
print(f"   Average Total Savings: KES {df['Total Savings'].mean():,.0f}")
no_savings_pct = (df['Total Savings'] == 0).sum() / len(df) * 100
print(f"   Clients with No Savings: {no_savings_pct:.1f}%")

# 6. Business Profile
print("\n6. BUSINESS PROFILE")
print("-" * 50)
print(f"   Most Common Business Type: {df['Business type'].mode().iloc[0]}")
print(f"   Number of Unique Specific Businesses: {df['Specific Biz'].nunique()}")

# 7. Data Quality Issues
print("\n7. DATA QUALITY ISSUES")
print("-" * 50)
missing_cols = (df.isnull().sum() > 0).sum()
high_missing = (df.isnull().sum() / len(df) * 100 > 50).sum()
print(f"   Columns with missing values: {missing_cols}")
print(f"   Columns with >50% missing: {high_missing}")
print(f"   Total missing values: {df.isnull().sum().sum()}")

In [None]:
# Top 10 Key Findings Summary
print("\n" + "=" * 70)
print("TOP 10 KEY FINDINGS")
print("=" * 70)

findings = [
    f"1. The dataset contains {len(df)} loan records with {len(df.columns)} features.",
    f"2. Default Rate: {df['Defaulted'].mean() * 100:.1f}% of loans defaulted.",
    f"3. The majority of clients are from {df['County'].mode().iloc[0]} county.",
    f"4. Most clients have {df['Education'].mode().iloc[0]} education level.",
    f"5. Retail is the dominant business type, followed by Service sector.",
    f"6. Average loan amount is KES {df['Loan Given'].mean():,.0f} with high variance.",
    f"7. {(df['Total Savings'] == 0).sum() / len(df) * 100:.1f}% of clients have no savings.",
    f"8. Income levels vary significantly (Mean: KES {df['Monthly income'].mean():,.0f}).",
    f"9. Default rates vary across age groups and education levels.",
    f"10. Digital Lenders are the primary loan source."
]

for finding in findings:
    print(f"\n{finding}")

In [None]:
# Risk Factors and Recommendations
print("\n" + "=" * 70)
print("POTENTIAL RISK FACTORS FOR LOAN DEFAULT")
print("=" * 70)

print('''
Based on the analysis, the following factors may increase default risk:

1. SAVINGS STATUS
   - Clients with no savings show higher default tendency
   - Recommendation: Consider savings history in credit scoring

2. BUSINESS TYPE
   - Certain business types show higher default rates
   - Recommendation: Apply differential risk assessment by business category

3. INCOME-TO-LOAN RATIO
   - High loan amounts relative to income may increase default risk
   - Recommendation: Enforce stricter loan-to-income limits

4. EXPENSE PATTERNS
   - High expense-to-income ratios may strain repayment capacity
   - Recommendation: Include expense analysis in credit assessment

5. AGE AND EXPERIENCE
   - Younger or less experienced borrowers may show different default patterns
   - Recommendation: Consider age-adjusted risk models

6. EDUCATION LEVEL
   - Education may correlate with financial literacy and default risk
   - Recommendation: Offer financial literacy programs alongside loans
''')

In [None]:
# Recommendations for Predictive Modeling
print("\n" + "=" * 70)
print("RECOMMENDATIONS FOR PREDICTIVE MODELING")
print("=" * 70)

print('''
DATA PREPROCESSING NEEDS:
1. Handle missing values (imputation or exclusion based on analysis)
2. Address class imbalance using SMOTE, undersampling, or class weights
3. Encode categorical variables appropriately
4. Scale numerical features for algorithms sensitive to magnitude
5. Create derived features (e.g., loan-to-income ratio, savings-to-loan ratio)

SUGGESTED FEATURES FOR MODELING:
- Age, Age Group
- Monthly income, Income Brackets
- Total Savings, Savings frequency
- Loan Given, Loan Purpose
- Business type, Specific Biz
- Business expenses, Expense Relative to Income
- Affordability metrics
- Education level
- Marital status
- Geographic features (County, Constituency)

RECOMMENDED ALGORITHMS:
1. Logistic Regression (baseline, interpretable)
2. Random Forest (handles non-linear relationships)
3. XGBoost/LightGBM (high performance, handles imbalance)
4. Neural Networks (for complex patterns)

EVALUATION METRICS:
- Due to class imbalance, prioritize:
  - Area Under ROC Curve (AUC-ROC)
  - Precision-Recall curves
  - F1-Score
  - Matthews Correlation Coefficient
''')

print("\n" + "=" * 70)
print("END OF EXPLORATORY DATA ANALYSIS")
print("=" * 70)