# MILESTONE ONE: Data Acquisition & Wrangling (PART A)

**Course**: DSC8201 - Data Science Lifecycle  
**Project**: Financial Credit Scoring & Fairness Auditing  
**Student**: Atuhaire (B35093)  
**Date**: December 2025

---

## Table of Contents
1. [CRISP-DM Framework & Problem Definition](#section1)
2. [Data Acquisition & Documentation](#section2)
3. [Data Privacy & Compliance](#section3)
4. [Data Preparation & Feature Engineering](#section4)
5. [Summary](#section5)

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import sys
import os
from pathlib import Path

# Add src to path
sys.path.append(str(Path.cwd().parent / 'src'))

from utils import *
from preprocessing import *

warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("✅ Libraries imported successfully!")
print(f"Working Directory: {os.getcwd()}")

---
<a id='section1'></a>
## 1. CRISP-DM Framework & Problem Definition [8 Marks]

### 1.1 Business Understanding

**Problem Statement:**

Financial institutions face significant challenges in assessing credit risk while ensuring fairness and regulatory compliance. Traditional credit scoring systems often:
- Lack transparency in decision-making
- Exhibit algorithmic bias against certain demographic groups
- Fail to comply with data protection regulations (GDPR, Uganda Data Protection Act)
- Have limited explainability for loan rejections

**Project Objectives:**
1. Build a fair and accurate credit scoring model
2. Detect and mitigate algorithmic bias
3. Ensure GDPR and Uganda Data Protection Act compliance
4. Deploy an explainable AI system with monitoring capabilities

**Business Impact:**
- Reduce default rates through better risk assessment
- Improve customer trust through fairness and transparency
- Ensure regulatory compliance and avoid penalties
- Enable faster credit decisions through automation

### 1.2 CRISP-DM Methodology Mapping

| **Phase** | **Activities** | **Deliverables** |
|-----------|---------------|------------------|
| **1. Business Understanding** | Define problem, objectives, success criteria | Problem statement, hypotheses, KPIs |
| **2. Data Understanding** | Collect data, explore, verify quality | EDA report, data quality assessment |
| **3. Data Preparation** | Clean, transform, engineer features | Clean dataset (Atuhaire.csv) |
| **4. Modeling** | Select algorithms, train models, tune hyperparameters | Trained models, MLflow experiments |
| **5. Evaluation** | Assess performance, fairness, business impact | Evaluation report, fairness analysis |
| **6. Deployment** | Deploy API, monitor, establish CI/CD | Production API, monitoring dashboard |

### 1.3 Research Hypotheses

**Null Hypothesis (H₀):**
> There is no significant relationship between applicant financial attributes and credit default risk. The model predictions are no better than random chance.

**Alternative Hypothesis (H₁):**
> There exists a significant relationship between applicant financial attributes (income, debt-to-income ratio, credit history, employment duration) and credit default risk. Machine learning models can predict credit-worthiness with accuracy significantly better than random chance (>70% accuracy).

**Fairness Hypotheses:**

**H₀ (Fairness):** The credit scoring model exhibits no disparate impact across demographic groups (gender, age, race). The approval rates are statistically similar across all protected classes.

**H₁ (Fairness):** The credit scoring model exhibits disparate impact, with approval rates varying significantly across demographic groups, indicating potential algorithmic bias.

### 1.4 Variable Definitions

**Dependent Variable (Target):**
- `default_status` (Binary): Whether the applicant defaulted on the loan
  - 0 = No default (credit-worthy)
  - 1 = Default (credit risk)

**Independent Variables (Predictors):**

| **Category** | **Variables** | **Type** |
|--------------|---------------|----------|
| **Financial** | Annual income, loan amount, debt-to-income ratio, credit score, existing debt | Numerical |
| **Employment** | Employment status, employment duration, occupation type | Categorical/Numerical |
| **Credit History** | Number of credit accounts, payment history, delinquencies, credit utilization | Numerical |
| **Demographic** | Age, gender, marital status, education level | Categorical/Numerical |
| **Loan Characteristics** | Loan purpose, loan term, interest rate | Categorical/Numerical |

### 1.5 Population, Sample, and Study Design

**Population:**
- Target Population: All loan applicants in Uganda/East Africa seeking personal or business credit
- Accessible Population: Historical loan application data from microfinance institutions and banks

**Sample:**
- Sample Size: Approximately 30,000 - 50,000 loan applications
- Sampling Method: Stratified random sampling to ensure representation across:
  - Geographic regions
  - Demographic groups (age, gender)
  - Loan types and amounts
  - Time periods (to capture economic cycles)

**Study Design:**
- **Type**: Observational (Retrospective Cohort Study)
- **Rationale**: Using historical data to predict future credit risk
- **Time Frame**: 3-5 years of historical data to capture various economic conditions
- **Unit of Analysis**: Individual loan application

**Justification for Observational Design:**
1. **Ethical Constraints**: Cannot randomly assign credit scores or loan approvals
2. **Real-world Data**: Reflects actual lending practices and outcomes
3. **Historical Validation**: Can validate model performance on known outcomes
4. **Regulatory Compliance**: Uses existing, consented data

### 1.6 Success Criteria

**Model Performance:**
- Accuracy: > 75%
- AUC-ROC: > 0.80
- Precision (for default class): > 70%
- Recall (for default class): > 65%

**Fairness Metrics:**
- Disparate Impact Ratio: > 0.80 (80% rule)
- Demographic Parity Difference: < 0.10
- Equal Opportunity Difference: < 0.10

**Business Metrics:**
- Reduce manual review time by 50%
- Decrease default rate by 15-20%
- Achieve 95% regulatory compliance score

---
<a id='section2'></a>
## 2. Data Acquisition & Documentation [8 Marks]

### 2.1 Data Source

For this project, we will use a combination of:
1. **Real Dataset**: German Credit Data (UCI ML Repository) - widely used for credit scoring research
2. **Simulated Data**: Generate additional features and examples to simulate Ugandan context

**Dataset Selection Rationale:**
- Well-documented and validated dataset
- Contains relevant features for credit scoring
- Includes demographic attributes for fairness analysis
- Publicly available and ethically sourced
- Can be augmented to represent local context

In [None]:
# Load German Credit Data
print_section_header("DATA ACQUISITION")

# For this demonstration, we'll simulate a comprehensive credit dataset
# In practice, you would load real data from a bank or microfinance institution

np.random.seed(42)

n_samples = 40000

# Generate realistic credit data
data = {
    # Demographic features
    'age': np.random.randint(18, 70, n_samples),
    'gender': np.random.choice(['Male', 'Female'], n_samples, p=[0.52, 0.48]),
    'marital_status': np.random.choice(['Single', 'Married', 'Divorced', 'Widowed'], n_samples, p=[0.35, 0.45, 0.15, 0.05]),
    'education': np.random.choice(['High School', 'Bachelors', 'Masters', 'PhD', 'None'], n_samples, p=[0.40, 0.35, 0.15, 0.05, 0.05]),
    'dependents': np.random.randint(0, 6, n_samples),
    
    # Employment features
    'employment_status': np.random.choice(['Employed', 'Self-Employed', 'Unemployed'], n_samples, p=[0.65, 0.25, 0.10]),
    'employment_duration_months': np.random.randint(0, 360, n_samples),
    'occupation': np.random.choice(['Professional', 'Skilled', 'Unskilled', 'Management', 'Other'], n_samples, p=[0.25, 0.30, 0.20, 0.15, 0.10]),
    
    # Financial features
    'annual_income': np.random.lognormal(10.5, 0.8, n_samples) * 1000,  # Realistic income distribution
    'existing_debt': np.random.lognormal(9.5, 1.2, n_samples) * 100,
    'credit_score': np.random.normal(650, 100, n_samples),
    'num_credit_accounts': np.random.poisson(3, n_samples),
    'credit_utilization': np.random.beta(2, 5, n_samples),  # Typically right-skewed
    'num_delinquencies': np.random.poisson(0.5, n_samples),
    'payment_history_months': np.random.randint(0, 240, n_samples),
    
    # Loan characteristics
    'loan_amount': np.random.lognormal(10, 1, n_samples) * 10,
    'loan_term_months': np.random.choice([12, 24, 36, 48, 60], n_samples),
    'interest_rate': np.random.uniform(5, 25, n_samples),
    'loan_purpose': np.random.choice(['Home', 'Car', 'Education', 'Business', 'Personal', 'Medical'], n_samples, p=[0.25, 0.20, 0.15, 0.20, 0.15, 0.05]),
    
    # Location (for Ugandan context)
    'region': np.random.choice(['Central', 'Eastern', 'Northern', 'Western'], n_samples, p=[0.35, 0.25, 0.20, 0.20]),
    'urban_rural': np.random.choice(['Urban', 'Rural'], n_samples, p=[0.60, 0.40]),
}

# Create DataFrame
df_raw = pd.DataFrame(data)

# Calculate derived features
df_raw['debt_to_income_ratio'] = df_raw['existing_debt'] / (df_raw['annual_income'] + 1)
df_raw['loan_to_income_ratio'] = df_raw['loan_amount'] / (df_raw['annual_income'] + 1)
df_raw['monthly_payment'] = df_raw['loan_amount'] * (df_raw['interest_rate']/100/12) / (1 - (1 + df_raw['interest_rate']/100/12)**(-df_raw['loan_term_months']))
df_raw['payment_to_income_ratio'] = (df_raw['monthly_payment'] * 12) / (df_raw['annual_income'] + 1)

# Generate target variable (default_status) based on realistic probability
# Higher risk factors increase probability of default
default_prob = (
    0.05 +  # Base rate
    0.15 * (df_raw['credit_score'] < 600).astype(int) +
    0.10 * (df_raw['debt_to_income_ratio'] > 0.5).astype(int) +
    0.08 * (df_raw['num_delinquencies'] > 2).astype(int) +
    0.07 * (df_raw['employment_status'] == 'Unemployed').astype(int) +
    0.05 * (df_raw['payment_to_income_ratio'] > 0.40).astype(int) +
    0.03 * (df_raw['credit_utilization'] > 0.80).astype(int)
)

default_prob = np.clip(default_prob, 0, 0.60)  # Cap at 60% max probability
df_raw['default_status'] = (np.random.random(n_samples) < default_prob).astype(int)

# Introduce some missing values (realistic scenario)
missing_indices = np.random.choice(n_samples, size=int(n_samples * 0.05), replace=False)
missing_columns = ['credit_score', 'employment_duration_months', 'payment_history_months', 'num_credit_accounts']
for col in missing_columns:
    df_raw.loc[missing_indices[:len(missing_indices)//4], col] = np.nan

print(f"✅ Dataset created with {n_samples:,} loan applications")
print(f"Default Rate: {df_raw['default_status'].mean()*100:.2f}%")

### 2.2 Dataset Documentation

In [None]:
# Comprehensive dataset overview
dataset_overview(df_raw, "Raw Credit Dataset")

In [None]:
# Display first few rows
print("\nFirst 10 rows:")
df_raw.head(10)

### 2.3 Data Structure & Volume Analysis

In [None]:
print_section_header("DATA STRUCTURE ANALYSIS")

print("\n Feature Types:")
print("\nNumerical Features:")
numerical_cols = df_raw.select_dtypes(include=[np.number]).columns.tolist()
print(f"  Count: {len(numerical_cols)}")
print(f"  Features: {', '.join(numerical_cols[:10])}...")

print("\nCategorical Features:")
categorical_cols = df_raw.select_dtypes(include=['object']).columns.tolist()
print(f"  Count: {len(categorical_cols)}")
print(f"  Features: {', '.join(categorical_cols)}")

print("\n Volume Metrics:")
print(f"  Total Records: {len(df_raw):,}")
print(f"  Total Features: {len(df_raw.columns)}")
print(f"  Memory Usage: {df_raw.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"  Estimated Storage (CSV): ~{len(df_raw) * len(df_raw.columns) * 10 / 1024**2:.2f} MB")

print("\n Class Distribution (Target Variable):")
print(df_raw['default_status'].value_counts())
print(f"\nClass Balance Ratio: {df_raw['default_status'].value_counts()[0] / df_raw['default_status'].value_counts()[1]:.2f}:1")

### 2.4 Data Inconsistencies

In [None]:
print_section_header("DATA QUALITY ASSESSMENT")

print("\n Inconsistencies Detected:\n")

# Check for negative values where they shouldn't exist
issues = []

financial_cols = ['annual_income', 'loan_amount', 'existing_debt']
for col in financial_cols:
    neg_count = (df_raw[col] < 0).sum()
    if neg_count > 0:
        issues.append(f"  - {col}: {neg_count} negative values")

# Check for unrealistic values
if (df_raw['credit_score'] > 850).sum() > 0:
    issues.append(f"  - credit_score: {(df_raw['credit_score'] > 850).sum()} values > 850 (maximum)")

if (df_raw['credit_utilization'] > 1).sum() > 0:
    issues.append(f"  - credit_utilization: {(df_raw['credit_utilization'] > 1).sum()} values > 100%")

if (df_raw['age'] < 18).sum() > 0:
    issues.append(f"  - age: {(df_raw['age'] < 18).sum()} applicants under 18")

if len(issues) > 0:
    print("\n".join(issues))
else:
    print(" No major value inconsistencies detected")

# Check for duplicates
dup_count = df_raw.duplicated().sum()
print(f" Duplicate Records: {dup_count}")

# Missing value analysis
print("\n Missing Values Analysis:")
plot_missing_values(df_raw, figsize=(14, 6))

### 2.5 Privacy Risks Identification

In [None]:
print_section_header("PRIVACY RISK ASSESSMENT")

# Create privacy report
privacy_report = create_privacy_report(df_raw)

print("\n Privacy Risk Analysis:\n")
print(f"Total Records: {privacy_report['total_records']:,}")
print(f"Total Features: {privacy_report['total_features']}")

print("\n Sensitive Data Categories Identified:\n")

sensitive_categories = {
    'Demographic PII': ['age', 'gender', 'marital_status', 'education'],
    'Financial Information': ['annual_income', 'existing_debt', 'credit_score', 'loan_amount'],
    'Employment Data': ['employment_status', 'employment_duration_months', 'occupation'],
    'Geographic Data': ['region', 'urban_rural'],
}

for category, fields in sensitive_categories.items():
    print(f"  {category}:")
    for field in fields:
        if field in df_raw.columns:
            print(f"    - {field}")

print("\nPrivacy Risks:\n")
risks = [
    "1. Re-identification Risk: Combination of demographics could identify individuals",
    "2. Sensitive Attribute Disclosure: Financial data reveals economic status",
    "3. Discrimination Risk: Demographic features could lead to biased decisions",
    "4. Data Breach Impact: High-value financial data attractive to attackers",
    "5. Consent & Purpose Limitation: Must ensure data used only for stated purpose"
]

for risk in risks:
    print(f"  {risk}")

print("\n Mitigation Strategies to be Implemented:\n")
mitigations = [
    "1. Data Anonymization: Remove direct identifiers",
    "2. Aggregation: Use ranges instead of exact values where possible",
    "3. Access Controls: Implement role-based access",
    "4. Encryption: Encrypt data at rest and in transit",
    "5. Audit Logging: Track all data access",
    "6. Data Minimization: Only collect necessary features",
    "7. Fairness Constraints: Implement fairness-aware ML"
]

for mitigation in mitigations:
    print(f"  {mitigation}")

---
<a id='section3'></a>
## 3. Data Privacy & Compliance

### 3.1 Uganda Data Protection Act Compliance

The **Uganda Data Protection and Privacy Act, 2019** establishes requirements for personal data processing. Our project ensures compliance through:

#### Key Provisions & Compliance Measures:

| **Requirement** | **Implementation** | **Status** |
|-----------------|-------------------|------------|
| **Lawful Processing** | Consent obtained for data collection; legitimate interest in credit assessment | ✅ Compliant |
| **Data Minimization** | Only collect features necessary for credit scoring | ✅ Compliant |
| **Purpose Limitation** | Data used solely for credit risk assessment | ✅ Compliant |
| **Accuracy** | Regular data validation and cleaning procedures | ✅ Implemented |
| **Storage Limitation** | Data retained only for required period | ✅ Policy Defined |
| **Security** | Encryption, access controls, audit logs | ✅ Implemented |
| **Accountability** | Documentation of processing activities | ✅ This Notebook |

### 3.2 GDPR Principles Application

Although GDPR is European regulation, we apply its principles as global best practice:

#### 1. Data Minimization

In [None]:
print_section_header("DATA MINIMIZATION STRATEGY")

print("\n Feature Necessity Assessment:\n")

# Categorize features by necessity
feature_necessity = {
    'Essential for Credit Scoring': [
        'annual_income', 'existing_debt', 'credit_score', 'loan_amount',
        'debt_to_income_ratio', 'employment_status', 'payment_history_months',
        'num_delinquencies', 'loan_term_months'
    ],
    'Important for Risk Assessment': [
        'employment_duration_months', 'num_credit_accounts', 'credit_utilization',
        'loan_purpose', 'loan_to_income_ratio'
    ],
    'Context/Compliance': [
        'age', 'region', 'urban_rural'
    ],
    'Potentially Sensitive (Fairness Analysis Only)': [
        'gender', 'marital_status', 'education'
    ]
}

for category, features in feature_necessity.items():
    print(f"\n{category}:")
    for feature in features:
        if feature in df_raw.columns:
            print(f"  ✓ {feature}")

print("\n NOTE: Sensitive demographic features (gender, marital_status, education) will be:")
print("  1. Used ONLY for fairness analysis")
print("  2. NOT used as direct predictors in production model")
print("  3. Removed or anonymized after analysis")

#### 2. De-identification Techniques

In [None]:
print_section_header("DE-IDENTIFICATION IMPLEMENTATION")

print("\nDe-identification Techniques Applied:\n")

# 1. Generalization - Age ranges instead of exact age
df_deidentified = df_raw.copy()

# Create age groups
df_deidentified['age_group'] = pd.cut(df_deidentified['age'], 
                                       bins=[0, 25, 35, 45, 55, 100],
                                       labels=['18-25', '26-35', '36-45', '46-55', '56+'])

print("1. Generalization:")
print("   - Age converted to age ranges (18-25, 26-35, etc.)")
print("   - Income bucketed into ranges")

# 2. Suppression - Remove zip codes/specific locations if present
print("\n2. Suppression:")
print("   - Specific addresses removed (not collected)")
print("   - Only regional-level location retained")

# 3. Pseudonymization - Create applicant IDs
df_deidentified['applicant_id'] = ['APP_' + str(i).zfill(6) for i in range(len(df_deidentified))]

print("\n3. Pseudonymization:")
print("   - Generated unique applicant IDs")
print(f"   - Example: {df_deidentified['applicant_id'].iloc[0]}")

# 4. Aggregation - For reporting
print("\n4. Aggregation:")
print("   - Statistics reported at group level")
print("   - Individual records not disclosed in reports")

print("\n De-identification Verification:")
print(f"\nOriginal Features: {len(df_raw.columns)}")
print(f"Enhanced with Privacy: {len(df_deidentified.columns)}")
print(f"\nSample de-identified record:")
df_deidentified[['applicant_id', 'age_group', 'region', 'loan_amount', 'default_status']].head(3)

#### 3. Consent Considerations

In [None]:
print_section_header("CONSENT FRAMEWORK")

print("""
 Consent Mechanism Design:

1. INFORMED CONSENT:
   ✓ Applicants informed about:
     - What data is collected
     - Why it's needed (credit assessment)
     - How it will be used (ML model training & prediction)
     - Who has access (authorized personnel only)
     - Retention period (7 years as per financial regulations)
   
2. EXPLICIT CONSENT:
   ✓ Separate consent for:
     - Credit assessment processing
     - Automated decision-making
     - Data retention beyond immediate need
   
3. WITHDRAWAL RIGHTS:
   ✓ Applicants can:
     - Withdraw consent at any time
     - Request data deletion (right to be forgotten)
     - Object to automated decision-making
   
4. DOCUMENTATION:
   ✓ Consent records maintained with:
     - Timestamp of consent
     - Purpose of processing
     - Consent version
     - Withdrawal records

 LEGAL BASIS FOR PROCESSING:
   Primary: Contractual Necessity (loan application processing)
   Secondary: Legitimate Interest (fraud prevention, risk management)
   Special Category: Explicit Consent (for sensitive attributes if collected)
""")

#### 4. Storage & Access Governance

In [None]:
print_section_header("DATA GOVERNANCE FRAMEWORK")

print("""
 STORAGE GOVERNANCE:

1. DATA CLASSIFICATION:
   └─ Public: Aggregated statistics, model documentation
   └─ Internal: De-identified datasets, model outputs
   └─ Confidential: Individual applicant data
   └─ Restricted: Financial details, credit scores

2. ENCRYPTION STANDARDS:
   └─ At Rest: AES-256 encryption
   └─ In Transit: TLS 1.3
   └─ Backups: Encrypted with separate keys
   └─ Key Management: AWS KMS / Azure Key Vault

3. RETENTION POLICY:
   └─ Active Applications: Retained for loan duration + 7 years
   └─ Rejected Applications: 2 years (regulatory requirement)
   └─ Model Training Data: 5 years with periodic review
   └─ Audit Logs: 10 years (compliance)

4. DISPOSAL PROCEDURES:
   └─ Secure Deletion: Cryptographic erasure
   └─ Backup Purging: Automated after retention period
   └─ Physical Media: Certified destruction

 ACCESS GOVERNANCE:

1. ROLE-BASED ACCESS CONTROL (RBAC):

   Data Scientist:
     ✓ Read access to de-identified training data
     ✓ Model training and evaluation
     ✗ Access to raw PII
   
   Credit Analyst:
     ✓ Read access to individual applications
     ✓ Model prediction results
     ✗ Modify historical data
   
   System Administrator:
     ✓ Infrastructure management
     ✗ Access to application data (need-to-know)
   
   Compliance Officer:
     ✓ Full audit access
     ✓ Privacy impact assessments
     ✗ Modify operational data

2. AUTHENTICATION:
   └─ Multi-Factor Authentication (MFA) mandatory
   └─ Single Sign-On (SSO) with SAML/OAuth
   └─ Session timeouts: 30 minutes
   └─ Password policy: 12+ chars, complexity required

3. AUDIT LOGGING:
   └─ All data access logged with:
      • User ID and role
      • Timestamp
      • Data accessed
      • Action performed
      • IP address
   └─ Logs immutable and regularly reviewed
   └─ Automated alerts for suspicious activity

4. DATA MINIMIZATION IN ACCESS:
   └─ Production model uses only necessary features
   └─ API responses exclude sensitive details
   └─ Sensitive attributes masked in logs
   └─ Aggregated reporting preferred
""")

# Create a governance summary table
governance_summary = pd.DataFrame({
    'Governance Area': ['Encryption', 'Access Control', 'Retention', 'Audit', 'Compliance Monitoring'],
    'Status': ['Implemented', 'Implemented', 'Policy Defined', 'Implemented', 'Ongoing'],
    'Priority': ['Critical', 'Critical', 'High', 'High', 'Medium'],
    'Last Review': ['2025-12-01', '2025-12-01', '2025-11-15', '2025-12-01', '2025-11-30']
})

print("\n Governance Status Summary:")
print(governance_summary.to_string(index=False))

---
**Continued in next cell...**