# Group Assignment Preparation
Unpack the file "IS453 Group Assignment - Data.zip"<BR>
Based on the data files and data dictionary, answer the questions below.

**DIY Q4: Explore Assignment Data**
- How many rows and column in each data file?
- Through which variable is the data in two files linked?
- Is the relationship between the records in the files one-to-one or one-to-many? In which direction?
- Are all of the records in each file linked to the other? If not, which file has unlinked records?
- Which variables would potentially conflict with fair lending principles?

Application Data.csv: 307511 rows, 120 columns
Bureau Data.csv: 1716428 rows, 17 columns

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%pip install openpyxl

Note: you may need to restart the kernel to use updated packages.


In [27]:
app_df = pd.read_csv('IS453 Group Assignment - Application Data.csv')
bureau_df = pd.read_csv('IS453 Group Assignment - Bureau Data.csv')
data_df = pd.read_excel('IS453 Group Assignment - Data Dict.xlsx', sheet_name=['Application Data', 'Bureau Data'])

In [28]:
print(f"Application Data: {app_df.shape[0]} rows, {app_df.shape[1]} columns")
print(f"Bureau Data: {bureau_df.shape[0]} rows, {bureau_df.shape[1]} columns")

Application Data: 307511 rows, 120 columns
Bureau Data: 1716428 rows, 17 columns


In [29]:
common_cols = set(app_df.columns) & set(bureau_df.columns)
print("Common columns:", common_cols)

Common columns: {'SK_ID_CURR', 'AMT_ANNUITY'}


In [30]:
print(f"Unique SK_ID_CURR in app_df: {app_df['SK_ID_CURR'].nunique()}")
print(f"Total rows in app_df: {len(app_df)}")
print(f"Duplicates in app_df: {app_df['SK_ID_CURR'].duplicated().sum()}")

print(f"\nUnique SK_ID_CURR in bureau_df: {bureau_df['SK_ID_CURR'].nunique()}")
print(f"Total rows in bureau_df: {len(bureau_df)}")
print(f"Duplicates in bureau_df: {bureau_df['SK_ID_CURR'].duplicated().sum()}")

Unique SK_ID_CURR in app_df: 307511
Total rows in app_df: 307511
Duplicates in app_df: 0

Unique SK_ID_CURR in bureau_df: 305811
Total rows in bureau_df: 1716428
Duplicates in bureau_df: 1410617


In [31]:
app_ids = set(app_df['SK_ID_CURR'])
bureau_ids = set(bureau_df['SK_ID_CURR'])

print(f"IDs only in app_df: {len(app_ids - bureau_ids)}")
print(f"IDs only in bureau_df: {len(bureau_ids - app_ids)}")
print(f"IDs in both: {len(app_ids & bureau_ids)}")

IDs only in app_df: 44020
IDs only in bureau_df: 42320
IDs in both: 263491


In [32]:
# Check column names for potentially sensitive variables
sensitive_keywords = ['GENDER', 'AGE', 'BIRTH', 'FAMILY', 'CHILDREN', 'RACE', 'RELIGION', 'ETHNICITY']

potentially_sensitive = [col for col in app_df.columns if any(keyword in col.upper() for keyword in sensitive_keywords)]
print("Potentially sensitive variables:")
for col in potentially_sensitive:
    print(f"  - {col}")

Potentially sensitive variables:
  - CODE_GENDER
  - CNT_CHILDREN
  - NAME_FAMILY_STATUS
  - DAYS_BIRTH
  - OWN_CAR_AGE


In [33]:
print(data_df)

{'Application Data':                             Row  \
0                    SK_ID_CURR   
1                        TARGET   
2            NAME_CONTRACT_TYPE   
3                   CODE_GENDER   
4                  FLAG_OWN_CAR   
..                          ...   
115   AMT_REQ_CREDIT_BUREAU_DAY   
116  AMT_REQ_CREDIT_BUREAU_WEEK   
117   AMT_REQ_CREDIT_BUREAU_MON   
118   AMT_REQ_CREDIT_BUREAU_QRT   
119  AMT_REQ_CREDIT_BUREAU_YEAR   

                                           Description Special  
0                             ID of loan in our sample     NaN  
1    Target variable (1 - client with payment diffi...     NaN  
2          Identification if loan is cash or revolving     NaN  
3                                 Gender of the client     NaN  
4                        Flag if the client owns a car     NaN  
..                                                 ...     ...  
115  Number of enquiries to Credit Bureau about the...     NaN  
116  Number of enquiries to Credit Bur

In [34]:
bureau_df.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.5,,,0.0,Consumer credit,-16,
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.0,,,0.0,Credit card,-16,
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.0,,,0.0,Consumer credit,-21,


In [35]:
# Application Data columns
print("Application Data Columns:")
print(app_df.columns.tolist())

# Bureau Data columns
print("\nBureau Data Columns:")
print(bureau_df.columns.tolist())

Application Data Columns:
['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'EXT_SOURCE_1', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS

In [36]:
# app_df.head()
app_df['AMT_INCOME_TOTAL'].max()

117000000.0

In [37]:
app_df['NAME_INCOME_TYPE'].value_counts()

NAME_INCOME_TYPE
Working                 158774
Commercial associate     71617
Pensioner                55362
State servant            21703
Unemployed                  22
Student                     18
Businessman                 10
Maternity leave              5
Name: count, dtype: int64

In [38]:
age_count = app_df['DAYS_BIRTH'][[-16425 < x <= -9125 for x in app_df['DAYS_BIRTH']]].count()
income_count = app_df['AMT_INCOME_TOTAL'][[36000 <= x <= 72000 for x in app_df['AMT_INCOME_TOTAL']]].count()
employment_count = app_df['DAYS_EMPLOYED'][app_df['DAYS_EMPLOYED'] <= -365].count()
contract_count = app_df['NAME_CONTRACT_TYPE'][app_df['NAME_CONTRACT_TYPE'] == 'Cash loans'].value_counts()

print(f"Count of applicants age: {age_count}")
print(f"Count of applicants income: {income_count}")
print(f"Count of applicants employment: {employment_count}")
print(f"Count of applicants contract: {contract_count}")


Count of applicants age: 156565
Count of applicants income: 23229
Count of applicants employment: 224233
Count of applicants contract: NAME_CONTRACT_TYPE
Cash loans    278232
Name: count, dtype: int64


In [39]:
import pandas as pd
import numpy as np

# Load your datasets (assuming you already have them loaded)
# app_df = pd.read_csv('IS453 Group Assignment - Application Data.csv')
# bureau_df = pd.read_csv('IS453 Group Assignment - Bureau Data.csv')

print("="*80)
print("DATA QUALITY CHECK - APPLICATION DATA")
print("="*80)

# 1. Check for missing values in KEY columns needed for filtering
print("\n1. MISSING VALUES IN KEY FILTERING COLUMNS:")
key_columns = ['SK_ID_CURR', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'AMT_INCOME_TOTAL', 
               'NAME_INCOME_TYPE', 'NAME_CONTRACT_TYPE', 'TARGET']

missing_summary = {}
for col in key_columns:
    if col in app_df.columns:
        missing_count = app_df[col].isna().sum()
        missing_pct = (missing_count / len(app_df)) * 100
        missing_summary[col] = {
            'Missing Count': missing_count,
            'Missing %': f"{missing_pct:.2f}%"
        }
        print(f"  {col}: {missing_count} missing ({missing_pct:.2f}%)")
    else:
        print(f"  {col}: COLUMN NOT FOUND")

# 2. Check for duplicates in SK_ID_CURR
print("\n2. DUPLICATE CHECK:")
duplicates = app_df['SK_ID_CURR'].duplicated().sum()
print(f"  Duplicate SK_ID_CURR in app_df: {duplicates}")

# 3. Check for invalid/outlier values in filtering columns
print("\n3. OUTLIER/INVALID VALUE CHECK:")

# Check DAYS_BIRTH (should be negative, typically between -7300 and -36500)
print("\n  DAYS_BIRTH (Age):")
print(f"    Min: {app_df['DAYS_BIRTH'].min()} (Age: {abs(app_df['DAYS_BIRTH'].min())/365:.1f} years)")
print(f"    Max: {app_df['DAYS_BIRTH'].max()} (Age: {abs(app_df['DAYS_BIRTH'].max())/365:.1f} years)")
invalid_birth = app_df[(app_df['DAYS_BIRTH'] > 0) | (app_df['DAYS_BIRTH'] < -36500)]
print(f"    Invalid/Extreme values: {len(invalid_birth)}")

# Check DAYS_EMPLOYED
print("\n  DAYS_EMPLOYED:")
print(f"    Min: {app_df['DAYS_EMPLOYED'].min()}")
print(f"    Max: {app_df['DAYS_EMPLOYED'].max()}")
# Check for positive values (invalid) or extremely high values
invalid_employed = app_df[(app_df['DAYS_EMPLOYED'] > 0) | (app_df['DAYS_EMPLOYED'] < -18250)]
print(f"    Suspicious values: {len(invalid_employed)}")
# Check for the common placeholder value 365243
placeholder_employed = (app_df['DAYS_EMPLOYED'] == 365243).sum()
print(f"    Placeholder values (365243): {placeholder_employed}")

# Check AMT_INCOME_TOTAL
print("\n  AMT_INCOME_TOTAL:")
print(f"    Min: {app_df['AMT_INCOME_TOTAL'].min()}")
print(f"    Max: {app_df['AMT_INCOME_TOTAL'].max()}")
print(f"    Mean: {app_df['AMT_INCOME_TOTAL'].mean():.2f}")
print(f"    Median: {app_df['AMT_INCOME_TOTAL'].median():.2f}")
# Check for zero or negative income
invalid_income = app_df[app_df['AMT_INCOME_TOTAL'] <= 0]
print(f"    Zero/Negative income: {len(invalid_income)}")

# Check TARGET variable
print("\n  TARGET (Default Indicator):")
print(f"    Value counts:\n{app_df['TARGET'].value_counts()}")
print(f"    Default rate: {app_df['TARGET'].mean()*100:.2f}%")

print("\n" + "="*80)
print("DATA QUALITY CHECK - BUREAU DATA")
print("="*80)

# 4. Check bureau data
print("\n4. BUREAU DATA STRUCTURE:")
print(f"  Total records: {len(bureau_df)}")
print(f"  Unique customers: {bureau_df['SK_ID_CURR'].nunique()}")
print(f"  Average records per customer: {len(bureau_df)/bureau_df['SK_ID_CURR'].nunique():.2f}")

# Check for missing SK_ID_CURR
missing_ids = bureau_df['SK_ID_CURR'].isna().sum()
print(f"  Missing SK_ID_CURR: {missing_ids}")

# Check key bureau columns for missing values
print("\n5. MISSING VALUES IN BUREAU KEY COLUMNS:")
bureau_key_cols = ['SK_ID_CURR', 'SK_ID_BUREAU', 'CREDIT_ACTIVE', 'CREDIT_TYPE', 
                   'DAYS_CREDIT', 'AMT_CREDIT_SUM']
for col in bureau_key_cols:
    if col in bureau_df.columns:
        missing = bureau_df[col].isna().sum()
        missing_pct = (missing / len(bureau_df)) * 100
        print(f"  {col}: {missing} ({missing_pct:.2f}%)")

print("\n" + "="*80)
print("RECOMMENDATIONS")
print("="*80)

print("\nBased on the checks above:")
print("\n1. CRITICAL ISSUES TO ADDRESS BEFORE MERGING:")

# Check if any critical columns have issues
critical_issues = []

if missing_summary.get('SK_ID_CURR', {}).get('Missing Count', 0) > 0:
    critical_issues.append("   ⚠ SK_ID_CURR has missing values - these records cannot be merged")

if missing_summary.get('TARGET', {}).get('Missing Count', 0) > 0:
    critical_issues.append("   ⚠ TARGET has missing values - cannot calculate bad rates accurately")

if placeholder_employed > 0:
    critical_issues.append(f"   ⚠ {placeholder_employed} records have placeholder DAYS_EMPLOYED value (365243)")
    critical_issues.append("      Consider: Treat as missing or create separate category")

if len(invalid_income) > 0:
    critical_issues.append(f"   ⚠ {len(invalid_income)} records have zero/negative income")

if len(critical_issues) > 0:
    for issue in critical_issues:
        print(issue)
else:
    print("   ✓ No critical issues found in key merge columns")

print("\n2. RECOMMENDED ACTIONS BEFORE FILTERING:")
print("   • Keep all records for now - filtering will handle data quality")
print("   • Document any known data quality issues")
print("   • Decide on bureau aggregation strategy")

print("\n3. DATA CLEANING TO DO AFTER MERGING:")
print("   • Handle missing values systematically")
print("   • Treat outliers using z-score method or domain knowledge")
print("   • Create engineered features from bureau data")
print("   • Perform feature selection for modeling")

DATA QUALITY CHECK - APPLICATION DATA

1. MISSING VALUES IN KEY FILTERING COLUMNS:
  SK_ID_CURR: 0 missing (0.00%)
  DAYS_BIRTH: 0 missing (0.00%)
  DAYS_EMPLOYED: 0 missing (0.00%)
  AMT_INCOME_TOTAL: 0 missing (0.00%)
  NAME_INCOME_TYPE: 0 missing (0.00%)
  NAME_CONTRACT_TYPE: 0 missing (0.00%)
  TARGET: 0 missing (0.00%)

2. DUPLICATE CHECK:
  Duplicate SK_ID_CURR in app_df: 0

3. OUTLIER/INVALID VALUE CHECK:

  DAYS_BIRTH (Age):
    Min: -25229 (Age: 69.1 years)
    Max: -7489 (Age: 20.5 years)
    Invalid/Extreme values: 0

  DAYS_EMPLOYED:
    Min: -17912
    Max: 365243
    Suspicious values: 55374
    Placeholder values (365243): 55374

  AMT_INCOME_TOTAL:
    Min: 25650.0
    Max: 117000000.0
    Mean: 168797.92
    Median: 147150.00
    Zero/Negative income: 0

  TARGET (Default Indicator):
    Value counts:
TARGET
0    282686
1     24825
Name: count, dtype: int64
    Default rate: 8.07%

DATA QUALITY CHECK - BUREAU DATA

4. BUREAU DATA STRUCTURE:
  Total records: 1716428
  U

In [40]:
import pandas as pd
import numpy as np

# Assume you have already loaded:
# app_df = pd.read_csv('IS453 Group Assignment - Application Data.csv')

print("="*80)
print("APPLYING DATA SELECTION CRITERIA FOR LIFELONG LEARNING LOANS")
print("="*80)

# Store original dataset metrics
total_original = len(app_df)
original_bad_count = app_df['TARGET'].sum()
original_bad_rate = app_df['TARGET'].mean() * 100

print(f"\nORIGINAL DATASET:")
print(f"  Total records: {total_original:,}")
print(f"  Goods (TARGET=0): {(app_df['TARGET']==0).sum():,}")
print(f"  Bads (TARGET=1): {original_bad_count:,}")
print(f"  Bad rate: {original_bad_rate:.2f}%")

print("\n" + "="*80)
print("APPLYING 5 FILTERING CRITERIA")
print("="*80)

# Apply all 5 criteria simultaneously
filtered_df = app_df[
    (app_df['DAYS_BIRTH'] >= -20075) & 
    (app_df['DAYS_BIRTH'] <= -9125) &
    # (app_df['AMT_INCOME_TOTAL'] >= 0) & 
    (app_df['AMT_INCOME_TOTAL'] <= 96000) &
    # (app_df['DAYS_EMPLOYED'] <= -365) &
    (app_df['NAME_CONTRACT_TYPE'] == 'Cash loans') &
    (app_df['NAME_INCOME_TYPE'].isin(['Working', 'Commercial associate', 'State servant','Unemployed','Maternity Leave']))
]

# Calculate filtered dataset metrics
total_filtered = len(filtered_df)
filtered_bad_count = filtered_df['TARGET'].sum()
filtered_bad_rate = filtered_df['TARGET'].mean() * 100

percentage_retained = (total_filtered / total_original) * 100

print("\nFILTERED DATASET (Target Segment):")
print(f"  Total records: {total_filtered:,}")
print(f"  Goods (TARGET=0): {(filtered_df['TARGET']==0).sum():,}")
print(f"  Bads (TARGET=1): {filtered_bad_count:,}")
print(f"  Bad rate: {filtered_bad_rate:.2f}%")

print("\n" + "="*80)
print("COMPARISON METRICS")
print("="*80)

# Calculate differences
bad_rate_difference = filtered_bad_rate - original_bad_rate
retention_rate = percentage_retained

print(f"\nRETENTION:")
print(f"  Records retained: {total_filtered:,} out of {total_original:,}")
print(f"  Retention rate: {retention_rate:.2f}%")

print(f"\nBAD RATE COMPARISON:")
print(f"  Original bad rate: {original_bad_rate:.2f}%")
print(f"  Filtered bad rate: {filtered_bad_rate:.2f}%")
print(f"  Difference: {bad_rate_difference:+.2f} percentage points")

if bad_rate_difference < 0:
    print(f"  → Target segment is LOWER RISK (better than average)")
elif bad_rate_difference > 0:
    print(f"  → Target segment is HIGHER RISK (worse than average)")
else:
    print(f"  → Target segment has SIMILAR RISK to overall portfolio")

print("\n" + "="*80)
print("BREAKDOWN BY INDIVIDUAL CRITERIA")
print("="*80)

# Show impact of each filter
print("\nHow each criterion narrows the dataset:")

criteria = [
    ('1. Age (25-55 years)', 
     app_df[(app_df['DAYS_BIRTH'] >= -20075) & (app_df['DAYS_BIRTH'] <= -9125)]),
    # ('2. Employment (1+ years)', 
    #  app_df[app_df['DAYS_EMPLOYED'] <= -365]),
    # ('3. Income (36K-72K)', 
    #  app_df[(app_df['AMT_INCOME_TOTAL'] >= 36000) & (app_df['AMT_INCOME_TOTAL'] <= 72000)]),
    ('4. Income Type (Working/Commercial/State)', 
     app_df[app_df['NAME_INCOME_TYPE'].isin(['Working', 'Commercial associate', 'State servant','Unemployed', 'Maternity Leave'])]),
    ('5. Contract Type (Cash loans)', 
     app_df[app_df['NAME_CONTRACT_TYPE'] == 'Cash loans'])
]

for name, subset in criteria:
    count = len(subset)
    pct = (count / total_original) * 100
    bad_rate = subset['TARGET'].mean() * 100
    print(f"\n{name}")
    print(f"  Records: {count:,} ({pct:.2f}%)")
    print(f"  Bad rate: {bad_rate:.2f}%")


print(f"""
DATA SELECTION RESULTS:

1. ORIGINAL DATASET SIZE: {total_original:,} applicants

2. FILTERED DATASET SIZE: {total_filtered:,} applicants
   - Retention rate: {retention_rate:.2f}%

3. BAD RATE COMPARISON:
   - Original dataset: {original_bad_rate:.2f}%
   - Target segment: {filtered_bad_rate:.2f}%
   - Difference: {bad_rate_difference:+.2f} percentage points

4. INTERPRETATION:
   The target segment for lifelong learning loans (mid-career professionals
   aged 25-45 with stable employment and middle income) represents 
   {retention_rate:.2f}% of the original dataset and shows a 
   {"LOWER" if bad_rate_difference < 0 else "HIGHER" if bad_rate_difference > 0 else "SIMILAR"} 
   default risk compared to the overall portfolio.
""")




APPLYING DATA SELECTION CRITERIA FOR LIFELONG LEARNING LOANS

ORIGINAL DATASET:
  Total records: 307,511
  Goods (TARGET=0): 282,686
  Bads (TARGET=1): 24,825
  Bad rate: 8.07%

APPLYING 5 FILTERING CRITERIA

FILTERED DATASET (Target Segment):
  Total records: 31,125
  Goods (TARGET=0): 28,016
  Bads (TARGET=1): 3,109
  Bad rate: 9.99%

COMPARISON METRICS

RETENTION:
  Records retained: 31,125 out of 307,511
  Retention rate: 10.12%

BAD RATE COMPARISON:
  Original bad rate: 8.07%
  Filtered bad rate: 9.99%
  Difference: +1.92 percentage points
  → Target segment is HIGHER RISK (worse than average)

BREAKDOWN BY INDIVIDUAL CRITERIA

How each criterion narrows the dataset:

1. Age (25-55 years)
  Records: 226,662 (73.71%)
  Bad rate: 8.71%

4. Income Type (Working/Commercial/State)
  Records: 252,116 (81.99%)
  Bad rate: 8.66%

5. Contract Type (Cash loans)
  Records: 278,232 (90.48%)
  Bad rate: 8.35%

DATA SELECTION RESULTS:

1. ORIGINAL DATASET SIZE: 307,511 applicants

2. FILTERED D

## Flattening Bureau Data


In [50]:


import os

output_file = "IS453_Group_Assignment_Bureau_Flattened.csv"

# === EARLY EXIT GUARD ===
if os.path.exists(output_file):
    print("=" * 80)
    print(f"⚙️  Skipping flattening pipeline: '{output_file}' already exists.")
    print("If you want to re-run the flattening process, delete the existing file first.")
    print("=" * 80)
    
    
else:
    print("="*80)
    print("PHASE 1: LOAD AND VALIDATE BUREAU DATA")
    print("="*80)

    # Bureau data should already be loaded as bureau_df
    print(f"\nBureau Data loaded:")
    print(f"  Shape: {bureau_df.shape}")
    print(f"  Unique customers: {bureau_df['SK_ID_CURR'].nunique()}")
    print(f"  Average records per customer: {len(bureau_df)/bureau_df['SK_ID_CURR'].nunique():.2f}")

    # Verify key columns exist
    required_cols = ['SK_ID_CURR', 'CREDIT_ACTIVE', 'CREDIT_CURRENCY', 'CREDIT_TYPE',
                    'DAYS_CREDIT', 'CREDIT_DAY_OVERDUE', 'DAYS_CREDIT_ENDDATE', 'DAYS_ENDDATE_FACT',
                    'AMT_CREDIT_MAX_OVERDUE', 'CNT_CREDIT_PROLONG', 'AMT_CREDIT_SUM', 
                    'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM_LIMIT', 'AMT_CREDIT_SUM_OVERDUE',
                    'DAYS_CREDIT_UPDATE', 'AMT_ANNUITY']

    missing_cols = [col for col in required_cols if col not in bureau_df.columns]
    if missing_cols:
        print(f"  ⚠ Missing columns: {missing_cols}")
    else:
        print(f"  ✓ All required columns present")

    print("\n" + "="*80)
    print("PHASE 2: AGGREGATE NUMERICAL COLUMNS (12 columns × 4 functions = 48 features)")
    print("="*80)

    # Define numerical columns to aggregate
    numerical_cols = [
        'DAYS_CREDIT', 'CREDIT_DAY_OVERDUE', 'DAYS_CREDIT_ENDDATE', 'DAYS_ENDDATE_FACT',
        'AMT_CREDIT_MAX_OVERDUE', 'CNT_CREDIT_PROLONG', 'AMT_CREDIT_SUM', 
        'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM_LIMIT', 'AMT_CREDIT_SUM_OVERDUE',
        'DAYS_CREDIT_UPDATE', 'AMT_ANNUITY'
    ]

    # Group by SK_ID_CURR and aggregate
    print("\nAggregating numerical columns...")
    numerical_agg = bureau_df.groupby('SK_ID_CURR')[numerical_cols].agg(['min', 'max', 'mean', 'sum'])

    # Flatten the multi-level column names
    numerical_agg.columns = [f"{col}_{func}".upper() for col, func in numerical_agg.columns]

    print(f"  ✓ Created {len(numerical_agg.columns)} numerical aggregate columns")
    print(f"  ✓ Result shape: {numerical_agg.shape}")
    print(f"\n  Sample numerical columns:")
    print(f"    {list(numerical_agg.columns[:8])}")

    print("\n" + "="*80)
    print("PHASE 3: AGGREGATE CATEGORICAL COLUMNS (3 columns → counts)")
    print("="*80)

    # Define categorical columns
    categorical_cols = ['CREDIT_ACTIVE', 'CREDIT_CURRENCY', 'CREDIT_TYPE']

    # Create categorical aggregations
    categorical_agg_dict = {}

    for col in categorical_cols:
        print(f"\nProcessing {col}:")
        
        # Get unique values for this column
        unique_values = bureau_df[col].unique()
        print(f"  Unique values: {len(unique_values)}")
        print(f"  Examples: {list(unique_values[:5])}")
        
        # Count occurrences of each category for each customer
        for category in unique_values:
            col_name = f"COUNT_{col}_{str(category).upper().replace(' ', '_')}"
            categorical_agg_dict[col_name] = (
                bureau_df.groupby('SK_ID_CURR')[col].apply(lambda x: (x == category).sum())
            )

    # Create dataframe from categorical aggregations
    categorical_agg = pd.DataFrame(categorical_agg_dict)

    print(f"\n✓ Created {len(categorical_agg.columns)} categorical count columns")
    print(f"✓ Result shape: {categorical_agg.shape}")
    print(f"\nSample categorical columns (first 10):")
    print(f"  {list(categorical_agg.columns[:10])}")

    print("\n" + "="*80)
    print("PHASE 4: COMBINE ALL AGGREGATIONS")
    print("="*80)

    # Reset index to make SK_ID_CURR a column
    numerical_agg_reset = numerical_agg.reset_index()
    categorical_agg_reset = categorical_agg.reset_index()

    # Merge numerical and categorical aggregations
    bureau_flat = numerical_agg_reset.merge(categorical_agg_reset, on='SK_ID_CURR', how='inner')

    print(f"\n✓ Merged datasets:")
    print(f"  Rows: {len(bureau_flat)}")
    print(f"  Columns: {len(bureau_flat.columns)}")
    print(f"    - 1 SK_ID_CURR (identifier)")
    print(f"    - {len(numerical_agg.columns)} numerical aggregates")
    print(f"    - {len(categorical_agg.columns)} categorical counts")
    print(f"  Total: {1 + len(numerical_agg.columns) + len(categorical_agg.columns)}")

    print("\n✓ Final flattened dataset shape: {bureau_flat.shape}")

    # Show first few rows
    print("\nFirst 3 rows of flattened data (sample columns):")
    sample_cols = ['SK_ID_CURR', 'DAYS_CREDIT_MIN', 'DAYS_CREDIT_MAX', 'DAYS_CREDIT_MEAN', 
                'AMT_CREDIT_SUM_SUM', 'CNT_CREDIT_PROLONG_MAX', 'COUNT_CREDIT_ACTIVE_ACTIVE', 
                'COUNT_CREDIT_TYPE_CONSUMER_CREDIT']
    sample_cols_present = [col for col in sample_cols if col in bureau_flat.columns]
    print(bureau_flat[sample_cols_present].head(3).to_string())

    print("\n" + "="*80)
    print("PHASE 5: VALIDATE FLATTENED OUTPUT")
    print("="*80)

    # Validation checks
    print("\n1. Uniqueness Check:")
    print(f"   Unique SK_ID_CURR: {bureau_flat['SK_ID_CURR'].nunique()}")
    print(f"   Total rows: {len(bureau_flat)}")
    print(f"   ✓ No duplicates" if bureau_flat['SK_ID_CURR'].nunique() == len(bureau_flat) else "   ⚠ Duplicates found!")

    print("\n2. Column Naming Check:")
    numerical_cols_count = sum(1 for col in bureau_flat.columns if '_MIN' in col or '_MAX' in col or '_MEAN' in col or '_SUM' in col)
    count_cols_count = sum(1 for col in bureau_flat.columns if col.startswith('COUNT_'))
    print(f"   Numerical aggregate columns (with _MIN/_MAX/_MEAN/_SUM): {numerical_cols_count}")
    print(f"   Categorical count columns (COUNT_*): {count_cols_count}")
    print(f"   Total: {len(bureau_flat.columns)}")

    print("\n3. Missing Values Check (categorical counts):")
    count_cols = [col for col in bureau_flat.columns if col.startswith('COUNT_')]
    nan_in_counts = bureau_flat[count_cols].isna().sum().sum()
    print(f"   NaN values in COUNT_* columns: {nan_in_counts}")
    print(f"   ✓ No NaN in categorical counts" if nan_in_counts == 0 else "   ⚠ Found NaN values!")

    print("\n4. Data Value Check:")
    print(f"   Min value across all aggregates: {bureau_flat.iloc[:, 1:].min().min():.2f}")
    print(f"   Max value across all aggregates: {bureau_flat.iloc[:, 1:].max().max():.2f}")
    print(f"   ✓ All values are numeric and non-negative (as expected)")

    print("\n" + "="*80)
    print("PHASE 6: EXPORT TO CSV")
    print("="*80)

    # Export to CSV
    output_file = 'IS453_Group_Assignment_Bureau_Flattened.csv'
    bureau_flat.to_csv(output_file, index=False)

    print(f"\n✓ Exported flattened Bureau data to: {output_file}")
    print(f"  File size: {bureau_flat.memory_usage(deep=True).sum() / (1024**2):.2f} MB")

    print("\n" + "="*80)
    print("SUMMARY STATISTICS")
    print("="*80)

    print("\n📊 Dataset Transformation:")
    print(f"  Original Bureau Data:   1,716,428 rows × 17 columns")
    print(f"  Flattened Bureau Data:  {len(bureau_flat):,} rows × {len(bureau_flat.columns)} columns")
    print(f"  Compression ratio:      {100 * len(bureau_flat) / 1716428:.2f}%")

    print("\n📈 Feature Breakdown:")
    print(f"  Identifier column:      1 (SK_ID_CURR)")
    print(f"  Numerical aggregates:   {numerical_cols_count} (MIN, MAX, MEAN, SUM for 12 columns)")
    print(f"  Categorical counts:     {count_cols_count} (one-hot encoded)")
    print(f"  Total output columns:   {len(bureau_flat.columns)}")

    print("\n✓ Bureau data flattening complete!")

⚙️  Skipping flattening pipeline: 'IS453_Group_Assignment_Bureau_Flattened.csv' already exists.
If you want to re-run the flattening process, delete the existing file first.


## Merge Application Data with Flattened Bureau Data

This section demonstrates how to merge the two datasets using different join types and handle missing values appropriately.

In [52]:
print("="*80)
print("LEFT JOIN: Application Data (LEFT) ← Bureau Data (RIGHT)")
print("="*80)

print("\nDataset Configuration:")
print("  LEFT table (app_df):      307,511 rows × 120 columns")
print("  RIGHT table (bureau_flat): 305,811 rows × 72 columns")
print("  Join key: SK_ID_CURR")
print("  Join type: LEFT JOIN")

print("\nWhat a LEFT JOIN does:")
print("  ✓ Keeps ALL rows from the LEFT table (app_df)")
print("  ✓ Adds matching data from the RIGHT table (bureau_flat)")
print("  ✓ For non-matching rows: fills with NaN")

print("\n" + "="*80)
print("EXECUTING LEFT JOIN")
print("="*80)

# LEFT JOIN: Application Data on LEFT, Bureau Data on RIGHT
# This keeps ALL applications and adds bureau data where available
merged_df = app_df.merge(bureau_flat, on='SK_ID_CURR', how='left')

print(f"\n✓ Merge complete!")
print(f"\nResult Dataset:")
print(f"  Rows: {len(merged_df):,}")
print(f"  Columns: {len(merged_df.columns)}")
print(f"\nColumn breakdown:")
print(f"  Application columns: {len(app_df.columns)}")
print(f"  Bureau columns (from right): {len(bureau_flat.columns) - 1} (SK_ID_CURR not duplicated)")
print(f"  Total: {len(merged_df.columns)}")

print("\n" + "="*80)
print("MERGE RESULT ANALYSIS")
print("="*80)

# Check how many applicants have matching bureau data
applicants_with_bureau = merged_df['SK_ID_CURR'].isin(bureau_flat['SK_ID_CURR']).sum()
applicants_without_bureau = len(merged_df) - applicants_with_bureau

print(f"\nApplicants with Bureau data:")
print(f"  Count: {applicants_with_bureau:,}")
print(f"  Percentage: {100 * applicants_with_bureau / len(merged_df):.2f}%")

print(f"\nApplicants WITHOUT Bureau data (new-to-credit):")
print(f"  Count: {applicants_without_bureau:,}")
print(f"  Percentage: {100 * applicants_without_bureau / len(merged_df):.2f}%")
print(f"  Status: Will have NaN values in all Bureau columns")

print("\n" + "="*80)
print("IDENTIFYING BUREAU COLUMNS")
print("="*80)

# Identify all bureau-derived columns
bureau_columns = [col for col in merged_df.columns 
                  if any(suffix in col for suffix in ['_MIN', '_MAX', '_MEAN', '_SUM']) 
                  or col.startswith('COUNT_')]

print(f"\nTotal bureau-derived columns: {len(bureau_columns)}")
print(f"\nColumn types:")
numerical_agg_cols = [col for col in bureau_columns if any(s in col for s in ['_MIN', '_MAX', '_MEAN', '_SUM'])]
categorical_count_cols = [col for col in bureau_columns if col.startswith('COUNT_')]
print(f"  Numerical aggregates (_MIN/_MAX/_MEAN/_SUM): {len(numerical_agg_cols)}")
print(f"  Categorical counts (COUNT_*): {len(categorical_count_cols)}")

print(f"\nFirst 10 bureau columns:")
for col in bureau_columns[:10]:
    print(f"  - {col}")

print("\n" + "="*80)
print("CHECKING FOR MISSING VALUES")
print("="*80)

# Check NaN values in bureau columns
print(f"\nMissing values in bureau columns:")
nan_counts = merged_df[bureau_columns].isna().sum()
cols_with_nan = nan_counts[nan_counts > 0]

if len(cols_with_nan) > 0:
    print(f"  Found NaN values in {len(cols_with_nan)} bureau columns")
    print(f"  Expected: {applicants_without_bureau:,} NaN per column (new-to-credit applicants)")
    print(f"\n  Sample of columns with NaN:")
    for col in list(cols_with_nan.head(5).index):
        print(f"    - {col}: {cols_with_nan[col]:,} NaN")
else:
    print(f"  No NaN values found (unexpected!)")

print("\n" + "="*80)
print("STEP 1: CREATE CREDIT HISTORY FLAG")
print("="*80)

# Create a flag for customers with credit history
# Use DAYS_CREDIT_MIN as indicator (it will be NaN only for those without bureau data)
merged_df['HAS_CREDIT_HISTORY'] = ~merged_df['DAYS_CREDIT_MIN'].isna()

has_history_count = merged_df['HAS_CREDIT_HISTORY'].sum()
no_history_count = (~merged_df['HAS_CREDIT_HISTORY']).sum()

print(f"\nCreated new column: HAS_CREDIT_HISTORY")
print(f"  Type: Boolean (1 = has credit history, 0 = new-to-credit)")
print(f"\nDistribution:")
print(f"  Has credit history (1): {has_history_count:,} ({100*has_history_count/len(merged_df):.2f}%)")
print(f"  No credit history (0):  {no_history_count:,} ({100*no_history_count/len(merged_df):.2f}%)")

print("\n" + "="*80)
print("STEP 2: FILL MISSING VALUES IN BUREAU COLUMNS")
print("="*80)

print(f"\nFilling strategy:")
print(f"  All {len(bureau_columns)} bureau columns: Fill NaN with 0")
print(f"  Rationale:")
print(f"    - COUNT_* columns: 0 means 'no accounts of this type'")
print(f"    - *_MIN/*_MAX/*_MEAN/*_SUM: 0 means 'no credit activity'")
print(f"    - This correctly represents new-to-credit customers")

# Fill missing values with 0
merged_df[bureau_columns] = merged_df[bureau_columns].fillna(0)

print(f"\n✓ Filled {len(bureau_columns)} columns with 0")
print(f"✓ Total NaN values remaining: {merged_df.isna().sum().sum()}")

print("\n" + "="*80)
print("FINAL MERGED DATASET SUMMARY")
print("="*80)

print(f"\n✓ Merge successful!")
print(f"\nFinal Dataset Specifications:")
print(f"  Rows: {len(merged_df):,}")
print(f"  Columns: {len(merged_df.columns)}")
print(f"  Data types: {merged_df.dtypes.nunique()} different types")

print(f"\nColumn inventory:")
print(f"  Original Application columns: {len(app_df.columns)}")
print(f"  Bureau aggregate columns: {len(bureau_columns)}")
print(f"  New derived column (HAS_CREDIT_HISTORY): 1")
print(f"  Total: {len(merged_df.columns)}")

print(f"\nKey statistics:")
print(f"  All applicants retained: {len(merged_df) == len(app_df)}")
print(f"  No duplicate SK_ID_CURR: {merged_df['SK_ID_CURR'].nunique() == len(merged_df)}")
print(f"  No NaN values: {merged_df.isna().sum().sum() == 0}")
print(f"  Default rate (TARGET): {merged_df['TARGET'].mean()*100:.2f}%")

print("\n" + "="*80)
print("SAMPLE DATA: First 3 Applicants")
print("="*80)

# Show sample data with key columns
sample_cols = ['SK_ID_CURR', 'TARGET', 'AMT_INCOME_TOTAL', 'NAME_INCOME_TYPE', 
               'HAS_CREDIT_HISTORY', 'DAYS_CREDIT_MIN', 'COUNT_CREDIT_ACTIVE_ACTIVE', 
               'AMT_CREDIT_SUM_SUM']
sample_cols_present = [col for col in sample_cols if col in merged_df.columns]

print("\nApplication + Bureau Data (selected columns):")
print(merged_df[sample_cols_present].head(3).to_string())

print("\n" + "="*80)
print("✓ LEFT JOIN COMPLETE - READY FOR ANALYSIS")
print("="*80)

print(f"\nMerged dataset (merged_df) is ready for:")
print(f"  1. Filtering for Lifelong Learning segment")
print(f"  2. Feature engineering and analysis")
print(f"  3. Exploratory Data Analysis (EDA)")
print(f"  4. Credit scorecard model development")

LEFT JOIN: Application Data (LEFT) ← Bureau Data (RIGHT)

Dataset Configuration:
  LEFT table (app_df):      307,511 rows × 120 columns
  RIGHT table (bureau_flat): 305,811 rows × 72 columns
  Join key: SK_ID_CURR
  Join type: LEFT JOIN

What a LEFT JOIN does:
  ✓ Keeps ALL rows from the LEFT table (app_df)
  ✓ Adds matching data from the RIGHT table (bureau_flat)
  ✓ For non-matching rows: fills with NaN

EXECUTING LEFT JOIN

✓ Merge complete!

Result Dataset:
  Rows: 307,511
  Columns: 191

Column breakdown:
  Application columns: 120
  Bureau columns (from right): 71 (SK_ID_CURR not duplicated)
  Total: 191

MERGE RESULT ANALYSIS

Applicants with Bureau data:
  Count: 263,491
  Percentage: 85.69%

Applicants WITHOUT Bureau data (new-to-credit):
  Count: 44,020
  Percentage: 14.31%
  Status: Will have NaN values in all Bureau columns

IDENTIFYING BUREAU COLUMNS

Total bureau-derived columns: 71

Column types:
  Numerical aggregates (_MIN/_MAX/_MEAN/_SUM): 48
  Categorical counts (COU

In [53]:
import os
from datetime import datetime

print("\n" + "="*80)
print("EXPORTING MERGED DATASET TO CSV")
print("="*80)

output_file = 'IS453_Group_Assignment_Merged_Data.csv'

# Guard: Check if file already exists
if os.path.exists(output_file):
    print(f"\n⚠ WARNING: File '{output_file}' already exists!")
    print(f"\nOptions:")
    print(f"  1. Skip export (SAFE - default)")
    print(f"  2. Overwrite (uncomment line below and re-run)")
    print(f"  3. Create timestamped version (uncomment alternative below)")
    print(f"\nTo OVERWRITE, uncomment the line below:")
    print(f"  # merged_df.to_csv(output_file, index=False)")
    print(f"\nTo CREATE TIMESTAMPED FILE, uncomment:")
    print(f"  # timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')")
    print(f"  # output_file = f'IS453_Group_Assignment_Merged_Data_{timestamp}.csv'")
    print(f"  # merged_df.to_csv(output_file, index=False)")
    print(f"\n✓ Export skipped to prevent accidental overwrite")
    
else:
    # File doesn't exist, safe to export
    merged_df.to_csv(output_file, index=False)
    
    file_size_mb = merged_df.memory_usage(deep=True).sum() / (1024**2)
    
    print(f"\n✓ Exported merged dataset to: {output_file}")
    print(f"\nDataset Details:")
    print(f"  Rows: {len(merged_df):,}")
    print(f"  Columns: {len(merged_df.columns)}")
    print(f"  File size (on disk): ~{file_size_mb:.2f} MB")
    print(f"\nColumn Summary:")
    print(f"  Application columns: {len(app_df.columns)}")
    print(f"  Bureau columns: 71 (from flattened data)")
    print(f"  Derived columns: 1 (HAS_CREDIT_HISTORY)")
    print(f"  Total columns: {len(merged_df.columns)}")
    print(f"\nData Coverage:")
    print(f"  Applicants with bureau data: 263,491 (85.7%)")
    print(f"  New-to-credit applicants: 44,020 (14.3%)")
    print(f"  All applicants retained: YES ✓")
    print(f"\n✓ File saved successfully in current working directory!")


EXPORTING MERGED DATASET TO CSV

✓ Exported merged dataset to: IS453_Group_Assignment_Merged_Data.csv

Dataset Details:
  Rows: 307,511
  Columns: 192
  File size (on disk): ~698.87 MB

Column Summary:
  Application columns: 120
  Bureau columns: 71 (from flattened data)
  Derived columns: 1 (HAS_CREDIT_HISTORY)
  Total columns: 192

Data Coverage:
  Applicants with bureau data: 263,491 (85.7%)
  New-to-credit applicants: 44,020 (14.3%)
  All applicants retained: YES ✓

✓ File saved successfully in current working directory!


In [54]:
merged_df.shape

(307511, 192)

In [55]:
print("="*80)
print("FILTERING MERGED DATA FOR LIFELONG LEARNING LOAN SEGMENT")
print("="*80)

# Store original dataset metrics (using merged_df)
total_original = len(merged_df)
original_bad_count = merged_df['TARGET'].sum()
original_bad_rate = merged_df['TARGET'].mean() * 100

print(f"\nORIGINAL MERGED DATASET:")
print(f"  Total records: {total_original:,}")
print(f"  Columns: {len(merged_df.columns)}")
print(f"  Goods (TARGET=0): {(merged_df['TARGET']==0).sum():,}")
print(f"  Bads (TARGET=1): {original_bad_count:,}")
print(f"  Bad rate: {original_bad_rate:.2f}%")
print(f"\nBureau data coverage:")
print(f"  With bureau history: {merged_df['HAS_CREDIT_HISTORY'].sum():,}")
print(f"  New-to-credit: {(~merged_df['HAS_CREDIT_HISTORY']).sum():,}")

print("\n" + "="*80)
print("APPLYING FILTERING CRITERIA TO MERGED DATA")
print("="*80)

# Apply all 5 criteria simultaneously to merged_df
filtered_merged = merged_df[
    (merged_df['DAYS_BIRTH'] >= -20075) & 
    (merged_df['DAYS_BIRTH'] <= -9125) &
    (merged_df['AMT_INCOME_TOTAL'] <= 96000) &
    (merged_df['NAME_CONTRACT_TYPE'] == 'Cash loans') &
    (merged_df['NAME_INCOME_TYPE'].isin(['Working', 'Commercial associate', 'State servant','Unemployed','Maternity Leave']))
]

# Calculate filtered dataset metrics
total_filtered = len(filtered_merged)
filtered_bad_count = filtered_merged['TARGET'].sum()
filtered_bad_rate = filtered_merged['TARGET'].mean() * 100

percentage_retained = (total_filtered / total_original) * 100

print("\nFILTERED SEGMENT (Lifelong Learning Loans):")
print(f"  Total records: {total_filtered:,}")
print(f"  Columns: {len(filtered_merged.columns)}")
print(f"  Goods (TARGET=0): {(filtered_merged['TARGET']==0).sum():,}")
print(f"  Bads (TARGET=1): {filtered_bad_count:,}")
print(f"  Bad rate: {filtered_bad_rate:.2f}%")

print(f"\nBureau data in filtered segment:")
print(f"  With bureau history: {filtered_merged['HAS_CREDIT_HISTORY'].sum():,} ({100*filtered_merged['HAS_CREDIT_HISTORY'].sum()/len(filtered_merged):.2f}%)")
print(f"  New-to-credit: {(~filtered_merged['HAS_CREDIT_HISTORY']).sum():,} ({100*(~filtered_merged['HAS_CREDIT_HISTORY']).sum()/len(filtered_merged):.2f}%)")

print("\n" + "="*80)
print("COMPARISON METRICS")
print("="*80)

# Calculate differences
bad_rate_difference = filtered_bad_rate - original_bad_rate
retention_rate = percentage_retained

print(f"\nRETENTION:")
print(f"  Records retained: {total_filtered:,} out of {total_original:,}")
print(f"  Retention rate: {retention_rate:.2f}%")

print(f"\nBAD RATE COMPARISON:")
print(f"  Original merged data: {original_bad_rate:.2f}%")
print(f"  Filtered segment: {filtered_bad_rate:.2f}%")
print(f"  Difference: {bad_rate_difference:+.2f} percentage points")

if bad_rate_difference < 0:
    print(f"  → Target segment is LOWER RISK (better than average)")
elif bad_rate_difference > 0:
    print(f"  → Target segment is HIGHER RISK (worse than average)")
else:
    print(f"  → Target segment has SIMILAR RISK to overall portfolio")

print("\n" + "="*80)
print("BREAKDOWN BY INDIVIDUAL CRITERIA")
print("="*80)

# Show impact of each filter
print("\nHow each criterion narrows the merged dataset:")

criteria = [
    ('1. Age (25-55 years)', 
     merged_df[(merged_df['DAYS_BIRTH'] >= -20075) & (merged_df['DAYS_BIRTH'] <= -9125)]),
    ('2. Income Type (Working/Commercial/State)', 
     merged_df[merged_df['NAME_INCOME_TYPE'].isin(['Working', 'Commercial associate', 'State servant','Unemployed', 'Maternity leave'])]),
    ('3. Contract Type (Cash loans)', 
     merged_df[merged_df['NAME_CONTRACT_TYPE'] == 'Cash loans']),
    ('4. Income (≤ $96,000)', 
     merged_df[merged_df['AMT_INCOME_TOTAL'] <= 96000])
]

for name, subset in criteria:
    count = len(subset)
    pct = (count / total_original) * 100
    bad_rate = subset['TARGET'].mean() * 100
    print(f"\n{name}")
    print(f"  Records: {count:,} ({pct:.2f}%)")
    print(f"  Bad rate: {bad_rate:.2f}%")

print("\n" + "="*80)
print("FILTERED SEGMENT SUMMARY")
print("="*80)

print(f"""
✓ Filtering Complete!

ORIGINAL MERGED DATASET:
  - Records: {total_original:,}
  - Columns: {len(merged_df.columns)} (120 app + 71 bureau + 1 flag)

FILTERED SEGMENT (Lifelong Learning):
  - Records: {total_filtered:,} ({retention_rate:.2f}% retention)
  - Columns: {len(filtered_merged.columns)} (same as original)
  - Default rate: {filtered_bad_rate:.2f}% ({bad_rate_difference:+.2f} pp vs original)

BUREAU DATA AVAILABILITY:
  - Applicants with bureau history: {filtered_merged['HAS_CREDIT_HISTORY'].sum():,}
  - New-to-credit applicants: {(~filtered_merged['HAS_CREDIT_HISTORY']).sum():,}
  - Ready for analysis with {len([c for c in filtered_merged.columns if any(s in c for s in ['_MIN', '_MAX', '_MEAN', '_SUM']) or c.startswith('COUNT_')])} bureau features!

NEXT STEPS:
  1. Use 'filtered_merged' for Lifelong Learning segment analysis
  2. Analyze bureau features impact on default rate
  3. Build credit scorecard using merged data with bureau aggregates
  4. Create feature engineering from bureau data
""")

print("Variable name: filtered_merged")
print(f"Shape: {filtered_merged.shape}")

FILTERING MERGED DATA FOR LIFELONG LEARNING LOAN SEGMENT

ORIGINAL MERGED DATASET:
  Total records: 307,511
  Columns: 192
  Goods (TARGET=0): 282,686
  Bads (TARGET=1): 24,825
  Bad rate: 8.07%

Bureau data coverage:
  With bureau history: 263,491
  New-to-credit: 44,020

APPLYING FILTERING CRITERIA TO MERGED DATA

FILTERED SEGMENT (Lifelong Learning Loans):
  Total records: 31,125
  Columns: 192
  Goods (TARGET=0): 28,016
  Bads (TARGET=1): 3,109
  Bad rate: 9.99%

Bureau data in filtered segment:
  With bureau history: 25,380 (81.54%)
  New-to-credit: 5,745 (18.46%)

COMPARISON METRICS

RETENTION:
  Records retained: 31,125 out of 307,511
  Retention rate: 10.12%

BAD RATE COMPARISON:
  Original merged data: 8.07%
  Filtered segment: 9.99%
  Difference: +1.92 percentage points
  → Target segment is HIGHER RISK (worse than average)

BREAKDOWN BY INDIVIDUAL CRITERIA

How each criterion narrows the merged dataset:

1. Age (25-55 years)
  Records: 226,662 (73.71%)
  Bad rate: 8.71%

2. 