<a href="https://colab.research.google.com/github/Undasnr/DL-ML/blob/main/Ronny_Credit_Information_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Confirmation of competition content and creating a baseline model**

**Goal**: Predict the probability that a loan applicant will default on their loan.

**Type of Task**: Binary classification

TARGET = 1: Applicant will have difficulty repaying

TARGET = 0: Applicant will repay the loan

**What to Submit to Kaggle**
I’ll submit a CSV file with two columns:

1. SK_ID_CURR: Unique ID for each applicant in the test set

2. TARGET: Predicted probability of default (a float between 0 and 1)

**Evaluation Metric**
Kaggle uses Area Under the ROC Curve (AUC) to evaluate submissions.

AUC measures how well my model distinguishes between defaulters and non-defaulters.

Higher AUC = better model performance

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Loading data
train = pd.read_csv('application_train.csv')
test = pd.read_csv('application_test.csv')

# Selecting basic features
features = ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'DAYS_BIRTH', 'DAYS_EMPLOYED']
X = train[features]
y = train['TARGET']
X_test = test[features]

# Preprocessing
imputer = SimpleImputer(strategy='median')
scaler = StandardScaler()

X_imputed = imputer.fit_transform(X)
X_scaled = scaler.fit_transform(X_imputed)

X_test_imputed = imputer.transform(X_test)
X_test_scaled = scaler.transform(X_test_imputed)

# Training the model
model = LogisticRegression()
model.fit(X_scaled, y)

# Predict
preds = model.predict_proba(X_test_scaled)[:, 1]

# Create submission file
submission = pd.DataFrame({
    'SK_ID_CURR': test['SK_ID_CURR'],
    'TARGET': preds
})
submission.to_csv('baseline_submission.csv', index=False)

# Evaluate on train set
X_train, X_val, y_train, y_val = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
model.fit(X_train, y_train)
val_preds = model.predict_proba(X_val)[:, 1]
print("Validation AUC:", roc_auc_score(y_val, val_preds))

Validation AUC: 0.6258083065422515


**Learning and Verification**

In [3]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Load training data
train = pd.read_csv('application_train.csv')
test = pd.read_csv('application_test.csv')

print(train.shape)
print(train[['TARGET', 'AMT_INCOME_TOTAL', 'AMT_CREDIT']].describe())
print(train['TARGET'].value_counts(normalize=True))

missing = train.isnull().sum().sort_values(ascending=False)
print("Top missing features:\n", missing.head())

# Correlation with target
corr = train.corr(numeric_only=True)['TARGET'].sort_values(ascending=False)
print("Top correlated features:\n", corr.head(10))

# Preprocessing
features = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'AMT_CREDIT', 'AMT_INCOME_TOTAL', 'DAYS_BIRTH']
X = train[features]
y = train['TARGET']
X_test = test[features]

# Impute and scale
imputer = SimpleImputer(strategy='median')
scaler = StandardScaler()

X_imputed = imputer.fit_transform(X)
X_scaled = scaler.fit_transform(X_imputed)

X_test_imputed = imputer.transform(X_test)
X_test_scaled = scaler.transform(X_test_imputed)

# Splitting for validation
X_train, X_val, y_train, y_val = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Training model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predicting probabilities
val_preds = model.predict_proba(X_val)[:, 1]

# Evaluating
auc = roc_auc_score(y_val, val_preds)
print(f"Validation AUC: {auc:.4f}")

# Predict on test set
test_preds = model.predict_proba(X_test_scaled)[:, 1]

# Creating submission file
submission = pd.DataFrame({
    'SK_ID_CURR': test['SK_ID_CURR'],
    'TARGET': test_preds
})
submission.to_csv('baseline_submission.csv', index=False)

(242712, 122)
              TARGET  AMT_INCOME_TOTAL    AMT_CREDIT
count  242712.000000      2.427120e+05  2.427110e+05
mean        0.080973      1.688283e+05  5.986425e+05
std         0.272793      2.604745e+05  4.020240e+05
min         0.000000      6.750000e+02  4.500000e+04
25%         0.000000      1.125000e+05  2.700000e+05
50%         0.000000      1.458000e+05  5.124465e+05
75%         0.000000      2.025000e+05  8.086500e+05
max         1.000000      1.170000e+08  4.050000e+06
TARGET
0    0.919027
1    0.080973
Name: proportion, dtype: float64
Top missing features:
 COMMONAREA_AVG              169738
COMMONAREA_MODE             169738
COMMONAREA_MEDI             169738
NONLIVINGAPARTMENTS_MEDI    168582
NONLIVINGAPARTMENTS_MODE    168582
dtype: int64
Top correlated features:
 TARGET                         1.000000
DAYS_BIRTH                     0.077927
REGION_RATING_CLIENT_W_CITY    0.061065
REGION_RATING_CLIENT           0.059101
DAYS_LAST_PHONE_CHANGE         0.054688
DAYS

**Estimation on test data**

In [8]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Loading data
train = pd.read_csv('application_train.csv')
test = pd.read_csv('application_test.csv')

# Features for baseline
features = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
            'AMT_CREDIT', 'AMT_INCOME_TOTAL', 'DAYS_BIRTH']

X = train[features]
y = train['TARGET']
X_test = test[features]

# Impute and scale baseline features
imputer = SimpleImputer(strategy='median')
scaler = StandardScaler()

X_imputed = imputer.fit_transform(X)
X_scaled = scaler.fit_transform(X_imputed)

X_test_imputed = imputer.transform(X_test)
X_test_scaled = scaler.transform(X_test_imputed)

# Polynomial features for EXT_SOURCE vars
ext_sources_train = imputer.fit_transform(train[['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3']])
ext_sources_test = imputer.transform(test[['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3']])

poly = PolynomialFeatures(degree=2, interaction_only=True)
X_poly_train = poly.fit_transform(ext_sources_train)
X_poly_test = poly.transform(ext_sources_test)

# Converting to DataFrames with feature names
poly_feature_names = poly.get_feature_names_out(['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3'])
X_poly_train_df = pd.DataFrame(X_poly_train, columns=poly_feature_names, index=train.index)
X_poly_test_df = pd.DataFrame(X_poly_test, columns=poly_feature_names, index=test.index)

# Displaying interaction features
print("Polynomial features shape:", X_poly_train_df.shape)
print("First 5 rows:\n", X_poly_train_df.head())

# Merging polynomial features back into X
X_full = pd.concat([pd.DataFrame(X_scaled, index=train.index, columns=features),
                    X_poly_train_df], axis=1)
X_test_full = pd.concat([pd.DataFrame(X_test_scaled, index=test.index, columns=features),
                         X_poly_test_df], axis=1)

# Training/validating Logistic Regression
X_train, X_val, y_train, y_val = train_test_split(X_full, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

val_preds = model.predict_proba(X_val)[:, 1]
auc = roc_auc_score(y_val, val_preds)
print(f"Validation AUC with polynomial features: {auc:.4f}")

# Predicting on test set and creating submission
test_preds = model.predict_proba(X_test_full)[:, 1]

submission = pd.DataFrame({
    'SK_ID_CURR': test['SK_ID_CURR'],
    'TARGET': test_preds
})
submission.to_csv('submission.csv', index=False)
print("✅ Submission file saved: submission.csv")

Polynomial features shape: (307511, 7)
First 5 rows:
      1  EXT_SOURCE_1  EXT_SOURCE_2  EXT_SOURCE_3  EXT_SOURCE_1 EXT_SOURCE_2  \
0  1.0      0.083037      0.262949      0.139376                   0.021834   
1  1.0      0.311267      0.622246      0.535276                   0.193685   
2  1.0      0.505998      0.555912      0.729567                   0.281290   
3  1.0      0.505998      0.650442      0.535276                   0.329122   
4  1.0      0.505998      0.322738      0.535276                   0.163305   

   EXT_SOURCE_1 EXT_SOURCE_3  EXT_SOURCE_2 EXT_SOURCE_3  
0                   0.011573                   0.036649  
1                   0.166614                   0.333073  
2                   0.369159                   0.405575  
3                   0.270849                   0.348166  
4                   0.270849                   0.172754  
Validation AUC with polynomial features: 0.7208
✅ Submission file saved: submission.csv


**Feature Engineering**

In [11]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Loading Data
train = pd.read_csv('application_train.csv')
test = pd.read_csv('application_test.csv')

# Creating Target
y = train['TARGET']
train.drop(columns=['TARGET'], inplace=True)

# Splitting for validation
X_train, X_val, y_train, y_val = train_test_split(train, y, test_size=0.2, random_state=42)

# Evaluating AUC
def evaluate_model(model, X_val, y_val):
    preds = model.predict_proba(X_val)[:, 1]
    return roc_auc_score(y_val, preds)

# Pattern 1: Basic Numeric Features
features_1 = ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'DAYS_BIRTH', 'DAYS_EMPLOYED']
pipeline_1 = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=1000))
])
model_1 = pipeline_1.fit(X_train[features_1], y_train)
auc_1 = evaluate_model(model_1, X_val[features_1], y_val)

# Pattern 2: Ratio Features
X_train = X_train.copy()
X_val = X_val.copy()
X_train['CREDIT_INCOME_RATIO'] = X_train['AMT_CREDIT'] / X_train['AMT_INCOME_TOTAL']
X_train['EMPLOYED_AGE_RATIO'] = X_train['DAYS_EMPLOYED'] / X_train['DAYS_BIRTH']
X_train['ANNUITY_INCOME_RATIO'] = X_train['AMT_ANNUITY'] / X_train['AMT_INCOME_TOTAL']
X_val['CREDIT_INCOME_RATIO'] = X_val['AMT_CREDIT'] / X_val['AMT_INCOME_TOTAL']
X_val['EMPLOYED_AGE_RATIO'] = X_val['DAYS_EMPLOYED'] / X_val['DAYS_BIRTH']
X_val['ANNUITY_INCOME_RATIO'] = X_val['AMT_ANNUITY'] / X_val['AMT_INCOME_TOTAL']

features_2 = ['CREDIT_INCOME_RATIO', 'EMPLOYED_AGE_RATIO', 'ANNUITY_INCOME_RATIO']
pipeline_2 = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=1000))
])
model_2 = pipeline_2.fit(X_train[features_2], y_train)
auc_2 = evaluate_model(model_2, X_val[features_2], y_val)

# Pattern 3: External Source Aggregates
for df in [X_train, X_val]:
    df['EXT_SOURCES_MEAN'] = df[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis=1)
    df['EXT_SOURCES_STD'] = df[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].std(axis=1)

features_3 = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'EXT_SOURCES_MEAN', 'EXT_SOURCES_STD']
pipeline_3 = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=1000))
])
model_3 = pipeline_3.fit(X_train[features_3], y_train)
auc_3 = evaluate_model(model_3, X_val[features_3], y_val)

# Pattern 4: Polynomial Interactions
imputer = SimpleImputer(strategy='median')
ext_train = imputer.fit_transform(X_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']])
ext_val = imputer.transform(X_val[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']])

poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_poly_train = poly.fit_transform(ext_train)
X_poly_val = poly.transform(ext_val)

model_4 = LogisticRegression(max_iter=1000)
model_4.fit(X_poly_train, y_train)
auc_4 = evaluate_model(model_4, X_poly_val, y_val)

# Pattern 5: Categorical + Numeric Mix
cat_features = ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'NAME_EDUCATION_TYPE']
num_features = ['AMT_CREDIT', 'DAYS_BIRTH', 'EXT_SOURCE_2']

preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]), num_features),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_features)
])

pipeline_5 = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(n_estimators=100, random_state=42))
])
model_5 = pipeline_5.fit(X_train[cat_features + num_features], y_train)
auc_5 = evaluate_model(model_5, X_val[cat_features + num_features], y_val)

# Results
print(f'Pattern 1 AUC: {auc_1:.3f}')
print(f'Pattern 2 AUC: {auc_2:.3f}')
print(f'Pattern 3 AUC: {auc_3:.3f}')
print(f'Pattern 4 AUC: {auc_4:.3f}')
print(f'Pattern 5 AUC: {auc_5:.3f}')

# Final Submission
test['EXT_SOURCE_2'] = test['EXT_SOURCE_2'].fillna(test['EXT_SOURCE_2'].median())

submission = pd.DataFrame({
    'SK_ID_CURR': test['SK_ID_CURR'],
    'TARGET': pipeline_5.predict_proba(test[cat_features + num_features])[:, 1]
})
submission.to_csv('final_submission.csv', index=False)
print("✅ Submission saved as final_submission.csv")

Pattern 1 AUC: 0.592
Pattern 2 AUC: 0.566
Pattern 3 AUC: 0.721
Pattern 4 AUC: 0.719
Pattern 5 AUC: 0.630
✅ Submission saved as final_submission.csv


**Summary**

I explored five distinct feature engineering strategies to improve model accuracy on the Home Credit Default Risk dataset. Each pattern involved different combinations of features and preprocessing techniques:

| Pattern | Strategy                    | Validation AUC |
|---------|-----------------------------|----------------|
| 1️⃣      | Basic numeric features       | 0.592          |
| 2️⃣      | Ratio-based features         | 0.566          |
| 3️⃣      | External source aggregates   | 0.721          |
| 4️⃣      | Polynomial interactions      | 0.719          |
| 5️⃣      | Categorical + numeric mix    | 0.630          |


Pattern 3 leverages the power of three external credit scoring features—EXT_SOURCE_1, EXT_SOURCE_2, and EXT_SOURCE_3—which are known to be highly predictive of loan default risk. By enriching these with:

(EXT_SOURCES_MEAN: captures the average creditworthiness signal

EXT_SOURCES_STD: reflects variability across sources)

the model gains a more nuanced view of applicant reliability.

Combined with a clean preprocessing pipeline and a well-calibrated Logistic Regression, this setup delivers the highest AUC score on the validation set—indicating strong generalization and predictive power.

**Downloading the final_submission.csv file to submit on Kaggle**

In [12]:
from google.colab import files
files.download("final_submission.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>