# HW 5 Cecily Wang Spring 2024 BA 476

The final homework is a kaggle competition. You will predict whether loan applicants are repaid or not. The target is binary (1 represents default) and you will be asked for probabilistic predictions, i.e. you have to predict the probability of default. Submissions will be evaluated using Brier scores, and your grade will depend on the accuracy of your predictions.

This dataset is synthetically generated from a well-known public dataset (available here). The dataset used for the competition is (obviously) not identical to the original; you are welcome to use the original in whatever way you want.

**Goal**: Your goal is to predict the probability a loan applicant defaults on the loan (outcome=1) or repays it (outcome=0). A predicted probability of 0.3 means a 30% chance of default.

Hint 1: Preprocess your data
We discussed standardizing/normalizing and simple transformations like polynomial features. The sklearn documentation can be a good starting point for further reading.

Hint 2: Probabilistic classifiers
Most of the classifiers in sklearn can be used to predict probabilities by using predict_proba() instead of predict().

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer



from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import brier_score_loss


In [None]:
df = pd.read_csv('/content/Loan_default.csv')


In [None]:
df

Unnamed: 0,LoanID,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner,Default
0,I38PQUQS96,56,85994,50587,520,80,4,15.23,36,0.44,Bachelor's,Full-time,Divorced,Yes,Yes,Other,Yes,0
1,HPSK72WA7R,69,50432,124440,458,15,1,4.81,60,0.68,Master's,Full-time,Married,No,No,Other,Yes,0
2,C1OZ6DPJ8Y,46,84208,129188,451,26,3,21.17,24,0.31,Master's,Unemployed,Divorced,Yes,Yes,Auto,No,1
3,V2KKSFM3UN,32,31713,44799,743,0,3,7.07,24,0.23,High School,Full-time,Married,No,No,Business,No,0
4,EY08JDHTZP,60,20437,9139,633,8,4,6.51,48,0.73,Bachelor's,Unemployed,Divorced,No,Yes,Auto,No,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
255342,8C6S86ESGC,19,37979,210682,541,109,4,14.11,12,0.85,Bachelor's,Full-time,Married,No,No,Other,No,0
255343,98R4KDHNND,32,51953,189899,511,14,2,11.55,24,0.21,High School,Part-time,Divorced,No,No,Home,No,1
255344,XQK1UUUNGP,56,84820,208294,597,70,3,5.29,60,0.50,High School,Self-employed,Married,Yes,Yes,Auto,Yes,0
255345,JAO28CPL4H,42,85109,60575,809,40,1,20.90,48,0.44,High School,Part-time,Single,Yes,Yes,Other,No,0


In [None]:
# train and test data
train_df = pd.read_csv('/content/train.csv')
test_df = pd.read_csv('/content/test.csv')


In [None]:
train_df.head()

Unnamed: 0,ID,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner,Default
0,0,21,78304,168713,653,60,1,8.8,60,0.59,High School,Part-time,Single,No,Yes,Home,Yes,0
1,1,28,63751,84674,681,58,1,4.91,48,0.21,PhD,Part-time,Married,Yes,Yes,Auto,No,0
2,2,57,96676,167540,467,98,4,16.78,36,0.63,High School,Unemployed,Single,No,Yes,Business,Yes,0
3,3,24,79289,61546,358,63,4,6.4,60,0.83,Master's,Full-time,Single,Yes,Yes,Business,Yes,0
4,4,31,98586,232342,692,10,2,19.97,60,0.51,PhD,Unemployed,Married,Yes,Yes,Education,Yes,0


In [None]:
test_df.head()

Unnamed: 0,ID,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner
0,150000,62,131102,240256,538,17,1,7.13,60,0.82,PhD,Self-employed,Single,No,No,Home,No
1,150001,23,35766,97204,812,38,4,12.5,12,0.84,High School,Part-time,Married,No,No,Education,No
2,150002,61,28925,89471,381,81,1,2.43,36,0.2,Master's,Part-time,Single,Yes,Yes,Home,Yes
3,150003,54,32569,28820,562,96,4,5.8,48,0.75,Bachelor's,Full-time,Single,Yes,Yes,Auto,Yes
4,150004,41,136460,213437,558,80,3,13.76,36,0.21,Bachelor's,Full-time,Single,No,No,Education,Yes


In [None]:
# assigning features and target from training data
X = train_df.drop(['ID', 'Default'], axis=1)
y = train_df['Default']


Preprocessing:
remember: Hint 1: Preprocess your data We discussed standardizing/normalizing and simple transformations like polynomial features. The sklearn documentation can be a good starting point for further reading.

In [None]:
# preprocessing for numerical data
# missing values will be replaced with column median and then standardizing
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Preprocess categorical:
#missing values replaced with most frequent and then one-hot encoding
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, [cname for cname in X.columns if X[cname].dtype in ['int64', 'float64']]),
        ('cat', categorical_transformer, [cname for cname in X.columns if X[cname].dtype == 'object'])])


so first model choice is random forest since i think this is what makes the most sense when trying to find out at what level someone will repay their loan

In [None]:
model = RandomForestClassifier(n_estimators=250, random_state=0) # I changed n_estimators from 100 to 300 and now at 250

Hint 2: Probabilistic classifiers Most of the classifiers in sklearn can be used to predict probabilities by using predict_proba() instead of predict().

In [None]:
#pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)])

# splitting data into training and validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

#fitting
clf.fit(X_train, y_train)

#predictions
preds = clf.predict_proba(X_valid)[:, 1]

ValueError: Must have at least 1 validation dataset for early stopping.

In [None]:
# model eval
print('Brier score:', brier_score_loss(y_valid, preds))

Brier score: 0.11438821592592592


In [None]:
# test data, fit model
preds_test = clf.predict_proba(test_df.drop(['ID'], axis=1))[:, 1]

output = pd.DataFrame({'ID': test_df.ID,
                       'TARGET': preds_test})



In [None]:
output

Unnamed: 0,ID,TARGET
0,150000,0.156667
1,150001,0.290000
2,150002,0.130000
3,150003,0.066667
4,150004,0.040000
...,...,...
49995,199995,0.130000
49996,199996,0.326667
49997,199997,0.050000
49998,199998,0.250000


In [None]:
# Save test predictions to file
output = pd.DataFrame({'ID': test_df.ID,
                       'TARGET': preds_test})
output.to_csv('cecily_wang_submission.csv', index=False)

# part 2 (using grid search to see if i can better accuracy)

Trying to get better score. grid search

#I changed n estimators from 10,15,35 to 15,75,125 to try and see if i could get a better score. Wjith the first round of using Grid search, it got me a submission score of .11417. And previous to that without using grid search (the code above) i was able to get .11445.

In [None]:
# Define a grid of hyperparameter ranges
param_grid = {
    'model__n_estimators': [15, 75, 130],
    'model__max_depth': [5, 10, None],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 4]
}


In [None]:
#grid search object
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=4, scoring='neg_brier_score', verbose=2, n_jobs=-1)

#fitting to data
grid_search.fit(X_train, y_train)

Fitting 4 folds for each of 81 candidates, totalling 324 fits


In [None]:
print("Best parameters:", grid_search.best_params_)

## it makes sense the the model n estimators chosen was 125 which makes me wonder if I were to do this again, if I were to put in parameter that would be closer to 100 or closer to 150 next time

In [None]:
# here using the best estimator to make predictions
best_clf = grid_search.best_estimator_
preds = best_clf.predict_proba(X_valid)[:, 1]

AttributeError: 'GridSearchCV' object has no attribute 'best_estimator_'

In [None]:
#eval
print('Brier score of the best model:', brier_score_loss(y_valid, preds))

Brier score of the best model: 0.11417937413894808


In [None]:

preds_test = best_clf.predict_proba(test_df.drop(['ID'], axis=1))[:, 1]

In [None]:
output2 = pd.DataFrame({'ID': test_df.ID, 'TARGET': preds_test})

In [None]:
output2.to_csv('2submission.csv', index=False)




# part three trying to vget a better score

In [None]:
# Define a grid of hyperparameter ranges
param_grid = {
    'model__n_estimators': [25, 100, 130],
    'model__max_depth': [5, 10, None],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 4]
}


In [None]:
#grid search object
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=3, scoring='neg_brier_score', verbose=2, n_jobs=-1)



In [None]:
#fitting to data
grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 81 candidates, totalling 243 fits


KeyboardInterrupt: 

In [None]:
print("Best parameters:", grid_search.best_params_)

AttributeError: 'GridSearchCV' object has no attribute 'best_params_'

In [None]:
# here using the best estimator to make predictions
best_clf = grid_search.best_estimator_
preds = best_clf.predict_proba(X_valid)[:, 1]

AttributeError: 'GridSearchCV' object has no attribute 'best_estimator_'

In [None]:
#eval
print('Brier score of the best model:', brier_score_loss(y_valid, preds))

Brier score of the best model: 0.11417937413894808


In [None]:

preds_test = best_clf.predict_proba(test_df.drop(['ID'], axis=1))[:, 1]

### Part 4

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier


# Define a larger grid of hyperparameters to explore
param_grid = {
    'model__n_estimators': [15,75 , 135],
    'model__max_depth': [5, 15, 25, None],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 4],
    'model__max_features': ['auto', 'sqrt'],
    'model__bootstrap': [True, False]
}

# adding SimpleImputer() to numerical and categorical transformers

model = RandomForestClassifier(random_state=0)

# pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)])

# randomized search to sample from the param_grid
random_search = RandomizedSearchCV(
    estimator=clf,
    param_distributions=param_grid,
    n_iter=100,  # number of parameter settings that are sampled
    cv=5,
    scoring='neg_brier_score',
    verbose=2,
    random_state=0,
    n_jobs=-1
)

# fit random search model
random_search.fit(X_train, y_train)

# best model
best_clf = random_search.best_estimator_

# predictions on the validation set
preds = best_clf.predict_proba(X_valid)[:, 1]

print('Brier score of the best model:', brier_score_loss(y_valid, preds))

# predictions on test set
preds_test = best_clf.predict_proba(test_df.drop(['ID'], axis=1))[:, 1]



Fitting 5 folds for each of 100 candidates, totalling 500 fits


ValueError: Input contains NaN.

In [None]:

# Save test predictions to file
output = pd.DataFrame({'ID': test_df.ID,
                       'TARGET': preds_test})
output.to_csv('3submission.csv', index=False)

Explanation and work through of my code:

ok so doing the grid search with param: 'model__n_estimators': [15, 75, 125] helped greatly improve my score in comparison to the other two

first I immediately thought of decision trees and random forests for the problem since we were trying to find a certain 'threshold' that could evaluate the loan repayment. Decision trees in this problem would be good to handle the non-linear relationships and interactions between the multiple features. But, I decided to first use scikit RandomForestClassifier so i could potentially fight against overfitting, especially given the complexity and potential noise in the dataset.

as for the parameter tuning, I did gridsearchCV to optimize the RandomForest parameters. Here I did two rounds on the parameters: n_estimators, max_depth, and min_samples_split. This helped mefind the best combination of parameters that would minimize overfitting while maximizing the model's predictive accuracy. It took about 7 minutes for this part of the code to run each time due to the large nature of the data but this is where I was able to heavily improve my accuracy. I alos think that the best part that I was able to implement into this process was the preprocessing pipeline (imputation, scaling, and encoding steps tailored to different data types). This also improved my model performance by ensuring that the model received clean and well-structured input data. The one-hot encoding for categorical variables and standard scaling for numerical inputs also helped in normalizing the feature scales and handling categorical data effectively. I will say that it was quite difficult making sure that the preprocessing steps were correctly applied during both training and prediction phases. Initially, discrepancies in handling missing values and categorical variables between the training and testing phases led to inconsistencies in model performance but once i was able to figure this out I could fix the issues by tightly integrating the preprocessing steps into the pipeline which made sure that the data transformations were applied correctly.

to bring it up again, the main issue that I still wish I could fix was the increased computational cost and time associated with training, especially the parameter tuning with grid search. Despite this, the trade-off with being able to heaivly improve predictive accuracy and model robustness against overfitting was ok.





In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import brier_score_loss
from xgboost import XGBClassifier

df = pd.read_csv('/content/Loan_default.csv')

train_df = pd.read_csv('/content/train.csv')
test_df = pd.read_csv('/content/test.csv')

X = train_df.drop(['ID', 'Default'], axis=1)
y = train_df['Default']

numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, [cname for cname in X.columns if X[cname].dtype in ['int64', 'float64']]),
        ('cat', categorical_transformer, [cname for cname in X.columns if X[cname].dtype == 'object'])])

model = XGBClassifier(n_estimators=24, random_state=0, use_label_encoder=False, eval_metric='logloss')

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)])

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

clf.fit(X_train, y_train)

preds = clf.predict_proba(X_valid)[:, 1]

print('Brier score:', brier_score_loss(y_valid, preds))

preds_test = clf.predict_proba(test_df.drop(['ID'], axis=1))[:, 1]
output = pd.DataFrame({'ID': test_df.ID, 'TARGET': preds_test})


Brier score: 0.11320250487632796


In [None]:
output = pd.DataFrame({'ID': test_df.ID,
                       'TARGET': preds_test})
output.to_csv('9cecily_wang_submission.csv', index=False)

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import brier_score_loss
from xgboost import XGBClassifier


numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_transformer, [cname for cname in X.columns if X[cname].dtype in ['int64', 'float64']]),
    ('cat', categorical_transformer, [cname for cname in X.columns if X[cname].dtype == 'object'])])

#classifier with early stopping
model = XGBClassifier(
    n_estimators=555,
    max_depth=5,
    learning_rate=0.04,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,
    reg_lambda=0.1,
    random_state=0,
    use_label_encoder=False,
    eval_metric='logloss',
    enable_categorical=True,
    early_stopping_rounds=10
)

# Preprocess
X_preprocessed = preprocessor.fit_transform(X)
X_train, X_valid, y_train, y_valid = train_test_split(X_preprocessed, y, train_size=0.75, test_size=0.15, random_state=0)

#  XGBClassifier
model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)])

preds = model.predict_proba(X_valid)[:, 1]
print('Brier score:', brier_score_loss(y_valid, preds))

X_test_preprocessed = preprocessor.transform(test_df.drop(['ID'], axis=1))
preds_test = model.predict_proba(X_test_preprocessed)[:, 1]
output = pd.DataFrame({'ID': test_df['ID'], 'TARGET': preds_test})


[0]	validation_0-logloss:0.43806
[1]	validation_0-logloss:0.43473
[2]	validation_0-logloss:0.43230
[3]	validation_0-logloss:0.42934
[4]	validation_0-logloss:0.42660
[5]	validation_0-logloss:0.42408
[6]	validation_0-logloss:0.42171
[7]	validation_0-logloss:0.42011
[8]	validation_0-logloss:0.41799
[9]	validation_0-logloss:0.41596
[10]	validation_0-logloss:0.41429
[11]	validation_0-logloss:0.41265
[12]	validation_0-logloss:0.41119
[13]	validation_0-logloss:0.40976
[14]	validation_0-logloss:0.40828
[15]	validation_0-logloss:0.40690
[16]	validation_0-logloss:0.40586
[17]	validation_0-logloss:0.40461
[18]	validation_0-logloss:0.40345
[19]	validation_0-logloss:0.40251
[20]	validation_0-logloss:0.40155
[21]	validation_0-logloss:0.40047
[22]	validation_0-logloss:0.39943
[23]	validation_0-logloss:0.39858
[24]	validation_0-logloss:0.39773
[25]	validation_0-logloss:0.39687
[26]	validation_0-logloss:0.39625
[27]	validation_0-logloss:0.39545
[28]	validation_0-logloss:0.39468
[29]	validation_0-loglos

In [None]:
from sklearn.metrics import make_scorer, brier_score_loss


train_df = pd.read_csv('/content/train.csv')
test_df = pd.read_csv('/content/test.csv')

X = train_df.drop(['ID', 'Default'], axis=1)
y = train_df['Default']
X_test = test_df.drop(['ID'], axis=1)

numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_transformer, [cname for cname in X.columns if X[cname].dtype in ['int64', 'float64']]),
    ('cat', categorical_transformer, [cname for cname in X.columns if X[cname].dtype == 'object'])])

full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', XGBClassifier(
        n_estimators=1100,
        max_depth=5,
        learning_rate=0.03,
        subsample=0.6,
        colsample_bytree=0.8,
        reg_alpha=0.7,
        reg_lambda=0.7,
        random_state=0,
        use_label_encoder=False,
        eval_metric='logloss',
        enable_categorical=True))
])

brier_scorer = make_scorer(brier_score_loss, needs_proba=True, greater_is_better=False)

# k-fold cross-validation
scores = cross_val_score(full_pipeline, X, y, cv=5, scoring=brier_scorer)

print("Cross-validated Brier scores:", scores)
print("Mean Brier score:", -scores.mean())





Cross-validated Brier scores: [-0.10928798 -0.10899046 -0.10944336 -0.10891281 -0.10917109]
Mean Brier score: 0.10916114086867942


In [None]:
# full pipeline on all training data
full_pipeline.fit(X, y)

# Predict test set
test_predictions = full_pipeline.predict_proba(X_test)[:, 1]

test_output = pd.DataFrame({'ID': test_df['ID'], 'TARGET': test_predictions})

In [None]:

test_output.to_csv('37cecily_wang_submission.csv', index=False)

# ok this part is a resubmission since I got better scores throughout the week. I was stuck initially on the previous model and so I decided to incoporate XGboosting into my model. I initially tried to do more gridsearch but it alwaysss took so long to load and so I just ran the code over and over with differing parameters and I landed on this submission (37) model as the best one. The part that I adjusted the most were n_estimators, learning rate, and I increased alpha and lambda.

#jlajdlfjae;


In [None]:
#nKLFv;JOf

# ajjdlfak

In [None]:
#jafjdsljlfa

# ajlf

In [None]:
jfaik;fja

NameError: name 'jfaik' is not defined

#33 fojaldfal


# ajlfjald

jafFDLjal.

:ADSJfla

#aklfjl;akf ;fewal

# aklfjfdlafj;

In [None]:
#jfa

#fjladjslkas

#raejfldfa;

#jfalj;fda'

#fkalj;afgj

# fjfal

In [None]:
dksalfdsa

In [None]:
dsalkf;da

In [None]:
jfal;jfdladjak;w

In [None]:
lj;ljf;lkdj;laf

In [None]:
fadlsk;jfla

In [None]:
ifadsjilas;hkfla

In [None]:
jfedal;jfal