
# Assignment 2: Hyperparameter Optimization For the Customer's Credit Scoring Model

This notebook contains a set of exercises that will guide you through the different steps of this assignment. Solutions must be code-based, _i.e._ hard-coded or manually computed results will not be accepted. Remember to write your solutions to each exercise in the dedicated cells and not modify or remove the test cells. When completing all the exercises submit this same notebook back to Moodle in **.ipynb** format.
<div class="alert alert-success">

The dataset consists of data about 1000 customers, encompassing 84 features extracted from their financial transactions and current financial status. The main aim is to utilize this dataset for credit risk assessment and forecasting potential defaults.

Included within are two target variables, one designed for classification and the other for regression analysis:

- **DEFAULT**: Binary target variable indicating if the customer has defaulted (1) or not (0)
- **CREDIT_SCORE**: Numerical target variable representing the customer's credit score (integer)

and these features:

- **INCOME**: Total income in the last 12 months
- **SAVINGS**: Total savings in the last 12 months
- **DEBT**: Total existing debt
- **R_SAVINGS_INCOME**: Ratio of savings to income
- **R_DEBT_INCOME**: Ratio of debt to income
- **R_DEBT_SAVINGS**: Ratio of debt to savings

Transaction groups (**GROCERIES**, **CLOTHING**, **HOUSING**, **EDUCATION**, **HEALTH**, **TRAVEL**, **ENTERTAINMENT**, **GAMBLING**, **UTILITIES**, **TAX**, **FINES**) are categorized.

- **T_{GROUP}_6**: Total expenditure in that group in the last 6 months
- **T_GROUP_12**: Total expenditure in that group in the last 12 months
- **R_[GROUP]**: Ratio of T_[GROUP]6 to T[GROUP]_12
- **R_[GROUP]INCOME**: Ratio of T[GROUP]_12 to INCOME
- **R_[GROUP]SAVINGS**: Ratio of T[GROUP]_12 to SAVINGS
- **R_[GROUP]DEBT**: Ratio of T[GROUP]_12 to DEBT

Categorical Features:

- **CAT_GAMBLING**: Gambling category (none, low, high)
- **CAT_DEBT**: 1 if the customer has debt; 0 otherwise
- **CAT_CREDIT_CARD**: 1 if the customer has a credit card; 0 otherwise
- **CAT_MORTGAGE**: 1 if the customer has a mortgage; 0 otherwise
- **CAT_SAVINGS_ACCOUNT**: 1 if the customer has a savings account; 0 otherwise
- **CAT_DEPENDENTS**: 1 if the customer has any dependents; 0 otherwise
- **CAT_LOCATION**: Location (San Francisco, Philadelphia, Los Angeles, etc.)
- **CAT_MARITAL_STATUS**: Marital status (Married, Widowed, Divorced or Single)
- **CAT_EDUCATION**: Level of Education (Postgraduate, College, High School or Graduate)

</div>

<div class="alert alert-danger"><b>Submission deadline:</b> Sunday, March 17th, 23:55</div>

In [None]:
import pandas as pd
from sklearn import set_config

set_config(transform_output="pandas")

<div class="alert alert-info"><b>Exercise 1</b>
    
Load the data from the link: https://raw.githubusercontent.com/jnin/information-systems/main/data/AI2_23_24_credit_score.csv in a DataFrame called ```df```. This time, drop only the column ```CUST_ID```.

Then, Write the code to create the feature matrix ```X``` and the target array ```y``` (```DEFAULT```), then split them into separate training and test sets with a relative size of 0.75 and 0.25. Store the training and tests feature matrix in variables called ```X_train``` and ```X_test```, and the training and test label arrays as ```y_train``` and ```y_test```.    
<br><i>[0.5 points]</i>
</div>
<div class="alert alert-warning">
    
Don't forget to remove the column ```CREDIT_SCORE``` from ```X```. This variable serves as the target array for the regression exercises.

</div>

In [None]:
# YOUR CODE HERE

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 2</b>

Write the code to construct a two-branched `Pipeline` – one branch for categorical attributes and another for numerical attributes. For categorical variables, employ a `SimpleImputer` with the most frequent strategy and an `OneHotEncoder`. For numerical attributes, use a `SimpleImputer` with the mean and a `StandardScaler`. The pipeline must conclude by training a `DecisionTreeClassifier` without hyper-parameter tuning. Save the resulting pipeline in a variable named `pipe`.
<br><i>[1.5 points]</i>
</div>


In [None]:
# YOUR CODE HERE

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 3</b>

Write the code to estimate the performance of the model using cross-validation with **five** stratified folds. Store the five test score values in a dictionary called ```fold_scores```.
<br><i>[0.5 points]</i>
</div>

In [None]:
# YOUR CODE HERE

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 4</b>

Write the code to instantiate a ```GridSearchCV``` object named `grid`, fitting it with **only three folds**. This ```GridSearchCV``` object should incorporate the previous pipeline and explore diverse hyperparameters to enhance the predictive capability of the previous `DecisionTreeClassifier`. Employ the grid search wisely, avoiding testing an excessive number of alternatives.

Lastly, save the score (accuracy) achieved by the best hyperparameter combination in a variable named `score`.
<br><i>[1.5 points]</i>
</div>

In [None]:
# YOUR CODE HERE

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 5</b>

    
Write the code to compute the generalization accuracy for the best model of the ```GridSearchCV``` object, and store this score in a variable called ```generalization_score```.
<br><i>[1 point]</i>
</div>

In [None]:
# YOUR CODE HERE

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 6</b>
    
The previous grid search is incomplete because it only optimizes the hyperparameters of the decision tree classifier. Now, let's replicate the same process but expand the scope to include testing parameters of all the steps within the pipeline for a regression task, with the CREDIT_SCORE attribute as the target array. This exercise is open-ended enabling you to explore any hyperparameters from the scaler, imputer, transformer, encoder, or model components. Do not limit yourself to linear models, instead, employ two diverse models: XGBRegressor and SVR. Finally, return the estimated generalization capability of both models.
<br><i>[5 points]</i>
</div>

In [None]:
# YOUR CODE HERE