
# Assignment 1: Predicting Customer's Credit Scoring

This notebook contains a set of exercises that will guide you through the different steps of this assignment. Solutions must be code-based, _i.e._ hard-coded or manually computed results will not be accepted. Remember to write your solutions to each exercise in the dedicated cells and not modify or remove the test cells. When completing all the exercises submit this same notebook back to Moodle in **.ipynb** format.
<div class="alert alert-success">

The dataset consists of data about 1000 customers, encompassing 84 features extracted from their financial transactions and current financial status. The main aim is to utilize this dataset for credit risk assessment and forecasting potential defaults.

Included within are two target variables, one designed for classification and the other for regression analysis:

- **DEFAULT**: Binary target variable indicating if the customer has defaulted (1) or not (0)
- **CREDIT_SCORE**: Numerical target variable representing the customer's credit score (integer)

and these features:

- **INCOME**: Total income in the last 12 months
- **SAVINGS**: Total savings in the last 12 months
- **DEBT**: Total existing debt
- **R_SAVINGS_INCOME**: Ratio of savings to income
- **R_DEBT_INCOME**: Ratio of debt to income
- **R_DEBT_SAVINGS**: Ratio of debt to savings

Transaction groups (**GROCERIES**, **CLOTHING**, **HOUSING**, **EDUCATION**, **HEALTH**, **TRAVEL**, **ENTERTAINMENT**, **GAMBLING**, **UTILITIES**, **TAX**, **FINES**) are categorized.

- **T_{GROUP}_6**: Total expenditure in that group in the last 6 months
- **T_GROUP_12**: Total expenditure in that group in the last 12 months
- **R_[GROUP]**: Ratio of T_[GROUP]6 to T[GROUP]_12
- **R_[GROUP]INCOME**: Ratio of T[GROUP]_12 to INCOME
- **R_[GROUP]SAVINGS**: Ratio of T[GROUP]_12 to SAVINGS
- **R_[GROUP]DEBT**: Ratio of T[GROUP]_12 to DEBT

Categorical Features:

- **CAT_GAMBLING**: Gambling category (none, low, high)
- **CAT_DEBT**: 1 if the customer has debt; 0 otherwise
- **CAT_CREDIT_CARD**: 1 if the customer has a credit card; 0 otherwise
- **CAT_MORTGAGE**: 1 if the customer has a mortgage; 0 otherwise
- **CAT_SAVINGS_ACCOUNT**: 1 if the customer has a savings account; 0 otherwise
- **CAT_DEPENDENTS**: 1 if the customer has any dependents; 0 otherwise
- **CAT_LOCATION**: Location (San Francisco, Philadelphia, Los Angeles, etc.)
- **CAT_MARITAL_STATUS**: Marital status (Married, Widowed, Divorced or Single)
- **CAT_EDUCATION**: Level of Education (Postgraduate, College, High School or Graduate)

</div>

<div class="alert alert-danger"><b>Submission deadline:</b> Sunday, February 25th, 23:55</div>


In [None]:
import pandas as pd
import numpy as np
from sklearn import set_config

set_config(transform_output="pandas")

<div class="alert alert-info"><b>Exercise 1</b>

Load the data from the link: https://raw.githubusercontent.com/jnin/information-systems/main/data/AI2_23_24_credit_score.csv in a DataFrame called ```df```. Then, select the following columns:

```INCOME, SAVINGS, DEBT, T_CLOTHING_12, T_CLOTHING_6, R_CLOTHING, R_CLOTHING_INCOME, R_CLOTHING_SAVINGS, R_CLOTHING_DEBT, T_EDUCATION_12, T_EDUCATION_6, R_EDUCATION, R_EDUCATION_INCOME, R_EDUCATION_SAVINGS, R_EDUCATION_DEBT, T_GROCERIES_12, T_GROCERIES_6, R_GROCERIES, R_GROCERIES_INCOME, R_GROCERIES_SAVINGS, R_GROCERIES_DEBT, T_HEALTH_12, T_HEALTH_6, R_HEALTH, R_HEALTH_INCOME, R_HEALTH_SAVINGS, R_HEALTH_DEBT, T_HOUSING_12, T_HOUSING_6, R_HOUSING, R_HOUSING_INCOME, R_HOUSING_SAVINGS, R_HOUSING_DEBT, CAT_GAMBLING, CAT_DEBT, CAT_CREDIT_CARD, CAT_MORTGAGE, CAT_SAVINGS_ACCOUNT, CAT_DEPENDENTS, CREDIT_SCORE, DEFAULT```
<br><i>[0.25 points]</i>
</div>
<div class="alert alert-warning">
    
Remember, Python is case-sensitive. The resulting DataFrame ```df``` should contain 41 columns.  

Do not download the dataset. Instead, read the data directly from the provided link.

</div>

In [None]:
# YOUR CODE HERE

In [None]:
#LEAVE BLANK

<div class="alert alert-info"><b>Exercise 2</b> 
    
Write the code to create the feature matrix ```X``` and the target array ```y``` (```DEFAULT```), then split them into separate training and test sets with a relative size of 0.75 and 0.25. Store the training and tests feature matrix in variables called ```X_train``` and ```X_test```, and the training and test label arrays as ```y_train``` and ```y_test```.
<br><i>[0.75 points]</i>
</div>

<div class="alert alert-warning">
    
Don't forget to remove the column ```CREDIT_SCORE``` from ```X```. This variable serves as the target array for exercises 6 and 7.

</div>

In [None]:
# YOUR CODE HERE

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 3 </b> 

    
The resulting feature matrix contains a categorical variable (```CAT_GAMBLING```). Write the code to create a ```ColumnTransformer``` to encode it using the one-hot encoding method. Store the transformer in a variable called ```transformer```. At this stage, you do not need to run it.
<br><i>[1 point]</i>
</div>

<div class='alert alert-warning'>

Ensure that the rest of the attributes remain intact.
</div>

In [None]:
# YOUR CODE HERE

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 4 </b> 

Some of the attributes contain missing values. Write the code to create a ```Pipeline``` consisting of a ```SimpleImputer``` with the most frequent strategy, the previous transformer, a standard scaler, and a logistic regression model. Store the resulting pipeline in a variable called ```pipe```.
<br><i>[2 points]</i>
</div>

<div class='alert alert-warning'>

Be sure you apply the data transformations in the correct order.

For the sake of simplicity, utilize the same `SimpleImputer` across all attributes, regardless of whether they have missing values or other imputers might appear more suitable.
</div>

In [None]:
# YOUR CODE HERE

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 5 </b> 
    
Write the code to store the achieved accuracy, recall, precision, and f1 in five variables called ```accuracy```, ```recall```, ```precision```, and ```f1``` respectively. 
<br><i>[1 point]</i>
</div>

<div class='alert alert-warning'>

Be sure you use the train and test datasets correctly.
</div>

In [None]:
# YOUR CODE HERE

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

In [None]:
# LEAVE BLANK

<div class="alert alert-info"><b>Execute the following cell </b>
    
As explained in the S2 video, logistic regression, akin to all linear models, derives feature importance from its weights when classifying a given sample. Execute the following cell to visualize the weights and grasp which attributes contribute the most in the decision.
</div>

In [None]:
import plotly.express as px

coef = pd.Series(pipe[-1].coef_.ravel(), index=pipe[-1].feature_names_in_).sort_values()
fig = px.bar(x=coef.index, y=coef.values, labels={'x': 'Feature', 'y': 'Coefficient'}, title='Feature importance')
fig.show()

<div class="alert alert-info"><b>Exercise 6 </b> 
    
The preceding exercises were focused on solving a classification task. Now, let's repeat the same process but this time utilize the `CREDIT_SCORE` target variable to address a regression task. This exercise is open-ended, allowing you to employ any scaler, imputer, transformer, or encoder of your choice. The only requirement is to train a linear regression model using the same columns used in the previous exercises.
<br><i>[4 points]</i>
</div>



In [None]:
# YOUR CODE HERE

<div class="alert alert-info"><b>Exercise 7 </b> 

Let's now proceed to compare both models. Execute the following cell to visualize the new weights, then provide insightful observations regarding their differences. Please ensure your insights are well-supported; otherwise, non-substantiated descriptions will be graded with 0.
<br><i>[1 point]</i>
</div>

In [None]:
import plotly.express as px

coef = pd.Series(pipe[-1].coef_.ravel(), index=pipe[-1].feature_names_in_).sort_values()
fig = px.bar(x=coef.index, y=coef.values, labels={'x': 'Feature', 'y': 'Coefficient'}, title='Feature importance')
fig.show()

In [None]:
# Elaborate your justified explanation in this cell
# YOUR CODE HERE