In [139]:
from sklearn.datasets import fetch_openml

In [140]:
X, y = fetch_openml('credit-g', version=2, as_frame=False, return_X_y=True)

In [141]:
df = fetch_openml('credit-g', version=2, as_frame=True)#, return_X_y=True)

In [142]:
df.keys()

dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

In [143]:
print(df['DESCR'])

**Author**: Dr. Hans Hofmann  
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)) - 1994    
**Please cite**: [UCI](https://archive.ics.uci.edu/ml/citation_policy.html)

**German Credit dataset**  
This dataset classifies people described by a set of attributes as good or bad credit risks.

This dataset comes with a cost matrix: 
``` 
Good  Bad (predicted)  
Good   0    1   (actual)  
Bad    5    0  
```

It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).  

### Attribute description  

1. Status of existing checking account, in Deutsche Mark.  
2. Duration in months  
3. Credit history (credits taken, paid back duly, delays, critical accounts)  
4. Purpose of the credit (car, television,...)  
5. Credit amount  
6. Status of savings account/bonds, in Deutsche Mark.  
7. Present employment, in number of years.  
8. Installment rate in percentage of disposable income  
9. Perso

In [144]:
df['data']['purpose'].value_counts()

purpose
radio/tv               280
new car                234
furniture/equipment    181
used car               103
business                97
education               50
repairs                 22
domestic appliance      12
other                   12
retraining               9
Name: count, dtype: int64

In [145]:
df['data']

Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,residence_since,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker
0,<0,6,critical/other existing credit,radio/tv,1169.0,no known savings,>=7,4,male single,none,4,real estate,67,none,own,2,skilled,1,yes,yes
1,0<=X<200,48,existing paid,radio/tv,5951.0,<100,1<=X<4,2,female div/dep/mar,none,2,real estate,22,none,own,1,skilled,1,none,yes
2,no checking,12,critical/other existing credit,education,2096.0,<100,4<=X<7,2,male single,none,3,real estate,49,none,own,1,unskilled resident,2,none,yes
3,<0,42,existing paid,furniture/equipment,7882.0,<100,4<=X<7,2,male single,guarantor,4,life insurance,45,none,for free,1,skilled,2,none,yes
4,<0,24,delayed previously,new car,4870.0,<100,1<=X<4,3,male single,none,4,no known property,53,none,for free,2,skilled,2,none,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,no checking,12,existing paid,furniture/equipment,1736.0,<100,4<=X<7,3,female div/dep/mar,none,4,real estate,31,none,own,1,unskilled resident,1,none,yes
996,<0,30,existing paid,used car,3857.0,<100,1<=X<4,4,male div/sep,none,4,life insurance,40,none,own,1,high qualif/self emp/mgmt,1,yes,yes
997,no checking,12,existing paid,radio/tv,804.0,<100,>=7,4,male single,none,4,car,38,none,own,1,skilled,1,none,yes
998,<0,45,existing paid,radio/tv,1845.0,<100,1<=X<4,4,male single,none,4,no known property,23,none,for free,1,skilled,1,yes,yes


In [146]:
df['data']['other_parties'].value_counts()

other_parties
none            907
guarantor        52
co applicant     41
Name: count, dtype: int64

In [147]:
from sklearn.ensemble import RandomForestClassifier

In [148]:
rndclf = RandomForestClassifier()

In [149]:
from sklearn.model_selection import train_test_split

In [150]:
X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)

## Work on dataframe

The actual data points can't be used directly and needs to be feature engineered. So I'll be using the `frame` content. It includes all data including the `target` column. After cleaning up, I can move it to a different column.

These columns needs to be corrected:

* credit_history, purpose, personal_status, other_parties, property_magnitude, other_payment_plans, housing, job - one hot enc
* checking_status, savings_status, employment,  - binning
* num_dependents - convert type to `int`
* own_telephone, foreign_worker, class - binary

In [261]:
import pandas as pd
credit = df['frame']

In [262]:
# one-hot encoding
for col in ['credit_history', 'purpose', 'personal_status', 'other_parties', 'property_magnitude', 'other_payment_plans', 'housing', 'job']:
    one_hot = pd.get_dummies(credit[col])
    credit = pd.concat(
        [credit.drop(col,axis=1),
        one_hot],
        axis=1
    )

This naive approach failed. Some of the values were `yes` and `no`. So I now have multiple columns with `yes` and `no` making it meaningless. I need to work on data a bit more.

Initially I had `own_telephone` and `foreign_worker` as one-hot encoding. I had to move it to binary category. After that, the data became cleaner.

In [263]:
# binning
credit['checking_status'].unique()

['<0', '0<=X<200', 'no checking', '>=200']
Categories (4, object): ['no checking', '0<=X<200', '<0', '>=200']

In [264]:
checking_bin_mapping = {
    'no checking': 0,
    '<0': 1,
    '0<=X<200': 2,
    '>=200': 3    
}

In [265]:
savings_bin_mapping = {
    'no known savings': 0,
    '<100': 1,
    '100<=X<500': 3,
    '500<=X<1000': 4,
    '>=1000': 5
}

In [266]:
employment_bin_mapping = {
    'unemployed': 0,
    '<1': 1,
    '1<=X<4': 2,
    '4<=X<7': 3,
    '>=7': 4    
}

In [267]:
def create_binning(data, bin_mapping):
    return [bin_mapping[x] for x in data]

In [268]:
credit['checking_bin'] = create_binning(credit['checking_status'], checking_bin_mapping)
credit['savings_bin'] = create_binning(credit['savings_status'], savings_bin_mapping)
credit['employment_bin'] =  create_binning(credit['employment'], employment_bin_mapping)

In [269]:
# drop the old columns
credit.drop(['checking_status', 'savings_status', 'employment'], axis=1, inplace=True)

In [270]:
credit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 49 columns):
 #   Column                          Non-Null Count  Dtype   
---  ------                          --------------  -----   
 0   duration                        1000 non-null   int64   
 1   credit_amount                   1000 non-null   float64 
 2   installment_commitment          1000 non-null   int64   
 3   residence_since                 1000 non-null   int64   
 4   age                             1000 non-null   int64   
 5   existing_credits                1000 non-null   int64   
 6   num_dependents                  1000 non-null   int64   
 7   own_telephone                   1000 non-null   category
 8   foreign_worker                  1000 non-null   category
 9   class                           1000 non-null   category
 10  all paid                        1000 non-null   bool    
 11  critical/other existing credit  1000 non-null   bool    
 12  delayed previously   

At this point, there are 3 columns - `class`, `own_telephone` and `foreign_worker` which are of type `category`. These are boolean values. So we need to convert it to bool.

In [271]:
credit['num_dependents'].unique()

array([1, 2])

In [272]:
credit['own_telephone'].unique()

['yes', 'none']
Categories (2, object): ['none', 'yes']

In [273]:
credit['foreign_worker'].unique()

['yes', 'no']
Categories (2, object): ['no', 'yes']

In [274]:
credit['class'].unique()

['good', 'bad']
Categories (2, object): ['bad', 'good']

In [275]:
# convert to int
credit['num_dependents'] = credit['num_dependents'].astype('int32')

In [276]:
# convert to bool
credit['own_telephone'] = credit['own_telephone'].map({ 'yes': True, 'none': False})
credit['own_telephone'] = credit['own_telephone'].astype('bool')

In [277]:
# convert to bool
credit['foreign_worker'] = credit['foreign_worker'].map({ 'yes': True, 'no': False})
credit['foreign_worker'] = credit['foreign_worker'].astype('bool')

In [278]:
# convert to bool
credit['class'] = credit['class'].map({ 'good': True, 'bad': False})
credit['class'] = credit['class'].astype('bool')

In [279]:
credit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 49 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   duration                        1000 non-null   int64  
 1   credit_amount                   1000 non-null   float64
 2   installment_commitment          1000 non-null   int64  
 3   residence_since                 1000 non-null   int64  
 4   age                             1000 non-null   int64  
 5   existing_credits                1000 non-null   int64  
 6   num_dependents                  1000 non-null   int32  
 7   own_telephone                   1000 non-null   bool   
 8   foreign_worker                  1000 non-null   bool   
 9   class                           1000 non-null   bool   
 10  all paid                        1000 non-null   bool   
 11  critical/other existing credit  1000 non-null   bool   
 12  delayed previously              100

We now have a clean dataset. Now, convert this to an `ndarray` so that it can used for training.

In [280]:
X = credit.drop('class', axis=1).values

In [281]:
y = credit['class'].values

In [282]:
# now retry the randomforest classifier

In [283]:
X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)

In [284]:
rndclf.fit(X_train, y_train)

In [285]:
preds = rndclf.predict(X_test)

In [286]:
# let's get the quality of our predictions
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                           f1_score, confusion_matrix, classification_report,
                           roc_auc_score, precision_recall_curve)

In [287]:
print(f"Precision: {precision_score(y_test, y_pred=preds)}\nRecall: {recall_score(y_test, y_pred=preds)}")

Precision: 0.7634408602150538
Recall: 0.9301310043668122


In [288]:
# let's see how each column helps.
import numpy as np
np.round(credit.corr()['class'], 3)

duration                         -0.215
credit_amount                    -0.155
installment_commitment           -0.072
residence_since                  -0.003
age                               0.091
existing_credits                  0.046
num_dependents                    0.003
own_telephone                     0.036
foreign_worker                   -0.082
class                             1.000
all paid                         -0.134
critical/other existing credit    0.182
delayed previously               -0.012
existing paid                    -0.044
no credits/all paid              -0.145
domestic appliance               -0.008
new car                          -0.097
used car                          0.100
business                         -0.036
education                        -0.070
furniture/equipment              -0.021
other                            -0.028
radio/tv                          0.107
repairs                          -0.021
retraining                        0.039


Precision went down after _cleaning_ up the data. So this is not a good idea. 

In [289]:
credit['class'].value_counts()

class
True     700
False    300
Name: count, dtype: int64

In [290]:
pd.Series(y_train).value_counts()

True     471
False    199
Name: count, dtype: int64

In [291]:
pd.Series(y_test).value_counts()

True     229
False    101
Name: count, dtype: int64

Perhaps there is a class imbalance. Maybe we should oversample on "bad" credits? Let's try to use SMOTE

In [292]:
!pip install -q imblearn

In [293]:
from imblearn.over_sampling import SMOTE

In [294]:
# Create the SMOTE object
smote = SMOTE(random_state=42)

# Fit and apply SMOTE
# X is your feature matrix, y is your target (credit column)
X_resampled, y_resampled = smote.fit_resample(X, y)

In [295]:
# Check the new class distribution
print("Original class distribution:")
print(pd.Series(y).value_counts())
print("\nResampled class distribution:")
print(pd.Series(y_resampled).value_counts())

Original class distribution:
True     700
False    300
Name: count, dtype: int64

Resampled class distribution:
True     700
False    700
Name: count, dtype: int64


In [296]:
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.33, random_state=42)
rndclf.fit(X_train, y_train)
preds = rndclf.predict(X_test)
print(f"Precision: {precision_score(y_test, y_pred=preds)}\nRecall: {recall_score(y_test, y_pred=preds)}")

Precision: 0.806949806949807
Recall: 0.8744769874476988


Precision has increased from 76% to over 80% now. That is good progress. Let's see if tuning the hyperparameters can have an effect.

In [299]:
rndclf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [322]:
def fit_and_train(model, X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    rndclf.fit(X_train, y_train)
    preds = rndclf.predict(X_test)
    print(f"Precision: {np.round(precision_score(y_test, y_pred=preds),4)}\nRecall: {np.round(recall_score(y_test, y_pred=preds),4)}\nF1 Score: {np.round(f1_score(y_test, y_pred=preds),4)}")

In [323]:
rndclf = RandomForestClassifier(n_estimators=200)
fit_and_train(rndclf, X_resampled, y_resampled)

Precision: 0.8
Recall: 0.887
F1 Score: 0.8413


Looks like just increasing the number of estimators is not helping. Let's prune the tree.

In [324]:
for max_dept in range(5,45,5):
    print(f"------ Max Depth: {max_dept} ------")
    rndclf = RandomForestClassifier(max_depth=10)
    fit_and_train(rndclf, X_resampled, y_resampled)

------ Max Depth: 5 ------
Precision: 0.8068
Recall: 0.8912
F1 Score: 0.8469
------ Max Depth: 10 ------
Precision: 0.7947
Recall: 0.8745
F1 Score: 0.8327
------ Max Depth: 15 ------
Precision: 0.7984
Recall: 0.8619
F1 Score: 0.829
------ Max Depth: 20 ------
Precision: 0.7909
Recall: 0.8703
F1 Score: 0.8287
------ Max Depth: 25 ------
Precision: 0.8106
Recall: 0.8954
F1 Score: 0.8509
------ Max Depth: 30 ------
Precision: 0.8135
Recall: 0.8577
F1 Score: 0.835
------ Max Depth: 35 ------
Precision: 0.8023
Recall: 0.8828
F1 Score: 0.8406
------ Max Depth: 40 ------
Precision: 0.7926
Recall: 0.8954
F1 Score: 0.8409


Looks like `max_depth` of 25 with `n_estimators=200 (default)` is the best hyper parameters using RandomForestClassifier and SMOTE over sampling.

## Feature Engineering
`RandomForestClassifier` is normally used as a feature engineering tool. It can show the features that contribute most to the decision. So let's expore and see if that does any better than simply using the correlation operator.

In [328]:
len(rndclf.feature_importances_)

48

In [330]:
credit.columns

Index(['duration', 'credit_amount', 'installment_commitment',
       'residence_since', 'age', 'existing_credits', 'num_dependents',
       'own_telephone', 'foreign_worker', 'class', 'all paid',
       'critical/other existing credit', 'delayed previously', 'existing paid',
       'no credits/all paid', 'domestic appliance', 'new car', 'used car',
       'business', 'education', 'furniture/equipment', 'other', 'radio/tv',
       'repairs', 'retraining', 'female div/dep/mar', 'male div/sep',
       'male mar/wid', 'male single', 'co applicant', 'guarantor', 'none',
       'life insurance', 'no known property', 'real estate', 'car', 'bank',
       'none', 'stores', 'for free', 'own', 'rent',
       'high qualif/self emp/mgmt', 'unemp/unskilled non res',
       'unskilled resident', 'skilled', 'checking_bin', 'savings_bin',
       'employment_bin'],
      dtype='object')

In [332]:
columns = list(credit.columns)


In [336]:
columns.pop(9)

'class'

In [338]:
len(columns)

48

In [340]:
for (col, val) in zip(columns, rndclf.feature_importances_):
    print( col, np.round(val,2))

duration 0.06374912258057969
credit_amount 0.05480287034327893
installment_commitment 0.03288516466304427
residence_since 0.03578878360421085
age 0.04778140032778424
existing_credits 0.01974767792784613
num_dependents 0.008644630973581634
own_telephone 0.016655301521635975
foreign_worker 0.002832819475428355
all paid 0.009145029641835632
critical/other existing credit 0.055192423026459464
delayed previously 0.011683455358985519
existing paid 0.032069017817403817
no credits/all paid 0.0066522785562702635
domestic appliance 0.0010629877597333399
new car 0.027295585686189446
used car 0.014125000811842729
business 0.009445832655943077
education 0.005604771869410638
furniture/equipment 0.011061458581972366
other 0.001083188061993686
radio/tv 0.035060319112146646
repairs 0.003403854422822285
retraining 0.0006140350033514953
female div/dep/mar 0.01857296578843669
male div/sep 0.0061168671368218375
male mar/wid 0.005258570354611371
male single 0.01758026615148396
co applicant 0.005005626592808

In [343]:
feature_importance = sorted(zip(columns, rndclf.feature_importances_), 
                          key=lambda x: x[1], 
                          reverse=True)

# Print the sorted features and their importance scores
for feature, importance in feature_importance:
    print(f"{feature:<30} {np.round(importance, 3):>6}")

checking_bin                    0.135
duration                        0.064
critical/other existing credit  0.055
credit_amount                   0.055
age                             0.048
savings_bin                     0.041
real estate                     0.038
employment_bin                  0.037
residence_since                 0.036
radio/tv                        0.035
installment_commitment          0.033
existing paid                   0.032
own                              0.03
new car                         0.027
none                             0.02
existing_credits                 0.02
female div/dep/mar              0.019
rent                            0.018
male single                     0.018
skilled                         0.017
own_telephone                   0.017
bank                            0.017
car                             0.015
used car                        0.014
for free                        0.012
delayed previously              0.012
no known pro

Looks like these 5 features are not that important:
* repairs
* foreign_worker
* unemp/unskilled non res
* other
* domestic appliance
* retraining

Let's see if dropping them will improve the performance.


In [344]:
credit.drop(['repairs', 'foreign_worker', 'unemp/unskilled non res', 'other', 'domestic appliance', 'retraining'], axis=1, inplace=True)

In [345]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Create the SMOTE object
smote = SMOTE(random_state=42)
# Fit and apply SMOTE
# X is your feature matrix, y is your target (credit column)
X_resampled, y_resampled = smote.fit_resample(X, y)

for max_dept in range(5,45,5):
    print(f"------ Max Depth: {max_dept} ------")
    rndclf = RandomForestClassifier(max_depth=10)
    fit_and_train(rndclf, X_resampled, y_resampled)

------ Max Depth: 5 ------
Precision: 0.7849
Recall: 0.8703
F1 Score: 0.8254
------ Max Depth: 10 ------
Precision: 0.8287
Recall: 0.8703
F1 Score: 0.849
------ Max Depth: 15 ------
Precision: 0.8269
Recall: 0.8996
F1 Score: 0.8617
------ Max Depth: 20 ------
Precision: 0.8053
Recall: 0.8828
F1 Score: 0.8423
------ Max Depth: 25 ------
Precision: 0.7901
Recall: 0.8661
F1 Score: 0.8263
------ Max Depth: 30 ------
Precision: 0.8137
Recall: 0.8954
F1 Score: 0.8526
------ Max Depth: 35 ------
Precision: 0.8075
Recall: 0.8954
F1 Score: 0.8492
------ Max Depth: 40 ------
Precision: 0.8086
Recall: 0.8661
F1 Score: 0.8364


In the previous case, this was the best result:

```
------ Max Depth: 25 ------
Precision: 0.8106
Recall: 0.8954
F1 Score: 0.8509
```

With the feature engineered data set, the results got better at an even lower depth of just 15.

```
------ Max Depth: 15 ------
Precision: 0.8269
Recall: 0.8996
F1 Score: 0.8617
```

At this point, I think we've done enough feature engineering. Now is the time to move on from RandomForest and see if changing algorigthm can provide a better performance.

### XGBoost

In the AWS Exams, I felt that XGBoost was given a lot of importance. It was considered as a decent enough algorithm to help in a lot of classification tasks. So let's start with it.

In [346]:
!pip install --upgrade -q pip

In [352]:
!pip uninstall -q xgboost -y

In [353]:
!pip install -q xgboost

In [354]:
import xgboost as xgb
print(xgb.__version__)

2.1.3


In [376]:
# Convert data into DMatrix
dtrain = xgb.DMatrix(X_resampled, label=y_resampled)
dtest = xgb.DMatrix(X_test, label=y_test)

# Set parameters
params = {
    "objective": "binary:logistic",
    "max_depth": 15,
    "eta": 0.1,
    "eval_metric": "logloss",
}

# Train model
num_round = 57
bst = xgb.train(params, dtrain, num_round)

In [377]:
# Predict
preds = bst.predict(dtest)
predictions = [1 if p > 0.5 else 0 for p in preds]

# Evaluate
print(f"Precision: {np.round(precision_score(y_test, y_pred=predictions),4)}\nRecall: {np.round(recall_score(y_test, y_pred=predictions),4)}\nF1 Score: {np.round(f1_score(y_test, y_pred=predictions),4)}")
print("Accuracy:", accuracy)

Precision: 1.0
Recall: 1.0
F1 Score: 1.0
Accuracy: 1.0


This is amazing! At 57 rounds and  depth of 15, this algorithm gives me a perfect score! There is no point in trying a neural network as this algorithm is blazingly fast.

## Conclusion

In this exercise, my objective was to take a new data set that I had never seen before. I wanted to go through the steps of data cleaning, encoding and feature engineering to arrive at a dataset that I could use for training. After running a training using RandomForest, I proceeded to enhance the data set using `SMOTE`. The purpose was to synthetically add more samples for the case when the credit was rejected. After this, I dropped some additional columns that `RandomForest` detected as features that contributed the least to predictions. After dropping them, I used `XGBoost`. I set the depth to same level as the best performing depth of `RandomForest`. With a few rounds of training, the algorithm reached a perfect F1-accuracy. 