# Students Do: Credit Risk Classification with Amazon SageMaker

* **Dataset:** German Credit Risk Dataset - Prof. Dr. Hans Hofmann (original source: [ics.uci.edu](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)), download: [Kaggle](https://www.kaggle.com/uciml/german-credit))

* **Goal:** Classify the credit risk of a person as described by a given set of input features.

**Note:** You should import and run this notebook into your notebook intance on Amazon SageMaker.

In [1]:
# Initial imports
import numpy as np
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt

%matplotlib inline

## Loading the data

In [2]:
# Upload the german_credit_data.csv file (located in ../Resources/) through through JupyterLab
file_path = Path("Data/german_credit_data.csv")
df = pd.read_csv(file_path, index_col=0)
df.head()

Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose,Risk
0,67,male,2,own,,little,1169,6,radio/TV,good
1,22,female,2,own,little,moderate,5951,48,radio/TV,bad
2,49,male,1,own,little,,2096,12,education,good
3,45,male,2,free,little,little,7882,42,furniture/equipment,good
4,53,male,2,free,little,little,4870,24,car,bad


In [3]:
# Creating a DataFrame with the features (include all columns except "Risk")
features_df = df.drop("Risk", axis=1)
features_df.head()

Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose
0,67,male,2,own,,little,1169,6,radio/TV
1,22,female,2,own,little,moderate,5951,48,radio/TV
2,49,male,1,own,little,,2096,12,education
3,45,male,2,free,little,little,7882,42,furniture/equipment
4,53,male,2,free,little,little,4870,24,car


In [4]:
# Creating a DataFrame with the target data (The "Risk" column)
target_df = pd.DataFrame(df["Risk"])
target_df.head()

Unnamed: 0,Risk
0,good
1,bad
2,good
3,good
4,bad


## Data Preparation

A logistic regression model will be trained using all the input features.

* `X` is the predictor variable vector with the values of all features.
* `Y` is the target variable vector with the risk result.

In [5]:
# Encode the categorical features (One-hot encode)
features_enc = pd.get_dummies(features_df)
features_enc.head()

Unnamed: 0,Age,Job,Credit amount,Duration,Sex_female,Sex_male,Housing_free,Housing_own,Housing_rent,Saving accounts_little,...,Checking account_moderate,Checking account_rich,Purpose_business,Purpose_car,Purpose_domestic appliances,Purpose_education,Purpose_furniture/equipment,Purpose_radio/TV,Purpose_repairs,Purpose_vacation/others
0,67,2,1169,6,0,1,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
1,22,2,5951,48,1,0,0,1,0,1,...,1,0,0,0,0,0,0,1,0,0
2,49,1,2096,12,0,1,0,1,0,1,...,0,0,0,0,0,1,0,0,0,0
3,45,2,7882,42,0,1,1,0,0,1,...,0,0,0,0,0,0,1,0,0,0
4,53,2,4870,24,0,1,1,0,0,1,...,0,0,0,1,0,0,0,0,0,0


In [6]:
# ignore DataConversionWarning messages
import warnings
from sklearn.exceptions import DataConversionWarning

warnings.filterwarnings(action="ignore", category=DataConversionWarning)

In [7]:
# Scale the features
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(features_enc)
scaler
print(scaler.mean_[:5])
print(scaler.scale_[:5])

[3.554600e+01 1.904000e+00 3.271258e+03 2.090300e+01 3.100000e-01]
[1.13697794e+01 6.53287073e-01 2.82132515e+03 1.20527835e+01
 4.62493243e-01]


In [8]:
features_enc_scaled = scaler.transform(features_enc)
features_enc_scaled

array([[ 2.76645648,  0.14694918, -0.74513141, ...,  1.60356745,
        -0.14998296, -0.11020775],
       [-1.19140394,  0.14694918,  0.94981679, ...,  1.60356745,
        -0.14998296, -0.11020775],
       [ 1.18331231, -1.38377145, -0.41656241, ..., -0.62360956,
        -0.14998296, -0.11020775],
       ...,
       [ 0.21583532,  0.14694918, -0.87450324, ...,  1.60356745,
        -0.14998296, -0.11020775],
       [-1.10345149,  0.14694918, -0.50552769, ...,  1.60356745,
        -0.14998296, -0.11020775],
       [-0.75164167,  0.14694918,  0.46245715, ..., -0.62360956,
        -0.14998296, -0.11020775]])

In [9]:
target_enc = pd.get_dummies(target_df)
target_enc.head()

Unnamed: 0,Risk_bad,Risk_good
0,0,1
1,1,0
2,0,1
3,0,1
4,1,0


In [10]:
X = features_enc_scaled
Y = target_enc["Risk_bad"].values.reshape(-1)  # 0 = Good, 1 = Bad Risk
print(X[:5])
print(Y[:5])

[[ 2.76645648  0.14694918 -0.74513141 -1.23647786 -0.67028006  0.67028006
  -0.3479601   0.63444822 -0.4669334  -1.23243324 -0.33886163 -0.25929878
  -0.22454436  1.62776996 -0.60662102 -0.25929878 -0.32774947 -0.71294854
  -0.11020775 -0.2503982  -0.47010767  1.60356745 -0.14998296 -0.11020775]
 [-1.19140394  0.14694918  0.94981679  2.24819436  1.49191369 -1.49191369
  -0.3479601   0.63444822 -0.4669334   0.81140298 -0.33886163 -0.25929878
  -0.22454436 -0.61433742  1.6484757  -0.25929878 -0.32774947 -0.71294854
  -0.11020775 -0.2503982  -0.47010767  1.60356745 -0.14998296 -0.11020775]
 [ 1.18331231 -1.38377145 -0.41656241 -0.73866754 -0.67028006  0.67028006
  -0.3479601   0.63444822 -0.4669334   0.81140298 -0.33886163 -0.25929878
  -0.22454436 -0.61433742 -0.60662102 -0.25929878 -0.32774947 -0.71294854
  -0.11020775  3.99363901 -0.47010767 -0.62360956 -0.14998296 -0.11020775]
 [ 0.8315025   0.14694918  1.63424694  1.75038404 -0.67028006  0.67028006
   2.8738927  -1.57617276 -0.466933

In [11]:
# Split the data in training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

## Machine Learning Model Creation

In [12]:
bucket = "sagemaker-20200312-1736"  #'<your_s3_bucket_name_here>'
prefix = "german-credit-risk"

# Amazon SageMaker and related imports
import sagemaker
import sagemaker.amazon.common as smac
from sagemaker.predictor import csv_serializer, json_deserializer
from sagemaker import get_execution_role
import boto3  # AWS Python sdk

import os
import io
import json

# AWS IAM role
role = get_execution_role()

### Uploading Training Data to Amazon S3

In order to train your machine learning model using Amazon SageMaker, the training data should passed through an Amazon S3 Bucket formatted as a [protobuf recordIO format](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html#td-serialization).

The profobuf recordIO format, is a method to serialize structured data (similar to `JSON`), to allow different applications to communicate with each other or for storing data.

Using the profobuf recordIO format, allows you to take advantage of _Pipe mode_ when training the algorithms that support it. In _Pipe mode_, your training job streams data directly from Amazon S3. Streaming can provide faster start times for training jobs and better throughput.

The following code converts the training data as a Protocol Buffer, next the data is uploaded to the Amazon S3 bucket.

In [13]:
# Encode the training data as Protocol Buffer
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(
    buf, np.array(X_train).astype("float32"), np.array(Y_train).astype("float32")
)
buf.seek(0)

# Upload encoded training data to Amazon S3
key = "train.data"
boto3.resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "train", key)
).upload_fileobj(buf)
s3_train_data = "s3://{}/{}/train/{}".format(bucket, prefix, key)
print("Training data uploaded to: {}".format(s3_train_data))

Training data uploaded to: s3://sagemaker-20200312-1736/german-credit-risk/train/train.data


#### Upload Test Data to Amazon S3

If you provide test data, the algorithm logs include the test score for the final model.

In [14]:
# Encode the testing data as Protocol Buffer
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(
    buf, np.array(X_test).astype("float32"), np.array(Y_test).astype("float32")
)
buf.seek(0)

# Upload encoded testing data to Amazon S3
key = "test.data"
boto3.resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "test", key)
).upload_fileobj(buf)
s3_test_data = "s3://{}/{}/test/{}".format(bucket, prefix, key)
print("Testing data uploaded to: {}".format(s3_test_data))

Testing data uploaded to: s3://sagemaker-20200312-1736/german-credit-risk/test/test.data


### Training the Machine Learning Model

Once you have uploaded your data to Amazon S3, it's time to train the machine learning model. In this activity, you will use the Amazon SageMaker's [_linear learner algorithm_](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html) to run a linear regression prediction model.

You can learn more about the diferent Amazon SageMaker built-in algorithms [in this page](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html).

First, an intance of the linear learner algorithm is created.

In [15]:
# Create an instance of the linear learner algorithm
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(boto3.Session().region_name, "linear-learner")


Next, the estimator container is created in an AWS EC2 instance using a `ml.m4.xlarge`.

**Note:** This step might take a few minutes.

In [16]:
# Start the Amazon SageMaker session
sess = sagemaker.Session()

# Create an instance of the linear learner estimator
linear = sagemaker.estimator.Estimator(
    container,
    role,
    train_instance_count=1,
    train_instance_type="ml.m4.xlarge",
    output_path="s3://{}/{}/output".format(bucket, prefix),
    sagemaker_session=sess,
)

# Get the dimension of the feature-input vector
feature_dim = len(X[:1][0])


# Define linear learner hyperparameters
# Note how in this case we use: predictor_type='binary_classifier' # (credit risk: good or bad)
linear.set_hyperparameters(
    feature_dim=feature_dim, mini_batch_size=200, predictor_type="binary_classifier"
)

# Fitting the linear learner model with the training data
linear.fit({"train": s3_train_data, "test": s3_test_data})

2020-03-13 12:38:48 Starting - Starting the training job...
2020-03-13 12:38:49 Starting - Launching requested ML instances......
2020-03-13 12:40:17 Starting - Preparing the instances for training......
2020-03-13 12:41:15 Downloading - Downloading input data......
2020-03-13 12:42:11 Training - Training image download completed. Training in progress..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[03/13/2020 12:42:15 INFO 140398256052032] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'feature_dim': u'auto', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'num_point_for_scaler': u'10000', u'_log_level': u'info', u'quantile': u'0.5', u'bias_lr_mult': u'auto', u'lr_schedul

### Deploying the Model to Make Predictions

In this section, the `linear-learner` model that was trained will be used to make predictions of credit risk. Deploy the model using a `ml.t2.medium` instance type.

**Note:** This step might take a few minutes.

In [17]:
# An instance of the linear-learner predictor is created
linear_predictor = linear.deploy(initial_instance_count=1, instance_type="ml.t2.medium")

-------------------!

In [18]:
# Linear predictor configurations
linear_predictor.serializer = csv_serializer
linear_predictor.deserializer = json_deserializer

In [19]:
# Making some predictions using the test data.
result = linear_predictor.predict(X_test)
y_predictions = np.array(
    [np.uint8(r["predicted_label"]) for r in result["predictions"]]
)
y_predictions

array([0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0], dtype=uint8)

### Model Evaluation

To evaluate the model, we create a confusion matrix to compare the cases where each predicted value matches the expected test-value and when not.

In [20]:
import pandas as pd

# Encode the predictios results as 0 = Good, 1 = Bad Risk
cat_test = np.array(["Good" if x == 0 else "Bad" for x in Y_test])
cat_pred = np.array(["Good" if x == 0 else "Bad" for x in y_predictions])

# Create the confusion matrix
pd.crosstab(cat_test, cat_pred, rownames=["actuals"], colnames=["predictions"])

predictions,Bad,Good
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
Bad,28,36
Good,16,120


The model is also evaluated using the `sklearn` metrics module. The following metrics are calculated:

* [`accuracy_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)

* [`confusion_matrix`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)

* [`classification_report`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

In [21]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print(f"accuracy_score: {accuracy_score(Y_test,y_predictions)}\n\n")
print(
    f"sklearn's confusion_matrix: \n\n{confusion_matrix(Y_test, y_predictions, labels=[0, 1])}\n\n"
)
print(
    f"classification_report: \n\n{classification_report(Y_test, y_predictions, target_names=['Good', 'Bad'])}\n\n"
)

accuracy_score: 0.74


sklearn's confusion_matrix: 

[[120  16]
 [ 36  28]]


classification_report: 

              precision    recall  f1-score   support

        Good       0.77      0.88      0.82       136
         Bad       0.64      0.44      0.52        64

   micro avg       0.74      0.74      0.74       200
   macro avg       0.70      0.66      0.67       200
weighted avg       0.73      0.74      0.72       200





Finally the end point is deleted to avoid additional AWS resources usage.

In [22]:
# Delete Amazon SageMaker end-point
sagemaker.Session().delete_endpoint(linear_predictor.endpoint)