<img src="../img/logo.png" aling='left' width=200>


## Glue Job Preprocessing with PySpark- Model training-Kueski Challenge

AWS Glue supports an extension of the PySpark Python dialect for scripting extract, transform, and load (ETL) jobs. In this notebook we use this dialect for creating an ETL script to run a Glue job. 

<a id='contents' />

## Table of contents

1. [Loading libraries](#loading)
2. [Data Preprocessing. Data Pipelines](#etl)
3. [Model Training.](#s3)

<a id='loading' />

## 1. Loading libraries:
[(back to top)](#contents)

In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.compose import ColumnTransformer
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.impute import SimpleImputer

import matplotlib.pyplot as plt
%matplotlib inline

from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import (
    accuracy_score, confusion_matrix, recall_score, 
    plot_confusion_matrix, precision_score, plot_roc_curve
)

from sklearn.ensemble import RandomForestClassifier

In [2]:
# Read Data from offline feature store.
df = pd.read_csv('../data/train_model.csv')
# Subset of column to be used in the model.
my_cols = ['age', 'years_on_the_job','nb_previous_loans','avg_amount_loans_previous','flag_own_car']

In [3]:
# Nan distribution.
df.isnull().sum()

id                              0
age                             0
years_on_the_job             6135
nb_previous_loans               0
avg_amount_loans_previous     343
flag_own_car                    0
status                          0
dtype: int64

In [4]:
df[df.avg_amount_loans_previous.isnull()]

Unnamed: 0,id,age,years_on_the_job,nb_previous_loans,avg_amount_loans_previous,flag_own_car,status
39,5008852,56,12.0,0.0,,1,0
120,5008939,42,9.0,0.0,,0,0
130,5008949,42,4.0,0.0,,0,0
213,5009058,20,2.0,0.0,,1,0
1070,5010174,24,2.0,0.0,,0,0
...,...,...,...,...,...,...,...
35968,5149672,45,6.0,0.0,,0,0
36009,5149722,29,0.0,0.0,,0,0
36115,5149854,25,0.0,0.0,,0,0
36353,5150237,52,0.0,0.0,,1,0


<a id='etl' />

## Data Preprocessing. Data Pipelines.
[(back to top)](#contents)


In [10]:
# train test split
Y = df['status'].astype('int')
X = df[my_cols]
X_train, X_test, y_train, y_test = train_test_split(X,Y, 
                                                    stratify=Y, test_size=0.3,
                                                    random_state = 12345)

In [11]:
# Pipeline
# Using Synthetic Minority Over-Sampling Technique(SMOTE) to overcome sample imbalance problem.
pipeline = imbpipeline(steps = [
    ["selector", ColumnTransformer([("selector", "passthrough", my_cols)], remainder="drop")],
    ["imputer", SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0 )],
    ['smote', SMOTE(random_state=11)],
    ['model', RandomForestClassifier()]])

<a id='s3' />

## Model Training.
[(back to top)](#contents)

In [12]:
# Model  training.
model = pipeline.fit(X_train,y_train)

In [13]:
# Hago un ejemplo a ver como me va.
example = pd.DataFrame({'age':[43.0],
 'years_on_the_job':[np.nan],
 'nb_previous_loans':[4.0],
 'avg_amount_loans_previous':[np.nan],
 'flag_own_car':[0]})
display(example)


Unnamed: 0,age,years_on_the_job,nb_previous_loans,avg_amount_loans_previous,flag_own_car
0,43.0,,4.0,,0


In [14]:
model.predict(example)

array([0])

In [15]:
# Model persistance

from joblib import dump, load
# dump model
dump(model, 'model_risk.joblib') 

['model_risk.joblib']

# Notes.
1) Imputing by zero is not correct since you are changing the data distribution for example  in the case of age. A capping strategy could be better (discussed on README.md file)

2) Imputing by zero in the case of the number of years worked is not correct either.

3) using SMOTE before the train test split is not correct either. Since smote is a technique to balance dataset for the training stage, the model learns from more instances of the minority class. Be carefully with metrics distorsions.

4) No cross-validation and search of hyperpamenters is done

5) Random forest in credit score is not auditable as a model with explainability.

Also in the implementation of notebook did not have the case of negative years.