## 4 Pre-processing and Training<a id='4_Pre-processing_and_Training'></a>

## 4.1 Contents<a id='4.1_Contents'></a>
* [4 Pre-Processing and Training Data](#4_Pre-Processing_and_Training_Data)
  * [4.1 Contents](#4.1_Contents)
  * [4.2 Introduction](#4.2_Introduction)
  * [4.3 Imports](#4.3_Imports)
  * [4.4 Loading the Data](#4.4_Loading_the_Data)
  * [4.5 Extract Borrower Data](#4.5_Extract_Borrower_Data)
  * [4.6 Splitting Indicator and Numerical Features](#4.6_Splitting_Indicator_and_Numerical_Features)
  * [4.7 Creating Dummy Features and Scaling Numeric Data](#4.7_Creating_Dummy_Features_and_Scaling_Numeric_Data)
  * [4.8 Train/Test Split](#4.8_Train/Test_Split)
  * [4.9 Initial Not-Even-A-Model](#4.9_Initial_Not-Even-A-Model)
  * [4.10 Metrics](#4.10_Metrics)
      * [4.10.1.1R-squared, or coefficient of determination](#4.10.1_R-squared,_or_coefficient_of_determination)
      * [4.10.2 Mean Absolute Error](#4.10.2_Mean_Absolute_Error)
      * [4.10.3 Mean Squared Error](#4.10.3_Mean_Squared_Error)
  * [4.11 Initial Model](#4.11_Initial_Model)
  * [4.12 Pipelines](#4.11_Pipelines)
      * [4.12.1 Define the pipeline](#4.12.2_Define_the_pipeline)
      * [4.12.2 Fit the pipeline](#4.12.2_Fit_the_pipeline)
      * [4.12.3 Make predictions on the train and test sets](#4.12.3_Make_predictions_on_the_train_and_test_sets)
      * [4.12.4 Assess performance](#4.12.4_Assess_performance)
  * [4.13 Summary](#4.13_Summary)


## 4.2 Introduction<a id='4.2_Introduction'></a>

In this notebook we'll begin building machine learning models. Before even starting with learning a machine learning model, let's start by considering how useful the mean value is as a predictor. Think of the first model as a baseline performance comparitor for any subsequent model. We'll then build up the process of efficiently and robustly creating and assessing models against it.

## 4.3 Imports<a id='4.3_Imports'></a>

In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score,f1_score,precision_score, recall_score
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import matplotlib.pyplot as plt
from sklearn import __version__ as sklearn_version
import datetime
import warnings
warnings.filterwarnings('ignore')

In [2]:
os.chdir(r'C:\Users\daenj\OneDrive\Desktop\Datasets\Capstone 2021')
os.getcwd()

'C:\\Users\\daenj\\OneDrive\\Desktop\\Datasets\\Capstone 2021'

## 4.4 Loading the Data<a id='4.4_Loading_the_Data'></a>

We're going to remove the numerical data and create a new data frame with only the object data types. This will be used later when creating dummy features.

In [3]:
data = pd.read_csv('cleaned_data.csv')
data = data.drop(columns='Unnamed: 0')
data_df = pd.DataFrame(data)

In [4]:
object_data = data_df.select_dtypes(include = ['object'])

## 4.5 Extracting Borrower Data<a id='4.5_Extracting_Borrower_Data'></a>

Here we're splitting the borrowers based on whether they have payment issues or not.

In [5]:
payment_issues = data_df[data_df.TARGET == 1]
no_issues = data_df[data_df.TARGET == 0]

In [6]:
payment_issues.shape

(24825, 41)

In [7]:
no_issues.shape

(282686, 41)

## 4.6 Splitting Indicator and Numerical Features<a id='4.6_Splitting_Indicator_and_Numerical_Features'></a>

In this section, the indicator and numerical features are split from borrowers with a target of 1 and 0. (Payment issues and no payment issues) Since the object data types was extracted from the whole data set, we won't use the object variables created here, only the float variables.

In [8]:
payment_issues_float = payment_issues.select_dtypes(include = ['int', 'float'])
payment_issues_object = payment_issues.select_dtypes(include=['object'])

In [9]:
no_issues_float = no_issues.select_dtypes(include = ['int', 'float'])
no_issues_object = no_issues.select_dtypes(include=['object'])

In [10]:
print(no_issues_float.columns)
print(no_issues_object.columns)

Index(['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       'REGION_POPULATION_RELATIVE', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION',
       'DAYS_ID_PUBLISH', 'CNT_FAM_MEMBERS', 'DAYS_LAST_PHONE_CHANGE'],
      dtype='object')
Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
       'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS',
       'NAME_HOUSING_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE',
       'DAYS_BIRTH_BINS', 'INCOME_VAL'],
      dtype='object')


## 4.7 Creating Dummy Features<a id='4.7_Creating_Dummy_Features'></a>

Let's create dummy features for our categorical data. We'll use the object variable created at the beginning of the notebook.

In [11]:
data_df = pd.concat([data_df.drop(object_data, axis=1), pd.get_dummies(object_data)], axis=1)

This next step concatenates the float variables from both groups of borrowers. 

In [12]:
frames = [no_issues_float, payment_issues_float]
float_data = pd.concat(frames)

In [13]:
float_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 307511 entries, 1 to 307509
Data columns (total 10 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   AMT_INCOME_TOTAL            307511 non-null  float64
 1   AMT_CREDIT                  307511 non-null  float64
 2   AMT_ANNUITY                 307511 non-null  float64
 3   AMT_GOODS_PRICE             307511 non-null  float64
 4   REGION_POPULATION_RELATIVE  307511 non-null  float64
 5   DAYS_EMPLOYED               307511 non-null  float64
 6   DAYS_REGISTRATION           307511 non-null  float64
 7   DAYS_ID_PUBLISH             307511 non-null  float64
 8   CNT_FAM_MEMBERS             307511 non-null  float64
 9   DAYS_LAST_PHONE_CHANGE      307511 non-null  float64
dtypes: float64(10)
memory usage: 25.8 MB


## 4.8 Train/Test Split<a id='4.8_Train/Test_Split'></a>

This Train/Test split has 30% of the data on the test split. As stated earlier, our target is either a 1 or 0 for all of the borrowers. 

In [14]:
X = float_data
target = data_df['TARGET']
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size= 0.3, random_state = 0)

In [15]:
X_train.shape, X_test.shape

((215257, 10), (92254, 10))

In [16]:
y_train.shape, y_test.shape

((215257,), (92254,))

## 4.9 Initial Not-Even-A-Model<a id='4.9_Initial_Not-Even-A-Model'></a>

A good place to start is to see how good the mean is as a predictor. In this case, the value here is our best guess at correctly predicting whether the average borrower WILL have payment issues.

In [17]:
train_mean = y_train.mean()
train_mean

0.08122848502023162

In [18]:
dumb_reg = DummyRegressor(strategy='mean')
dumb_reg.fit(X_train, y_train)
dumb_reg.constant_

array([[0.08122849]])

### 4.10 Metrics<a id='4.10_Metrics'></a>

#### 4.10.1 R-squared, or coefficient of determination<a id='4.10.1_R-squared,_or_coefficient_of_determination'></a>

One measure is $R^2$, the coefficient of determination. This is a measure of the proportion of variance in the dependent variable (payment issues or not) that is predicted by our "model".

In [19]:
#Code task 6#
#Calculate the R^2 as defined above
def r_squared(y, ypred):
    """R-squared score.
    
    Calculate the R-squared, or coefficient of determination, of the input.
    
    Arguments:
    y -- the observed values
    ypred -- the predicted values
    """
    ybar = np.sum(y) / len(y) #yes, we could use np.mean(y)
    sum_sq_tot = np.sum((y - ybar)**2) #total sum of squares error
    sum_sq_res = np.sum((y - ypred)**2) #residual sum of squares error
    R2 = 1.0 - sum_sq_res / sum_sq_tot
    return R2

In [20]:
y_tr_pred_ = train_mean * np.ones(len(y_train))
y_tr_pred_[:5]

array([0.08122849, 0.08122849, 0.08122849, 0.08122849, 0.08122849])

In [21]:
y_tr_pred = dumb_reg.predict(X_train)
y_tr_pred[:5]

array([0.08122849, 0.08122849, 0.08122849, 0.08122849, 0.08122849])

In [22]:
r_squared(y_train, y_tr_pred)

0.0

In [23]:
y_te_pred = train_mean * np.ones(len(y_test))
r_squared(y_test, y_te_pred)

-3.787954817524586e-05

#### 4.10.2 Mean Absolute Error<a id='4.10.2_Mean_Absolute_Error'></a>

In [24]:
#Code task 7#
#Calculate the MAE as defined above
def mae(y, ypred):
    """Mean absolute error.
    
    Calculate the mean absolute error of the arguments

    Arguments:
    y -- the observed values
    ypred -- the predicted values
    """
    abs_error = np.abs(y - ypred)
    mae = np.mean(abs_error)
    return mae

In [25]:
mae(y_train, y_tr_pred)

0.14926083648360142

In [26]:
mae(y_test, y_te_pred)

0.14786587570123505

Mean absolute error is arguably the most intuitive of all the metrics. This essentially tells you that, on average, you could expect your predictions to be 15% more likely to be wrong if you guessed that the average borrower will have payment issues based on an average of known values.

#### 4.10.3 Mean Squared Error<a id='4.10.3_Mean_Squared_Error'></a>

In [27]:
#Code task 8#
#Calculate the MSE as defined above
def mse(y, ypred):
    """Mean square error.
    
    Calculate the mean square error of the arguments

    Arguments:
    y -- the observed values
    ypred -- the predicted values
    """
    sq_error = (y - ypred)**2
    mse = np.mean(sq_error)
    return mse

In [28]:
mse(y_train, y_tr_pred)

0.07463041824157111

In [29]:
mse(y_test, y_te_pred)

0.07323545745987244

In [30]:
np.sqrt([mse(y_train, y_tr_pred), mse(y_test, y_te_pred)])

array([0.27318568, 0.2706205 ])

## 4.11 Initial Models<a id='4.11_Initial_Models'></a>

Let's briefly build a model that uses the median to impute missing values/

In [31]:
# These are the values we'll use to fill in any missing values
X_defaults_median = X_train.median()
X_defaults_median

AMT_INCOME_TOTAL              144000.00000
AMT_CREDIT                    513531.00000
AMT_ANNUITY                    24907.50000
AMT_GOODS_PRICE               450000.00000
REGION_POPULATION_RELATIVE         0.01885
DAYS_EMPLOYED                   2215.00000
DAYS_REGISTRATION               4508.00000
DAYS_ID_PUBLISH                 3255.00000
CNT_FAM_MEMBERS                    2.00000
DAYS_LAST_PHONE_CHANGE           758.00000
dtype: float64

In [32]:
X_tr = X_train.fillna(X_defaults_median)
X_te = X_test.fillna(X_defaults_median)

In [33]:
scaler = StandardScaler()
scaler.fit(X_tr)
X_tr_scaled = scaler.transform(X_tr)
X_te_scaled = scaler.transform(X_te)

In [34]:
lm = LogisticRegression().fit(X_tr_scaled, y_train)

In [35]:
y_tr_pred = lm.predict(X_tr_scaled)
y_te_pred = lm.predict(X_te_scaled)

In [36]:
# r^2 - train, test
median_r2 = r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)
median_r2

(-0.08840988613150458, -0.08644039852085639)

In [37]:
median_mae = mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)
median_mae

(0.08122848502023162, 0.07956294578012878)

In [38]:
median_mse = mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)
median_mse

(0.08122848502023162, 0.07956294578012878)

### 4.12 Pipelines<a id='4.12_Pipelines'></a>

Let's compare the model built above to a model built by a pipeline. Will the results be different?

#### 4.12.1 Define the pipeline<a id='4.12.1_Define_the_pipeline'></a>

In [39]:
pipe_all = make_pipeline(
    SimpleImputer(strategy='median'), 
    StandardScaler(),
    SelectKBest(f_regression, k='all'),
    LogisticRegression()
)

In [40]:
type(pipe_all)

sklearn.pipeline.Pipeline

In [41]:
hasattr(pipe_all, 'fit'), hasattr(pipe_all, 'predict')

(True, True)

#### 4.12.2 Fit the pipeline<a id='4.12.2_Fit_the_pipeline'></a>

In [42]:
pipe_all.fit(X_train, y_train)

Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
                ('standardscaler', StandardScaler()),
                ('selectkbest',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x0000020109D7E700>)),
                ('logisticregression', LogisticRegression())])

#### 4.12.3 Making Predictions on Train and Test Sets<a id='4.12.2_Making_Predictions_on_Train_and_Test_Sets'></a>

In [43]:
y_tr_pred = pipe_all.predict(X_train)
y_te_pred = pipe_all.predict(X_test)

#### 4.12.4 Assess Performance<a id='4.12.4_Assess_Performance'></a>

In [44]:
r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)

(-0.08840988613150458, -0.08644039852085639)

In [45]:
mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)

(0.08122848502023162, 0.07956294578012878)

In [46]:
mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)

(0.08122848502023162, 0.07956294578012878)

In [47]:
cv_results = cross_validate(pipe_all, X_train, y_train, cv=5)

In [48]:
cv_scores = cv_results['test_score']
cv_scores

array([0.91877265, 0.91877265, 0.91877076, 0.91877076, 0.91877076])

In [49]:
np.mean(cv_scores), np.std(cv_scores)

(0.9187715149692501, 9.243249537556939e-07)

In [50]:
np.round((np.mean(cv_scores) - 2 * np.std(cv_scores), np.mean(cv_scores) + 2 * np.std(cv_scores)), 2)

array([0.92, 0.92])

#### 4.13 Summary<a id='4.13_Summary'></a>

The pipeline and the median-imputed model are virtually the same. Our cross-validation test states that our average prediction accuracy is at about 92%. 