**All references to Project, COS Bucket, and API keys must be replaced before running this notebook in your project. Eearch for "replace"**

<table style="border: none" align="left">
   <tr style="border: none">
      <th style="border: none"><font face="verdana" size="5" color="black"><b>Build and Save a Sci-Kit Learn model to Watson Machine Learning (WML)</b></th>
      <th style="border: none"><img src="https://github.com/pmservice/customer-satisfaction-prediction/blob/master/app/static/images/ml_icon_gray.png?raw=true" alt="Watson Machine Learning icon" height="40" width="40"></th>
   </tr>
</table>

This notebook walks you through these steps:
 
- Access the data
- Cleanse data for analysis
- Explore data
- Build a classification model
- Save the model in the ML repository with associated meta data


## Step 1: Install and Import Required Libraries

In [None]:
!pip install -U ibm-watson-machine-learning

In [None]:
# Import the pandas and seaborn for data handling and visualisation and datetime for date manipulations

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime

pd.options.display.max_columns = 999
%matplotlib inline
sns.set(style="darkgrid")

In [None]:
# Import the required scikit-learn libraries

import numpy as np
import urllib3, requests, json
import sklearn
import warnings
warnings.filterwarnings('ignore')

from scipy.stats import chi2_contingency,ttest_ind

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score, roc_curve, roc_auc_score
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer


## Step 2: Import Data from Cloud Object Storage as Pandas Dataframe

<font color=red><b>ACTION:</b> Ensure that you have the correct bucket referenced below - it maybe easier to delete the next 4 cells and then use "Insert to Code" to add the three data sets</font>

In [None]:
# Instead of replacing bucket and COS API keys, we can also re-generate code. In that case we need to change the names of pandas data frame
#customer - for the Mortgage_Customer.csv
#property - for the Mortgage_Property.csv
#default - for the Mortgage_Default.csv

In [None]:
# Connect to Cloud Object Storage to access the data
import types
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_11c9f875c35a4c2c979b35121b858b43 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='replace-with-ibm-cos-api-key',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')


In [None]:
# Load the Customer Data

body = client_11c9f875c35a4c2c979b35121b858b43.get_object(Bucket='replace-with-your-cos-bucket',Key='Mortgage_Customer.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

customer = pd.read_csv(body)

In [None]:
# Load the Property Data

body = client_11c9f875c35a4c2c979b35121b858b43.get_object(Bucket='replace-with-your-cos-bucket',Key='Mortgage_Property.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

property = pd.read_csv(body)

In [None]:
# Load the Default data

body = client_11c9f875c35a4c2c979b35121b858b43.get_object(Bucket='replace-with-your-cos-bucket',Key='Mortgage_Default.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

default = pd.read_csv(body)



In [None]:
# This is the project bucket, taken from the cell above, which can be used later if assets need to be moved from the project to a deployment space
project_bucket = "replace-with-your-cos-bucket"

In [None]:
# Check that the dataframes have the desired columns

print ("Customer dataframe:")
print (list(customer))
print ("")
print ("Property dataframe:")
print (list(property))
print ("")
print ("Default dataframe:")
print (list(default))

## Step 3: Merge the Data Files

In [None]:
merged = pd.merge(pd.merge(customer, property, on='ID'),default,on='ID')
merged.head(3)

## Step 4: Simple Data Preparation - Rename some columns and ensure correct data types 
#### Remove spaces from columns names

In [None]:
# Rename fields to remove spaces
merged.rename(columns={
    "Yrs at Current Address":"YearCurrentAddress", 
    "Yrs with Current Employer":"YearsCurrentEmployer",
    "Number of Cards":"NumberOfCards",
    "Creditcard Debt":"CCDebt",
    "Loan Amount":"LoanAmount"}, 
              inplace=True)

#### Check the Data Types and correct any that require it

In [None]:
merged.dtypes

In [None]:
# Loop through the Decimal (Float) fields and change to Integers

float_col = merged.select_dtypes(include = ['float64']) # This will select float columns only
# list(float_col.columns.values)
for col in float_col.columns.values:
    merged[col] = merged[col].astype('int64')

In [None]:
merged.dtypes

## Step 5: Data Exploration

1) Obtain some data shape summaries in terms of number of fields and records <br>
2) Perform some exploratory analysis of distributions, scatterplots using appropriate visualisations libraries


In [None]:
print ("There are " + str(merged.shape[0]) + " records and " + str(merged.shape[1]) + " fields in the dataset.")

In [None]:
g1 = sns.countplot(data=merged, x='MortgageDefault', order=merged.MortgageDefault.value_counts().index)
plt.title('Mortgage Default Rates')
plt.ylabel('Count of Default')
plt.ylim(0, 800)
#Add percentages to the graph
total = float(len(merged)) #one person per row
for p in g1.patches:
    height = p.get_height()
    g1.text(p.get_x()+p.get_width()/2.,
            height + 1,
            '{0:.0%}'.format(height/total),
            ha="center") 
plt.show()


In [None]:
sns.catplot(x="MortgageDefault", y="YearCurrentAddress",
                 hue="Residence", col="AppliedOnline",
                 data=merged, kind="bar",
                 height=7, aspect=.81,capsize=.05);

In [None]:
sns.catplot(x="MortgageDefault", y="SalePrice",
                 hue="Residence", col="AppliedOnline",
                 data=merged, kind="bar",
                 height=7, aspect=.81,capsize=.05);

In [None]:
sns.set(style="whitegrid")
plt.figure(figsize=(15,8))
g = sns.lineplot(x="YearCurrentAddress", y="CCDebt", hue="Residence",data=merged)


## Step 6: Build the Sci-Kit Learn pipeline using a Random Forest model


### Step 6.1: Create the Input Data for Modelling

In [None]:
# Convert the Target/Label column to a numeric

le = LabelEncoder()
merged.loc[:,'MortgageDefault']= le.fit_transform(merged.loc[:,'MortgageDefault'])
merged.head()

In [None]:
# Check the values for MortgageDefault

merged.groupby(['MortgageDefault']).size()

In [None]:
# Split the columns in to "Input Features" and "Label"

y = np.float32(merged.MortgageDefault)
x = merged.drop(['MortgageDefault','ID'], axis = 1)

In [None]:
list(x)

### Step 6.2: Split the data in to Training & Test samples

In [None]:
# split the data to training and testing set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=y)

### Step 6.3: Transform Input Fields -  LabelEncoding or OneHotEncoding for Catagorical & Scaled for Numerics

In [None]:
# Split the input features in to numeric/categorical features

numeric_features = ['Income','YearCurrentAddress','YearsCurrentEmployer','NumberOfCards','CCDebt','Loans','LoanAmount','SalePrice']
categorical_features = ['AppliedOnline','Residence','Location']


In [None]:
# The definition of a Numeric transformation is shown here, but commented out in the pipeline creation below
#   - the numeric transformation fills missing values (SimpleImputer) and then scales them to a standardised score (StandardScaler)
# The definition of a Categorical tranformer is shown here - and used in the pipeline creation
#   - the categorical transformer uses a OneHotEncoder but it could have been a label encoder

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Do not include numeric transforms to the pipeline creation and ensure that Non Categorical features are "passed through" and still avialable in their raw format
preprocessor = ColumnTransformer(
    transformers=[
 #       ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)],
        remainder='passthrough')

### Step 6.4: Define the classifier, the pipeline steps and then create the actual model

In [None]:
# Specify the classifier function to be used for the model creation

classifier_function = RandomForestClassifier()
#classifier_function = DecisionTreeClassifier()
#classifier_function = MLPClassifier()
#classifier_function = LogisticRegression()

In [None]:
# Define the pipeline as a series of steps

pipeline = Pipeline(steps=[('preprocessor', preprocessor),('classifier', classifier_function)])  

In [None]:
# Create the model based upon the defined pipeline

model = pipeline.fit(x_train,y_train)

In [None]:
#print(model)

## Step 7: Evaluate the Model and Check the Accuracy and Performance

### Step 7.1: Look at the various Accuracy Measures

In [None]:
# call pipeline.predict() on your X_test data to make a set of test predictions which are written to series y_prediction

y_prediction = pipeline.predict(x_test)
y_probability = pipeline.predict_proba(x_test)

# Evaluate the model using sklearn.classification_report()
report = sklearn.metrics.classification_report(y_test, y_prediction )
accuracy = sklearn.metrics.accuracy_score(y_test, y_prediction )
print(report)
print("Overall Model Accuracy: " + str(accuracy))

### Step 7.2: Score the Test Data through the Model Pipeline and view the Actual & Predicted Results

In [None]:
#Reset the index on the x_train data so that the join will match record by record and not require a key
x_test.reset_index(drop=True, inplace=True)

#Write the Actual and Predicted Mortgage Default values in to dataframes 
y_test_df = pd.DataFrame(y_test,columns=['MortgageDefault'])
y_pred_df = pd.DataFrame(y_prediction,columns=['Pred Default'])
y_prob_df = pd.DataFrame(y_probability,columns=['Prob Non-Default','Prob Default'])

# Combine the three dataframes by index value rather than key field
scored_df = pd.concat([x_test, y_test_df, y_pred_df, y_prob_df], axis=1)
scored_df.head()


## Step 8: Understand the Model which has been created
Feature Importance must be calculated based upon the transformed data and not the original data - therefore the final fields in the model are obtained by accessing them from the first step of the pipeline (post transformation)

In [None]:
# Obtain the fields submitted to the model after the first step of the pipeline (post transformation)
onehot_columns = pipeline.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names(input_features=categorical_features)


# Create a pandas series which contains both the transformed feature list and the calculated importance based upon the model fit
X_values = preprocessor.fit_transform(x_train)
df_from_array_pipeline = pd.DataFrame(X_values, columns = numeric_features + list(onehot_columns) )
feature_imp_series = pd.Series(data= pipeline.named_steps['classifier'].feature_importances_, index = np.array(numeric_features + list(onehot_columns)))

In [None]:
# Convert the pandas series to a pandas dataframe - rename the columns and sort in descending order of Feature Importance

feature_imp = feature_imp_series.to_frame()
feature_imp['Feature'] = feature_imp.index
feature_imp = feature_imp.rename(columns = {0: 'Importance'})
feature_imp = feature_imp.sort_values(by=['Importance'], ascending=False)
feature_imp.head(25)

In [None]:
# Visualise the Feature Importance in descending order

sns.factorplot(y="Feature",x="Importance", data=feature_imp,kind="bar",palette="Blues",size=6,aspect=2)
plt.title('Mortgage Default - Feature Importance')

## Step 9: Save Model to  the Project

### Step 9.1: Obtain Credentials for WML and initiate the WML Client API, then choose the Model Name

In [None]:
from ibm_watson_machine_learning import APIClient

# IMPORTANT
# Replace with your Cloud API key and location
api_key = 'replace-with-your-cloud-api-key'
location = 'https://us-south.ml.cloud.ibm.com'  # For example, Dallas location is 'https://us-south.ml.cloud.ibm.com'


wml_credentials = {
    "apikey": api_key,
    "url": location
}

client = APIClient(wml_credentials)

In [None]:
client.set.default_project(pc.projectID)

### Step 9.2: Store the Model to the Project

In [None]:
# Provide metadata and save the model into the repository. After running this cell, the model will be displayed in the Assets view

# Model Metadata
model_name = 'mortgage_default_model'
software_spec_uid = client.software_specifications.get_uid_by_name('default_py3.8')

metadata = {
    client.repository.ModelMetaNames.NAME: model_name,
    client.repository.ModelMetaNames.SOFTWARE_SPEC_UID: software_spec_uid,
    client.repository.ModelMetaNames.TYPE: "scikit-learn_0.23"
}

stored_model_details = client.repository.store_model(pipeline,
                                               meta_props=metadata,
                                               training_data=x_train,
                                               training_target=y_train)

### Step 9.3: Save Model in the Project (as an asset) - optional
While we recommend that models are saved using the WML API, there may be cases when the models may need to be saved as files. Two typical reasons for using this approach are:
1. Model framework is not supported by Watson Studio. You can find the list of supported frameworks in documentation: https://www.ibm.com/docs/en/cloud-paks/cp-data/4.0?topic=functions-supported-deployment-frameworks
2. A customer has an established deployment process which works with exported model files. 

*In this notebook we demontrate how to save the model to the project (it will be displayed under Data Assets). You can also save the model file to any storage type - Storage Volume (shared file system), Git, Object Storage, etc.*

In [None]:
model_name = 'mortgage_default_model'

In [None]:
import pickle
from io import BytesIO

# Save the model to working directory. Pickle is one of the most frequently used options for saving Python models, but you can also use other approaches to save the model file. 
# Modify model name to make it easier to distinguish between the model saved with WML and regular save file.
model_name = model_name + '_custom'

# save model object as in-memory bytes buffer
buffer = BytesIO()
pickle.dump(model, buffer, pickle.HIGHEST_PROTOCOL)
buffer.seek(0)

project.save_data(model_name, buffer, overwrite=True)

---------------------------------------------------------------------------------
<u>Author Information:</u><br>
**Stephen Groves** and **Elena Lowery** <br/>
<i> Data Science & AI Technical Sales, IBM </i><br>
9 October 2020, updated in December 2021