# Prepare the real-time scoring model

The team at Woodgrove Bank has provided you with exported CSV copies of historical data for you to train your model against. Run the following cell to load required libraries and download the data sets from the Azure ML datastore.

In [None]:
# Ignore the error
! pip install --force-reinstall joblib==0.14.1 scikit-learn==0.22.2.post1

In [None]:
# Restart the kernel 

In [None]:
import sklearn
import joblib

print(sklearn.__version__)
print(joblib.__version__)

# Make sure joblib version == 0.14.1 and sklearn == 0.22.2.post1

In [None]:
!pip install --upgrade azureml-train-automl-runtime
!pip install --upgrade azureml-automl-runtime
# !pip install --upgrade scikit-learn
!pip install --upgrade numpy

In [None]:
from azureml.core import Workspace, Environment, Datastore, Dataset
from azureml.core.experiment import Experiment
from azureml.core.run import Run
from azureml.core.model import Model

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# sklearn.externals.joblib was deprecated in 0.21
from sklearn import __version__ as sklearnver
from packaging.version import Version
if Version(sklearnver) < Version("0.21.0"):
    from sklearn.externals import joblib
else:
    import joblib

import numpy as np
import pandas as pd

ws = Workspace.from_config()

# Load data
ds = Datastore.get(ws, "woodgrovestorage")

account_ds = Dataset.Tabular.from_delimited_files(path = [(ds, 'synapse/Account_Info.csv')])
fraud_ds = Dataset.Tabular.from_delimited_files(path = [(ds, 'synapse/Fraud_Transactions.csv')])
untagged_ds = Dataset.Tabular.from_delimited_files(path = [(ds, 'synapse/Untagged_Transactions.csv')])

# Create pandas dataframes from datasets
account_df = account_ds.to_pandas_dataframe()
fraud_df = fraud_ds.to_pandas_dataframe()
untagged_df = untagged_ds.to_pandas_dataframe()

View the fraud dataframe. 

NOTE: The schema documentation for the data used here is available at:
https://microsoft.github.io/r-server-fraud-detection/input_data.html

In [None]:
fraud_df

View the account info dataframe.

In [None]:
account_df

View the untagged transactions dataframe.

In [None]:
###### Reorder the column of dataframe by ascending order in pandas 
cols=untagged_df.columns.tolist()
cols.sort()
untagged_df=untagged_df[cols]

untagged_df

## Prepare data

The raw data has some issues we need to cleanup before we can use it to train a model, which we perform in the following cells.

### Prepare accounts

Begin by cleaning the data in accounts data set.
Remove columns that have very few or no values: `accountOwnerName`, `accountAddress`, `accountCity` and `accountOpenDate` 

In [None]:
account_df_clean = account_df[["accountID", "transactionDate", "transactionTime", 
                               "accountPostalCode", "accountState", "accountCountry", 
                               "accountAge", "isUserRegistered", "paymentInstrumentAgeInAccount", 
                               "numPaymentRejects1dPerUser"]]

Create a copy of the dataframe so our data manipulation does not affect the original.

In [None]:
account_df_clean = account_df_clean.copy()

Let's ensure that values that are not numeric (e.g., they have incorrect string values or garbage data) are converted to NaN and then we can fill those NaN values with 0.

In [None]:
account_df_clean['paymentInstrumentAgeInAccount'] = pd.to_numeric(account_df_clean['paymentInstrumentAgeInAccount'], errors='coerce')
account_df_clean['paymentInstrumentAgeInAccount'] = account_df_clean[['paymentInstrumentAgeInAccount']].fillna(0)['paymentInstrumentAgeInAccount']

Next, let's convert the `numPaymentRejects1dPerUser` so that the column has a datatype of `float` instead of `object`.

In [None]:
account_df_clean["numPaymentRejects1dPerUser"] = account_df_clean[["numPaymentRejects1dPerUser"]].astype(float)["numPaymentRejects1dPerUser"]

Let's take a look at the results of our cleanup of this one column. Looks like the most payment declines/rejects that happen to a given user in one day happens either zero times or 1 time, and then trails off quickly. After 5 times the number of rejects per user per day is down to 136.

In [None]:
account_df_clean["numPaymentRejects1dPerUser"].value_counts()

`account_df_clean` is now ready for use in modeling.

### Prepare untagged transactions

Next, cleanup the untagged transactions data set. There are 16 columns in the untagged_transactions whose values are all null, let's drop these columns to simplify our dataset.

In [None]:
untagged_df_clean = untagged_df.dropna(axis=1, how="all").copy()

We can examine the count of non-null values, and view the inferred data type for each column by running the following cell. Looking at the output of the cell, we have some work to do. For a start, we have columns with fewer than 200,000 non-null values. This means there are some null values in that column that we need to fix.

In [None]:
untagged_df_clean.info()

Let's cleanup the `localHour` field. 

Replace null values in `localHour` with `-99`. Also replace values of `-1` with `-99`.

In [None]:
untagged_df_clean["localHour"] = untagged_df_clean["localHour"].fillna(-99)
untagged_df_clean.loc[untagged_df_clean.loc[:,"localHour"] == -1, "localHour"] = -99

Confirm the values now look good.

In [None]:
untagged_df_clean["localHour"].value_counts()

Clean up the remaining null fields:
- Fix missing values for location fields by setting them to `NA` for unknown. 
- Set `isProxyIP` to False
- Set `cardType` to `U` for unknown (which is a new level)
- Set `cvvVerifyResult` to `N` which means for those where the transaction failed because the wrong CVV2 number was entered ro no CVV2 numebr was entered, treat those as if there was no CVV2 match.

In [None]:
untagged_df_clean = untagged_df_clean.fillna(value={"ipState": "NA", "ipPostcode": "NA", "ipCountryCode": "NA", 
                               "isProxyIP":False, "cardType": "U", 
                               "paymentBillingPostalCode" : "NA", "paymentBillingState":"NA",
                               "paymentBillingCountryCode" : "NA", "cvvVerifyResult": "N"
                              })

Confirm all null values have been addressed.

In [None]:
untagged_df_clean.info()

The `transactionScenario` column provides no insights because all rows have the same `A` value. Let's drop that column. Same idea for the `transactionType` column.

In [None]:
del untagged_df_clean["transactionScenario"]

In [None]:
del untagged_df_clean["transactionType"]

`untagged_df_clean` is now ready for use in modeling.

### Prepare fraud transactions

Now move on to preparing the fraud transactions data set.

The `transactionDeviceId` has no meaningful values, so we will drop it.

In [None]:
fraud_df_clean = fraud_df.copy()
del fraud_df_clean['transactionDeviceId']

The fraud data set has a `localHour` field that we need to fill missing values, just as we did for the account data set.

In [None]:
fraud_df_clean["localHour"] = fraud_df_clean["localHour"].fillna(-99)

Examine your work, you should have 8640 non-null values in each column.

In [None]:
fraud_df_clean.info()

`fraud_df_clean` is now ready for use in modeling.

## Create labels

The goal is to create a dataframe with all transactions, where each transaction is tagged via the `isFraud` column with a value of `0` - no fraud or `1` - fraudulent. 

Any transactions that appear in untagged_transactions dataframe that also appear in the fraud dataframe will be marked as fraudulent. 

The remaining transactions will be marked as not fraudulent. 

Run the following cells to create the labels series.

In [None]:
all_labels = untagged_df_clean["transactionID"].isin(fraud_df_clean["transactionID"])

In [None]:
all_transactions = untagged_df_clean

## Create feature engineering pipeline

In the following cells we will define two custom estimators that will be used in pipeline to prepare the data.

We collect these estimators in a module and then save this module file in the models directory. Then we will use the classes in this module during both model training and model scoring. During deployment, when you register the model using `Model.register` (as we do later in `deployModelAsWebService`), all files in the models directory are uploaded with the model.

The module containing the estimators used to transform the data before scoring needs to be deployed with the model. This ensures that the code can be executed in the context of the webservice, when the serialized pipeline is loaded and used for scoring. 

First we need to create the models directory if it does not already exist.

In [None]:
import uuid
import os

# Create a temporary folder to store locally relevant content for this notebook
tempFolderName = 'FileStore/mcw_cdb_{0}'.format(uuid.uuid4())

models_dir = tempFolderName + "/models/"
if not os.path.exists(models_dir):
    os.makedirs(models_dir)

Then we can save our estimators module.

In [None]:
# write out to models/customestimators.py
scoring_service = """
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
class NumericCleaner(BaseEstimator, TransformerMixin):
    def __init__(self):
        self = self
    def fit(self, X, y=None):
        print("NumericCleaner.fit called")
        return self
    def transform(self, X):
        print("NumericCleaner.transform called")
        X["localHour"] = X["localHour"].fillna(-99)
        X.loc[X.loc[:,"localHour"] == -1, "localHour"] = -99
        return X

class CategoricalCleaner(BaseEstimator, TransformerMixin):
    def __init__(self):
        self = self
    def fit(self, X, y=None):
        print("CategoricalCleaner.fit called")
        return self
    def transform(self, X):
        print("CategoricalCleaner.transform called")
        X = X.fillna(value={"cardType":"U","cvvVerifyResult": "N"})
        return X
""" 

with open(models_dir + "customestimators.py", "w") as file:
    file.write(scoring_service)

You need to add the created module to our python search path so it can be found and loaded in this notebook environment:

In [None]:
import sys
from os.path import dirname
sys.path.append(models_dir)

Next, load the estimators.

In [None]:
from customestimators import NumericCleaner, CategoricalCleaner

Now build the pipeline that will prepare the data. 

The gist of the following cell is to split the data preparation into two paths, splitting the data sets vertically, and then combine the result. The `ColumnTransformer` will effectively concatenate the data frame that results from the numeric transformations with the data frame resulting from the categorical transformations. 

- Numeric Transformer Pipeline: We use the custom transformers created previously to cleanup the numeric columns. Since the model you will train in this notebook is a Support Vector Machine classifier, we need to standardize the scale of numeric values which is what the `StandardScaler` provides.
- Categorical Transformer Pipeline: We use the custome transformer created previously cleanup the categorical columns. Then we one-hot encode each value of each categorical column, resulting in a wider data frame with one column for each possible value (and 1 appearing in rows that had that value).

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

numeric_features=["transactionAmountUSD", "transactionDate", "transactionTime", "localHour", 
                  "transactionIPaddress", "digitalItemCount", "physicalItemCount"]

categorical_features=["transactionCurrencyCode", "browserLanguage", "paymentInstrumentType", "cardType", "cvvVerifyResult"]                           

numeric_transformer = Pipeline(steps=[
    ('cleaner', NumericCleaner()),
    ('scaler', StandardScaler())
])
                               
categorical_transformer = Pipeline(steps=[
    ('cleaner', CategoricalCleaner()),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [None]:
import sklearn
print(sklearn.__version__)

Let's confirm we run all our historical data thru this transformation pipeline and observe the resulting shape.

In [None]:
preprocessed_result = preprocessor.fit_transform(all_transactions)

In [None]:
preprocessed_result.shape

In [None]:
pd.DataFrame(preprocessed_result.todense())

## Create pipeline and train a simple model

Now you will build upon the transformation pipeline you created previously to train a model to classify rows as fraudulent or not fraudulent.

Run the following cells to make sure you've imported the dependencies for the pipeline (you probably already have, but having them clearly loaded here will help you when porting your code to a web service).

In [None]:
from customestimators import NumericCleaner, CategoricalCleaner
from sklearn.model_selection import train_test_split

As might be obvious, our data has a lot of samples that are not fraudulent. If we proceed to train a model, we will effectively train the model to predict non-fraud. This situation where one class (non-fraud) appears much more often than the others (fraud) is called a class imbalance, and to mitigate its effect we can reduce the number of non-fraud samples so that we have the same number of non-fraud and fraud samples. 

Run the following cells to downsize and then randomly sample 1,151 non-fraud rows, and then we'll union these row with our 1,151 fraud rows.

> Feel free to ignore any `SettingWithCopyWarning` warnings in the cell output below.

In [None]:
only_fraud_samples = all_transactions.loc[all_labels == True]
only_fraud_samples["label"] = True
only_non_fraud_samples = all_transactions.loc[all_labels == False]
only_non_fraud_samples["label"] = False
random_non_fraud_samples = only_non_fraud_samples.sample(n=1151, replace=False, random_state=42)
balanced_transactions = random_non_fraud_samples.append(only_fraud_samples)

balanced_transactions["label"].value_counts()

Next, you need to separate out the label column from the dataframe so the labels are not used as input features:

In [None]:
balanced_labels = balanced_transactions["label"]
del balanced_transactions["label"]

Now you will create subsets of the training data frame, one that will be used for training the model `X_train` and `y_train` and the another that reserved for testing its performance `X_test` and `y_test`.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(balanced_transactions, balanced_labels, 
                                                    test_size=0.2, random_state=42)

Now train the model. In this case, you will use the `LinearSVC` class.

> Feel free to ignore any `ConvergenceWarning` warnings in the cell output below

In [None]:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

svm_clf = Pipeline((
    ("preprocess", preprocessor),
    ("linear_svc", LinearSVC(C=1, loss="hinge"))
))
svm_clf.fit(X_train, y_train)

Test the model predicting against a single row from the test set.

In [None]:
svm_clf.predict(X_test[0:1])

Next, evaluate the model by examining how well it is predicting against all data in the training set.

In [None]:
y_train_preds = svm_clf.predict(X_train)

Use a confusion matrix to see how your model performed when correctly predicting non-fraud and fraud (the top left and bottom right values). Also, examine how your model made mistakes (the bottom left and top right values). In the below, the column headers are predicted non-fraud and predicted fraud, and the row headers are actually non-fraud, and actually fraud (e.g., as described by the training data).

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
confusion_matrix(y_train, y_train_preds)

Take a look at the performance of your model using the common set of metrics for a classifier. Do you think this is good or bad?

In [None]:
print("Accuracy:", accuracy_score(y_train, y_train_preds))
print("Precision:", precision_score(y_train, y_train_preds))
print("Recall:", recall_score(y_train, y_train_preds))
print("F1:", f1_score(y_train, y_train_preds))
print("AUC:", roc_auc_score(y_train, y_train_preds))

Given that this is just a parsimonous model, this model provides a start that performs better than random (as indicated by the AUC being greater than 0.5). There is more work (such as additional feature engineering) that can be done to improve this beyond the current performance that you would want to do before deploying it in production, but that is out of scope for this lab. A parsiminous model helps us to both see if the desired classification is possible given the data and allows to quickly get to something we can deploy as a service to enable integration early on. Then we can iterate deploying improved versions of the model.

Now, evaluate the same using the test data set, using data the trained model has not seen. How does it perform?

In [None]:
y_test_preds = svm_clf.predict(X_test)
print(confusion_matrix(y_test, y_test_preds))
print(accuracy_score(y_test, y_test_preds))
print("Accuracy:", accuracy_score(y_test, y_test_preds))
print("Precision:", precision_score(y_test, y_test_preds))
print("Recall:", recall_score(y_test, y_test_preds))
print("F1:", f1_score(y_test, y_test_preds))
print("AUC:", roc_auc_score(y_test, y_test_preds))

The overall performance of the model against data it has not seen (the test data) is similar to how it performs with the training data. That's a good sign, indicating we did not overfit the model to the training data.

Next, let's look the steps to prepare the model for deployment as a web service.

## Save the model to disk

In preparation for deploying the model, you need to save the model to disk.

In [None]:
joblib.dump(svm_clf, models_dir + 'fraud_score.pkl')

## Test loading the model

Next simulate re-loading the model from disk, just like the web service (which you will create in a moment) will have to do.

In [None]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from customestimators import NumericCleaner, CategoricalCleaner

# sklearn.externals.joblib was deprecated in 0.21
from sklearn import __version__ as sklearnver
from packaging.version import Version
if Version(sklearnver) < Version("0.21.0"):
    from sklearn.externals import joblib
else:
    import joblib

desired_cols = ['accountID',
 'browserLanguage',
 'cardType',
 'cvvVerifyResult',
 'digitalItemCount',
 'ipCountryCode',
 'ipPostcode',
 'ipState',
 'isProxyIP',
 'localHour',
 'paymentBillingCountryCode',
 'paymentBillingPostalCode',
 'paymentBillingState',
 'paymentInstrumentType',
 'physicalItemCount',
 'transactionAmount',
 'transactionAmountUSD',
 'transactionCurrencyCode',
 'transactionDate',
 'transactionID',
 'transactionIPaddress',
 'transactionTime']

scoring_pipeline = joblib.load(models_dir + 'fraud_score.pkl')

In [None]:
untagged_ds = Dataset.Tabular.from_delimited_files(path = [(ds, 'synapse/Untagged_Transactions.csv')])
untagged_df_fresh = untagged_ds.to_pandas_dataframe()
untagged_df_fresh=untagged_df_fresh[desired_cols]

test_pipeline_preds = scoring_pipeline.predict(untagged_df_fresh)
test_pipeline_preds

In [None]:
one_row = untagged_df_fresh.iloc[:1]
test_pipeline_preds2 = scoring_pipeline.predict(one_row)
test_pipeline_preds2

## Register the model in the Azure ML workspace

In [None]:
import azureml
from azureml.core import Workspace, Webservice
from azureml.core.model import Model
from azureml.exceptions import WebserviceException
from azureml.core.resource_configuration import ResourceConfiguration

In [None]:
print(sklearnver)

Register the model, providing details on the framework that was used to create the model. The cell output above shows the version of scikit-learn we are using. Additionally, we specify the desired resources (CPU and Memory) to be allocated for the deployment of the model.

In [None]:
from azureml.core.resource_configuration import ResourceConfiguration

# Register the model with the workspace
registered_model = Model.register(model_path=models_dir, model_name="fraud-score", workspace=ws, model_framework=Model.Framework.SCIKITLEARN, model_framework_version=sklearnver, resource_configuration=ResourceConfiguration(cpu=1, memory_in_gb=0.5), description='Fraud Detection Model')

> **Note**: Please note that executing the next few cells can take between **7** and **10** minutes.

Deploy the model to Azure Container Instances. Once deployment is complete - you will see the message **ACI Service creation operation finished, operation "Succeeded"**

In [None]:
# Ignore 
# service_name = "scoringservice"

# # delete any existing service with the same name
# try:
#   Webservice(ws, service_name).delete()
# except WebserviceException:
#   pass

# # deploy the registered model to Azure Container Instances
# service = Model.deploy(ws, service_name, [registered_model])
# service.wait_for_deployment(show_output=True)

Finally, test your deployed web service.

In [None]:
#ignore 
# # test the web service
# import json
# untagged_ds = Dataset.Tabular.from_delimited_files(path = [(ds, 'synapse/Untagged_Transactions.csv')])
# untagged_df_fresh = untagged_ds.to_pandas_dataframe()
# untagged_df_fresh=untagged_df_fresh[desired_cols]
# input_df = untagged_df_fresh.iloc[:5]
# # Convert dataframe to JSON, setting index=False so we don't add the index column
# json_df = input_df.to_json(orient='table', index=False)
# result = service.run(input_data=json_df)
# result

# # Uncomment the line below to output the JSON data for testing the service endpoint from CURL or other external application
# # json.dumps(json_df)