# Predictive Modelling for Classification

## Import Libraries

In [None]:
import sys
# adding to the path variables the one folder higher (locally, not changing system variables)
sys.path.append("..")
from datetime import datetime
import pandas as pd
from scripts.helper import reduce_mem_usage
from pycaret.classification import *
from pycaret.utils import check_metric
from pycaret.datasets import get_data
import pickle
pd.set_option("display.max_columns", 120)

from model.config import TRACKING_URI, EXPERIMENT_NAME
import mlflow

## Import Datasets

In [None]:
# Create datetime
today = datetime.today()
d1 = today.strftime("%d%m%Y")

In [None]:
# Test dataset from pycaret library for classification modelling
#dataset_pycaret = get_data('credit')

# Capstone dataset!
dataset = pd.read_csv('data/feat_train_v2.csv')
dataset_test = pd.read_csv('data/feat_test_v2.csv')

In [None]:
print("There are {} observations and {} features in this featured train dataset. \n".format(dataset.shape[0],dataset.shape[1]))

In [None]:
print("There are {} observations and {} features in this featured test dataset. \n".format(dataset_test.shape[0],dataset_test.shape[1]))

In [None]:
#dataset_pycaret.info()

## Preparation for Modelling

In [None]:
numerical_cols = np.load("data/Numerical_Columns.npy")
categorical_cols = np.load("data/Categorical_Columns.npy")

In [None]:
type(numerical_cols)

In [None]:
numerical_cols = numerical_cols.tolist()
categorical_cols = categorical_cols.tolist()

In [None]:
type(numerical_cols)

In [None]:
# Create target for classification model
class_train = dataset[categorical_cols+numerical_cols]
class_train['Target'] = dataset['totals.transactionRevenue'].apply(lambda x: 0 if x == 0 else 1)

class_test = dataset_test[categorical_cols+numerical_cols]
class_test['Target'] = dataset_test['totals.transactionRevenue'].apply(lambda x: 0 if x == 0 else 1)

### Removing some zeros!

In [None]:
totals_transactionRevenue_zero = class_train[class_train['Target'] == 0].sample(frac=0.25, random_state=123)
totals_transactionRevenue_nonzero = class_train[class_train['Target'] != 0]
class_train = pd.concat([totals_transactionRevenue_zero, totals_transactionRevenue_nonzero], axis=0)

In [None]:
class_train.head()

## Binary Classification

Binary classification is a supervised machine learning technique where the goal is to predict categorical class labels which are discrete and unoredered such as Pass/Fail, Positive/Negative, Default/Not-Default etc. A few real world use cases for classification are listed below:

Medical testing to determine if a patient has a certain disease or not - the classification property is the presence of the disease.
A "pass or fail" test method or quality control in factories, i.e. deciding if a specification has or has not been met – a go/no-go classification.
Information retrieval, namely deciding whether a page or an article should be in the result set of a search or not – the classification property is the relevance of the article, or the usefulness to the user.

In order to demonstrate the predict_model() function on unseen data, a sample of records has been withheld from the original dataset to be used for predictions. This should not be confused with a train/test split as this particular split is performed to simulate a real life scenario. Another way to think about this is that these records are not available at the time when the machine learning experiment was performed.

In [None]:
data_unseen = class_test
#data.reset_index(inplace=True, drop=True)
#data_unseen.reset_index(inplace=True, drop=True)
print('Data for Modeling: ' + str(class_train.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

## 1.0 Setting up environment in PyCaret

The setup() function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup() must be called before executing any other function in pycaret. It takes two mandatory parameters: a pandas dataframe and the name of the target column. All other parameters are optional and are used to customize the pre-processing pipeline.

When setup() is executed, PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. The data type should be inferred correctly but this is not always the case. To account for this, PyCaret displays a table containing the features and their inferred data types after setup() is executed. If all of the data types are correctly identified enter can be pressed to continue or quit can be typed to end the expriment. Ensuring that the data types are correct is of fundamental importance in PyCaret as it automatically performs a few pre-processing tasks which are imperative to any machine learning experiment. These tasks are performed differently for each data type which means it is very important for them to be correctly configured.

In [None]:
print(class_train.info())
# class_train[categorical_cols] = class_train[categorical_cols].astype('category')
# class_test[categorical_cols] = class_test[categorical_cols].astype('category')

In [None]:
start_time = datetime.now() # Set start point for time analysis

exp_clf101 = setup(data = class_train, target = 'Target', session_id=123, data_split_stratify = True, fold_strategy = 'stratifiedkfold', fix_imbalance = True, numeric_features = categorical_cols+numerical_cols)

end_time = datetime.now() # Set end point for time analysis
print('Duration: {}'.format(end_time - start_time)) # Print out anlysed time

## 2.0 Comparing all models

Comparing all models to evaluate performance is the recommended starting point for modeling once the setup is completed (unless you exactly know what kind of model you need, which is often not the case). This function trains all models in the model library and scores them using stratified cross validation for metric evaluation. The output prints a score grid that shows average Accuracy, AUC, Recall, Precision, F1, Kappa, and MCC accross the folds (10 by default) along with training times.

In [None]:
models()

## Comparing models - Don't run if you have just one model, skip to next part

Run if you need to compare only!

In [None]:
#start_time = datetime.now()

#best_model = compare_models()

#end_time = datetime.now()
#print('Duration: {}'.format(end_time - start_time))

Two simple words of code (not even a line) have trained and evaluated over 15 models using cross validation. The score grid printed above highlights the highest performing metric for comparison purposes only. The grid by default is sorted using 'Accuracy' (highest to lowest) which can be changed by passing the sort parameter. For example compare_models(sort = 'Recall') will sort the grid by Recall instead of Accuracy. If you want to change the fold parameter from the default value of 10 to a different value then you can use the fold parameter. For example compare_models(fold = 5) will compare all models on 5 fold cross validation. Reducing the number of folds will improve the training time. By default, compare_models return the best performing model based on default sort order but can be used to return a list of top N models by using n_select parameter.

In [None]:
#print(best_model)

## 3.0 Create a Model

create_model is the most granular function in PyCaret and is often the foundation behind most of the PyCaret functionalities. As the name suggests this function trains and evaluates a model using cross validation that can be set with fold parameter. The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1, Kappa and MCC by fold.

There are 18 classifiers available in the model library of PyCaret. To see list of all classifiers either check the docstring or use models function to see the library.

### 3.1 LGBM

In [None]:
lgbm = create_model('lightgbm')

Hyperparameter Tuning on GPU with 3.2 XGBoost Classifier is also possible too!

## 4. Tune a Model

When a model is created using the create_model() function it uses the default hyperparameters to train the model. In order to tune hyperparameters, the tune_model() function is used. This function automatically tunes the hyperparameters of a model using Random Grid Search on a pre-defined search space. The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1, Kappa, and MCC by fold for the best model. To use the custom search grid, you can pass custom_grid parameter in the tune_model function (KNN tuning below).

### 4.1 LGBM Tuning

In [None]:
#tuned_lgbm = tune_model(lgbm)

In [None]:
#print(tuned_lgbm)

### ...

## 5. Plot a Model

Before model finalization, the plot_model() function can be used to analyze the performance across different aspects such as AUC, confusion_matrix, decision boundary etc. This function takes a trained model object and returns a plot based on the test / hold-out set.

There are 15 different plots available, please see the plot_model() docstring for the list of available plots.

### 5.1 AUC Plot

In [None]:
plot_model(lgbm, plot = 'auc')

### 5.2 Precision-Recall Curve

In [None]:
plot_model(lgbm, plot = 'pr')

### 5.3 Feature Importance Plot

In [None]:
plot_model(lgbm, plot='feature')

### 5.4 Confusion Matrix

In [None]:
plot_model(lgbm, plot = 'confusion_matrix')

Another way to analyze the performance of models is to use the evaluate_model() function which displays a user interface for all of the available plots for a given model. It internally uses the plot_model() function.

In [None]:
evaluate_model(lgbm)

## 6 Predict on test / hold-out Sample

Before finalizing the model, it is advisable to perform one final check by predicting the test/hold-out set and reviewing the evaluation metrics. Now, using our final trained model stored in the tuned_rf variable we will predict against the hold-out sample and evaluate the metrics to see if they are materially different than the CV results.

In [None]:
predict_model(lgbm)

## 7 Finalize Model for Deplyoment

Model finalization is the last step in the experiment. A normal machine learning workflow in PyCaret starts with setup(), followed by comparing all models using compare_models() and shortlisting a few candidate models (based on the metric of interest) to perform several modeling techniques such as hyperparameter tuning, ensembling, stacking etc. This workflow will eventually lead you to the best model for use in making predictions on new and unseen data. The finalize_model() function fits the model onto the complete dataset including the test/hold-out sample. The purpose of this function is to train the model on the complete dataset before it is deployed in production.

In [None]:
final_lgbm = finalize_model(lgbm)

In [None]:
# Final Random Forest model parameters for deployment
print(final_lgbm)

Caution: One final word of caution. Once the model is finalized using finalize_model(), the entire dataset including the test/hold-out set is used for training. As such, if the model is used for predictions on the hold-out set after finalize_model() is used, the information grid printed will be misleading as you are trying to predict on the same data that was used for modeling.

In [None]:
predict_model(final_lgbm);

Notice how the AUC in final_lgbm has increased to 0.9868 from 0.9871, even though the model is the same. This is because the final_rf variable has been trained on the complete dataset including the test/hold-out set.

## 8. Predict on unseen data

The predict_model() function is also used to predict on the unseen dataset. The only difference from section 6 above is that this time we will pass the data_unseen parameter. data_unseen is the variable created at the beginning.

In [None]:
unseen_predictions = predict_model(lgbm, data=data_unseen)

In [None]:
#unseen_predictions.Label.describe()
#unseen_predictions.head()

The Label and Score columns are added onto the data_unseen set. Label is the prediction and score is the probability of the prediction. Notice that predicted results are concatenated to the original dataset while all the transformations are automatically performed in the background. You can also check the metrics on this since you have actual target column default available. To do that we will use pycaret.utils module. See example below:

In [None]:
check_metric(unseen_predictions['Target'], unseen_predictions['Label'], metric = 'Recall')

## 9. Saving the model

We have now finished the experiment by finalizing the tuned_rf model which is now stored in final_rf variable. We have also used the model stored in final_rf to predict data_unseen. This brings us to the end of our experiment, but one question is still to be asked: What happens when you have more new data to predict? Do you have to go through the entire experiment again? The answer is no, PyCaret's inbuilt function save_model() allows you to save the model along with entire transformation pipeline for later use.

In [None]:
save_model(final_lgbm,'model/Class_lgbm_Model_{}'.format(d1))

## 10. Loading the saved model

To load a saved model at a future date in the same or an alternative environment, we would use PyCaret's load_model() function and then easily apply the saved model on new unseen data for prediction.

In [None]:
saved_final_nb = load_model('model/Class_lgbm_Model_{}'.format(d1))

In [None]:
plot_model(final_lgbm, plot='confusion_matrix')

## 11. Creating submission file

In [None]:
sub_class = unseen_predictions['Label']
sub_class.head()

In [None]:
sub_class.to_csv("model/sub_class.csv",index=False)

In [None]:
pd.read_csv("model/sub_class.csv")

Once the model is loaded in the environment, you can simply use it to predict on any new data using the same predict_model() function. Below we have applied the loaded model to predict the same data_unseen that we used in section 8 above.

## 12. ML FLOW implementation

In [None]:
#!mlflow ui
TRACKING_URI

In [None]:
mlflow.set_tracking_uri(TRACKING_URI)
mlflow.set_experiment(EXPERIMENT_NAME)

In [None]:
client = mlflow.tracking.MlflowClient()

In [None]:
mlflow.start_run()

In [None]:
mlflow.log_metric("train -" + "MSE", 1)

In [None]:
mlflow.end_run()