# Customer Churn Prediction using Auto-SKLearn
_**Using Auto-SKLearn to Predict Mobile Customer Departure**_

---

---

Kernel `conda_mxnet_latest_p37` works well with this notebook.

## Contents

1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Classification](#Data)
1. [Data Loading](#Dataloading)
1. [Auto-SKLearn Results](#Results)
1. [Host](#Host)
1. [Cleanup](#Cleanup)


---

## Introduction<a name="Introduction"></a>

Auto-Sklearn is an open-source library for performing AutoML in Python. It makes use of the popular Scikit-Learn machine learning library for data transforms and machine learning algorithms and uses a Bayesian Optimization search procedure to efficiently discover a top-performing model pipeline for a given dataset. 

This notebook has been developed to show AutoML development using Auto-SKLearn as a comparison to SageMaker AutoPilot. Currently SageMaker AutoPilot is not available in GovCloud so we are going to use Auto-SKLearn to be able to do AutoML on GovCloud.

---
## Setup<a name="Setup"></a>

_This notebook was created and tested on an ml.m4.xlarge notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

Next let's start by installing auto-sklearn 

In [None]:
! pip3 install auto-sklearn==0.13.0


## Classification<a name="Classification"></a>

The following example shows how to fit a simple classification model with
*auto-sklearn*.


In [None]:
import sagemaker
import boto3
import pandas as pd

import sklearn.datasets
import sklearn.metrics
import autosklearn.classification

# You can modify the following to use a bucket of your choosing
session = sagemaker.Session()
bucket = session.default_bucket()
prefix = "sagemaker/autopilot-cust-churn"


## Data Loading<a name="Dataloading"></a>

We already have a prepared dataset from our AutoPilot lab we just completed. Now we will load our dataset

In [None]:
print(f"s3://{bucket}/{prefix}/train/train_data.csv")

Quick test to see if the dataset files are in S3. If not then the AutoPilot lab was not run ahead of time and so the file are re-created and stored in S3. This situation would occur if running the labs in the GovCloud region since AutoPilot is not available yet. This next cell will allow you to run the auto-sklearn labs without having to run the AutoPilot lab.

In [None]:
import boto3
import botocore

src_prefix = "sagemaker/autopilot-cust-churn"

s3 = boto3.resource('s3')

print(f"Testing if file exists at s3://{bucket}/{src_prefix}/train/train_data.csv")

try:
    s3.Object(f'{bucket}', f'{src_prefix}/train/train_data.csv').load()
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == "404":
        # The object does not exist.
        churn = pd.read_csv("../dataset/churn.txt", index_col=False)
        churn.drop(columns=['Area Code', 'Phone'],inplace=True)
        
        train_data = churn.sample(frac=0.8, random_state=200)
        test_data = churn.drop(train_data.index)
        test_data_no_target = test_data.drop(columns=["Churn?"])
        
        train_file = "train_data.csv"
        train_data.to_csv(train_file, index=False, header=True)
        train_data_s3_path = session.upload_data(path=train_file, key_prefix=src_prefix + "/train")
        print("Train data uploaded to: " + train_data_s3_path)

        test_file = "test_data.csv"
        test_data_no_target.to_csv(test_file, index=False, header=False)
        test_data_s3_path = session.upload_data(path=test_file, key_prefix=src_prefix + "/test")
        print("Test data uploaded to: " + test_data_s3_path)

        test_file_wt = "test_data_w_target.csv"
        test_data.to_csv(test_file_wt, index=False, header=True)
        test_data_s3_path = session.upload_data(path=test_file_wt, key_prefix=src_prefix + "/test")
        print("Test data with target uploaded to: " + test_data_s3_path)
    else:
        # Something else has gone wrong.
        raise


Let us now load the training dataset from S3 

In [None]:
s3_client = boto3.client("s3")

response = s3_client.get_object(Bucket=bucket, Key=f"{prefix}/train/train_data.csv")

status = response.get("ResponseMetadata", {}).get("HTTPStatusCode")

if status == 200:
    print(f"Successful S3 get_object response. Status - {status}")
    data = pd.read_csv(response.get("Body"),index_col=False)
else:
    print(f"Unsuccessful S3 get_object response. Status - {status}")

Using the loaded dataframe, we now create our training dataset (train_X) and our target dataset (train_Y)

In [None]:
train_X = data[[c for c in data.columns if c != 'Churn?']]
train_Y = data[['Churn?']]

In [None]:
train_X = train_X.copy()

Looking at the training dataset, the columns "State", "Int'l Plan" and "VMail Plan" are strings. So we set their type as 'category' and auto-sklearn will perform categorical encoding on them. NOTE: Without this step, auto-sklearn will fail. This was not needed when we used AutoPilot since it is able to analyze and apply the correct encoding as required.

In [None]:
train_X["State"] = train_X["State"].astype('category')
train_X["Int'l Plan"] = train_X["Int'l Plan"].astype('category')
train_X["VMail Plan"] = train_X["VMail Plan"].astype('category')

## Build and fit a classifier

Now we have our dataset ready, we set up an auto-sklearn classifier since we are doing Binary Classification. 
The parameters are as follows

**time_left_for_this_task** = Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models

**per_run_time_limit** = Time limit for a single call to the machine learning model

**tmp_folder** = folder to store configuration output and log files

We have set these values lower to ensure the jobs complete with the workshop timeframe

In [None]:
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder='/tmp/autosklearn_classification_example_tmp2',
)

Now we call the **fit** method with the training and target datasets. (This process takes a few minutes to run)

In [None]:
%%time
automl.fit(train_X, train_Y)

## View the models found by auto-sklearn

Now the that auto-sklearn fit process has finished, lets evaluate what it found. We can list the ensemble models its created with their relevant statistics by calling the **leaderboard** method

In [None]:
print(automl.leaderboard())

## Print the final ensemble constructed by auto-sklearn

Returns a representation of the final ensemble found by auto-sklearn

In [None]:
print(automl.show_models())

## Get the Score of the final ensemble

Now let us load the test dataset that we created and stored in S3. We are going to use the test dataset with the target column since we will use this column data later to evaluate the accuracy of thee predictions  


In [None]:
response = s3_client.get_object(Bucket=bucket, Key=f"{prefix}/test/test_data_w_target.csv")

status = response.get("ResponseMetadata", {}).get("HTTPStatusCode")

if status == 200:
    print(f"Successful S3 get_object response. Status - {status}")
    test_data = pd.read_csv(response.get("Body"),index_col=False)
else:
    print(f"Unsuccessful S3 get_object response. Status - {status}")

We are now going to create the testing dataset in X_test) and strip off the target column (Churn? column into y_test) 

In [None]:
X_test = test_data[[c for c in data.columns if c != 'Churn?']]
y_test = test_data[['Churn?']]

Let us take a look at the structure of each dataset

In [None]:
X_test.info()

In [None]:
y_test.info()

Now lets do predictions using the **X_test** dataset and then compare the results against **y_test**

In [None]:
predictions = automl.predict(X_test)
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, predictions))

## How to save the best performing ensemble model

The automl variable contains the trained model. We are going to save the model and model artifects of the best performing ensemble model in pickle format.

In [None]:
import os
import pickle

In [None]:
with open(os.path.join(".", 'auto-sklearn.pkl'), 'wb') as out:
    pickle.dump(automl, out)

## Conclusion of Lab 2

So what we have done in this lab is to:

1) Take your dataset and do some data engineering (we removed some columns) and provided hints to auto-sklearn (indicating categorical columns)

2) auto-sklearn analyzed the dataset and evaluated different permutations and combinations of algorithms to create the best performing model

3) Once the process was finished, we could use a test dataset to do live predictions to assess the accuracy of our model.

Since this is a workshop, all of the work was done in the notebook using small datasets and constrained execution time limits. This method works great for development, however does not work for productionizing a process. In the next lab we will show how to use the power of SageMaker to make the auto-sklearn training and inference into fully managed services 