# Customer Churn Prediction using Auto-SKLearn
_**Using Auto-SKLearn to Predict Mobile Customer Departure**_

---

---

Kernel `conda_mxnet_latest_p37` works well with this notebook.

## Contents

1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Classification](#Data)
1. [Data Loading](#Dataloading)
1. [Auto-SKLearn Results](#Results)
1. [Host](#Host)
1. [Cleanup](#Cleanup)


---

## Introduction<a name="Introduction"></a>

Auto-Sklearn is an open-source library for performing AutoML in Python. It makes use of the popular Scikit-Learn machine learning library for data transforms and machine learning algorithms and uses a Bayesian Optimization search procedure to efficiently discover a top-performing model pipeline for a given dataset. 

This notebook has been developed to show AutoML development using Auto-SKLearn as a comparison to SageMaker AutoPilot. Currently SageMaker AutoPilot is not available in GovCloud so we are going to use Auto-SKLearn to be able to do AutoML on GovCloud.

---
## Setup<a name="Setup"></a>

_This notebook was created and tested on an ml.m4.xlarge notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

Next let's install auto-sklearn 

In [63]:
! pip3 install auto-sklearn==0.13.0

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/mxnet_latest_p37/bin/python -m pip install --upgrade pip' command.[0m



## Classification<a name="Classification"></a>

The following example shows how to fit a simple classification model with
*auto-sklearn*.


In [64]:
import sagemaker
import boto3
import pandas as pd

import sklearn.datasets
import sklearn.metrics
import autosklearn.classification

# You can modify the following to use a bucket of your choosing
session = sagemaker.Session()
bucket = session.default_bucket()
prefix = "sagemaker/autopilot-cust-churn"


## Data Loading<a name="Dataloading"></a>

We already have a prepared dataset from our AutoPilot lab we just completed. Now we will load our dataset

In [65]:
print(f"s3://{bucket}/{prefix}/train/train_data.csv")

s3://sagemaker-us-east-1-650687152614/sagemaker/autopilot-cust-churn/train/train_data.csv


Quick test to see if the dataset files are in S3. If not then the AutoPilot lab was not run ahead of time and so the file are re-created and stored in S3. This situation would occur if running the labs in the GovCloud region since AutoPilot is not available yet. This next cell will allow you to run the auto-sklearn labs without having to run the AutoPilot lab.

In [66]:
import boto3
import botocore

src_prefix = "sagemaker/autopilot-cust-churn"

s3 = boto3.resource('s3')

print(f"Testing if file exists at s3://{bucket}/{src_prefix}/train/train_data.csv")

try:
    s3.Object(f'{bucket}', f'{src_prefix}/train/train_data.csv').load()
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == "404":
        # The object does not exist.
        churn = pd.read_csv("../dataset/churn.txt", index_col=False)
        churn.drop(columns=['Area Code', 'Phone'],inplace=True)
        train_data = churn.sample(frac=0.8, random_state=200)
        test_data = churn.drop(train_data.index)
        test_data_no_target = test_data.drop(columns=["Churn?"])
        train_file = "train_data.csv"
        train_data.to_csv(train_file, index=False, header=True)
        train_data_s3_path = sess.upload_data(path=train_file, key_prefix=src_prefix + "/train")
        print("Train data uploaded to: " + train_data_s3_path)

        test_file = "test_data.csv"
        test_data_no_target.to_csv(test_file, index=False, header=False)
        test_data_s3_path = sess.upload_data(path=test_file, key_prefix=src_prefix + "/test")
        print("Test data uploaded to: " + test_data_s3_path)

        test_file_wt = "test_data_w_target.csv"
        test_data.to_csv(test_file_wt, index=False, header=True)
        test_data_s3_path = sess.upload_data(path=test_file_wt, key_prefix=src_prefix + "/test")
        print("Test data with target uploaded to: " + test_data_s3_path)
    else:
        # Something else has gone wrong.
        raise


Testing if file exists at s3://sagemaker-us-east-1-650687152614/sagemaker/autopilot-cust-churn/train/train_data.csv


In [67]:
s3_client = boto3.client("s3")

response = s3_client.get_object(Bucket=bucket, Key=f"{prefix}/train/train_data.csv")

status = response.get("ResponseMetadata", {}).get("HTTPStatusCode")

if status == 200:
    print(f"Successful S3 get_object response. Status - {status}")
    data = pd.read_csv(response.get("Body"),index_col=False)
else:
    print(f"Unsuccessful S3 get_object response. Status - {status}")

Successful S3 get_object response. Status - 200


In [68]:
train_X = data[[c for c in data.columns if c != 'Churn?']]
train_Y = data[['Churn?']]

In [69]:
train_X = train_X.copy()

In [70]:
train_X["State"] = train_X["State"].astype('category')
train_X["Int'l Plan"] = train_X["Int'l Plan"].astype('category')
train_X["VMail Plan"] = train_X["VMail Plan"].astype('category')

## Build and fit a classifier



In [71]:
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder='/tmp/autosklearn_classification_example_tmp2',
)

In [72]:
%%time
automl.fit(train_X, train_Y)

CPU times: user 1.05 s, sys: 366 ms, total: 1.42 s
Wall time: 2min 4s


AutoSklearnClassifier(per_run_time_limit=30, time_left_for_this_task=120,
                      tmp_folder='/tmp/autosklearn_classification_example_tmp2')

## View the models found by auto-sklearn



In [73]:
print(automl.leaderboard())

          rank  ensemble_weight               type      cost   duration
model_id                                                               
12           1             0.22                mlp  0.068939  19.899640
8            2             0.38        extra_trees  0.081061   4.948165
2            3             0.10      random_forest  0.081818   3.585315
7            4             0.14  gradient_boosting  0.083333   2.048503
4            5             0.08                lda  0.143182   1.107202
5            6             0.06        gaussian_nb  0.159091   2.233084
3            7             0.02        gaussian_nb  0.170455   0.988530


## Print the final ensemble constructed by auto-sklearn



In [74]:
print(automl.show_models())

[(0.380000, SimpleClassificationPipeline({'balancing:strategy': 'none', 'classifier:__choice__': 'extra_trees', 'data_preprocessing:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessing:categorical_transformer:category_coalescence:__choice__': 'minority_coalescer', 'data_preprocessing:numerical_transformer:imputation:strategy': 'mean', 'data_preprocessing:numerical_transformer:rescaling:__choice__': 'robust_scaler', 'feature_preprocessor:__choice__': 'fast_ica', 'classifier:extra_trees:bootstrap': 'False', 'classifier:extra_trees:criterion': 'gini', 'classifier:extra_trees:max_depth': 'None', 'classifier:extra_trees:max_features': 0.4986263167495816, 'classifier:extra_trees:max_leaf_nodes': 'None', 'classifier:extra_trees:min_impurity_decrease': 0.0, 'classifier:extra_trees:min_samples_leaf': 1, 'classifier:extra_trees:min_samples_split': 4, 'classifier:extra_trees:min_weight_fraction_leaf': 0.0, 'data_preprocessing:categorical_transformer:ca

## Get the Score of the final ensemble

Now let us load the test dataset that we created and stored in S3. We are going to use the test dataset with the target column since we will use this column data later to evaluate the ccuracy of thee predictions  


In [75]:
response = s3_client.get_object(Bucket=bucket, Key=f"{prefix}/test/test_data_w_target.csv")

status = response.get("ResponseMetadata", {}).get("HTTPStatusCode")

if status == 200:
    print(f"Successful S3 get_object response. Status - {status}")
    test_data = pd.read_csv(response.get("Body"),index_col=False)
else:
    print(f"Unsuccessful S3 get_object response. Status - {status}")

Successful S3 get_object response. Status - 200


We are now going to create the testing dataset in X_test) and strip off the target column (Churn? column into y_test) 

In [76]:
X_test = test_data[[c for c in data.columns if c != 'Churn?']]
y_test = test_data[['Churn?']]

In [77]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   State           1000 non-null   object 
 1   Account Length  1000 non-null   int64  
 2   Int'l Plan      1000 non-null   object 
 3   VMail Plan      1000 non-null   object 
 4   VMail Message   1000 non-null   int64  
 5   Day Mins        1000 non-null   float64
 6   Day Calls       1000 non-null   int64  
 7   Day Charge      1000 non-null   float64
 8   Eve Mins        1000 non-null   float64
 9   Eve Calls       1000 non-null   int64  
 10  Eve Charge      1000 non-null   float64
 11  Night Mins      1000 non-null   float64
 12  Night Calls     1000 non-null   int64  
 13  Night Charge    1000 non-null   float64
 14  Intl Mins       1000 non-null   float64
 15  Intl Calls      1000 non-null   int64  
 16  Intl Charge     1000 non-null   float64
 17  CustServ Calls  1000 non-null   in

In [78]:
y_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Churn?  1000 non-null   object
dtypes: object(1)
memory usage: 7.9+ KB


In [79]:
predictions = automl.predict(X_test)
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, predictions))

Accuracy score: 0.94


In [80]:
type(predictions)

numpy.ndarray

In [81]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 19 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   State           1000 non-null   object 
 1   Account Length  1000 non-null   int64  
 2   Int'l Plan      1000 non-null   object 
 3   VMail Plan      1000 non-null   object 
 4   VMail Message   1000 non-null   int64  
 5   Day Mins        1000 non-null   float64
 6   Day Calls       1000 non-null   int64  
 7   Day Charge      1000 non-null   float64
 8   Eve Mins        1000 non-null   float64
 9   Eve Calls       1000 non-null   int64  
 10  Eve Charge      1000 non-null   float64
 11  Night Mins      1000 non-null   float64
 12  Night Calls     1000 non-null   int64  
 13  Night Charge    1000 non-null   float64
 14  Intl Mins       1000 non-null   float64
 15  Intl Calls      1000 non-null   int64  
 16  Intl Charge     1000 non-null   float64
 17  CustServ Calls  1000 non-null   in

In [82]:
print(f"{prefix}/test/test_data_w_target.csv")

sagemaker/autopilot-cust-churn/test/test_data_w_target.csv


## How to save the best performing ensemble model

The automl variable contains the trained model. We are going to save the model and model artifects of the best performing ensemble model in pickle format.

In [83]:
import os
import pickle

In [84]:
with open(os.path.join(".", 'auto-sklearn.pkl'), 'wb') as out:
    pickle.dump(automl, out)