# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

<hr>

# Assignment 5 - Charles Alders

## Imports and reading data

In [36]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split

import timeit

In [37]:
df = pd.read_csv("prepped_churn_data.csv", index_col=0)
df.head()

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,total_charges_tenure_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,0,29.85
5575-GNVDE,34,1,1,1,56.95,1889.5,0,55.573529
3668-QPYBK,2,1,0,1,53.85,108.15,1,54.075
7795-CFOCW,45,0,1,2,42.3,1840.75,0,40.905556
9237-HQITU,2,1,0,0,70.7,151.65,1,75.825


## Splitting data

Splitting data into features and targets, then into train and test sets for our model. Using stratify to keep the same proportion of target class.

In [38]:
features = df.drop("Churn", axis=1)
targets = df.Churn

x_train, x_test, y_train, y_test = train_test_split(features, targets, stratify=targets, random_state=26)

## Using TPOT to find the best model

In [39]:
%%time
tpot = TPOTClassifier(generations=5, population_size=50, cv=5, random_state=26, scoring='accuracy', verbosity=2, n_jobs=-1)
tpot.fit(x_train, y_train)

Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.7961695009757457

Generation 2 - Current best internal CV score: 0.7965470291464698

Generation 3 - Current best internal CV score: 0.7965470291464698

Generation 4 - Current best internal CV score: 0.7965470291464698

Generation 5 - Current best internal CV score: 0.7969290538413807

Best pipeline: ExtraTreesClassifier(CombineDFs(input_matrix, input_matrix), bootstrap=True, criterion=entropy, max_features=0.25, min_samples_leaf=3, min_samples_split=7, n_estimators=100)
Wall time: 2min 36s


TPOTClassifier(generations=5, n_jobs=-1, population_size=50, random_state=26,
               scoring='accuracy', verbosity=2)

Looks like ExtraTreesClassifier is best algorithm for this dataset. Below compares TPOT's predictions to the actual values in the test set. Looks pretty good, but I do see one incorrect prediction.

In [49]:
# For some reason this gave me a warning when I ran it on my MacBook, but not when I run it on Windows 10... oh well?
# I fixed it on MacOS by fitting the model with x_train.values, but it cause the best model to be far more complex.

predictions = tpot.predict(x_test)

# Comparing predictions to the test set.
print(predictions[0:5], "...", predictions[-6:-1]) # By default, printing predictions was only showing the first 3.
print(y_test)

from sklearn.metrics import accuracy_score
print(f'\n\nAccuracy of TPOT predictions: {accuracy_score(y_test, predictions)}')

[0 0 0 0 1] ... [1 0 0 0 0]
customerID
2969-WGHQO    0
8034-RYTVV    0
7025-WCBNE    0
6137-MFAJN    0
1792-UXAFY    1
             ..
6967-QIQRV    0
9761-XUJWD    0
3705-RHRFR    0
4801-KFYKL    0
9357-UJRUN    0
Name: Churn, Length: 1758, dtype: int64


Accuracy of TPOT predictions: 0.7957906712172924


According to the output above, the ExtraTreesClassifier had the highest accuracy - 79.57%, which is the accuracy of TPOT's predictions.

## Exporting and running pipeline

Exporting the best algorithm to a Python file for easy access and reproducing. 

In [41]:
tpot.export('tpot_churn_pipeline.py')

I modified the file to include my own file path, and changed the target column to "Churn" from "target".
For reference, the code from the file:

In [42]:
from IPython.display import Code
Code('tpot_churn_pipeline_new.py')

Running the file using the magic %run command from Jupyter.

In [43]:
%run tpot_churn_pipeline_new.py

[0 1 0 ... 0 1 0]


## Testing predictions with new data

In [44]:
new_data = pd.read_csv('new_churn_data.csv', index_col=0)
new_data.head()

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
9305-CKSKC,22,1,0,2,97.4,811.7,36.895455
1452-KNGVK,8,0,1,1,77.3,1701.95,212.74375
6723-OKKJM,28,1,0,0,28.25,250.9,8.960714
7832-POPKP,62,1,0,2,101.7,3106.56,50.105806
6348-TACGU,10,0,0,1,51.15,3440.97,344.097


In [45]:
tpot.predict(new_data)

array([0, 0, 0, 0, 0], dtype=int64)

Trying predictions on the same data but using the exported Python file. Please let me know if this is not right! I see the same answers, so I am assuming this is correct.

In [46]:
from tpot_churn_pipeline_new import exported_pipeline
exported_pipeline.predict(new_data)

array([0, 0, 0, 0, 0], dtype=int64)

## Summary

This assignment utilized TPOT for automatic machine learning. After splitting the data into features/targets and train and test sets (how we have in previous weeks), I used the TPOT Classifier and fitted the data to the best model, which ended up being the ExtraTreesClassifier. The models were scored by accuracy, which ExtraTreesClassifier had the highest, at 79.57% accuracy. The model was then exported as a Python file. This is extremely useful as exporting the model allows others to use it for their churn data (with the same features) seamlessly. Lastly, I ran predictions on the new data using both the tpot.predict method in the notebook, and again utilizing the pipeline file. The predictions for the new data indicated that none of the five customers would churn. Keep in mind, the no-information rate is about 74%, while our best model has an accuracy of only 79.6%.