# DS Automation Assignment - Courtney Drysdale

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [53]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split

import timeit 

In [54]:
df = pd.read_csv('prepped_churn_data.csv', index_col='customerID')
df.head()

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,TotalChargesTenureRatio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1,0,0,1,29.85,29.85,0,29.85
5575-GNVDE,34,1,1,0,56.95,1889.5,0,55.573529
3668-QPYBK,2,1,0,0,53.85,108.15,1,54.075
7795-CFOCW,45,0,1,2,42.3,1840.75,0,40.905556
9237-HQITU,2,1,0,1,70.7,151.65,1,75.825


In [55]:
features = df.drop('Churn', axis=1)
targets = df['Churn']

x_train, x_test, y_train, y_test = train_test_split(features, targets, stratify=targets, random_state=42)

In [56]:
%%time
tpot = TPOTClassifier(generations=5, population_size=50, cv=5,random_state=42, scoring='accuracy', verbosity=2, n_jobs=-1)

tpot.fit(x_train, y_train)
print(tpot.score(x_test, y_test))

Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.7963592542964288

Generation 2 - Current best internal CV score: 0.7963592542964288

Generation 3 - Current best internal CV score: 0.797306761873072

Generation 4 - Current best internal CV score: 0.7993924296518791

Generation 5 - Current best internal CV score: 0.7993924296518791

Best pipeline: XGBClassifier(MultinomialNB(MinMaxScaler(input_matrix), alpha=1.0, fit_prior=True), learning_rate=0.1, max_depth=1, min_child_weight=13, n_estimators=100, n_jobs=1, subsample=0.3, verbosity=0)
0.7849829351535836
CPU times: user 1min 10s, sys: 12.5 s, total: 1min 23s
Wall time: 9min 24s




## Note

I have tried multiple ways to clear this future warning, and I am not able to do it. I have updated sklearn, tried to find other ways to score, imported other packages, and I am at a loss for what to do. I also tried to learn how to ignore a warning in a single cell, but I didn't want to mess up anything else.

In [57]:
from sklearn.metrics import accuracy_score
print(f'Accuracy of the TPOT predictions: {accuracy_score(y_test,predictions)}')

Accuracy of the TPOT predictions: 0.7849829351535836


In [58]:
tpot.export('tpot_churn_pipeline5.py')

In [65]:
from IPython.display import Code

Code('tpot_churn_pipeline5_filledin.py')

## Running Python File

This one had me confused with trying to convert the customerID from a string to a float, but then I realized I could add the index_col parameter to the read_csv function and it worked. I printed the results below but due to the size of the dataset it is only printing the first and last three results.

In [66]:
%run tpot_churn_pipeline5_filledin.py

[0 0 1 ... 0 0 0]


# Summary

The prepped data was imported, split into test and train sets, then fit to a TPOT model. The TPOT ran 5 generations and found that the best pipeline was XGBClassifier. The TPOT was exported to a python file, which I then updated with the correct file path and target column name and ran it again on the prepped data. The TPOT accuracy was 78% which is better than the "no information" rate we looked at earlier in the data analysis. And the average score on the training set was 80% for the selected model, XGBClassifier. Overall, comparing to the other models that we have tried in recent weeks, the performance is about the same as other models.