# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [4]:
import pandas as pd
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [2]:
df = pd.read_csv('prepped_churn_data.csv', index_col='Customer_ID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,totalCharges_tenure_ratio
Customer_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,1,0,0,0,29.85,29.85,0,29.850000
1,34,1,1,1,56.95,1889.50,0,55.573529
2,2,1,0,1,53.85,108.15,1,54.075000
3,45,0,1,2,42.30,1840.75,0,40.905556
4,2,1,0,0,70.70,151.65,1,75.825000
...,...,...,...,...,...,...,...,...
7038,24,1,1,1,84.80,1990.50,0,82.937500
7039,72,1,1,3,103.20,7362.90,0,102.262500
7040,11,0,0,0,29.60,346.45,0,31.495455
7041,4,1,0,1,74.40,306.60,1,76.650000


In [12]:
automl = setup(df, target='Churn', silent=True, fold_shuffle=True, imputation_type='iterative')

Unnamed: 0,Description,Value
0,session_id,1433
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7032, 8)"
5,Missing Values,False
6,Numeric Features,4
7,Categorical Features,3
8,Ordinal Features,False
9,High Cardinality Features,False


AttributeError: 'Make_Time_Features' object has no attribute 'list_of_features'

Seems like this is an error that a lot of people are getting where pycaret and sklearn are reacting to each other. Could not find a solution digging through multiple stack overflow conversations. Still functioning, just have error message in notebook

In [38]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7976,0.8401,0.4717,0.6486,0.5451,0.4194,0.4287,0.115
lr,Logistic Regression,0.7964,0.8353,0.5039,0.632,0.5592,0.4295,0.4349,0.015
ada,Ada Boost Classifier,0.795,0.838,0.4827,0.6353,0.5469,0.4181,0.4255,0.051
lda,Linear Discriminant Analysis,0.795,0.8262,0.5244,0.6225,0.5683,0.4354,0.4387,0.006
ridge,Ridge Classifier,0.7942,0.0,0.4496,0.6462,0.5289,0.4029,0.4145,0.006
catboost,CatBoost Classifier,0.7911,0.8365,0.4622,0.6299,0.5323,0.4022,0.4106,0.847
lightgbm,Light Gradient Boosting Machine,0.7893,0.8272,0.4835,0.6173,0.5417,0.4077,0.4131,0.121
xgboost,Extreme Gradient Boosting,0.7848,0.8187,0.485,0.6039,0.5372,0.3994,0.4039,0.128
rf,Random Forest Classifier,0.7727,0.7946,0.4638,0.5734,0.5115,0.3659,0.3701,0.115
knn,K Neighbors Classifier,0.7722,0.7452,0.4394,0.5788,0.4979,0.3546,0.3611,0.011


In [34]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,totalCharges_tenure_ratio,Label,Score
Customer_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
7041,4,1,0,1,74.4,306.6,1,76.65,1,0.5246


In [None]:
save_model(best_model, 'LR')

Model was exported in Google Colab due to errors and no workaround within Jupyter Lab. Please see LR Pickle.ipynb file for file export

In [25]:
import pickle

with open('LR.pkl', 'wb') as f:
    pickle.dump(best_model, f)

In [26]:
with open('LR.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

In [36]:
new_data = pd.read_csv('new_churn_data.csv')
new_data['totalCharges_tenure_ratio'] = df['TotalCharges']/df['tenure']
loaded_LR = load_model('LR')

Transformation Pipeline and Model Successfully Loaded


In [37]:
predict_model(loaded_LR, new_data)

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,totalCharges_tenure_ratio,Label,Score
0,9305-CKSKC,22,1,0,2,97.4,811.7,36.895455,29.85,0,0.9117
1,1452-KNGVK,8,0,1,1,77.3,1701.95,212.74375,55.573529,0,0.9086
2,6723-OKKJM,28,1,0,0,28.25,250.9,8.960714,54.075,0,0.8458
3,7832-POPKP,62,1,0,2,101.7,3106.56,50.105806,40.905556,0,0.9078
4,6348-TACGU,10,0,0,1,51.15,3440.97,344.097,75.825,1,0.5829


# Summary

This ended up being a complicated scenario, because it seems like there are compatibility issues between pycaret and other libraries. I ended up having to use pycaret to create the model in Google Colab as Jupyter Lab had multiple errors that I found were common place on forums but without a solution. Someone recommended using Google Colab which did eventually work to export the pickle file and bring into Jupyter Lab. In the end, this data set still doesn't seem to have strong correlation to be reliable using a predictive ml model. The highest accuracy model is .79 so when using the model on the new data that was imported,the true values were [1, 0, 0, 1, 0] but the model predicted [0, 0, 0, 0, 1]. Only 40% accurate when it came to the new dataset. I would be interested to see if this is what everyone in the class is getting or if I have messed something up on my end.