# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [32]:
import pandas as pd

df = pd.read_csv('NewChurn.csv', index_col='customerID', )
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,Total_Charges_ratio,Monthly_Charges_ratio,Average_Monthly_Charges_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
7590-VHVEG,1,No,Month-to-month,Electronic check,29.85,29.85,No,29.850000,1.000000,29.850000
5575-GNVDE,34,Yes,One year,Mailed check,56.95,1889.50,No,55.573529,0.030140,55.573529
3668-QPYBK,2,Yes,Month-to-month,Mailed check,53.85,108.15,Yes,54.075000,0.497920,54.075000
7795-CFOCW,45,No,One year,Bank transfer (automatic),42.30,1840.75,No,40.905556,0.022980,40.905556
9237-HQITU,2,Yes,Month-to-month,Electronic check,70.70,151.65,Yes,75.825000,0.466205,75.825000
...,...,...,...,...,...,...,...,...,...,...
6840-RESVB,24,Yes,One year,Mailed check,84.80,1990.50,No,82.937500,0.042602,82.937500
2234-XADUH,72,Yes,One year,Credit card (automatic),103.20,7362.90,No,102.262500,0.014016,102.262500
4801-JZAZL,11,No,Month-to-month,Electronic check,29.60,346.45,No,31.495455,0.085438,31.495455
8361-LTMKD,4,Yes,Month-to-month,Mailed check,74.40,306.60,Yes,76.650000,0.242661,76.650000


In [33]:
from pycaret.classification import *


In [34]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,5147
1,Target,Churn
2,Target type,Binary
3,Target mapping,"No: 0, Yes: 1"
4,Original data shape,"(7043, 10)"
5,Transformed data shape,"(7043, 15)"
6,Transformed train set shape,"(4930, 15)"
7,Transformed test set shape,"(2113, 15)"
8,Ordinal features,1
9,Numeric features,6


In [35]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7961,0.838,0.7961,0.7853,0.7867,0.4351,0.4421,0.153
lda,Linear Discriminant Analysis,0.7953,0.835,0.7953,0.7848,0.7868,0.4364,0.4422,0.014
ridge,Ridge Classifier,0.7933,0.0,0.7933,0.7799,0.7796,0.411,0.423,0.013
lr,Logistic Regression,0.7911,0.8377,0.7911,0.7807,0.783,0.4272,0.4322,0.285
ada,Ada Boost Classifier,0.7911,0.8325,0.7911,0.78,0.7819,0.423,0.4293,0.048
lightgbm,Light Gradient Boosting Machine,0.7886,0.8255,0.7886,0.7787,0.7812,0.4235,0.4278,0.482
rf,Random Forest Classifier,0.7757,0.8044,0.7757,0.7645,0.7676,0.3878,0.3918,0.089
et,Extra Trees Classifier,0.7655,0.7795,0.7655,0.757,0.7602,0.3733,0.375,0.066
knn,K Neighbors Classifier,0.7592,0.7372,0.7592,0.7423,0.7463,0.3256,0.3322,0.016
dummy,Dummy Classifier,0.7347,0.5,0.7347,0.5398,0.6223,0.0,0.0,0.013


In [36]:
best_model

In [37]:
df.iloc[-2:-1]

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,Total_Charges_ratio,Monthly_Charges_ratio,Average_Monthly_Charges_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4,Yes,Month-to-month,Mailed check,74.4,306.6,Yes,76.65,0.242661,76.65


In [38]:
predict_model(best_model, df.iloc[-2:-1])


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Gradient Boosting Classifier,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Total_Charges_ratio,Monthly_Charges_ratio,Average_Monthly_Charges_ratio,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
8361-LTMKD,4,Yes,Month-to-month,Mailed check,74.400002,306.600006,76.650002,0.242661,76.650002,Yes,Yes,0.5264


In [39]:
save_model(best_model, 'XGBoost')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('label_encoding',
                  TransformerWrapperWithInverse(exclude=None, include=None,
                                                transformer=LabelEncoder())),
                 ('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'MonthlyCharges',
                                              'TotalCharges',
                                              'Total_Charges_ratio',
                                              'Monthly_Charges_ratio',
                                              'Average_Monthly_Charges_ratio'],
                                     transformer=SimpleIm...
                                             criterion='friedman_mse', init=None,
                                             learning_rate=0.1, loss='log_loss',
                                             max_depth=3, max_features=None,
                       

In [40]:
import pickle
with open('XGBoost.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [41]:
with open('XGBoost.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [42]:
loaded_lda = load_model('XGBoost')

Transformation Pipeline and Model Successfully Loaded


In [43]:
new_data=df.iloc[-2:-1]

In [44]:
predict_model(loaded_lda, new_data)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Gradient Boosting Classifier,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Total_Charges_ratio,Monthly_Charges_ratio,Average_Monthly_Charges_ratio,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
8361-LTMKD,4,Yes,Month-to-month,Mailed check,74.400002,306.600006,76.650002,0.242661,76.650002,Yes,Yes,0.5264


In [45]:
from IPython.display import Code

Code('predict_churn.py')

# Summary

Write a short summary of the process and results here.