# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
import pandas as pd

df = pd.read_csv('data/prepped_churn_data_unmodified.csv', index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,0,0,2,29.85,29.85,0
5575-GNVDE,34,1,1,3,56.95,1889.50,0
3668-QPYBK,2,1,0,3,53.85,108.15,1
7795-CFOCW,45,0,1,0,42.30,1840.75,0
9237-HQITU,2,1,0,2,70.70,151.65,1
...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,3,84.80,1990.50,0
2234-XADUH,72,1,1,1,103.20,7362.90,0
4801-JZAZL,11,0,0,2,29.60,346.45,0
8361-LTMKD,4,1,0,3,74.40,306.60,1


In [None]:
conda update -n base conda

In [None]:
conda install -c conda-forge pycaret -y

In [2]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [10]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,session_id,660
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7032, 7)"
5,Missing Values,0
6,Numeric Features,3
7,Categorical Features,3
8,Ordinal Features,0
9,High Cardinality Features,0


In [22]:
automl[3]

Unnamed: 0_level_0,tenure,MonthlyCharges,TotalCharges,PhoneService_1,Contract_0,Contract_1,Contract_2,PaymentMethod_0,PaymentMethod_1,PaymentMethod_2,PaymentMethod_3
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
4018-KJYUY,22.0,20.150000,432.500000,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
0771-WLCLA,16.0,112.949997,1882.550049,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
0596-BQCEQ,62.0,100.150002,6283.299805,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
7503-ZGUZJ,1.0,84.650002,84.650002,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
3932-IJWDZ,45.0,103.650002,4747.850098,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
7032-LMBHI,37.0,64.650002,2347.850098,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
3926-YZVVX,41.0,50.049999,2029.050049,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
4324-AHJKS,5.0,55.799999,300.399994,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
8984-EYLLL,64.0,105.250000,6823.399902,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [24]:
automl = setup(df, target='Churn', preprocess=False, numeric_features=['Contract','PhoneService','PaymentMethod'])

Unnamed: 0,Description,Value
0,session_id,6712
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7032, 7)"
5,Missing Values,0
6,Numeric Features,6
7,Categorical Features,0
8,Transformed Train Set,"(4922, 6)"
9,Transformed Test Set,"(2110, 6)"


In [25]:
automl[3]

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3466-WAESX,16.0,1.0,0.0,2.0,69.099998,1083.699951
5600-PDUJF,6.0,1.0,0.0,1.0,49.500000,312.700012
3251-YMVWZ,53.0,1.0,1.0,0.0,24.049999,1301.900024
6888-SBYAI,1.0,1.0,0.0,3.0,50.700001,50.700001
2378-YIZKA,68.0,1.0,2.0,1.0,85.000000,5607.750000
...,...,...,...,...,...,...
4393-RYCRE,44.0,1.0,1.0,2.0,106.050003,4510.799805
3955-JBZZM,20.0,1.0,0.0,2.0,78.800003,1641.300049
4835-YSJMR,39.0,1.0,2.0,0.0,49.799999,1971.150024
5214-CHIWJ,27.0,1.0,1.0,3.0,20.299999,595.049988


In [44]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
catboost,CatBoost Classifier,0.7962,0.8404,0.5103,0.6498,0.5707,0.4398,0.4459,0.369
gbc,Gradient Boosting Classifier,0.7958,0.839,0.5027,0.6514,0.5667,0.4361,0.4429,0.045
ada,Ada Boost Classifier,0.7924,0.8374,0.4866,0.6465,0.5547,0.4229,0.4305,0.021
ridge,Ridge Classifier,0.7903,0.0,0.4484,0.6588,0.5322,0.4034,0.4166,0.004
lr,Logistic Regression,0.7897,0.8339,0.5195,0.6291,0.568,0.4308,0.435,0.008
lightgbm,Light Gradient Boosting Machine,0.7887,0.8271,0.5088,0.6263,0.5607,0.4238,0.4282,2.043
lda,Linear Discriminant Analysis,0.7844,0.8197,0.495,0.622,0.5499,0.4107,0.4163,0.004
xgboost,Extreme Gradient Boosting,0.7836,0.8192,0.5057,0.6131,0.5535,0.4127,0.4164,0.834
rf,Random Forest Classifier,0.7818,0.8169,0.4905,0.613,0.5435,0.4028,0.4079,0.062
et,Extra Trees Classifier,0.7733,0.7992,0.5012,0.5878,0.5405,0.3913,0.3938,0.056


In [45]:
best_model

<catboost.core.CatBoostClassifier at 0x7fda3a47bc40>

In [46]:
df.iloc[-2:-1].shape

(1, 7)

In [48]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4,1,0,3,74.4,306.6,1,0,0.5163


In [51]:
save_model(best_model, 'CATBOOST')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=['Contract',
                                                           'PhoneService',
                                                           'PaymentMethod'],
                                       target='Churn', time_features=[])),
                 ['trained_model',
                  <catboost.core.CatBoostClassifier object at 0x7fda3a47bc40>]],
          verbose=False),
 'CATBOOST.pkl')

In [52]:
import pickle

with open('CATBOOST_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [53]:
with open('CATBOOST_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [54]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)

array([0])

In [55]:
loaded_lda = load_model('CATBOOST')

Transformation Pipeline and Model Successfully Loaded


In [56]:
predict_model(loaded_lda, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
8361-LTMKD,4,1,0,3,74.4,306.6,0,0.5163


In [57]:
pip install ipython

Note: you may need to restart the kernel to use updated packages.


In [58]:
from IPython.display import Code

Code('predict_churn.py')

In [62]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
predictions:
customerID
9305-CKSKC       Churn
1452-KNGVK    No Churn
6723-OKKJM    No Churn
7832-POPKP    No Churn
6348-TACGU    No Churn
Name: Churn_prediction, dtype: object


# Summary

Write a short summary of the process and results here.