# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
# import libraries
import pandas as pd
import os as os
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model


In [2]:
# load data
df = pd.read_csv('./data/prepared_churn_data.csv', index_col='customerID')
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7043 entries, 7590-VHVEG to 3186-AJIEK
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   tenure             7043 non-null   float64
 1   PhoneService       7043 non-null   int64  
 2   Contract           7043 non-null   int64  
 3   PaymentMethod      7043 non-null   int64  
 4   MonthlyCharges     7043 non-null   float64
 5   TotalCharges       7043 non-null   float64
 6   Churn              7043 non-null   int64  
 7   charge_per_tenure  7043 non-null   float64
dtypes: float64(4), int64(4)
memory usage: 495.2+ KB


In [3]:
# Setup pycaret, ignore customerID
automl = setup(df, target='Churn', preprocess=False, ignore_features=['customerID'])

Unnamed: 0,Description,Value
0,Session id,2812
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 8)"
4,Transformed data shape,"(7043, 8)"
5,Transformed train set shape,"(4930, 8)"
6,Transformed test set shape,"(2113, 8)"
7,Ignore features,1
8,Numeric features,7


In [4]:
# have pycaret compare models
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7961,0.8404,0.5344,0.6391,0.5812,0.4481,0.4517,0.254
gbc,Gradient Boosting Classifier,0.7959,0.8422,0.516,0.6448,0.5718,0.4404,0.4459,0.054
lda,Linear Discriminant Analysis,0.7939,0.8268,0.516,0.6381,0.5695,0.4363,0.4411,0.004
ridge,Ridge Classifier,0.7931,0.8268,0.4548,0.66,0.5372,0.4101,0.4225,0.004
ada,Ada Boost Classifier,0.7925,0.8402,0.5115,0.6372,0.5659,0.432,0.4374,0.018
lightgbm,Light Gradient Boosting Machine,0.7858,0.8329,0.5298,0.6113,0.5669,0.4258,0.4281,0.315
rf,Random Forest Classifier,0.7805,0.8099,0.4985,0.6055,0.5462,0.4034,0.407,0.049
et,Extra Trees Classifier,0.7675,0.7827,0.4916,0.5722,0.5284,0.3754,0.3776,0.032
knn,K Neighbors Classifier,0.7629,0.7433,0.4342,0.5703,0.4919,0.3413,0.3472,0.068
qda,Quadratic Discriminant Analysis,0.7491,0.8242,0.737,0.5196,0.6092,0.4326,0.4472,0.004


In [5]:
notebook_dir = os.getcwd()
models_dir = os.path.join(notebook_dir, 'models')
if(not os.path.exists(models_dir)):
    os.makedirs(models_dir)

In [6]:
# save best model using the class name as the filename, so that we can save different models
os.chdir(models_dir)
model_name = best_model.__class__.__name__
save_model(best_model, model_name)
os.chdir(notebook_dir)


Transformation Pipeline and Model Successfully Saved


In [7]:
# test the pickle file
os.chdir(models_dir)
loaded_model =load_model(model_name)
loaded_model
os.chdir(notebook_dir)


Transformation Pipeline and Model Successfully Loaded


In [8]:
# test the loaded model by making a prediction
unseen_data = df.copy()
unseen_data = unseen_data.drop('Churn', axis=1)
predict_model(loaded_model, data=unseen_data.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4.0,1,0,2,74.400002,306.600006,76.650002,1,0.5717


In [9]:
# vefify the prediction
df.iloc[-2:-1]

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
8361-LTMKD,4.0,1,0,2,74.4,306.6,1,76.65


In [10]:
from src.predict_churn import list_models, make_prediction
list_models(models_dir)

['GradientBoostingClassifier.pkl', 'LogisticRegression.pkl']

In [11]:

new_data = pd.read_csv('./data/new_churn_data.csv', index_col='customerID')
make_prediction(models_dir,'LogisticRegression', new_data)

Transformation Pipeline and Model Successfully Loaded


Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
9305-CKSKC,22,1,0,2,97.400002,811.700012,36.895454,1,0.501
1452-KNGVK,8,0,1,1,77.300003,1701.949951,212.743744,1,0.5833
6723-OKKJM,28,1,0,0,28.25,250.899994,8.960714,0,0.9368
7832-POPKP,62,1,0,2,101.699997,3106.560059,50.105808,0,0.8075
6348-TACGU,10,0,0,1,51.150002,3440.969971,344.096985,1,0.7838


# Summary

Write a short summary of the process and results here.