# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [133]:
# I have to make sure that the chrun data 2 has similar features with the new churn data of the professor

In [134]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [135]:
import pandas as pd

df=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Churn Data/prepped_churn_data_2.csv', index_col= 'customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,tenure_MonthlyCharges_ratio,charges_per_month
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,0,0.033501,29.850000
5575-GNVDE,34,1,1,1,56.95,1889.50,0,0.597015,55.573529
3668-QPYBK,2,1,0,1,53.85,108.15,1,0.037140,54.075000
7795-CFOCW,45,0,1,2,42.30,1840.75,0,1.063830,40.905556
9237-HQITU,2,1,0,0,70.70,151.65,1,0.028289,75.825000
...,...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,1990.50,0,0.283019,82.937500
2234-XADUH,72,1,1,3,103.20,7362.90,0,0.697674,102.262500
4801-JZAZL,11,0,0,0,29.60,346.45,0,0.371622,31.495455
8361-LTMKD,4,1,0,1,74.40,306.60,1,0.053763,76.650000


# AutoML with pycaret

In [136]:
import subprocess
import sys
def install_pycaret():
    try:
        import pycaret
    except ImportError:
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'pycaret'])

install_pycaret()

In [137]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [138]:
automl = setup(df, target = 'Churn')

Unnamed: 0,Description,Value
0,Session id,2299
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7032, 9)"
4,Transformed data shape,"(7032, 9)"
5,Transformed train set shape,"(4922, 9)"
6,Transformed test set shape,"(2110, 9)"
7,Numeric features,8
8,Preprocess,True
9,Imputation type,simple


In [139]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7911,0.8331,0.5122,0.6315,0.5653,0.43,0.4342,0.363
ridge,Ridge Classifier,0.7903,0.823,0.4472,0.654,0.5311,0.4022,0.4143,0.051
lda,Linear Discriminant Analysis,0.7901,0.823,0.5069,0.6307,0.5617,0.426,0.4306,0.029
ada,Ada Boost Classifier,0.7877,0.8283,0.4908,0.6293,0.5507,0.4147,0.4206,0.239
gbc,Gradient Boosting Classifier,0.7861,0.834,0.4709,0.6318,0.5389,0.4036,0.4113,0.906
lightgbm,Light Gradient Boosting Machine,0.7783,0.82,0.4939,0.6006,0.5417,0.3974,0.4009,0.563
xgboost,Extreme Gradient Boosting,0.7706,0.8083,0.487,0.5836,0.5306,0.3805,0.3834,0.135
rf,Random Forest Classifier,0.7694,0.7998,0.4679,0.5828,0.5184,0.3694,0.3736,0.639
knn,K Neighbors Classifier,0.7649,0.746,0.4549,0.5737,0.5068,0.3553,0.3598,0.065
et,Extra Trees Classifier,0.7627,0.7804,0.4847,0.5629,0.5201,0.3639,0.3661,0.407


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

In [140]:
best_model

It looks like our best model is LogisticRegression, closely followed by ridge and lda

In [141]:
df.iloc[-2:-1].shape

(1, 9)

In [142]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,tenure_MonthlyCharges_ratio,charges_per_month,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
8361-LTMKD,4,1,0,1,74.400002,306.600006,0.053763,76.650002,1,1,0.5671


# Saving and loading our model

In [143]:
save_model(best_model, 'LDA')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges',
                                              'tenure_MonthlyCharges_ratio',
                                              'charges_per_month'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_val...
                                                               fill_value=None,
            

In [144]:
import pickle

with open('LDA_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [145]:
with open('LDA_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [146]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)

array([1], dtype=int8)

In [147]:
loaded_lda = load_model('LDA')

Transformation Pipeline and Model Successfully Loaded


In [148]:
predict_model(loaded_lda, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,tenure_MonthlyCharges_ratio,charges_per_month,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4,1,0,1,74.400002,306.600006,0.053763,76.650002,1,0.5671


# Making a Python module to make predictions

In [149]:
from IPython.display import Code

Code('/content/drive/MyDrive/Colab Notebooks/Week 5/predict_churn.py')

In [150]:
%run predict_churn.py

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,Description,Value
0,Session id,3784
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7032, 9)"
4,Transformed data shape,"(7032, 9)"
5,Transformed train set shape,"(4922, 9)"
6,Transformed test set shape,"(2110, 9)"
7,Numeric features,8
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7899,0.833,0.5107,0.6295,0.5637,0.4274,0.4315,0.286
ada,Ada Boost Classifier,0.7897,0.8318,0.4793,0.643,0.5481,0.4149,0.4232,0.257
gbc,Gradient Boosting Classifier,0.7891,0.8359,0.4847,0.6356,0.5498,0.4154,0.422,0.874
ridge,Ridge Classifier,0.7881,0.823,0.4388,0.6504,0.5238,0.3941,0.4069,0.028
lda,Linear Discriminant Analysis,0.7879,0.8231,0.4962,0.6279,0.554,0.4175,0.4226,0.031
lightgbm,Light Gradient Boosting Machine,0.7816,0.8254,0.4985,0.6097,0.5481,0.4061,0.4099,0.37
rf,Random Forest Classifier,0.7779,0.8072,0.4748,0.6049,0.5313,0.3888,0.3941,0.615
xgboost,Extreme Gradient Boosting,0.7749,0.8147,0.4878,0.5954,0.5356,0.389,0.3928,0.209
et,Extra Trees Classifier,0.7686,0.7917,0.4779,0.5787,0.5225,0.3719,0.3754,0.401
knn,K Neighbors Classifier,0.7647,0.7395,0.4266,0.5794,0.4909,0.3425,0.3496,0.099


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,1.0,0,1.0,1.0,1.0,,0.0


Transformation Pipeline and Model Successfully Saved
Transformation Pipeline and Model Successfully Loaded


In the following code I will be adding features in new_data_data.csv to match the prepped_churn_data from week 2 and setting their values to 0 so that we are able to compare the results.

In [160]:
import pandas as pd
from pycaret.classification import load_model, predict_model

def predict_churn(data_filepath):
    model = load_model('LDA')
    new_data = pd.read_csv(data_filepath)
    new_data['tenure_MonthlyCharges_ratio'] = 0
    new_data['charges_per_month'] = 0
    predictions = predict_model(model, data=new_data)
    predictions.rename(columns={'prediction_label': 'churn_prediction'}, inplace=True)
    print(predictions['churn_prediction'])

if __name__ == "__main__":
       data_filepath = 'new_churn_data.csv'
       predict_churn(data_filepath)

Transformation Pipeline and Model Successfully Loaded


0    0
1    0
2    0
3    0
4    1
Name: churn_prediction, dtype: int64


The true values are (1, 0, 0, 1, 0) so our model is working but it can still be improved. We have one false negative and two false positives.

# Summary

For this assignment, I used Pycaret to automatically find the best model for predicting customer churn.

I then saved this model and created a Python script predict_churn.py that can load the model and make predictions on new customer data.

I tested the script with the provided new_churn_data.csv file and compared the predictions to the actual churn values. This showed how effective the model and the automated prediction process are.