# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [8]:
import pandas as pd

df = pd.read_csv('prepped_churn_data2.csv', index_col='customerID', )
print(df)

            tenure  PhoneService  Contract              PaymentMethod  \
customerID                                                              
7590-VHVEG       1             0         0           Electronic check   
5575-GNVDE      34             1         1               Mailed check   
3668-QPYBK       2             1         0               Mailed check   
7795-CFOCW      45             0         1  Bank transfer (automatic)   
9237-HQITU       2             1         0           Electronic check   
...            ...           ...       ...                        ...   
6840-RESVB      24             1         1               Mailed check   
2234-XADUH      72             1         1    Credit card (automatic)   
4801-JZAZL      11             0         0           Electronic check   
8361-LTMKD       4             1         0               Mailed check   
3186-AJIEK      66             1         2  Bank transfer (automatic)   

            MonthlyCharges  TotalCharges  Churn  t

In [5]:
pip install pycaret




In [6]:
from pycaret.classification import *

In [4]:
import sys
print(sys.version)

3.10.14 | packaged by Anaconda, Inc. | (main, May  6 2024, 19:44:50) [MSC v.1916 64 bit (AMD64)]


In [9]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,687
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7032, 8)"
4,Transformed data shape,"(7032, 11)"
5,Transformed train set shape,"(4922, 11)"
6,Transformed test set shape,"(2110, 11)"
7,Numeric features,6
8,Categorical features,1
9,Preprocess,True


In [10]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7905,0.8375,0.5137,0.6279,0.5634,0.428,0.4326,1.027
lda,Linear Discriminant Analysis,0.7877,0.8241,0.4961,0.626,0.552,0.4159,0.4214,0.052
gbc,Gradient Boosting Classifier,0.7875,0.8352,0.4885,0.6272,0.5469,0.4118,0.4183,0.401
ada,Ada Boost Classifier,0.7863,0.8339,0.5023,0.6206,0.5532,0.4154,0.4204,0.155
ridge,Ridge Classifier,0.7857,0.8241,0.4296,0.6433,0.5132,0.3836,0.3972,0.044
lightgbm,Light Gradient Boosting Machine,0.7846,0.8249,0.4992,0.6153,0.5494,0.4106,0.4153,0.231
knn,K Neighbors Classifier,0.7649,0.7429,0.4449,0.574,0.5005,0.3504,0.3555,0.067
rf,Random Forest Classifier,0.7625,0.8016,0.4586,0.5637,0.5052,0.3514,0.3549,0.299
et,Extra Trees Classifier,0.7597,0.7796,0.4754,0.5538,0.5106,0.353,0.3552,0.359
nb,Naive Bayes,0.7515,0.8136,0.7125,0.5243,0.6031,0.4285,0.4402,0.047


In [11]:
best_model

In [12]:
df.iloc[-2:-1]

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,total_tenure_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
8361-LTMKD,4,1,0,Mailed check,74.4,306.6,1,76.65


In [13]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.0,0,0.0,0.0,0.0,0.0,0.0


Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,total_tenure_ratio,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4,1,0,Mailed check,74.400002,306.600006,76.650002,1,0,0.5167


In [14]:
save_model(best_model, 'XGBoost')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'MonthlyCharges',
                                              'TotalCharges',
                                              'total_tenure_ratio'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean'))),
                 ('categorical_impu...
                                                          

In [15]:
import pickle
with open('XGBoost.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [16]:
with open('XGBoost.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [17]:
loaded_lda = load_model('XGBoost')

Transformation Pipeline and Model Successfully Loaded


In [18]:
new_data=df.iloc[-2:-1]

In [19]:
predict_model(loaded_lda, new_data)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.0,0,0.0,0.0,0.0,0.0,0.0


Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,total_tenure_ratio,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4,1,0,Mailed check,74.400002,306.600006,76.650002,1,0,0.5167


In [None]:
SWITCHED OVER TO VS CODE FOR THE FOLLOWING LINES

In [25]:
from IPython.display import Code

Code('predict_Churn.py')

In [None]:
%run predict_Churn.py

Write a short summary of the process and results here.

Using the prepped Churn data from week 2, we utilized AutoML through pycaret. This helps determine the best model with real time fitting.
The best fit for this data was a logistic regression, however that can change as the data is manipulated. Using this info, I then made predictions 
in a 2D array. The data was saved through pickle. I then attempted to pick the assignmnent up in VS Code to complete the assignmnet but had some
difficulty.