# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [15]:
import pandas as pd

df = pd.read_csv('Week_2_result_churn_data.csv', index_col='customerID' )
df = df.drop('MonthlyChargesToTotalChargesRatio', axis=1)

df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,0,0,1,29.85,29.85,0
5575-GNVDE,34,1,1,0,56.95,1889.50,0
3668-QPYBK,2,1,0,0,53.85,108.15,1
7795-CFOCW,45,0,1,2,42.30,1840.75,0
9237-HQITU,2,1,0,1,70.70,151.65,1
...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,0,84.80,1990.50,0
2234-XADUH,72,1,1,3,103.20,7362.90,0
4801-JZAZL,11,0,0,1,29.60,346.45,0
8361-LTMKD,4,1,0,0,74.40,306.60,1


In [16]:
from pycaret.classification import *

In [17]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,7561
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 7)"
4,Transformed data shape,"(7043, 7)"
5,Transformed train set shape,"(4930, 7)"
6,Transformed test set shape,"(2113, 7)"
7,Numeric features,6
8,Preprocess,True
9,Imputation type,simple


In [18]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7927,0.8331,0.4649,0.655,0.5432,0.4141,0.4247,0.161
ridge,Ridge Classifier,0.7911,0.8149,0.4496,0.6564,0.533,0.4045,0.4169,0.019
lr,Logistic Regression,0.7905,0.8263,0.4962,0.6352,0.5566,0.4222,0.4281,0.036
ada,Ada Boost Classifier,0.786,0.8284,0.4702,0.6309,0.5379,0.4026,0.4105,0.084
lda,Linear Discriminant Analysis,0.7826,0.8149,0.4862,0.6151,0.5424,0.4025,0.4077,0.015
lightgbm,Light Gradient Boosting Machine,0.7819,0.8178,0.4786,0.6154,0.5376,0.3981,0.4039,0.126
rf,Random Forest Classifier,0.7718,0.7961,0.4725,0.5883,0.5226,0.3754,0.3801,0.17
et,Extra Trees Classifier,0.7615,0.7754,0.4786,0.558,0.5142,0.3577,0.3601,0.146
knn,K Neighbors Classifier,0.757,0.7283,0.4044,0.5577,0.4683,0.3162,0.3233,0.034
qda,Quadratic Discriminant Analysis,0.7467,0.8182,0.7362,0.5164,0.6068,0.4284,0.4433,0.015


In [19]:
best_model

In [20]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Gradient Boosting Classifier,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4,1,0,0,74.400002,306.600006,1,1,0.5928


In [21]:
save_model(best_model, 'trained_model')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean'))),
                 ('categorical_imputer',...
                                             criterion='friedman_mse', init=None,
                      

In [22]:
# import pickle
# with open('Ridge.pk', 'wb') as f:
#     pickle.dump(best_model, f)

In [23]:
# with open('Ridge.pk', 'rb') as f:
#     loaded_model = pickle.load(f)

In [24]:
loaded_lda = load_model('trained_model')

Transformation Pipeline and Model Successfully Loaded


In [25]:
new_df = pd.read_csv('new_churn_data.csv', index_col='customerID' )
new_df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
9305-CKSKC,22,1,0,2,97.4,811.7,36.895455
1452-KNGVK,8,0,1,1,77.3,1701.95,212.74375
6723-OKKJM,28,1,0,0,28.25,250.9,8.960714
7832-POPKP,62,1,0,2,101.7,3106.56,50.105806
6348-TACGU,10,0,0,1,51.15,3440.97,344.097


In [26]:
# new_data=new_df.iloc[-2:-1]
# print(new_data)

In [27]:
prediction = predict_model(loaded_lda, new_df)
prediction

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
9305-CKSKC,22,1,0,2,97.400002,811.700012,36.895454,1,0.6264
1452-KNGVK,8,0,1,1,77.300003,1701.949951,212.743744,0,0.8426
6723-OKKJM,28,1,0,0,28.25,250.899994,8.960714,0,0.92
7832-POPKP,62,1,0,2,101.699997,3106.560059,50.105808,0,0.6583
6348-TACGU,10,0,0,1,51.150002,3440.969971,344.096985,0,0.7059


In [28]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
Data loaded successfully!!


predictions:
           prediction_label
customerID                 
9305-CKSKC            Churn
1452-KNGVK         No churn
6723-OKKJM         No churn
7832-POPKP         No churn
6348-TACGU         No churn


# Summary

Write a short summary of the process and results here.

First, I loaded the data from a file called 'Week_2_result_churn_data.csv' and removed one of the columns. I then used PyCaret to set up a machine learning pipeline, where the target was the 'Churn' column. After comparing different models, I selected the best one. I tested the model by making a prediction for the second-to-last customer in the dataset. Next, I saved the model and reloaded it to check if it worked correctly. I also loaded a new dataset from 'new_churn_data.csv' and made predictions for that data. Finally, I ran a script named predict_churn.py to complete the task.

 Python script loads customer churn data and make predictions with a pre-trained machine learning model. The script first loads a saved model named 'trained_model' using PyCaret's load_model() function. It then reads the customer data from a CSV file called 'new_churn_data.csv' into a pandas DataFrame, with 'customerID' as the index. The script contains a function called make_predictions(), which uses the loaded model to predict whether each customer is likely to churn. The predictions are labeled as either 'Churn' or 'No churn', and unnecessary columns are removed from the results. After running the script, it prints out the predictions for the new data.


