# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [None]:
# My Fifth Assignment on DS Automation

In [89]:
import pandas as pd

df = pd.read_csv('New_Churn_data.csv', index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,Average_Monthly_Charges_ratio,Monthly_Charges_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
7590-VHVEG,1,No,Month-to-month,Electronic check,29.85,29.85,No,29.850000,1.000000
5575-GNVDE,34,Yes,One year,Mailed check,56.95,1889.50,No,55.573529,0.030140
3668-QPYBK,2,Yes,Month-to-month,Mailed check,53.85,108.15,Yes,54.075000,0.497920
7795-CFOCW,45,No,One year,Bank transfer (automatic),42.30,1840.75,No,40.905556,0.022980
9237-HQITU,2,Yes,Month-to-month,Electronic check,70.70,151.65,Yes,75.825000,0.466205
...,...,...,...,...,...,...,...,...,...
6840-RESVB,24,Yes,One year,Mailed check,84.80,1990.50,No,82.937500,0.042602
2234-XADUH,72,Yes,One year,Credit card (automatic),103.20,7362.90,No,102.262500,0.014016
4801-JZAZL,11,No,Month-to-month,Electronic check,29.60,346.45,No,31.495455,0.085438
8361-LTMKD,4,Yes,Month-to-month,Mailed check,74.40,306.60,Yes,76.650000,0.242661


In [90]:
from pycaret.classification import *


# Automation (Machine Learning & target as Churn.)

In [91]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,2668
1,Target,Churn
2,Target type,Binary
3,Target mapping,"No: 0, Yes: 1"
4,Original data shape,"(7043, 9)"
5,Transformed data shape,"(7043, 14)"
6,Transformed train set shape,"(4930, 14)"
7,Transformed test set shape,"(2113, 14)"
8,Ordinal features,1
9,Numeric features,5


In [92]:
# Predicting the best model by comparing different model's. 

In [93]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7951,0.8388,0.7951,0.7843,0.7858,0.4328,0.4396,0.024
lda,Linear Discriminant Analysis,0.7947,0.8344,0.7947,0.7837,0.7854,0.4319,0.4384,0.016
gbc,Gradient Boosting Classifier,0.7943,0.8362,0.7943,0.7825,0.7837,0.4256,0.4337,0.131
ridge,Ridge Classifier,0.7935,0.0,0.7935,0.78,0.7785,0.4068,0.4209,0.013
ada,Ada Boost Classifier,0.7866,0.8331,0.7866,0.7752,0.7773,0.411,0.4171,0.045
lightgbm,Light Gradient Boosting Machine,0.784,0.8255,0.784,0.7738,0.7765,0.4117,0.4157,0.552
rf,Random Forest Classifier,0.7751,0.7981,0.7751,0.7636,0.7663,0.3837,0.3887,0.087
knn,K Neighbors Classifier,0.7673,0.7378,0.7673,0.7524,0.7548,0.3487,0.3567,0.018
et,Extra Trees Classifier,0.7529,0.7717,0.7529,0.7459,0.7484,0.3452,0.3467,0.067
dummy,Dummy Classifier,0.7347,0.5,0.7347,0.5398,0.6223,0.0,0.0,0.014


In [94]:
# Our best model is Logistic Regression with 80% Accuracy 

In [95]:
best_model

In [96]:
# Saving the Best model to the disk

In [106]:
save_model(best_model, 'LogisticRegression')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('label_encoding',
                  TransformerWrapperWithInverse(exclude=None, include=None,
                                                transformer=LabelEncoder())),
                 ('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'MonthlyCharges',
                                              'TotalCharges',
                                              'Average_Monthly_Charges_ratio',
                                              'Monthly_Charges_ratio'],
                                     transformer=SimpleImputer(add_indicator=Fa...
                                                               handle_missing='return_nan',
                                                               handle_unknown='value',
                                                               return_df=True,
                                                    

In [107]:
import pickle
with open('LogisticRegression.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [108]:
with open('LogisticRegression.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [109]:
loaded_lda = load_model('LogisticRegression')

Transformation Pipeline and Model Successfully Loaded


In [110]:
new_data=df.iloc[-2:-1]

# Predicting the Model and we got 67% as prediction score

In [111]:
predict_model(loaded_lda, new_data)

Unnamed: 0_level_0,Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Average_Monthly_Charges_ratio,Monthly_Charges_ratio,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
7832-POPKP,3,62,1,0,0,101.699997,3106.560059,50.105808,0.032737,Yes,0.6649


In [103]:
# putting together a Python script with a function that, when given a pandas data frame as input, returns the likelihood that each row in the data frame will churn

In [156]:
from IPython.display import Code

Code('predict_churn.py')

# True Values for the new data

In [157]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded


predictions:
           Churn_prediction
customerID                 
9305-CKSKC                1
1452-KNGVK                0
6723-OKKJM                0
7832-POPKP                0
6348-TACGU                1


# Summary

Identified the best machine learning method for the data using Pycaret and Predicting the best model by comparing different model's. A score of To choose the optimal machine learning method for the given data, this is accuracy in which we got 80% for logistic regression, but other options include AUC, precision, recall, and so forth.Then we saved the model to disk. using write and read modules saved as pickle file then tried to load the data and predicted the score and we got 67% as prediction score for our data.
Created a Python script, file, or module that contains a function that, when given a pandas data frame as input, provides the likelihood that each row in the data frame will churn and tried different threshold values to get the true values of the new data file, By taking week2 file and replacing the churn data with new churn data unmodified and done the cleaning and made sure the updated churn data and the unmodified data are matching so that we could get the output and predict the churn.
Created an github account and uploaded the python file to repository.

Refreshed the model and checked multiple times, I got different models as it is automl, and for Gradient Boosting classifier i was able to get the true values as 1 0 0 1 0 when threshold was 0.8.