# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
import pandas as pd


In [2]:
df=pd.read_csv('prepped_Churn_data.csv', index_col='customerID')
df

Unnamed: 0_level_0,Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,totalcharges_monthlycharges_ratio,totalcharges_tenure_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
7590-VHVEG,0,1,0,0,1,29.85,29.85,0,1.000000,29.850000
5575-GNVDE,1,34,1,1,2,56.95,1889.50,0,33.178227,55.573529
3668-QPYBK,2,2,1,0,2,53.85,108.15,1,2.008357,54.075000
7795-CFOCW,3,45,0,1,3,42.30,1840.75,0,43.516548,40.905556
9237-HQITU,4,2,1,0,1,70.70,151.65,1,2.144979,75.825000
...,...,...,...,...,...,...,...,...,...,...
6840-RESVB,7038,24,1,1,2,84.80,1990.50,0,23.472877,82.937500
2234-XADUH,7039,72,1,1,4,103.20,7362.90,0,71.345930,102.262500
4801-JZAZL,7040,11,0,0,1,29.60,346.45,0,11.704392,31.495455
8361-LTMKD,7041,4,1,0,2,74.40,306.60,1,4.120968,76.650000


In [3]:
df = df.drop('Unnamed: 0', axis=1)
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,totalcharges_monthlycharges_ratio,totalcharges_tenure_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
7590-VHVEG,1,0,0,1,29.85,29.85,0,1.000000,29.850000
5575-GNVDE,34,1,1,2,56.95,1889.50,0,33.178227,55.573529
3668-QPYBK,2,1,0,2,53.85,108.15,1,2.008357,54.075000
7795-CFOCW,45,0,1,3,42.30,1840.75,0,43.516548,40.905556
9237-HQITU,2,1,0,1,70.70,151.65,1,2.144979,75.825000
...,...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,2,84.80,1990.50,0,23.472877,82.937500
2234-XADUH,72,1,1,4,103.20,7362.90,0,71.345930,102.262500
4801-JZAZL,11,0,0,1,29.60,346.45,0,11.704392,31.495455
8361-LTMKD,4,1,0,2,74.40,306.60,1,4.120968,76.650000


In [4]:
del df['totalcharges_monthlycharges_ratio']


In [5]:
del df['totalcharges_tenure_ratio']


In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 7590-VHVEG to 3186-AJIEK
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tenure          7032 non-null   int64  
 1   PhoneService    7032 non-null   int64  
 2   Contract        7032 non-null   int64  
 3   PaymentMethod   7032 non-null   int64  
 4   MonthlyCharges  7032 non-null   float64
 5   TotalCharges    7032 non-null   float64
 6   Churn           7032 non-null   int64  
dtypes: float64(2), int64(5)
memory usage: 697.5+ KB


##  AutoML with pycaret


In [6]:
conda install -c conda-forge pycaret -y


Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [7]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [8]:
automl = setup(data = df, target = 'Churn', fold_shuffle=True, preprocess=False)

Unnamed: 0,Description,Value
0,session_id,994
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7032, 7)"
5,Missing Values,False
6,Numeric Features,3
7,Categorical Features,3
8,Transformed Train Set,"(4922, 6)"
9,Transformed Test Set,"(2110, 6)"


In [9]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7903,0.832,0.4676,0.6476,0.5414,0.4107,0.4206,0.269
lr,Logistic Regression,0.7879,0.8281,0.4965,0.6308,0.5541,0.4179,0.4239,0.649
ada,Ada Boost Classifier,0.7873,0.83,0.4897,0.6319,0.5503,0.4142,0.4208,0.237
catboost,CatBoost Classifier,0.785,0.8255,0.4729,0.6296,0.539,0.4029,0.4104,1.48
lda,Linear Discriminant Analysis,0.7842,0.8176,0.4851,0.6245,0.5446,0.4065,0.4128,0.025
ridge,Ridge Classifier,0.7838,0.0,0.428,0.6441,0.5127,0.3813,0.3953,0.019
et,Extra Trees Classifier,0.7596,0.7655,0.4874,0.5579,0.5189,0.36,0.3623,0.344
rf,Random Forest Classifier,0.7594,0.7853,0.4638,0.5597,0.5056,0.349,0.3525,0.41
knn,K Neighbors Classifier,0.7592,0.7305,0.4174,0.5681,0.4783,0.3275,0.3354,0.041
qda,Quadratic Discriminant Analysis,0.7414,0.8174,0.7411,0.5114,0.6044,0.422,0.4388,0.022


In [10]:
best_model

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, random_state=994,
                           subsample=1.0, tol=0.0001, validation_fraction=0.1,
                           verbose=0, warm_start=False)

In [11]:
df.iloc[-2:-1].shape

(1, 7)

In [12]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4,1,0,2,74.4,306.6,1,1,0.6492


## Saving and loading our model


In [21]:
save_model(best_model, 'gbc')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ['trained_model',
                  GradientBoostingClassifier(ccp_alpha=0.0,
                                             criterion='friedman_mse', init=None,
                                             learning_rate=0.1, loss='deviance',
                                             max_depth=3, max_features=None,
                                             max_leaf_nodes=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
    

In [22]:
import pickle

with open('gbc_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [23]:
with open('gbc_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [24]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)

array([1])

In [25]:
loaded_lr = load_model('gbc')

Transformation Pipeline and Model Successfully Loaded


In [27]:
predict_model(loaded_lr, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
8361-LTMKD,4,1,0,2,74.4,306.6,1,0.6492


# Making a Python module to make predictions


In [66]:
from IPython.display import Code

Code('predict_churn.py')

In [67]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
predictions:
customerID
9305-CKSKC       Churn
1452-KNGVK    No Churn
6723-OKKJM    No Churn
7832-POPKP    No Churn
6348-TACGU    No Churn
Name: Churn_prediction, dtype: object


# Summary

This assignment was intresting one to me.We firstly started off by importing the prepped churn data set.The data set is converted into numeric values.We then use pycaret.We have also imported different functions.Using the auto ML function we were able  to find the best model.The best model may change when we rerun the program and may change depending to the models with closer similarities.Here we used pycaret's predict model to make predictions and the score was 0.642 which is good.By using pickle we saved our model.In end we transformed and loaded pipleline so the final true values are [1,0,0,0,,0].