# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

# Load our data

In [70]:
import pandas as pd

df = pd.read_csv('week_two_data.csv')
df = df.rename(columns = {"Total Charges To Tenure Ratio": "charge_per_tenure"})
df = df.drop(columns="MonthlyCharges/Ratio Variance", axis=1)
df

Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure
0,1,0,0,0,29.85,29.85,0,29.850000
1,34,1,1,1,56.95,1889.50,0,55.573529
2,2,1,0,1,53.85,108.15,1,54.075000
3,45,0,1,2,42.30,1840.75,0,40.905556
4,2,1,0,0,70.70,151.65,1,75.825000
...,...,...,...,...,...,...,...,...
7027,24,1,1,1,84.80,1990.50,0,82.937500
7028,72,1,1,3,103.20,7362.90,0,102.262500
7029,11,0,0,0,29.60,346.45,0,31.495455
7030,4,1,0,1,74.40,306.60,1,76.650000


In [71]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

# Perform AutoML with PyCaret

Note, I did *not* include the install process for pycaret because it was... tedious. There are a lot of dependency issues and I had to do some playing around with conda and pip to get everything to play nice. I ended up using scikit-learn 0.23.2 and pycaret 2.3.10.

In [72]:
automl = setup(df, target='Churn', fold_shuffle=True, session_id=2)

Unnamed: 0,Description,Value
0,session_id,2
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(7032, 8)"
5,Missing Values,False
6,Numeric Features,4
7,Categorical Features,3
8,Ordinal Features,False
9,High Cardinality Features,False


In [73]:
automl[17]

Unnamed: 0,tenure,TotalCharges,charge_per_tenure,PhoneService_0,Contract_0,Contract_1,Contract_2,PaymentMethod_0,PaymentMethod_1,PaymentMethod_2,PaymentMethod_3
0,1.0,29.850000,29.850000,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
1,34.0,1889.500000,55.573528,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,2.0,108.150002,54.075001,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
3,45.0,1840.750000,40.905556,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,2.0,151.649994,75.824997,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
7027,24.0,1990.500000,82.937500,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
7028,72.0,7362.899902,102.262497,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
7029,11.0,346.450012,31.495455,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
7030,4.0,306.600006,76.650002,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0


In [74]:
# Compare models
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7934,0.8358,0.4781,0.65,0.55,0.4199,0.4289,0.705
lda,Linear Discriminant Analysis,0.7903,0.828,0.5258,0.6226,0.5694,0.4323,0.4354,0.035
lr,Logistic Regression,0.7901,0.8345,0.5058,0.6274,0.559,0.4236,0.4285,1.714
ada,Ada Boost Classifier,0.7895,0.8343,0.4873,0.6325,0.5497,0.4154,0.4219,0.329
ridge,Ridge Classifier,0.7881,0.0,0.4534,0.6395,0.5299,0.3984,0.4086,0.038
catboost,CatBoost Classifier,0.7863,0.8305,0.4727,0.6272,0.5381,0.4029,0.4102,4.683
lightgbm,Light Gradient Boosting Machine,0.7859,0.8234,0.4881,0.6205,0.5457,0.4084,0.4138,0.414
xgboost,Extreme Gradient Boosting,0.768,0.8095,0.4673,0.5748,0.5148,0.3646,0.3684,1.034
rf,Random Forest Classifier,0.7639,0.7942,0.4612,0.5652,0.5074,0.3544,0.3578,0.868
knn,K Neighbors Classifier,0.7631,0.7404,0.4334,0.5668,0.4905,0.34,0.3455,0.079


Gradient Boosting Classifier appears to be the most effective model in terms of Accuracy, AUC, and Prec.

In [75]:
best_model

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=2, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

# Save our model

In [76]:
import pickle

with open('GBCmodel.pkl', 'wb') as f:
    pickle.dump(best_model, f)

In [77]:
with open('GBCmodel.pkl', 'rb') as f:
    loaded_model = pickle.load(f)
    print(loaded_model)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=2, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)


In [80]:
from IPython.display import Code

Code('model-script.py')

In [83]:
%run model-script.py

Predictions: 
0    No Churn
1    No Churn
2    No Churn
3    No Churn
4       Churn
Name: Churn_Predicition, dtype: object


# Summary

In this notebook, we loaded our data from week two and modified the dataframe to match up with the new churn data set for testing. Following this, we setup our auto machine learning using pycaret. We then compared the preformance of a variety of machine learning models on the test data set. This revealed the gradient boosting classifier to be the most efficeint in terms of accuracy, area under the curve, and precision. Then, we saved the model as a pickle object using the pickle library for python. Using a python script, we were able to create a repeatable process to load a model and test said model's predictions on a data set. 

Our GBC model predicted the new churn data set to be : 0, 0, 0, 0, 1. The true labels were 1, 0, 0, 1, 0. We can see our model has two false negatives and one false positive, only correctly identifying two of data points. It appears that our model is more keen on predicting negative values, or no churn, then postive ones. There are a few possible reasons for and solutions to this. First of, it is possible that the model needs to be trained on more data. Secondly, the dataframe may need to be manipulated and supplemented with new features. We've seen throughout the last few weeks that none of our models have gotten much better than 80% accuracy. This could be an indication that the features provided by the training data is not providing a full picture of the reasons customers churn. Finally, the training and test data may be divergent enough to create a model that underfits the test data.