# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

### Data Setup
We're starting with our Week 2 `prepared_churn_data.csv` to use with pycaret autoML.

In [2]:
import pandas as pd
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

df = pd.read_csv('prepared_churn_data.csv', index_col='customerID')
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,session_id,2953
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(7043, 8)"
5,Missing Values,False
6,Numeric Features,4
7,Categorical Features,3
8,Ordinal Features,False
9,High Cardinality Features,False


### Choose a metric you think is best to use for finding the best model
After doing some reading on the various metrics, I saw that `F1` is very good for binary classifications where we care the most about the positive class, or in this case `Churn = 1`. We discussed why this is important over sheer accuracy - in fact, we will see that the best `F1` model has a worse accuracy than the no information rate. This is important because while doing so, we also minimize our false negative rate, ensuring the highest level of coverage for churn risk.

In [4]:
best_model = compare_models(sort='F1')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
nb,Naive Bayes,0.6864,0.8111,0.8429,0.4484,0.5853,0.3692,0.417,0.004
lda,Linear Discriminant Analysis,0.7897,0.8268,0.522,0.6186,0.5651,0.4279,0.4312,0.005
gbc,Gradient Boosting Classifier,0.7994,0.8371,0.4926,0.6572,0.5624,0.4359,0.4439,0.087
lr,Logistic Regression,0.7911,0.8349,0.5088,0.6258,0.56,0.4252,0.4299,0.201
ada,Ada Boost Classifier,0.7917,0.8354,0.5003,0.6307,0.5566,0.4231,0.4288,0.041
lightgbm,Light Gradient Boosting Machine,0.7868,0.8207,0.5049,0.614,0.5534,0.4153,0.4192,0.083
ridge,Ridge Classifier,0.7886,0.0,0.4485,0.6402,0.5263,0.3958,0.4069,0.004
rf,Random Forest Classifier,0.7621,0.7904,0.4687,0.5546,0.5073,0.3522,0.3547,0.115
svm,SVM - Linear Kernel,0.7428,0.0,0.5149,0.5528,0.5023,0.3404,0.3592,0.009
knn,K Neighbors Classifier,0.7675,0.7439,0.4338,0.5759,0.4941,0.3473,0.3535,0.011


We can see that `Naive Bayes` is the best model for `F1` as we specified as well as the best for `Recall`. Unsurprisingly, `Gradient Boost Classifier` is the best in all other scenarios. I've ran this compare numerous times and the randomness is apparent as the scores and even rankings tend to change from time to time.

### Save the Model to Disk
Now to save the model with the pycaret function, let's check to make sure our `best_model` is correct and check the naming convention for readability later. Once done with that, we'll simply save the model as a `.pkl` or pickle file for later use.

In [5]:
best_model

GaussianNB(priors=None, var_smoothing=1e-09)

In [6]:
save_model(best_model, 'GNB')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strate...
                 ('binn', 'passthrough'), ('rem_outliers', 'passthrough'),
                 ('cluster_all', 'passthrough'),
                 ('dummy', Dummify(target='Churn')),
                 ('fix_perfect', Remove_100(target='Churn')),
                 ('clean_names', Cl

### Create a Python Script to Return the Probability of Churn for Each Row of the Dataframe
I wrote the script in VS Code since that's what I'm familiar with and all I really did was change the names of the variables for Churn and `GNB`, but why fix what isn't broken?

First we can preview the code as demonstrated in the FTE.

In [8]:
from IPython.display import Code

Code('predict_churn.py')

Now that we see the code, it has two functions; one for loading the DataFrame and one for making the prediction based on the hardcoded model `GNB`. We also do a little formatting for readability by renaming the default `Label:` to `Churn_prediction` as well as assign strings to results rather than 1:0.

The script can be run directly or imported piecemeal. If run directly, it is hardcoded to load `new_churn_data.csv` and run the `make_prediction` funtion. If imported and run individually, you can specify the filename via `df=load_data(<filename)` then run `make_prediction(df)`. 

In [9]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
predictions:
customerID
9305-CKSKC       Churn
1452-KNGVK    No Churn
6723-OKKJM    No Churn
7832-POPKP    No Churn
6348-TACGU    No Churn
Name: Churn_prediction, dtype: object


Thanks to the forewarning, we know the data should be [1, 0, 0, 1, 0]. Looks like our `GNB` model still gave us a false negative for the 2nd last entry, giving us an 80% accuracy for this dataset.

# Summary
This week is pretty straightforward and happened to do all the work of the previous weeks in less time than it would take to read this paragraph. We started from our Week 2 `prepared_churn_data.csv` and ran it through `pycaret` auto machine learning. Once we acknowledged a proper setup, the data was preprocessed by `pycaret.setup`. Once done, we could use our preprocessed dataset through `pycaret.best_model` to determine what model best fits our parameters. In choosing `F1` as our preferred parameter, it returned that `Naive Bayes` was the best model for our needs. Conveniently, we can export this trained model to a `.pkl` file for later use; having a model as static data is very useful for recalling and reuing in later functions or through an API. Lastly, we wrote a python script to read a dataframe from file, read the model from file, then run a prediction on the selected data. The prediction outputs a simple `Churn/No Churn` for each row of data and in our case, it had an 80% accuracy rating. 

### Github
https://github.com/arggonuts/MSDS600_autoML_project