# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

# 1. Setup

## 1.1.Install Pycaret

In [2]:
!conda install -c conda-forge pycaret -y

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/g7/anaconda3

  added / updated specs:
    - pycaret


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _py-xgboost-mutex-2.0      |            cpu_0           8 KB  conda-forge
    catalogue-1.0.0            |   py38h578d9bd_3          13 KB  conda-forge
    catboost-1.0.3             |   py38h578d9bd_1        60.1 MB  conda-forge
    chart-studio-1.1.0         |     pyh9f0ad1d_0          51 KB  conda-forge
    colorlover-0.3.0           |             py_0          12 KB  conda-forge
    configparser-5.1.0         | 

In [5]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

## 1.2 Import Data

In [6]:
import pandas as pd
df = pd.read_csv('churn_data.csv', index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,No,Month-to-month,Electronic check,29.85,29.85,No
5575-GNVDE,34,Yes,One year,Mailed check,56.95,1889.50,No
3668-QPYBK,2,Yes,Month-to-month,Mailed check,53.85,108.15,Yes
7795-CFOCW,45,No,One year,Bank transfer (automatic),42.30,1840.75,No
9237-HQITU,2,Yes,Month-to-month,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...
6840-RESVB,24,Yes,One year,Mailed check,84.80,1990.50,No
2234-XADUH,72,Yes,One year,Credit card (automatic),103.20,7362.90,No
4801-JZAZL,11,No,Month-to-month,Electronic check,29.60,346.45,No
8361-LTMKD,4,Yes,Month-to-month,Mailed check,74.40,306.60,Yes


# 2. AutoML

Original automl work in JupyterLab had errors.  When switching to Notebook the problem seemed to resolve. Experimentation with 2 datasets follow but only the last experiment is shown in the notebook.

In [7]:
#automl = setup(df, target='Churn', fold_shuffle=True)
automl = setup(df, target='Churn')


Unnamed: 0,Description,Value
0,session_id,1257
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"No: 0, Yes: 1"
4,Original Data,"(7043, 7)"
5,Missing Values,True
6,Numeric Features,3
7,Categorical Features,3
8,Ordinal Features,False
9,High Cardinality Features,False


In [8]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7959,0.8377,0.5448,0.6493,0.5919,0.4574,0.4609,0.233
gbc,Gradient Boosting Classifier,0.7935,0.8396,0.5097,0.6568,0.5726,0.4395,0.4465,0.069
lda,Linear Discriminant Analysis,0.7921,0.8282,0.5537,0.6376,0.5915,0.4532,0.456,0.01
ridge,Ridge Classifier,0.7876,0.0,0.4799,0.6488,0.5507,0.4159,0.4246,0.005
catboost,CatBoost Classifier,0.7876,0.836,0.5142,0.6356,0.5671,0.4289,0.4338,1.646
ada,Ada Boost Classifier,0.786,0.8352,0.5157,0.6332,0.5674,0.4273,0.4319,0.033
lightgbm,Light Gradient Boosting Machine,0.7819,0.8275,0.5276,0.6186,0.568,0.4236,0.427,0.804
xgboost,Extreme Gradient Boosting,0.77,0.8178,0.506,0.5913,0.5441,0.3918,0.3947,0.522
rf,Random Forest Classifier,0.7663,0.7934,0.4933,0.5835,0.5336,0.3794,0.3824,0.093
knn,K Neighbors Classifier,0.7568,0.7432,0.444,0.568,0.4971,0.3403,0.3454,0.089


In [9]:
best_model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=1257, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [10]:
df.iloc[-2:-1].shape

(1, 7)

In [11]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4,Yes,Month-to-month,Mailed check,74.4,306.6,Yes,No,0.6053


# 3. Save Model

The experiment contains two models.  One model, Gradient Boosting Classifier, with the prepared dataset 'churnNum'.  The second model, Logistic Regression, was created off the original dataset 'churn_data'. 

In [12]:
save_model(best_model, 'lr')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strate...
                 ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                 ('dfs', 'passthrough'), ('pca', 'passthrough'),
                 ['trained_model',
                  LogisticRegression(C=1.0, class_weight=None, dual=False,
                 

In [13]:
from IPython.display import Code

Code('predict_churn.py')

In [14]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
predictions:
0     No
1     No
2     No
3     No
4    Yes
Name: Churn_prediction, dtype: object


  return linalg.solve(A, Xy, sym_pos=True,
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,


  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,
  return linalg.solve(A, Xy, sym_pos=True,


# Summary

Original automl work in JupyterLab had errors. When switching to Notebook the problem seemed to resolve. Experimentation with 2 datasets resulted in the two differing models.  Best model for prepared numeric data was Gradient Boosting Classifier.  Logistic Regression perfomed the best on unprepared data. Both models were saved as 'lr' and 'gbc' pickle files.  The python automation script can be adjusted accordingly to use either. 