# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

## Load Data

In [1]:
import pandas as pd

df = pd.read_csv('prepped_churn_data.csv', index_col='customerID')
df.drop('tenure_TotalCharges_ratio', axis=1, inplace=True)
df.drop('tenure_MonthlyCharges_ratio', axis=1, inplace=True)
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,0
5575-GNVDE,34,1,1,1,56.95,1889.50,0
3668-QPYBK,2,1,0,1,53.85,108.15,1
7795-CFOCW,45,0,1,2,42.30,1840.75,0
9237-HQITU,2,1,0,0,70.70,151.65,1
...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,1990.50,0
2234-XADUH,72,1,1,3,103.20,7362.90,0
4801-JZAZL,11,0,0,0,29.60,346.45,0
8361-LTMKD,4,1,0,1,74.40,306.60,1


To match the new_churn_data.csv at the end of the assignment, I have to drop the ratio data from the prepped data. The ratio data causes errors as it does not exist within the new data that we are to process.

## AutoML with pycaret

In [2]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [3]:
automl = setup(df, target='Churn', numeric_features=['PhoneService','Contract','PaymentMethod'])

Unnamed: 0,Description,Value
0,session_id,7294
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7043, 7)"
5,Missing Values,False
6,Numeric Features,6
7,Categorical Features,0
8,Ordinal Features,False
9,High Cardinality Features,False


The columns of PhoneService, Contract, and PaymentMethod needed to be converted to a numeric value because the model output ends up building additional columns into the data to compensate for the different categories of data. When the additional data is there, the loading of data and predicting based on the model and the old data does not computer to even column types. 

In [4]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7957,0.8444,0.5104,0.6603,0.5749,0.4435,0.4504,0.073
lr,Logistic Regression,0.7949,0.8389,0.5381,0.6477,0.5872,0.4524,0.4562,0.227
ridge,Ridge Classifier,0.7921,0.0,0.4707,0.6659,0.5508,0.4208,0.4319,0.005
catboost,CatBoost Classifier,0.7907,0.8395,0.5232,0.641,0.5754,0.4385,0.4429,0.565
ada,Ada Boost Classifier,0.7897,0.8417,0.5014,0.6441,0.5631,0.4276,0.4338,0.034
lda,Linear Discriminant Analysis,0.789,0.8258,0.5164,0.6377,0.57,0.4324,0.437,0.005
lightgbm,Light Gradient Boosting Machine,0.7856,0.8331,0.5194,0.6272,0.5673,0.4266,0.4305,0.12
xgboost,Extreme Gradient Boosting,0.7803,0.8219,0.521,0.6119,0.5622,0.417,0.4197,0.106
knn,K Neighbors Classifier,0.7716,0.7509,0.4761,0.6016,0.53,0.3821,0.3876,0.011
rf,Random Forest Classifier,0.7675,0.8033,0.5007,0.5836,0.5381,0.3842,0.3867,0.101


For the churn data, it seems that the Gradient Boosting Classifier (gbc) is the best model for our data set.

In [5]:
best_model

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=7294, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

At the time of processing the data, the GBC algorithm was the best. I have run the algorithms a few times to see if that changes, but GBC continues to come out the best out of all the available algorithms. If the best possible algorithm changes during the final rerun of the code before submittal, the associated code following will be adjusted.

In [6]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4,1,0,1,74.4,306.6,1,1,0.5654


Our Score column shows us a 0.5416 which flips our Label column to a 1

## Saving and loading the model

In [7]:
save_model(best_model, 'GBC')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=['PhoneService',
                                                           'Contract',
                                                           'PaymentMethod'],
                                       target='Churn', time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None...
                                             learning_rate=0.1, loss='deviance',
                                             max_depth=3, max_features=None,
                                             max_leaf_nodes=None

In [8]:
import pickle

with open('GBC_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [9]:
with open('GBC_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [10]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)

array([1])

In [11]:
loaded_gbc = load_model('GBC')

Transformation Pipeline and Model Successfully Loaded


In [12]:
predict_model(loaded_gbc, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
8361-LTMKD,4,1,0,1,74.4,306.6,1,0.5654


## Python module to make predictions

In [13]:
from IPython.display import Code

Code('predict_churn.py')

In [14]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
predictions:
customerID
9305-CKSKC       Churn
1452-KNGVK    No churn
6723-OKKJM    No churn
7832-POPKP    No churn
6348-TACGU    No churn
Name: Churn_prediction, dtype: object


# Summary

The best model is not static in this type of analysis, which is fine but makes writing analysis more difficult if you are calling out a specific model. 

After removing the data that did not match the new data given for the assignment and converting the non-numeric columns, the analysis shows that predictions can be made on new data that have yet to receive Churn information. We can take known data and train our algorithms to understand what factors lead to customer churn. Then we can run new, unfinished data to try and determine who of the current customers might leave the company.

Based on the type of model chosen, the Churn/No churn data changes. It could be worth running several models and determining what the average churn is and focusing on the customers that churn the most over several different models. Also, reworking the prepped data to get a better accuracy would yield better results.