# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

I ended up dropping tc_tenure_ratio because the new churn data does not have that in it, and it causes problems later on.

In [1]:
import pandas as pd

df = pd.read_csv('data/even_better_new_churn_data.csv', index_col='customerID')
#df = pd.read_csv('data/test_data1.csv', index_col='customerID')
# removing this as the new test data does not have it
df = df = df.drop('tc_tenure_ratio', axis=1)
# Was hoping it was a column sort issue from stackoverflows I found, but it did not seem to help with my errors later.
df = df.reindex(sorted(df.columns), axis=1)
df

Unnamed: 0_level_0,Churn,Contract,MonthlyCharges,PaymentMethod,PhoneService,TotalCharges,tenure
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,0,0,29.85,0,0,29.85,1
5575-GNVDE,0,1,56.95,1,1,1889.50,34
3668-QPYBK,1,0,53.85,1,1,108.15,2
7795-CFOCW,0,1,42.30,2,0,1840.75,45
9237-HQITU,1,0,70.70,0,1,151.65,2
...,...,...,...,...,...,...,...
6840-RESVB,0,1,84.80,1,1,1990.50,24
2234-XADUH,0,1,103.20,3,1,7362.90,72
4801-JZAZL,0,0,29.60,0,0,346.45,11
8361-LTMKD,1,0,74.40,1,1,306.60,4


First, I needed to install pycaret. On my Mac, I also needed to install libomp through brew with _brew install libomp_

In [2]:
pip install pycaret





You should consider upgrading via the '/usr/local/Cellar/jupyterlab/3.1.9/libexec/bin/python3.9 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model, get_config
# Tricky thing never detects all numeric features properly, even in FTE.  
# I also wonder if preprocessing has something to do with my errors.
automl = setup(df, target='Churn', numeric_features=['PhoneService','Contract','PaymentMethod'])
#automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,session_id,6062
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(7032, 7)"
5,Missing Values,False
6,Numeric Features,6
7,Categorical Features,0
8,Ordinal Features,False
9,High Cardinality Features,False


In [8]:
# This thing only works sometimes.
automl[14]

Unnamed: 0_level_0,Contract,MonthlyCharges,PaymentMethod,PhoneService,TotalCharges,tenure
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0479-HMSWA,2.0,105.449997,0.0,1.0,2715.300049,26.0
9367-WXLCH,0.0,84.500000,2.0,1.0,662.650024,8.0
1989-PRJHP,0.0,75.500000,0.0,1.0,1893.949951,27.0
3717-OEAUQ,0.0,70.699997,1.0,1.0,129.199997,2.0
2946-KIQSP,0.0,33.450001,1.0,0.0,1175.849976,35.0
...,...,...,...,...,...,...
2265-CYWIV,0.0,99.599998,0.0,1.0,347.649994,4.0
2453-SAFNS,1.0,72.099998,1.0,1.0,3886.050049,54.0
2710-WYVXG,2.0,71.099998,1.0,1.0,213.350006,3.0
9770-KXGQU,1.0,98.599998,1.0,1.0,5311.850098,53.0


In [9]:
#best_model = compare_models(sort="Recall")
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.795,0.8377,0.4996,0.6565,0.5657,0.435,0.4429,0.062
ada,Ada Boost Classifier,0.7936,0.8367,0.5132,0.646,0.5706,0.4373,0.4432,0.032
ridge,Ridge Classifier,0.7918,0.0,0.4488,0.6657,0.5347,0.4075,0.4213,0.005
lr,Logistic Regression,0.7917,0.8362,0.5178,0.6406,0.5712,0.4359,0.4411,0.5
lda,Linear Discriminant Analysis,0.7891,0.8219,0.5079,0.6351,0.5632,0.4266,0.432,0.006
lightgbm,Light Gradient Boosting Machine,0.783,0.8219,0.495,0.6202,0.549,0.4089,0.4143,0.021
rf,Random Forest Classifier,0.7662,0.7931,0.4707,0.5796,0.5181,0.3663,0.3705,0.098
et,Extra Trees Classifier,0.7564,0.7713,0.4738,0.5544,0.5101,0.3495,0.3519,0.088
knn,K Neighbors Classifier,0.7521,0.7316,0.423,0.549,0.4768,0.3183,0.3234,0.142
qda,Quadratic Discriminant Analysis,0.7479,0.8248,0.7445,0.5214,0.6129,0.4346,0.4501,0.005


In [10]:
best_model

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=6062, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [11]:
best_model.n_features_in_

6

This is how I figured out there were too many features and what they were when troubleshooting.  I'm not sure what get_config is a part of for the import, so I unfortunately just did a greedy glob.

In [12]:
from pycaret.classification import *

print(get_config('X_train').columns)

Index(['Contract', 'MonthlyCharges', 'PaymentMethod', 'PhoneService',
       'TotalCharges', 'tenure'],
      dtype='object')


No longer getting a ton of features that threw off pickle after specifying column types. Details sent in email.

In [13]:
df.iloc[-1].shape

(7,)

In [14]:
df.iloc[-2:-1].shape

(1, 7)

In [15]:
predict_model(best_model, df.iloc[-10:-1])

Unnamed: 0_level_0,Churn,Contract,MonthlyCharges,PaymentMethod,PhoneService,TotalCharges,tenure,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
9767-FFLEM,0,0,69.5,3,1,2625.25,38,0,0.8023
0639-TSIQW,1,0,102.95,3,1,6886.25,67,0,0.6893
8456-QDAVC,0,0,78.7,2,1,1495.1,19,0,0.6758
7750-EYXWZ,0,1,60.65,0,0,743.3,12,0,0.7757
2569-WGERO,0,2,21.15,2,1,1419.4,72,0,0.9897
6840-RESVB,0,1,84.8,1,1,1990.5,24,0,0.8914
2234-XADUH,0,1,103.2,3,1,7362.9,72,0,0.918
4801-JZAZL,0,0,29.6,0,0,346.45,11,0,0.591
8361-LTMKD,1,0,74.4,1,1,306.6,4,1,0.6716


In [16]:
save_model(best_model, 'ChadModel')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=['PhoneService',
                                                           'Contract',
                                                           'PaymentMethod'],
                                       target='Churn', time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None...
                                             learning_rate=0.1, loss='deviance',
                                             max_depth=3, max_features=None,
                                             max_leaf_nodes=None

In [17]:
loaded_best_model = load_model('ChadModel')

Transformation Pipeline and Model Successfully Loaded


In [18]:
loaded_best_model

Pipeline(memory=None,
         steps=[('dtypes',
                 DataTypes_Auto_infer(categorical_features=[],
                                      display_types=True, features_todrop=[],
                                      id_columns=[],
                                      ml_usecase='classification',
                                      numerical_features=['PhoneService',
                                                          'Contract',
                                                          'PaymentMethod'],
                                      target='Churn', time_features=[])),
                ('imputer',
                 Simple_Imputer(categorical_strategy='not_available',
                                fill_value_categorical=None...
                                            learning_rate=0.1, loss='deviance',
                                            max_depth=3, max_features=None,
                                            max_leaf_nodes=None,
              

Validating the the model works...(I originally could not get this part to work in the FTE or here because of feature mismatches in pickle).  

In [19]:
import pickle

with open('Chad_model_pickle.pkl', 'wb') as f:
    pickle.dump(best_model, f)

In [20]:
with open('Chad_model_pickle.pkl', 'rb') as f:
    loaded_model_pickle = pickle.load(f)

In [21]:
print(pickle.format_version)

4.0


In [22]:
test_saved_data = df.iloc[-2:-1].copy()
test_saved_data.drop('Churn', axis=1, inplace=True)
loaded_model_pickle.predict(test_saved_data)

array([1])

In [23]:
loaded_best_model.predict(test_saved_data)

array([1])

In [26]:
test_saved_data

Unnamed: 0_level_0,Contract,MonthlyCharges,PaymentMethod,PhoneService,TotalCharges,tenure
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
8361-LTMKD,0,74.4,1,1,306.6,4


Doing it direct with pycaret loaded data:

In [27]:
predict_model(loaded_best_model, test_saved_data)

Unnamed: 0_level_0,Contract,MonthlyCharges,PaymentMethod,PhoneService,TotalCharges,tenure,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
8361-LTMKD,0,74.4,1,1,306.6,4,1,0.6716


In [28]:
predict_model(loaded_model_pickle, test_saved_data)

Unnamed: 0_level_0,Contract,MonthlyCharges,PaymentMethod,PhoneService,TotalCharges,tenure,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
8361-LTMKD,0,74.4,1,1,306.6,4,1,0.6716


Hey, they match up!

Trying it out with the new churn information we have.

In [29]:
df2 = pd.read_csv('data/new_churn_data.csv', index_col='customerID')
# Not doing this here, did it up above.
#df2['tc_tenure_ratio'] = df2['tenure'] / df2['TotalCharges']
predict_model(loaded_best_model, df2, probability_threshold=.5)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
9305-CKSKC,22,1,0,2,97.4,811.7,36.895455,1,0.5216
1452-KNGVK,8,0,1,1,77.3,1701.95,212.74375,0,0.8551
6723-OKKJM,28,1,0,0,28.25,250.9,8.960714,0,0.8942
7832-POPKP,62,1,0,2,101.7,3106.56,50.105806,0,0.6881
6348-TACGU,10,0,0,1,51.15,3440.97,344.097,0,0.8672


Hmmm, I don't seem to get 1,0,0,1,0 as my results exactly, but the one it is incorrect on is only .68.  I reran this with the original data after I cleaned it, with just the original uncleaned churn data, and the results were worse.  Model definitely seems to play a part in this.

In [30]:
lr = create_model('lr')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.8053,0.8456,0.5,0.6875,0.5789,0.4564,0.4662
1,0.7992,0.8241,0.5682,0.641,0.6024,0.4687,0.4703
2,0.7846,0.8084,0.4885,0.6214,0.547,0.4083,0.4134
3,0.7825,0.8384,0.5303,0.6087,0.5668,0.4225,0.4243
4,0.7683,0.8147,0.4848,0.5818,0.5289,0.377,0.3797
5,0.7825,0.8394,0.5606,0.6016,0.5804,0.4339,0.4344
6,0.7988,0.847,0.5455,0.6486,0.5926,0.4603,0.4634
7,0.7947,0.8379,0.5076,0.6505,0.5702,0.4381,0.4439
8,0.7866,0.842,0.5152,0.6239,0.5643,0.4247,0.4281
9,0.815,0.8645,0.4773,0.7412,0.5806,0.469,0.4877


In [32]:
predict_model(lr)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.7896,0.8346,0.5145,0.6152,0.5604,0.4235,0.4264


Unnamed: 0,Contract,MonthlyCharges,PaymentMethod,PhoneService,TotalCharges,tenure,Churn,Label,Score
0,2.0,108.099998,2.0,1.0,5067.450195,48.0,0,0,0.9004
1,1.0,103.099998,0.0,1.0,4889.299805,47.0,0,0,0.7170
2,0.0,65.000000,1.0,1.0,663.049988,9.0,1,0,0.5387
3,2.0,83.500000,3.0,1.0,5435.000000,63.0,0,0,0.9786
4,2.0,79.199997,0.0,1.0,4016.300049,52.0,0,0,0.9541
...,...,...,...,...,...,...,...,...,...
2105,0.0,84.400002,2.0,1.0,4116.149902,50.0,1,0,0.7767
2106,0.0,49.250000,0.0,1.0,91.099998,2.0,1,0,0.5334
2107,2.0,29.600000,1.0,0.0,299.049988,10.0,0,0,0.9159
2108,2.0,23.299999,2.0,1.0,797.099976,35.0,0,0,0.9922


In [34]:
predict_model(lr, df2)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
9305-CKSKC,22,1,0,2,97.4,811.7,36.895455,0,0.5328
1452-KNGVK,8,0,1,1,77.3,1701.95,212.74375,1,0.5841
6723-OKKJM,28,1,0,0,28.25,250.9,8.960714,0,0.8891
7832-POPKP,62,1,0,2,101.7,3106.56,50.105806,0,0.8376
6348-TACGU,10,0,0,1,51.15,3440.97,344.097,1,0.7047


Yikes, that one is much worse!  The script sticks with the best model.

In [40]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
predictions:
customerID
9305-CKSKC       Churn
1452-KNGVK    No churn
6723-OKKJM    No churn
7832-POPKP    No churn
6348-TACGU    No churn
Name: Churn_prediction, dtype: object


# Summary

Write a short summary of the process and results here.

I had a lot of issues with the pickle save here and in the FTE with the pickle loaded files having feature mismatches.  .  I found articles (e.g. https://stackoverflow.com/questions/67875188/feature-mismatch-prediction-through-scikit-learn-pipeline) that seemed to imply it could be because of how pickle saves as a matrix and recommended using indexes, but that didn't work for me and we're starting to get deep into areas I am not familair with.  I'm not sure if it has something to do with my mac or versions, but everything is on the latest version (except I believe scikit is on 0.23.x and not 0.24.x because of previous lesson's bugs).  Forcing the columns detected as category into numeric values on both the FTE and this assignment worked, however.  

Some of the times when a different best_model is used, it does not have a score (for example ridge).  My first few runs, Linear Regression always seems to have the best score in determining things and also shows confidence that is close to the actual 1,0,0,1,0 outcome for the churn data.  Later runs did not really show that, and had more accurate results with gbc.  Maybe this has to do with test data, or maybe it has to do with how finicky customers are.  When inquiring at my own job about our churn data and processing with similar ML techniques, I learned our company had similar prediction issues that were described as "mixed results".  I'm guessing that really working on and tuning this data might yield better accuracy.

I also just realized that missing 1 out of 5 is about 80%, the predicted accuracy of the model.