# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [8]:
#!conda create -n msds python=3.10.14 -y
#!conda activate msds
!#pip install --upgrade pycaret

In [106]:
import pandas as pd
df = pd.read_csv('/Users/adamkhay/Desktop/intro data analysis/Churn_data.csv', index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1,0,0,3,29.85,29.85,0,29.850000
5575-GNVDE,34,1,1,2,56.95,1889.50,0,55.573529
3668-QPYBK,2,1,0,2,53.85,108.15,1,54.075000
7795-CFOCW,45,0,1,1,42.30,1840.75,0,40.905556
9237-HQITU,2,1,0,3,70.70,151.65,1,75.825000
...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,2,84.80,1990.50,0,82.937500
2234-XADUH,72,1,1,0,103.20,7362.90,0,102.262500
4801-JZAZL,11,0,0,3,29.60,346.45,0,31.495455
8361-LTMKD,4,1,0,2,74.40,306.60,1,76.650000


In [107]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [108]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,5361
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7032, 8)"
4,Transformed data shape,"(7032, 8)"
5,Transformed train set shape,"(4922, 8)"
6,Transformed test set shape,"(2110, 8)"
7,Numeric features,7
8,Preprocess,True
9,Imputation type,simple


In [109]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.795,0.8383,0.4901,0.6553,0.5595,0.4297,0.4381,0.112
lda,Linear Discriminant Analysis,0.7926,0.8208,0.5023,0.6417,0.5624,0.4293,0.4355,0.01
lr,Logistic Regression,0.7924,0.8321,0.5122,0.6374,0.5669,0.4327,0.4378,0.025
ridge,Ridge Classifier,0.7909,0.8208,0.4526,0.6563,0.5347,0.4059,0.4181,0.007
ada,Ada Boost Classifier,0.7907,0.832,0.5123,0.633,0.5653,0.4297,0.4345,0.034
lightgbm,Light Gradient Boosting Machine,0.7796,0.8198,0.487,0.607,0.5392,0.397,0.4018,0.165
rf,Random Forest Classifier,0.7773,0.7984,0.4748,0.6081,0.5316,0.3886,0.3947,0.089
et,Extra Trees Classifier,0.7706,0.7772,0.4855,0.5859,0.5297,0.38,0.3837,0.058
knn,K Neighbors Classifier,0.766,0.7463,0.4281,0.5835,0.4927,0.3454,0.3531,0.021
qda,Quadratic Discriminant Analysis,0.745,0.8139,0.7171,0.5164,0.5994,0.4201,0.4331,0.006


In [110]:
best_model

In [111]:
df.iloc[-2:-1].shape

(1, 8)

In [112]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Gradient Boosting Classifier,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4,1,0,2,74.400002,306.600006,76.650002,1,1,0.596


In [113]:
save_model(best_model, 'GBC')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges',
                                              'charge_per_tenure'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean'))),
                 ('c...
                                             criterion='f

In [114]:
import pickle

with open('GBC_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [115]:
with open('GBC_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [116]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)


In [117]:
loaded_model.predict(new_data)

array([1], dtype=int8)

In [118]:
loaded_GBC = load_model('GBC')

Transformation Pipeline and Model Successfully Loaded


In [119]:
predict_model(loaded_GBC, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4,1,0,2,74.400002,306.600006,76.650002,1,0.596


In [120]:
from IPython.display import Code

Code('predict_Churn.py')

In [121]:
%run predict_Churn.py

Transformation Pipeline and Model Successfully Loaded


Index(['tenure', 'PhoneService', 'Contract', 'PaymentMethod', 'MonthlyCharges',
       'TotalCharges', 'charge_per_tenure', 'prediction_label',
       'prediction_score'],
      dtype='object')
predictions:
customerID
9305-CKSKC    No Churn
1452-KNGVK    No Churn
6723-OKKJM    No Churn
7832-POPKP    No Churn
6348-TACGU    No Churn
Name: Churn_prediction, dtype: object


# Summary

First, I started by loading our same prepared Churn data where everything had been converted to numbers.
after running the best model code, It looks like our best model is LR, closely followed by some others. This may change when you re-run this, so the top model may be different each time this is run since the accuracy scores are so similar between models.We can now use the model to make predictions. If our data is not being preprocessed, we can simply used the best_model object, which is an sklearn model, to make predictions. 
after that We are selecting the last row, but using the indexing `[-2:-1]` to make it a 2D array instead of 1D (which throws an error). I then tried running `df.iloc[-1].shape` and `df.iloc[-2:-1].shape` to see how they differ. However, this only works if we set `preprocess=False` in our setup function. Otherwise the order of features may be different A more robust way (in case we are using preprocessing with autoML) is to use pycaret's predict_model function. 
after runninf predict_model, We can see this creates a new column, 'Score', with the probability of class 1. It also creates a 'Label' column with the predicted label, where it rounds up if score is >= 0.5 (greater than or equal to 0.5).
Next, I will save our trained model so we can use it in a Python file later. 
after that I use Pickle, it is a built-in module in the Python standard library which allows for saving and loading of binary data. It's data that's been encoded (usually using hexidecimal encoding) to a file, and we can store any Python object as-is in a pickle file. Then I can load the data from the file and be right back where I left off.
I used the built-in `open` function to open a file with the name `LR_model.pk`, then open it for writing with `'w'` and in a binary format using `'b'`. I save that file object in the variable `f`. The `with` statement automatically closes the file after I exit the with statement, otherwise, I should call the function `close` from the file object `f`. Then I use pickle to save the data to the file.
i then ran loaded_model.predict, Loading it is almost the same, except we use `rb` for "read binary" and use pickle's load function. Under the hood, pycaret is doing something similar, but we can use it with the save_model function as we saw above. Once we have our saved pycaret model, we can test loading it and making predictions to make sure it works. 
I then moved to making a Python module to make predictions, I can now use this model in a Python file to take in new data and make a prediction. I will first need to compose a Python file. I then tested out running the file with the Jupyter "magic" command %run.
finally, I ran Predict_Churn.py, we can run it over and over after making changes to the file while we are writing it. 