# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
!pip install pycaret



In [2]:
import pandas as pd
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [3]:
df = pd.read_csv('prepped_churn_data.csv')
df = df.drop("customerID", axis = 'columns')
df

Unnamed: 0.1,Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,TCharges_tenure_ratio,MCharges_tenure_ratio
0,0,1,1,0,0,29.85,29.85,1,29.850000,29.850000
1,1,34,0,1,1,56.95,1889.50,1,55.573529,1.675000
2,2,2,0,0,1,53.85,108.15,0,54.075000,26.925000
3,3,45,1,1,2,42.30,1840.75,1,40.905556,0.940000
4,4,2,0,0,0,70.70,151.65,0,75.825000,35.350000
...,...,...,...,...,...,...,...,...,...,...
7027,7038,24,0,1,1,84.80,1990.50,1,82.937500,3.533333
7028,7039,72,0,1,3,103.20,7362.90,1,102.262500,1.433333
7029,7040,11,1,0,0,29.60,346.45,1,31.495455,2.690909
7030,7041,4,0,0,1,74.40,306.60,0,76.650000,18.600000


In [4]:
automl = setup(data = df, target = 'Churn')

Unnamed: 0,Description,Value
0,session_id,586
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(7032, 10)"
5,Missing Values,False
6,Numeric Features,6
7,Categorical Features,3
8,Ordinal Features,False
9,High Cardinality Features,False


In [5]:
automl[6]

'box-cox'

In [6]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lda,Linear Discriminant Analysis,0.7972,0.8363,0.912,0.8297,0.8688,0.4265,0.4366,0.013
ridge,Ridge Classifier,0.7966,0.0,0.9302,0.8185,0.8707,0.4028,0.4221,0.01
lr,Logistic Regression,0.7958,0.837,0.9076,0.8309,0.8674,0.4266,0.4354,1.974
gbc,Gradient Boosting Classifier,0.7913,0.836,0.8976,0.8322,0.8636,0.4221,0.4283,0.341
ada,Ada Boost Classifier,0.7899,0.8315,0.8907,0.8351,0.8619,0.4246,0.4293,0.126
lightgbm,Light Gradient Boosting Machine,0.782,0.8231,0.8874,0.8288,0.857,0.4004,0.4056,0.074
rf,Random Forest Classifier,0.7743,0.8081,0.8783,0.8261,0.8513,0.3841,0.3877,0.226
et,Extra Trees Classifier,0.7639,0.7938,0.8601,0.8264,0.8428,0.3692,0.3708,0.21
svm,SVM - Linear Kernel,0.7548,0.0,0.9045,0.7967,0.8417,0.266,0.3016,0.026
dummy,Dummy Classifier,0.7363,0.5,1.0,0.7363,0.8481,0.0,0.0,0.008


In [7]:
best_model
#looks like our best model is the LDA

LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
                           solver='svd', store_covariance=False, tol=0.0001)

In [9]:
new_df = df.iloc[-2:-1].copy()
new_df.drop('Churn', axis = 1, inplace = True)
new_df

Unnamed: 0.1,Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,TCharges_tenure_ratio,MCharges_tenure_ratio
7030,7041,4,0,0,1,74.4,306.6,76.65,18.6


In [10]:
predict_model(best_model, new_df)

Unnamed: 0.1,Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,TCharges_tenure_ratio,MCharges_tenure_ratio,Label,Score
7030,7041,4,0,0,1,74.4,306.6,76.65,18.6,0,0.5319


In [11]:
#Saving and loading our model
save_model(best_model, 'LDA')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strate...
                 ('dummy', Dummify(target='Churn')),
                 ('fix_perfect', Remove_100(target='Churn')),
                 ('clean_names', Clean_Colum_Names()),
                 ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                 ('dfs

In [12]:
test_model = load_model('LDA')
predict_model(test_model,data= new_df)

Transformation Pipeline and Model Successfully Loaded


Unnamed: 0.1,Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,TCharges_tenure_ratio,MCharges_tenure_ratio,Label,Score
7030,7041,4,0,0,1,74.4,306.6,76.65,18.6,0,0.5319


In [13]:
import pickle
with open ('LDA_model.pk', 'wb') as f:
    pickle.dump(best_model,f)
# we used built in function to open a file with the name LDA_model.pk, then opened it for writing (w) and in a binary
#format using b.
#save file object in variable f 
#with statement automatically closes the file 

In [14]:
with open('LDA_model.pk', 'rb') as f: 
    loaded_model = pickle.load(f)
#reload data

In [19]:
from IPython.display import Code
Code('predict_churn.py')

In [20]:
%run predict_churn.py

# Summary

Write a short summary of the process and results here.

Pycaret was installed so that we can use this libary to import the functions we will use. The next line we imported those functions. Using the variable automl, the setup function will have us pass the target variable to inform the function that we will not train against a feature. The best model for this data set is LDA, linear discriminant analysis with an accuracy of 79.72%. A new data frame was created to drop the churn column. Then, we saved the model. Pickle is used to save and load binary data. We used this to open a file with the name LDA_model.py, then opened it for writing. Then, a file was created. 