# Data science automation

This week is all about looking at automation tehcniques for data science and with Python. We can automate a lot of things with Python: collecting data, processing it, cleaning it, and many other parts of the data science pipeline. Here, we will show how to:

- use the pycaret autoML Python package to find an optimized ML model for our diabetes dataset
- create a Python script to ingest new data and make predictions on it

Often, next steps in fully operationalizing an ML pipeline like this are to use a cloud service to scale and serve our ML algorithm. We can use things like AWS lambda, GCP, AWS, or Azure ML depolyment with tools such as docker and kubernetes.

## Load data

Import libraries and load the data

In [6]:
import pandas as pd 
df=pd.read_csv('prepped_churn_data2.csv',index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,total_monthly_ratio,tenure_monthly_ratio,tenure_Total_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
7590-VHVEG,1,0,0,2,29.85,29.85,0,1.000000,0.033501,0.033501
5575-GNVDE,34,1,1,3,56.95,1889.50,0,33.178227,0.597015,0.017994
3668-QPYBK,2,1,0,3,53.85,108.15,1,2.008357,0.037140,0.018493
7795-CFOCW,45,0,1,0,42.30,1840.75,0,43.516548,1.063830,0.024447
9237-HQITU,2,1,0,2,70.70,151.65,1,2.144979,0.028289,0.013188
...,...,...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,3,84.80,1990.50,0,23.472877,0.283019,0.012057
2234-XADUH,72,1,1,1,103.20,7362.90,0,71.345930,0.697674,0.009779
4801-JZAZL,11,0,0,2,29.60,346.45,0,11.704392,0.371622,0.031751
8361-LTMKD,4,1,0,3,74.40,306.60,1,4.120968,0.053763,0.013046


# AutoML with pycaret

In [None]:
pip install pycaret

In [None]:
from pycaret.classification import *

In [5]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,401
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7032, 10)"
4,Transformed data shape,"(7032, 10)"
5,Transformed train set shape,"(4922, 10)"
6,Transformed test set shape,"(2110, 10)"
7,Numeric features,9
8,Preprocess,True
9,Imputation type,simple


Using the setup function from the AutoML library, possibly PyCaret or another similar tool, to prepare  DataFrame for modeling. The setup function initializes the environment for  machine learning project, and i want to specify the target variable correctly.

In [6]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7964,0.8329,0.5191,0.6475,0.5751,0.4435,0.4489,0.435
ridge,Ridge Classifier,0.7924,0.8224,0.4626,0.6579,0.5417,0.4129,0.4245,0.015
gbc,Gradient Boosting Classifier,0.7911,0.8358,0.4886,0.6416,0.5537,0.4208,0.4281,1.539
lda,Linear Discriminant Analysis,0.7909,0.8223,0.5107,0.634,0.5645,0.4293,0.4344,0.043
ada,Ada Boost Classifier,0.7901,0.8339,0.51,0.6319,0.5629,0.4272,0.4324,0.484
lightgbm,Light Gradient Boosting Machine,0.7838,0.8254,0.5008,0.6152,0.5516,0.4113,0.4153,0.582
rf,Random Forest Classifier,0.7755,0.8045,0.4679,0.6004,0.5252,0.3813,0.3868,1.228
et,Extra Trees Classifier,0.7662,0.7806,0.4893,0.5712,0.5264,0.3725,0.3749,0.651
knn,K Neighbors Classifier,0.7641,0.741,0.4342,0.5747,0.4934,0.3439,0.3503,0.206
dummy,Dummy Classifier,0.7343,0.5,0.0,0.0,0.0,0.0,0.0,0.037



 compare_models() in PyCaret, it will automatically train several classification models on  dataset and evaluate their performance based on default metrics (like accuracy, AUC, etc.). The best model will be returned and stored in the variable best_model.

In [7]:
best_model

It displays the best-performing model based on the default evaluation metric

In [9]:
df.iloc[1:3].shape

(2, 10)

It retrieves the shape of the subset of the DataFrame df that includes rows 1 & 2 and the result of this operation will be a tuple representing the number of rows and columns in that subset. 



In [10]:
predict_model(best_model, df.iloc[1:3])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.5,1.0,0.0,0.0,0.0,0.0,0.0


Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,total_monthly_ratio,tenure_monthly_ratio,tenure_Total_ratio,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
5575-GNVDE,34,1,1,3,56.950001,1889.5,33.178226,0.597015,0.017994,0,0,0.9268
3668-QPYBK,2,1,0,3,53.849998,108.150002,2.008357,0.03714,0.018493,1,0,0.5557


Here iam using the predict_model function from PyCaret to make predictions on the subset of  DataFrame that includes rows 1 and 2.

## Saving and loading our model

In [11]:
save_model(best_model, 'lr')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges',
                                              'total_monthly_ratio',
                                              'tenure_monthly_ratio',
                                              'tenure_Total_ratio'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=F...
                                                               fill_value=None,
                             

saving the trained model best_model to a file named lr. This is typically done to persist the model so i can load it later without needing to retrain it.

In [13]:
import pickle

with open('lr_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

saving the model using pickle

In [14]:
with open('lr_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

loading the model

In [15]:
new_data = df.iloc[1:3].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)

array([0, 0], dtype=int8)

preparing new data and making predictions

In [16]:
loaded_lr = load_model('lr')

Transformation Pipeline and Model Successfully Loaded


In [17]:
predict_model(loaded_lr, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,total_monthly_ratio,tenure_monthly_ratio,tenure_Total_ratio,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
5575-GNVDE,34,1,1,3,56.950001,1889.5,33.178226,0.597015,0.017994,0,0.9268
3668-QPYBK,2,1,0,3,53.849998,108.150002,2.008357,0.03714,0.018493,0,0.5557


# Making a Python module to make predictions

In [1]:
from IPython.display import Code
Code('predict_churn.py')

# summary

This code uses PyCaret to build a machine learning model for predicting customer churn. It starts by loading a preprocessed dataset and initializing the PyCaret environment with Churn as the target variable. After evaluating multiple models, it selects the best one and makes predictions for specific records.

The trained model is saved using both PyCaret's save_model function and Python's pickle for later use. The code demonstrates how to reload the model and prepares new data for predictions by removing the target variable. Finally, it runs an external script (predict_churn.py) for additional tasks. This workflow efficiently automates the process of training and using a churn prediction model.