**Author:** Cainã Max Couto da Silva  
**LinkedIn:** [@cmcouto-silva](https://www.linkedin.com/in/cmcouto-silva/)

&nbsp;

---

This notebook continues the previous notebooks, where we have learned how to build efficient pipelines using [scikit-learn](https://drive.google.com/file/d/13q0UmHCZshnyJv0T3fIwvi8qDBwjeT_x/view?usp=sharing), [feature-engine](https://colab.research.google.com/drive/1AKKe5iNerluf4K-gHN-u7itrPHL-z2OL?usp=sharing), and [imbalanced-learn](https://colab.research.google.com/drive/1AKKe5iNerluf4K-gHN-u7itrPHL-z2OL?usp=sharing). Please check it out!

Now you can understand how pipelines work in many applications since you're already familiar with the scikit-learn pipelines.

This notebook aims to introduce you PyCaret, highlighting how it uses scikit-learn pipelines under the hood.

# **PyCaret**

[PyCaret](https://pycaret.org/) is an open-source, low-code machine learning library in Python that automates machine learning workflows.

The workflow involves setting up an experiment, allowing PyCaret to understand your data's structure and preprocessing requirements. Subsequently, single command lines enable you to train multiple models, tune the chosen model, interpret, finalize, and deploy it efficiently.

In [1]:
%pip install pycaret[analysis]



In [2]:
import pandas as pd
from pycaret.classification import *

# 1. Loading dataset

In [3]:
data_url = 'https://raw.githubusercontent.com/cmcouto-silva/datasets/main/datasets/telco_churn.csv'
df = pd.read_csv(data_url, index_col='CustomerID')
display(df)

Unnamed: 0_level_0,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,Senior Citizen,...,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Value,Churn Score,CLTV,Churn Reason
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3668-QPYBK,1,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964131,-118.272783,Male,No,...,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,86,3239,Competitor made better offer
9237-HQITU,1,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059281,-118.307420,Female,No,...,Month-to-month,Yes,Electronic check,70.70,151.65,Yes,1,67,2701,Moved
9305-CDSKC,1,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048013,-118.293953,Female,No,...,Month-to-month,Yes,Electronic check,99.65,820.50,Yes,1,86,5372,Moved
7892-POOKP,1,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062125,-118.315709,Female,No,...,Month-to-month,Yes,Electronic check,104.80,3046.05,Yes,1,84,5003,Moved
0280-XJGEX,1,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039224,-118.266293,Male,No,...,Month-to-month,Yes,Bank transfer (automatic),103.70,5036.30,Yes,1,89,5340,Competitor had better devices
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2569-WGERO,1,United States,California,Landers,92285,"34.341737, -116.539416",34.341737,-116.539416,Female,No,...,Two year,Yes,Bank transfer (automatic),21.15,1419.40,No,0,45,5306,
6840-RESVB,1,United States,California,Adelanto,92301,"34.667815, -117.536183",34.667815,-117.536183,Male,No,...,One year,Yes,Mailed check,84.80,1990.50,No,0,59,2140,
2234-XADUH,1,United States,California,Amboy,92304,"34.559882, -115.637164",34.559882,-115.637164,Female,No,...,One year,Yes,Credit card (automatic),103.20,7362.90,No,0,71,5560,
4801-JZAZL,1,United States,California,Angelus Oaks,92305,"34.1678, -116.86433",34.167800,-116.864330,Female,No,...,Month-to-month,Yes,Electronic check,29.60,346.45,No,0,59,2793,


In [4]:
NUMERIC_FEATURES = [
    'Tenure Months',
    'Monthly Charges',
    'Total Charges',
    'CLTV'
]

CATEGORICAL_FEATURES = [
    'Senior Citizen',
    'Partner',
    'Dependents',
    'Multiple Lines',
    'Internet Service',
    'Online Security',
    'Online Backup',
    'Device Protection',
    'Tech Support',
    'Streaming TV',
    'Streaming Movies',
    'Contract',
    'Paperless Billing',
    'Payment Method'
]

FEATURES = NUMERIC_FEATURES + CATEGORICAL_FEATURES
TARGET = 'Churn Value'

# 2. Initialize Setup

The `setup` function in PyCaret is the initial step in setting up a machine learning experiment. It's designed for preparing your dataset for modeling and involves various stages like data preprocessing, splitting into train and test sets, and feature engineering.

While only two parameters are mandatory (the dataset and the target variable), it offers numerous optional parameters, therefore promoting a high degree of optimization.

In [5]:
# Set up experiment
s = setup(df[FEATURES+[TARGET]], target=TARGET, session_id=2023)

Unnamed: 0,Description,Value
0,Session id,2023
1,Target,Churn Value
2,Target type,Binary
3,Original data shape,"(7032, 19)"
4,Transformed data shape,"(7032, 40)"
5,Transformed train set shape,"(4922, 40)"
6,Transformed test set shape,"(2110, 40)"
7,Ordinal features,4
8,Numeric features,4
9,Categorical features,14


# 3. Compare Baseline

In [6]:
# Test multiple models
best_model = compare_models(fold=5)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
ada,Ada Boost Classifier,0.8033,0.8492,0.5749,0.6466,0.6084,0.4777,0.4794,1.012
lr,Logistic Regression,0.8027,0.848,0.5582,0.6506,0.6005,0.4706,0.4733,2.932
ridge,Ridge Classifier,0.8019,0.0,0.5283,0.6587,0.5863,0.4581,0.463,0.332
lda,Linear Discriminant Analysis,0.7995,0.8434,0.5711,0.6367,0.6021,0.4686,0.4699,0.39
gbc,Gradient Boosting Classifier,0.7993,0.854,0.5329,0.649,0.585,0.4544,0.4583,1.494
rf,Random Forest Classifier,0.793,0.8322,0.5031,0.6403,0.5633,0.4302,0.4357,0.874
lightgbm,Light Gradient Boosting Machine,0.7903,0.8432,0.5275,0.625,0.5721,0.4345,0.4373,1.946
xgboost,Extreme Gradient Boosting,0.7832,0.8315,0.526,0.6062,0.5632,0.42,0.4219,0.476
et,Extra Trees Classifier,0.7747,0.812,0.4862,0.5924,0.534,0.3874,0.3907,1.108
qda,Quadratic Discriminant Analysis,0.7456,0.8459,0.802,0.5146,0.6265,0.4476,0.4732,0.36


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

In [7]:
# Show best model
best_model

# 4. Create model

Let's also use logistic regression here for compatibility with the previous notebooks.

In [8]:
# Create logistic regression model
lr = create_model('lr')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8032,0.8335,0.6031,0.6371,0.6196,0.487,0.4874
1,0.8215,0.876,0.6183,0.6807,0.648,0.5288,0.5299
2,0.7581,0.8296,0.4692,0.5495,0.5062,0.3474,0.3493
3,0.8049,0.8435,0.5846,0.6441,0.6129,0.4829,0.4839
4,0.813,0.8428,0.5649,0.6789,0.6167,0.4944,0.498
5,0.8049,0.8455,0.5802,0.6496,0.6129,0.483,0.4844
6,0.815,0.8375,0.542,0.6961,0.6094,0.4907,0.4973
7,0.8049,0.8655,0.5878,0.6471,0.616,0.4856,0.4866
8,0.8089,0.8653,0.5649,0.6667,0.6116,0.486,0.489
9,0.811,0.8542,0.5573,0.6759,0.6109,0.4876,0.4915


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

# 5. Tune Hyperparameters

In [9]:
# Tuning the hyperparameters automatically (10 iteractions)
tuned_lr = tune_model(lr, n_iter=10, optimize='Recall')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.7485,0.8351,0.7863,0.5176,0.6242,0.447,0.4691
1,0.783,0.8756,0.8473,0.5606,0.6748,0.5218,0.5469
2,0.7419,0.8252,0.7769,0.5075,0.614,0.4326,0.4548
3,0.7337,0.8424,0.7692,0.4975,0.6042,0.4172,0.4397
4,0.7642,0.842,0.7939,0.5389,0.642,0.4756,0.4955
5,0.7398,0.8488,0.8168,0.5071,0.6257,0.4426,0.4722
6,0.7256,0.8389,0.771,0.4903,0.5994,0.4061,0.4302
7,0.7622,0.8638,0.8321,0.5343,0.6507,0.4831,0.5104
8,0.7683,0.8659,0.8321,0.5423,0.6566,0.4933,0.519
9,0.7764,0.8568,0.8473,0.5522,0.6687,0.511,0.5377


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 10 folds for each of 10 candidates, totalling 100 fits


# 6. Ensemble / Blend / Stack Models

In PyCaret, `ensemble_model` is used for ensemble learning to improve model performance by combining predictions from multiple models, `blend_models` merges predictions from different models to form a single prediction, and `stack_models` stacks models in layers to use the output of one model as input for another, enhancing prediction accuracy.

We'll not discuss them here, but it's worth it to know about them.

In [10]:
# ensemble_model | blend_models | stack_models

# 7. Evaluate model

In [11]:
evaluate_model(tuned_lr)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

# 8. Interpret model

In [12]:
interpret_model(tuned_lr, plot='msa')

# 9. Finalize model

After creating, tuning, and understanding our model, we can finalize it to make it ready for production:

In [13]:
# Finalize the model
final_model = finalize_model(tuned_lr)
display(final_model)

As you can see, under the hood, the final model is a sklearn-like pipeline, with transformaton and modeling steps.

In [14]:
# Let's list all attributes from our experiment setup
[attr for attr in dir(s) if not attr.startswith('_')]

['USI',
 'X',
 'X_test',
 'X_test_transformed',
 'X_train',
 'X_train_transformed',
 'X_transformed',
 'add_metric',
 'all_allowed_engines',
 'automl',
 'blend_models',
 'calibrate_model',
 'check_drift',
 'check_fairness',
 'compare_models',
 'convert_model',
 'create_api',
 'create_app',
 'create_docker',
 'create_model',
 'dashboard',
 'data',
 'data_split_shuffle',
 'data_split_stratify',
 'dataset',
 'dataset_transformed',
 'deploy_model',
 'ensemble_model',
 'evaluate_model',
 'exp_id',
 'exp_model_engines',
 'exp_name_log',
 'finalize_model',
 'fold_generator',
 'fold_groups_param',
 'fold_shuffle_param',
 'get_allowed_engines',
 'get_config',
 'get_engine',
 'get_leaderboard',
 'get_logs',
 'get_metrics',
 'gpu_n_jobs_param',
 'gpu_param',
 'html_param',
 'idx',
 'index',
 'interpret_model',
 'is_multiclass',
 'load_experiment',
 'load_model',
 'log_plots_param',
 'logger',
 'logging_param',
 'memory',
 'models',
 'n_jobs_param',
 'optimize_threshold',
 'pipeline',
 'plot_model

We can access the pipeline, methods, the raw and transformed data, and so on.

In [15]:
# Accessing the train data
s.X_train.head()

Unnamed: 0_level_0,Tenure Months,Monthly Charges,Total Charges,CLTV,Senior Citizen,Partner,Dependents,Multiple Lines,Internet Service,Online Security,Online Backup,Device Protection,Tech Support,Streaming TV,Streaming Movies,Contract,Paperless Billing,Payment Method
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
5533-NHFRF,7,44.400002,265.799988,3410,Yes,No,No,No phone service,DSL,No,Yes,No,Yes,No,Yes,Month-to-month,Yes,Electronic check
2155-AMQRX,28,54.900002,1505.150024,2885,No,No,No,Yes,DSL,No,No,No,Yes,No,No,Month-to-month,Yes,Credit card (automatic)
5673-FSSMF,1,60.150002,60.150002,4611,No,No,No,Yes,DSL,No,No,No,No,No,Yes,Month-to-month,Yes,Electronic check
7856-GANIL,45,98.699997,4525.799805,2535,Yes,Yes,No,Yes,Fiber optic,No,Yes,Yes,Yes,Yes,No,One year,Yes,Bank transfer (automatic)
3506-OVLKD,35,26.200001,954.900024,4419,No,No,No,Yes,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,One year,No,Bank transfer (automatic)


In [16]:
# Accessing the train data
s.X_train_transformed.head()

Unnamed: 0_level_0,Tenure Months,Monthly Charges,Total Charges,CLTV,Senior Citizen,Partner,Dependents,Multiple Lines_No phone service,Multiple Lines_Yes,Multiple Lines_No,...,Streaming Movies_No,Streaming Movies_No internet service,Contract_Month-to-month,Contract_One year,Contract_Two year,Paperless Billing,Payment Method_Electronic check,Payment Method_Credit card (automatic),Payment Method_Bank transfer (automatic),Payment Method_Mailed check
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5533-NHFRF,7.0,44.400002,265.799988,3410.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
2155-AMQRX,28.0,54.900002,1505.150024,2885.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
5673-FSSMF,1.0,60.150002,60.150002,4611.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
7856-GANIL,45.0,98.699997,4525.799805,2535.0,1.0,1.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
3506-OVLKD,35.0,26.200001,954.900024,4419.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


In [17]:
# Accessing the preprocessing pipeline =)
s.pipeline