In [5]:
# If you do not have the catboost package installed, please uncomment and run the following line
# !pip install catboost

In [51]:
import catboost as cb
from catboost import CatBoostClassifier, Pool
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# CatBoost: A Complete Guide


## Introduction

CatBoost is a high-performance open-source library for gradient boosting on decision trees developed by Yandex. 
It is particularly well-suited for classification, regression, and ranking tasks, and offers robust performance on categorical features without requiring one-hot encoding. 
In this lesson, we will cover:
- What is CatBoost?
- Key Features
- Installation
- Basic Implementation
- Important Hyperparameters
- Comparison with Other Gradient Boosting Algorithms
- Tips and Tricks



## What is CatBoost?

CatBoost, short for "Categorical Boosting," is an algorithm that efficiently handles categorical features without the need for extensive preprocessing like one-hot encoding or label encoding. 
It uses ordered boosting to prevent target leakage and performs well on imbalanced datasets. Its main advantages include:
- **Handling Categorical Data:** Built-in support for categorical features, avoiding the curse of dimensionality associated with one-hot encoding.
- **Efficient Training:** Fast training even on large datasets with high dimensions.
- **Robust Performance:** Good out-of-the-box performance with minimal tuning.

CatBoost is often compared with other boosting algorithms such as XGBoost and LightGBM, but its main edge comes from its categorical handling and lower susceptibility to overfitting.



## Key Features of CatBoost

1. **Automatic Handling of Categorical Features:** CatBoost can work directly with categorical variables without the need for manual preprocessing.
2. **Efficient Training:** The training process is fast due to efficient CPU and GPU implementations.
3. **Ordered Boosting:** It uses ordered boosting to avoid target leakage during training.
4. **Robust to Overfitting:** By using techniques such as feature combination and advanced regularization, CatBoost is less prone to overfitting than many other boosting algorithms.



## Important Hyperparameters

Let's go through some of the most important hyperparameters of CatBoost:

1. **iterations:** The maximum number of trees that can be built. A larger number of iterations can lead to better model performance but may increase training time and risk of overfitting.
2. **depth:** The depth of the trees. A deeper tree can learn more complex patterns but may also lead to overfitting.
3. **l2_leaf_reg:** L2 regularization coefficient. It helps in controlling overfitting by penalizing large weights.
4. **border_count:** Number of splits for numerical features. Increasing `border_count` may improve model performance but can also increase training time.
5. **eval_metric:** Metric used for evaluating model performance. Options include "Accuracy," "AUC," "Logloss," etc.
6. **cat_features:** List of categorical features (column indices) to be treated as categorical.
7. **random_seed:** Seed for random number generation to ensure reproducibility.
8. **verbose:** Verbosity of the training process (0 = silent, 1 = print updates).

### Optimizations sidenote

Using astype('category') Pandas stores the name of the city as an int, not a string, and saves a lot of memeory

In [52]:
data_size = 1_000_000
data = pd.DataFrame({
    'city': ['New York', 'Los Angeles', 'San Francisco', 'Chicago', 'Houston'] * (data_size // 5)
})

memory_usage_before = data.memory_usage(deep=True).sum() / 1_000_000

data['city'] = data['city'].astype('category')

memory_usage_after = data.memory_usage(deep=True).sum() / 1_000_000

(memory_usage_before, memory_usage_after)

(66.200132, 1.000635)

## Dataset

For demonstration purposes, we will use the `iris` dataset. This is a classic classification dataset where the goal is to predict the species of iris based on its features.


In [72]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", 
    "marital-status", "occupation", "relationship", "race", "sex", 
    "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"
]

data = pd.read_csv(url, names=columns, na_values=" ?", sep=",\s*", engine="python")

In [54]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [73]:
data['income'].value_counts()

income
<=50K    24720
>50K      7841
Name: count, dtype: int64

In [55]:
data.shape

(32561, 15)

In [56]:
data.dropna(inplace=True)

In [57]:
data['income'] = data['income'].apply(lambda x: 1 if x == ">50K" else 0)

In [58]:
X = data.drop("income", axis=1)
y = data["income"]

In [59]:
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

for col in categorical_features:
    X[col] = X[col].astype('category')

In [60]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [61]:
cat_features_indices = [X_train.columns.get_loc(col) for col in categorical_features]

In [62]:
%%time

model = CatBoostClassifier(
    iterations=500,             # Number of boosting iterations
    learning_rate=0.1,          
    depth=10,                   # Depth of the trees
    l2_leaf_reg=3,              # L2 regularization
    eval_metric='Accuracy',     # Evaluation metric
    cat_features=cat_features_indices,  # Indices of categorical features
    verbose=50,                 # Verbosity
    random_seed=42              
)

model.fit(X_train, y_train)

0:	learn: 0.8418689	total: 23.8ms	remaining: 11.9s
50:	learn: 0.8812193	total: 1.18s	remaining: 10.4s
100:	learn: 0.8931588	total: 2.3s	remaining: 9.07s
150:	learn: 0.9073249	total: 3.58s	remaining: 8.27s
200:	learn: 0.9159244	total: 4.89s	remaining: 7.27s
250:	learn: 0.9237945	total: 6.21s	remaining: 6.16s
300:	learn: 0.9322405	total: 7.56s	remaining: 5s
350:	learn: 0.9390356	total: 8.9s	remaining: 3.78s
400:	learn: 0.9446407	total: 10.2s	remaining: 2.53s
450:	learn: 0.9495931	total: 11.5s	remaining: 1.25s
499:	learn: 0.9532018	total: 12.9s	remaining: 0us
CPU times: user 1min, sys: 9.47 s, total: 1min 9s
Wall time: 13 s


<catboost.core.CatBoostClassifier at 0x174d5e390>

In [69]:
X_test.iloc[0]

age                          27
workclass               Private
fnlwgt                   160178
education          Some-college
education-num                10
marital-status         Divorced
occupation         Adm-clerical
relationship      Not-in-family
race                      White
sex                      Female
capital-gain                  0
capital-loss                  0
hours-per-week               38
native-country    United-States
Name: 14160, dtype: object

In [63]:
y_pred = model.predict(X_test)

In [64]:
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred))

Accuracy: 0.87
              precision    recall  f1-score   support

           0       0.90      0.94      0.92      4942
           1       0.77      0.65      0.71      1571

    accuracy                           0.87      6513
   macro avg       0.83      0.80      0.81      6513
weighted avg       0.87      0.87      0.87      6513



In [65]:
importances = model.get_feature_importance(prettified=True)
print(importances)

        Feature Id  Importances
0              age    14.922751
1       occupation    13.645859
2           fnlwgt    10.743213
3        education     8.548148
4   hours-per-week     8.467470
5     relationship     8.171926
6   marital-status     7.753239
7     capital-gain     6.999732
8        workclass     6.457446
9    education-num     3.559516
10            race     3.156632
11    capital-loss     3.018592
12  native-country     2.342311
13             sex     2.213165



## Tips and Tricks

1. **Hyperparameter Tuning:** Use grid search or random search to find the optimal set of hyperparameters for your specific dataset.
2. **Categorical Features:** Always provide the `cat_features` parameter to leverage the strength of CatBoost's handling of categorical data.
3. **GPU Training:** Use `task_type='GPU'` if you have a compatible GPU to speed up training on large datasets.
4. **Handling Class Imbalance:** Use `class_weights` to handle imbalanced datasets.

## Further Exploration

To dive deeper into CatBoost, consider:
- Exploring other evaluation metrics (`eval_metric`).
- Experimenting with different boosting types (`boosting_type`).


## Conclusion

CatBoost is a powerful and versatile gradient boosting algorithm that performs well on both numerical and categorical data. 
It is efficient, easy to use, and requires minimal preprocessing of categorical features. By leveraging its built-in functionalities and tuning its hyperparameters, you can achieve high performance on a wide range of machine learning tasks.