# CatBoost Methodology

##### Author information
- Name: Joel Shin
- email address: joel@handong.ac.kr
- GitHub: https://github.com/JoellikeCoffee

#### Part 1.Introduction of the Catboost model
CatBoost, developed by Yandex, signifies a significant advancement in machine learning algorithms, particularly within the domain of gradient boosting models. The moniker "CatBoost" is a fusion of the terms "Categorical" and "Boosting," underscoring the model's core competencies: robustness and efficiency in handling categorical data.

Traditionally, gradient boosting models have been primarily tailored for numerical data, necessitating an initial step of converting categorical data into numerical form to accommodate the model. This indispensable conversion process, unfortunately, often leads to substantial information loss, resulting in subpar model performance. To rectify this widespread issue, CatBoost was designed with the unique ability to process categorical data efficiently, without undermining the integrity of the information.

Demonstrating operational versatility, CatBoost extends its functionality to both CPU and GPU platforms. Furthermore, it's renowned as one of the most efficient machine learning algorithms, outstripping competitors such as XGBoost and LightGBM in terms of algorithm training speed.

Substantially contributing to the enhancement of machine learning algorithms dealing with categorical features, CatBoost, through its unique algorithm and emphasis on categorical data, has not only propelled the field of gradient boosting forward but also significantly broadened the horizons of machine learning as a whole.

#### Part 2. Key concept of Catboost
The Catboost algorithm has five phases of operation.

1. **Data Preparation**: CatBoost begins with a critical process, employing a distinctive encoding technique called "Ordered Target Statistics". It numerically transforms each categorical feature based on its sequence of occurrence, significantly enhancing the efficiency in managing categorical data.


2. **Model Initialization**: Similar to traditional gradient boosting algorithms, CatBoost initiates with a rudimentary model, often a decision tree. This model endeavors to learn the optimal representation of the data, yielding initial predictions.


3. **Residual Computation**: The precision of the initial model's predictions is gauged against the actual values. The discrepancy, known as the residual, is calculated between the predicted and actual values, indicating the accuracy of the preliminary predictions.


4. **New Model Training**: Once the residuals are calculated, a new model is trained to rectify the inaccuracies identified by the preceding model. In this stage, CatBoost implements a unique technique, "Ordered Boosting", to forestall overfitting.


5. **Model Combination**: The newly trained model is fused with the existing model to augment the accuracy of the overall model. This iterative process is executed until a comprehensive final model emerges. This final model, an ensemble of multiple models, leverages the strengths of each individual model to offer effective data predictions.

A fundamental aspect of CatBoost's approach is its strategy of computing statistics for each category during the training process and employing these statistics for tree construction. CatBoost calculates target value statistics for each category, such as the count of samples in each category and the average target value for each category.

The hallmark of CatBoost's methodology is the way it handles categorical features in gradient boosting, eliminating the need for preprocessing. This is achieved by computing and employing statistics for each category during the training phase to guide tree construction. The algorithm calculates the target value statistics for each category, such as the number of samples in each category and the mean target value for each category, allowing the algorithm to optimize the use of categorical features.
The target value statistics for each category are computed using the following equation:

$$ S_i = \frac{\sum_{j=1}^{n} y_j \cdot [x_j = i] + \alpha}{\sum_{j=1}^{n} [x_j = i] + \alpha} $$

where $S_i$ is the target value statistic for category $i$, $y_j$ is the target value for sample $j$, $x_j$ is the categorical feature value for sample $j$, $n$ is the total number of samples, and $\alpha$ is a smoothing parameter. This formula computes the average target value for each category, with the smoothing parameter $\alpha$ preventing overfitting to categories with fewer occurrences.

Additionally, CatBoost applies an innovative schema for calculating leaf values during tree structure selection, which contributes to reducing overfitting. This scheme assigns smaller weights to deeper trees and larger weights to shallower trees, curtailing the model's tendency to overfit to the training data.

The primary strength of CatBoost is its proficiency in handling categorical features without requiring preprocessing. This characteristic makes the algorithm more efficient and user-friendly by eliminating the need for manual feature engineering. Moreover, CatBoost's method of calculating leaf values aids in avoiding overfitting, resulting in more precise predictions.

#### Part 3. Example
For experiments, Titanic data freely available on kaggle used for modeling. After performing only basic character preprocessing, Catboost, Xgboost, and Lightgbm totaly three model compared with the model performance and training time. Derivative variable task to add categorical variables during preprocessing added In this prosess.

In [8]:
# Set library
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.preprocessing import LabelEncoder
import time
import warnings
warnings.filterwarnings('ignore')

# Load the Titanic Dataset
titanic_data = pd.read_csv('titanic.csv')
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [10]:
# Preprocessing
X = titanic_data.drop(['Survived'], axis=1)
y = titanic_data['Survived']
X.columns = X.columns.str.replace('[^\w\s-]', '')

# Create a new feature Title, containing the titles of passenger names
X['Title'] = X['Name'].apply(lambda name: name.split(',')[1].split('.')[0].strip())
X['Title'] = X['Title'].replace(['Lady', 'Countess', 'Don', 'Sir', 'Jonkheer', 'Dona'], 'Noble')
X['Title'] = X['Title'].replace(['Capt', 'Col', 'Dr', 'Major', 'Rev'], 'Officer')
X['Title'] = X['Title'].replace('Mlle', 'Miss')
X['Title'] = X['Title'].replace('Ms', 'Miss')
X['Title'] = X['Title'].replace('Mme', 'Mrs')

# Create a new feature FamilySize
X['FamilySize'] = X['SibSp'] + X['Parch'] + 1

# Create a new feature IsAlone
X['IsAlone'] = 0
X.loc[X['FamilySize'] == 1, 'IsAlone'] = 1

# Create a new feature CabinClass
X['CabinClass'] = X['Cabin'].apply(lambda cabin: cabin[0] if type(cabin) is str else 'Unknown')

In [13]:
# Label Encoding
categorical_features = ['Sex', 'Cabin', 'Embarked', 'Ticket', 'Name', 'Title', 'CabinClass']
encoder = LabelEncoder()
for feature in categorical_features:
    X[feature] = encoder.fit_transform(X[feature].astype(str))

# Set Validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=15)

# Training and Evaluation
models = {
    'CatBoost': CatBoostClassifier(iterations=100, learning_rate=0.1, verbose=0),
    'XGBoost': XGBClassifier(n_estimators=100, learning_rate=0.1, verbosity=0),
    'LightGBM': LGBMClassifier(n_estimators=100, learning_rate=0.1, verbosity=-1)
}
times = []
accuracies = []

for name, model in models.items():
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    elapsed_time = end_time - start_time
    times.append(elapsed_time)
    
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    
# print result
for i, name in enumerate(models.keys()):
    print(f'{name}: Computing time - {times[i]:.4f} sec, Accuracy - {accuracies[i]:.4f}')

CatBoost: Computing time - 0.1280 sec, Accuracy - 0.8603
XGBoost: Computing time - 0.0504 sec, Accuracy - 0.8268
LightGBM: Computing time - 0.0760 sec, Accuracy - 0.8324


<figure style="text-align: center;">
    <img src="Evaluation_of_Catboost.png" alt="Drawing" style="width: 800px;"/>
    <figcaption>Modeling learning times from official Catboost</figcaption>
</figure>

In the experiments conducted with CPU, CatBoost exhibited the highest accuracy among the three models, achieving an evaluation score of 86%. Moreover, it also required the shortest training time. According to existing literature on CatBoost, this model is likely to demonstrate improved effectiveness when applied to larger datasets or those with a greater number of categorical variables

#### Part 4. Appendix
- Catboost Official Website : https://catboost.ai/


- Dataset : https://www.kaggle.com/c/titanic


- Papers
    - Dorogush, A. V., Ershov, V., & Gulin, A. (2018). CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363.
    - Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: unbiased boosting with categorical features. Advances in neural information processing systems, 31.
    
    
- Concept of Boosting Algorithm
    - https://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting
    - https://en.wikipedia.org/wiki/Boosting_(machine_learning)
    - https://towardsdatascience.com/tagged/boosting
    
    
- Accuracy Formula    
$$
\text{{Accuracy}} = \frac{{\text{{True Positives}} + \text{{True Negatives}}}}{{\text{{True Positives}} + \text{{False Positives}} + \text{{True Negatives}} + \text{{False Negatives}}}}
$$


- System Environment
| System Environment| Information                        |
|-------------------|------------------------------------|
| OS                | Windows 11 Pro 21H2 Version        |
| CPU               | 12th Gen Intel(R) Core(TM) i9-12900|
| GPU               | NVIDIA GeForce RTX 3090 X 2EA      |
| RAM               | Samsung 32GB DDR4 25600 X 4EA      |
| Python Version    | 3.9.13                             |

