# Catboost and XGBoost regression

XGBoost a boosting machine learning algorithm, which is the next version on top of the gradient boosting algorithm. The full name of the XGBoost algorithm is the eXtreme Gradient Boosting algorithm, as the name suggests it is an extreme version of the previous gradient boosting algorithm. The main difference between GradientBoosting is XGBoost is that XGbost uses a regularization technique in it. In simple words, it is a regularized form of the existing gradient-boosting algorithm.

CatBoost is a boosting algorithm that performs exceptionally very well on categorical datasets other than any algorithm in the field of machine learning as there is a special type of method for handling categorical datasets. In CatBoost, the categorical features are encoded on the basis of the output columns. So while training or encoding the categorical features, the weightage of the output column will also be considered which makes it higher accurate on categorical datasets.

## Importing and loading data

In [1]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.5-cp312-cp312-win_amd64.whl.metadata (1.2 kB)
Collecting graphviz (from catboost)
  Downloading graphviz-0.20.3-py3-none-any.whl.metadata (12 kB)
Collecting plotly (from catboost)
  Downloading plotly-5.22.0-py3-none-any.whl.metadata (7.1 kB)
Collecting tenacity>=6.2.0 (from plotly->catboost)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Downloading catboost-1.2.5-cp312-cp312-win_amd64.whl (101.1 MB)
   ---------------------------------------- 0.0/101.1 MB ? eta -:--:--
   ---------------------------------------- 0.8/101.1 MB 25.0 MB/s eta 0:00:05
    --------------------------------------- 2.2/101.1 MB 27.4 MB/s eta 0:00:04
   - -------------------------------------- 3.7/101.1 MB 29.5 MB/s eta 0:00:04
   -- ------------------------------------- 5.1/101.1 MB 29.6 MB/s eta 0:00:04
   -- ------------------------------------- 6.3/101.1 MB 28.8 MB/s eta 0:00:04
   --- ------------------------------------ 7.6/101.1 MB 28.7 MB/


[notice] A new release of pip is available: 24.1 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-2.1.0-py3-none-win_amd64.whl.metadata (2.1 kB)
Downloading xgboost-2.1.0-py3-none-win_amd64.whl (124.9 MB)
   ---------------------------------------- 0.0/124.9 MB ? eta -:--:--
   ---------------------------------------- 0.7/124.9 MB 15.0 MB/s eta 0:00:09
    --------------------------------------- 2.8/124.9 MB 29.6 MB/s eta 0:00:05
   - -------------------------------------- 4.6/124.9 MB 32.5 MB/s eta 0:00:04
   -- ------------------------------------- 7.3/124.9 MB 38.8 MB/s eta 0:00:04
   -- ------------------------------------- 7.9/124.9 MB 33.6 MB/s eta 0:00:04
   -- ------------------------------------- 7.9/124.9 MB 33.6 MB/s eta 0:00:04
   -- ------------------------------------- 8.4/124.9 MB 25.4 MB/s eta 0:00:05
   --- ------------------------------------ 10.2/124.9 MB 27.1 MB/s eta 0:00:05
   --- ------------------------------------ 11.6/124.9 MB 28.4 MB/s eta 0:00:04
   ---- ----------------------------------- 13.3/124.9 MB 28.5 MB/s 


[notice] A new release of pip is available: 24.1 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
# Importing libraries
import pandas as pd 
import numpy as np

# For plotting graphs
import matplotlib.pyplot as plt

# Importing the train_test_split function
from sklearn.model_selection import train_test_split
from sklearn import datasets

from sklearn.metrics import r2_score, mean_squared_error
from catboost import CatBoostRegressor
from xgboost import XGBRegressor

from sklearn import metrics
from sklearn.metrics import f1_score

import warnings
warnings.filterwarnings('ignore')

In [4]:
from sklearn.datasets import load_diabetes

# Load the diabetes data
X, y = load_diabetes(return_X_y = True)

print(X[:10], end = '\n')
print('---')
print(y[:10])

[[ 0.03807591  0.05068012  0.06169621  0.02187239 -0.0442235  -0.03482076
  -0.04340085 -0.00259226  0.01990749 -0.01764613]
 [-0.00188202 -0.04464164 -0.05147406 -0.02632753 -0.00844872 -0.01916334
   0.07441156 -0.03949338 -0.06833155 -0.09220405]
 [ 0.08529891  0.05068012  0.04445121 -0.00567042 -0.04559945 -0.03419447
  -0.03235593 -0.00259226  0.00286131 -0.02593034]
 [-0.08906294 -0.04464164 -0.01159501 -0.03665608  0.01219057  0.02499059
  -0.03603757  0.03430886  0.02268774 -0.00936191]
 [ 0.00538306 -0.04464164 -0.03638469  0.02187239  0.00393485  0.01559614
   0.00814208 -0.00259226 -0.03198764 -0.04664087]
 [-0.09269548 -0.04464164 -0.04069594 -0.01944183 -0.06899065 -0.07928784
   0.04127682 -0.0763945  -0.04117617 -0.09634616]
 [-0.04547248  0.05068012 -0.04716281 -0.01599898 -0.04009564 -0.02480001
   0.00077881 -0.03949338 -0.06291688 -0.03835666]
 [ 0.06350368  0.05068012 -0.00189471  0.06662945  0.09061988  0.10891438
   0.02286863  0.01770335 -0.03581619  0.00306441]


## Model building

### Creating the training and testing sets

In [5]:
# Divide into train and test sets
train_x, test_x, train_y, test_y = train_test_split(X, y, random_state = 23)

print(train_x.shape, train_y.shape)
print(test_x.shape, test_y.shape)

(331, 10) (331,)
(111, 10) (111,)


### Build the catboost regressor model

In [7]:
# Creating an Catboost instance
cbr = CatBoostRegressor(loss_function = 'RMSE')

# Train the model
cbr.fit(train_x, train_y)

Learning rate set to 0.034381
0:	learn: 76.0463837	total: 150ms	remaining: 2m 29s
1:	learn: 75.0816365	total: 152ms	remaining: 1m 15s
2:	learn: 74.0677588	total: 155ms	remaining: 51.5s
3:	learn: 73.2984687	total: 158ms	remaining: 39.3s
4:	learn: 72.2910077	total: 160ms	remaining: 31.9s
5:	learn: 71.4731411	total: 163ms	remaining: 27.1s
6:	learn: 70.7001150	total: 166ms	remaining: 23.6s
7:	learn: 70.0089314	total: 169ms	remaining: 21s
8:	learn: 69.0787413	total: 173ms	remaining: 19.1s
9:	learn: 68.2266996	total: 176ms	remaining: 17.4s
10:	learn: 67.3996335	total: 178ms	remaining: 16s
11:	learn: 66.5254701	total: 182ms	remaining: 15s
12:	learn: 65.9099320	total: 184ms	remaining: 14s
13:	learn: 65.2886330	total: 188ms	remaining: 13.3s
14:	learn: 64.6992752	total: 192ms	remaining: 12.6s
15:	learn: 64.1729245	total: 194ms	remaining: 11.9s
16:	learn: 63.5230544	total: 196ms	remaining: 11.3s
17:	learn: 62.9449694	total: 198ms	remaining: 10.8s
18:	learn: 62.3813037	total: 200ms	remaining: 10.3

<catboost.core.CatBoostRegressor at 0x227c9c2a300>

In [8]:
# Calculating scores
print('Training Score:', cbr.score(train_x, train_y).round(3))
print('Testing Score:', cbr.score(test_x, test_y).round(3))

Training Score: 0.987
Testing Score: 0.424


In [9]:
# Calculate MAE, R2 Score and RMSE
y_train_pred = cbr.predict(train_x)
y_pred = cbr.predict(test_x)

print('Training R2-Score:', round(r2_score(train_y, y_train_pred), 3))
print('Testing R2-Score:', round(r2_score(test_y, y_pred), 3))
print('Root Mean Square Error:', round(np.sqrt(mean_squared_error(test_y, y_pred)), 3))

Training R2-Score: 0.987
Testing R2-Score: 0.424
Root Mean Square Error: 57.033


### Build the xgboost regressor model

In [10]:
# Creating an xgboost regressor instance
xgb = XGBRegressor(objective ='reg:linear', n_estimators = 10, seed = 23)

# Train the model
xgb.fit(train_x, train_y)

In [13]:
# Calculating scores
print('Training Score:', round(xgb.score(train_x, train_y), 3))
print('Testing Score:', round(xgb.score(test_x, test_y), 3))

Training Score: 0.942
Testing Score: 0.313


In [14]:
# Calculate MAE, R2 Score and RMSE
y_train_pred = xgb.predict(train_x)
y_pred = xgb.predict(test_x)

print('Training R2-Score:', round(r2_score(train_y, y_train_pred), 3))
print('Testing R2-Score:', round(r2_score(test_y, y_pred), 3))
print('Root Mean Square Error:', round(np.sqrt(mean_squared_error(test_y, y_pred)), 3))

Training R2-Score: 0.942
Testing R2-Score: 0.313
Root Mean Square Error: 62.296
