# Catboost and XGBoost regression

XGBoost a boosting machine learning algorithm, which is the next version on top of the gradient boosting algorithm. The full name of the XGBoost algorithm is the eXtreme Gradient Boosting algorithm, as the name suggests it is an extreme version of the previous gradient boosting algorithm. The main difference between GradientBoosting is XGBoost is that XGbost uses a regularization technique in it. In simple words, it is a regularized form of the existing gradient-boosting algorithm.

CatBoost is a boosting algorithm that performs exceptionally very well on categorical datasets other than any algorithm in the field of machine learning as there is a special type of method for handling categorical datasets. In CatBoost, the categorical features are encoded on the basis of the output columns. So while training or encoding the categorical features, the weightage of the output column will also be considered which makes it higher accurate on categorical datasets.

## Importing and loading data

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Importing classifiers
from catboost import CatBoostClassifier
from xgboost import XGBClassifier

from sklearn import metrics 
from sklearn.metrics import *
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Reading the data
data = pd.read_csv('datasets/data_cleaned.csv')

# Check the data
data.head()

Unnamed: 0,Survived,Age,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,SibSp_0,SibSp_1,...,Parch_0,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6,Embarked_C,Embarked_Q,Embarked_S
0,0,22.0,7.25,0,0,1,0,1,0,1,...,1,0,0,0,0,0,0,0,0,1
1,1,38.0,71.2833,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,1,0,0
2,1,26.0,7.925,0,0,1,1,0,1,0,...,1,0,0,0,0,0,0,0,0,1
3,1,35.0,53.1,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,0,0,1
4,0,35.0,8.05,0,0,1,0,1,1,0,...,1,0,0,0,0,0,0,0,0,1


## Model building

### Separating independent and dependent variables

In [3]:
# Independent variables
x = data.drop(['Survived'], axis=1)

#dependent variable
y = data['Survived']

print(x.shape, y.shape)

(891, 24) (891,)


### Creating the training and testing sets

In [4]:
# Divide into train and test sets
train_x, test_x, train_y, test_y = train_test_split(x, y, random_state = 101, stratify = y)

print(train_x.shape, train_y.shape)
print(test_x.shape, test_y.shape)

(668, 24) (668,)
(223, 24) (223,)


### Build the catboost classification model

In [5]:
# Creating an Gradient boosting instance
clf = CatBoostClassifier(iterations = 5, learning_rate=0.1, loss_function='CrossEntropy')

# Train the model
clf.fit(train_x,train_y)

0:	learn: 0.6220928	total: 151ms	remaining: 602ms
1:	learn: 0.5700240	total: 156ms	remaining: 234ms
2:	learn: 0.5320119	total: 160ms	remaining: 107ms
3:	learn: 0.4953079	total: 166ms	remaining: 41.5ms
4:	learn: 0.4678107	total: 170ms	remaining: 0us


<catboost.core.CatBoostClassifier at 0x1ecf62cb230>

In [6]:
# Calculating scores
print('Training Score:', clf.score(train_x, train_y).round(3))
print('Testing Score:', clf.score(test_x, test_y).round(3))

Training Score: 0.849
Testing Score: 0.789


### Build the XGBoost classification model 

In [7]:
# Creating an Gradient boosting instance
xgb = XGBClassifier(objective = 'binary:logistic', n_estimators=20, random_state=42, eval_metric=["auc", "error", "error@0.6"])

# Train the model
xgb.fit(train_x, train_y)

In [10]:
# Calculating scores
print('Training Score:', round(xgb.score(train_x, train_y), 3))
print('Testing Score:', round(xgb.score(test_x, test_y), 3))

Training Score: 0.918
Testing Score: 0.816
