# CatBoost vs XGBoost Classifier Predictive Power Comparison 5 Minute Quick and Dirty Stylee

#### This is a quick and dirty look at how TPOT compares to XGBoost with minimal tuning.  
#### I am using https://www.kaggle.com/sampadab17/network-intrusion-detection network security dataset from Kaggle

## Imports and custom cell behavior

In [1]:
import numpy as np
import pandas as pd
import xgboost as xgb
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

# see all the data
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# widen jupyter cells to fit page
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Load data

In [2]:
df_train = pd.read_csv('Train_data.csv')
df_test = pd.read_csv('Test_data.csv')

# ===================  Preprocess data  ==================================

## Check for missing values

In [3]:
df_train.isnull().sum() 

duration                       0
protocol_type                  0
service                        0
flag                           0
src_bytes                      0
dst_bytes                      0
land                           0
wrong_fragment                 0
urgent                         0
hot                            0
num_failed_logins              0
logged_in                      0
num_compromised                0
root_shell                     0
su_attempted                   0
num_root                       0
num_file_creations             0
num_shells                     0
num_access_files               0
num_outbound_cmds              0
is_host_login                  0
is_guest_login                 0
count                          0
srv_count                      0
serror_rate                    0
srv_serror_rate                0
rerror_rate                    0
srv_rerror_rate                0
same_srv_rate                  0
diff_srv_rate                  0
srv_diff_h

## One line NaN check

In [39]:
df_train.isnull().values.any()

False

##  Now we know all that no values are NaN's we can optimize the data for model consumption

# Encode data columns with categorical data

In [4]:
df_encoded = pd.get_dummies(df_train, prefix=['protocol_type', 'service', 'flag'], columns=['protocol_type', 'service', 'flag'])
df_test_encoded = pd.get_dummies(df_test, prefix=['protocol_type', 'service', 'flag'], columns=['protocol_type', 'service', 'flag'])

## Grab X and y values from dataframe

In [6]:
X = df_encoded.drop('class', axis=1).values
y = df_encoded['class'].values

# Standardize data

In [5]:
std_scaler = StandardScaler()

In [7]:
X_std = std_scaler.fit_transform(X)

## Encode target variable for model consumption

In [8]:
labelencoder = LabelEncoder()
y_encoded = labelencoder.fit_transform(y)

## Train/test split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X_std, y_encoded, test_size=0.2, random_state=93)

# =======================  XGBoost Model  ==================================

## Load XGBoost Classifier

In [44]:
xgb_model = xgb.XGBClassifier(n_jobs=-1, random_state=93)

## Fit and Predict on XGBoost

In [45]:
xgb_model.fit(X_train,y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=-1, nthread=None, objective='binary:logistic',
       random_state=93, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1)

In [11]:
preds = xgb_model.predict(X_test)

## Evaluate predictions

In [12]:
accuracy = accuracy_score(y_test, preds)
accuracy

0.9954356023020441

# =======================  CatBoost Model  =================================

## Reload raw data to see how CatBoost handles categorical data automatically

In [20]:
X_cat = df_train.drop('class', axis=1).values
y = df_train['class'].values

In [22]:
X_cat_train, X_cat_test, y_cat_train, y_cat_test = train_test_split(X_cat, y_encoded, test_size=0.2, random_state=93)

## Grab categorical columns to feed to CatBoost model for auto handling of categorical features

In [30]:
cat_features_indices = np.where(np.logical_and(df_train.drop('class', axis=1).dtypes != np.int, df_train.drop('class', axis=1).dtypes != np.float))[0]

cat_features_indices

array([1, 2, 3])

## Load CatBoost Classifier

In [46]:
cat_model = CatBoostClassifier(
    custom_loss=['Accuracy'],
    random_seed=93,
    logging_level='Verbose'
)

## Fit model

In [47]:
cat_model.fit(
    X_cat_train, y_cat_train,
    cat_features=cat_features_indices,
    eval_set=(X_cat_test, y_cat_test)
)

Learning rate set to 0.106956
0:	learn: 0.4624527	test: 0.4630817	best: 0.4630817 (0)	total: 115ms	remaining: 1m 54s
1:	learn: 0.3225984	test: 0.3206360	best: 0.3206360 (1)	total: 177ms	remaining: 1m 28s
2:	learn: 0.2224392	test: 0.2208187	best: 0.2208187 (2)	total: 264ms	remaining: 1m 27s
3:	learn: 0.1676912	test: 0.1650538	best: 0.1650538 (3)	total: 322ms	remaining: 1m 20s
4:	learn: 0.1373049	test: 0.1347718	best: 0.1347718 (4)	total: 376ms	remaining: 1m 14s
5:	learn: 0.1091680	test: 0.1060534	best: 0.1060534 (5)	total: 449ms	remaining: 1m 14s
6:	learn: 0.0917493	test: 0.0877947	best: 0.0877947 (6)	total: 554ms	remaining: 1m 18s
7:	learn: 0.0825103	test: 0.0788006	best: 0.0788006 (7)	total: 642ms	remaining: 1m 19s
8:	learn: 0.0671166	test: 0.0635188	best: 0.0635188 (8)	total: 731ms	remaining: 1m 20s
9:	learn: 0.0591325	test: 0.0559955	best: 0.0559955 (9)	total: 809ms	remaining: 1m 20s
10:	learn: 0.0521221	test: 0.0493616	best: 0.0493616 (10)	total: 872ms	remaining: 1m 18s
11:	learn: 

93:	learn: 0.0085277	test: 0.0125231	best: 0.0125231 (93)	total: 6.34s	remaining: 1m 1s
94:	learn: 0.0083982	test: 0.0123290	best: 0.0123290 (94)	total: 6.4s	remaining: 1m
95:	learn: 0.0083595	test: 0.0122889	best: 0.0122889 (95)	total: 6.46s	remaining: 1m
96:	learn: 0.0083374	test: 0.0122404	best: 0.0122404 (96)	total: 6.51s	remaining: 1m
97:	learn: 0.0082777	test: 0.0121690	best: 0.0121690 (97)	total: 6.59s	remaining: 1m
98:	learn: 0.0082439	test: 0.0121387	best: 0.0121387 (98)	total: 6.64s	remaining: 1m
99:	learn: 0.0082393	test: 0.0121328	best: 0.0121328 (99)	total: 6.67s	remaining: 1m
100:	learn: 0.0080935	test: 0.0120668	best: 0.0120668 (100)	total: 6.73s	remaining: 59.9s
101:	learn: 0.0080931	test: 0.0120743	best: 0.0120668 (100)	total: 6.77s	remaining: 59.6s
102:	learn: 0.0079715	test: 0.0120335	best: 0.0120335 (102)	total: 6.92s	remaining: 1m
103:	learn: 0.0078736	test: 0.0119643	best: 0.0119643 (103)	total: 6.98s	remaining: 1m
104:	learn: 0.0078303	test: 0.0119535	best: 0.011

187:	learn: 0.0049355	test: 0.0099603	best: 0.0099580 (178)	total: 11.9s	remaining: 51.5s
188:	learn: 0.0049271	test: 0.0099604	best: 0.0099580 (178)	total: 12s	remaining: 51.3s
189:	learn: 0.0049201	test: 0.0099673	best: 0.0099580 (178)	total: 12s	remaining: 51.2s
190:	learn: 0.0048621	test: 0.0099339	best: 0.0099339 (190)	total: 12.1s	remaining: 51.1s
191:	learn: 0.0048595	test: 0.0099175	best: 0.0099175 (191)	total: 12.1s	remaining: 50.9s
192:	learn: 0.0048259	test: 0.0099391	best: 0.0099175 (191)	total: 12.2s	remaining: 50.9s
193:	learn: 0.0047862	test: 0.0099211	best: 0.0099175 (191)	total: 12.2s	remaining: 50.8s
194:	learn: 0.0047557	test: 0.0099330	best: 0.0099175 (191)	total: 12.3s	remaining: 50.7s
195:	learn: 0.0047509	test: 0.0099357	best: 0.0099175 (191)	total: 12.3s	remaining: 50.5s
196:	learn: 0.0047456	test: 0.0099363	best: 0.0099175 (191)	total: 12.4s	remaining: 50.4s
197:	learn: 0.0047450	test: 0.0099413	best: 0.0099175 (191)	total: 12.4s	remaining: 50.3s
198:	learn: 0.

281:	learn: 0.0036448	test: 0.0099245	best: 0.0098906 (218)	total: 16.7s	remaining: 42.5s
282:	learn: 0.0036436	test: 0.0099229	best: 0.0098906 (218)	total: 16.8s	remaining: 42.4s
283:	learn: 0.0036200	test: 0.0099072	best: 0.0098906 (218)	total: 16.8s	remaining: 42.4s
284:	learn: 0.0036112	test: 0.0099060	best: 0.0098906 (218)	total: 16.9s	remaining: 42.3s
285:	learn: 0.0035971	test: 0.0099017	best: 0.0098906 (218)	total: 16.9s	remaining: 42.2s
286:	learn: 0.0035965	test: 0.0099024	best: 0.0098906 (218)	total: 17s	remaining: 42.1s
287:	learn: 0.0035773	test: 0.0099154	best: 0.0098906 (218)	total: 17s	remaining: 42.1s
288:	learn: 0.0035766	test: 0.0099087	best: 0.0098906 (218)	total: 17s	remaining: 41.9s
289:	learn: 0.0035765	test: 0.0099068	best: 0.0098906 (218)	total: 17.1s	remaining: 41.8s
290:	learn: 0.0035724	test: 0.0099033	best: 0.0098906 (218)	total: 17.1s	remaining: 41.7s
291:	learn: 0.0035711	test: 0.0099025	best: 0.0098906 (218)	total: 17.2s	remaining: 41.7s
292:	learn: 0.00

373:	learn: 0.0029676	test: 0.0096938	best: 0.0096816 (365)	total: 21.1s	remaining: 35.3s
374:	learn: 0.0029602	test: 0.0097097	best: 0.0096816 (365)	total: 21.1s	remaining: 35.2s
375:	learn: 0.0029545	test: 0.0097026	best: 0.0096816 (365)	total: 21.2s	remaining: 35.1s
376:	learn: 0.0029492	test: 0.0097176	best: 0.0096816 (365)	total: 21.2s	remaining: 35.1s
377:	learn: 0.0029399	test: 0.0097111	best: 0.0096816 (365)	total: 21.3s	remaining: 35s
378:	learn: 0.0029393	test: 0.0097105	best: 0.0096816 (365)	total: 21.3s	remaining: 34.9s
379:	learn: 0.0029384	test: 0.0097038	best: 0.0096816 (365)	total: 21.4s	remaining: 34.9s
380:	learn: 0.0029384	test: 0.0097032	best: 0.0096816 (365)	total: 21.4s	remaining: 34.8s
381:	learn: 0.0029250	test: 0.0097429	best: 0.0096816 (365)	total: 21.5s	remaining: 34.7s
382:	learn: 0.0029148	test: 0.0097612	best: 0.0096816 (365)	total: 21.5s	remaining: 34.6s
383:	learn: 0.0028981	test: 0.0097325	best: 0.0096816 (365)	total: 21.6s	remaining: 34.6s
384:	learn: 

466:	learn: 0.0025677	test: 0.0096386	best: 0.0096259 (426)	total: 25.5s	remaining: 29.1s
467:	learn: 0.0025661	test: 0.0096253	best: 0.0096253 (467)	total: 25.5s	remaining: 29s
468:	learn: 0.0025660	test: 0.0096265	best: 0.0096253 (467)	total: 25.6s	remaining: 28.9s
469:	learn: 0.0025276	test: 0.0096179	best: 0.0096179 (469)	total: 25.6s	remaining: 28.9s
470:	learn: 0.0024904	test: 0.0096440	best: 0.0096179 (469)	total: 25.7s	remaining: 28.8s
471:	learn: 0.0024904	test: 0.0096440	best: 0.0096179 (469)	total: 25.7s	remaining: 28.8s
472:	learn: 0.0024731	test: 0.0096311	best: 0.0096179 (469)	total: 25.8s	remaining: 28.7s
473:	learn: 0.0024711	test: 0.0096487	best: 0.0096179 (469)	total: 25.8s	remaining: 28.6s
474:	learn: 0.0024686	test: 0.0096459	best: 0.0096179 (469)	total: 25.8s	remaining: 28.6s
475:	learn: 0.0024617	test: 0.0096336	best: 0.0096179 (469)	total: 25.9s	remaining: 28.5s
476:	learn: 0.0024615	test: 0.0096315	best: 0.0096179 (469)	total: 25.9s	remaining: 28.4s
477:	learn: 

558:	learn: 0.0023200	test: 0.0097064	best: 0.0096179 (469)	total: 30.6s	remaining: 24.2s
559:	learn: 0.0023196	test: 0.0096975	best: 0.0096179 (469)	total: 30.7s	remaining: 24.1s
560:	learn: 0.0023191	test: 0.0096960	best: 0.0096179 (469)	total: 30.8s	remaining: 24.1s
561:	learn: 0.0023096	test: 0.0096704	best: 0.0096179 (469)	total: 30.8s	remaining: 24s
562:	learn: 0.0023084	test: 0.0096724	best: 0.0096179 (469)	total: 30.9s	remaining: 24s
563:	learn: 0.0023071	test: 0.0096773	best: 0.0096179 (469)	total: 31s	remaining: 23.9s
564:	learn: 0.0023065	test: 0.0096791	best: 0.0096179 (469)	total: 31s	remaining: 23.9s
565:	learn: 0.0023035	test: 0.0096751	best: 0.0096179 (469)	total: 31.1s	remaining: 23.8s
566:	learn: 0.0022998	test: 0.0096653	best: 0.0096179 (469)	total: 31.1s	remaining: 23.8s
567:	learn: 0.0022949	test: 0.0096436	best: 0.0096179 (469)	total: 31.2s	remaining: 23.7s
568:	learn: 0.0022935	test: 0.0096281	best: 0.0096179 (469)	total: 31.3s	remaining: 23.7s
569:	learn: 0.0022

651:	learn: 0.0021376	test: 0.0097555	best: 0.0096179 (469)	total: 36s	remaining: 19.2s
652:	learn: 0.0021359	test: 0.0097655	best: 0.0096179 (469)	total: 36s	remaining: 19.1s
653:	learn: 0.0021164	test: 0.0097216	best: 0.0096179 (469)	total: 36.1s	remaining: 19.1s
654:	learn: 0.0021163	test: 0.0097216	best: 0.0096179 (469)	total: 36.2s	remaining: 19s
655:	learn: 0.0021163	test: 0.0097215	best: 0.0096179 (469)	total: 36.2s	remaining: 19s
656:	learn: 0.0021161	test: 0.0097227	best: 0.0096179 (469)	total: 36.3s	remaining: 18.9s
657:	learn: 0.0021119	test: 0.0097134	best: 0.0096179 (469)	total: 36.3s	remaining: 18.9s
658:	learn: 0.0021115	test: 0.0097099	best: 0.0096179 (469)	total: 36.4s	remaining: 18.8s
659:	learn: 0.0021091	test: 0.0097127	best: 0.0096179 (469)	total: 36.5s	remaining: 18.8s
660:	learn: 0.0020903	test: 0.0097208	best: 0.0096179 (469)	total: 36.5s	remaining: 18.7s
661:	learn: 0.0020902	test: 0.0097212	best: 0.0096179 (469)	total: 36.6s	remaining: 18.7s
662:	learn: 0.0020

745:	learn: 0.0019016	test: 0.0096139	best: 0.0096120 (721)	total: 41.2s	remaining: 14s
746:	learn: 0.0019014	test: 0.0096171	best: 0.0096120 (721)	total: 41.2s	remaining: 14s
747:	learn: 0.0018594	test: 0.0095858	best: 0.0095858 (747)	total: 41.3s	remaining: 13.9s
748:	learn: 0.0018593	test: 0.0095835	best: 0.0095835 (748)	total: 41.3s	remaining: 13.8s
749:	learn: 0.0018589	test: 0.0095834	best: 0.0095834 (749)	total: 41.4s	remaining: 13.8s
750:	learn: 0.0018586	test: 0.0095821	best: 0.0095821 (750)	total: 41.4s	remaining: 13.7s
751:	learn: 0.0018584	test: 0.0095831	best: 0.0095821 (750)	total: 41.5s	remaining: 13.7s
752:	learn: 0.0018581	test: 0.0095683	best: 0.0095683 (752)	total: 41.6s	remaining: 13.6s
753:	learn: 0.0018580	test: 0.0095692	best: 0.0095683 (752)	total: 41.6s	remaining: 13.6s
754:	learn: 0.0018558	test: 0.0095648	best: 0.0095648 (754)	total: 41.7s	remaining: 13.5s
755:	learn: 0.0018557	test: 0.0095663	best: 0.0095648 (754)	total: 41.7s	remaining: 13.5s
756:	learn: 0.

838:	learn: 0.0016169	test: 0.0095002	best: 0.0094723 (813)	total: 46.3s	remaining: 8.88s
839:	learn: 0.0016126	test: 0.0094932	best: 0.0094723 (813)	total: 46.3s	remaining: 8.82s
840:	learn: 0.0016105	test: 0.0095148	best: 0.0094723 (813)	total: 46.4s	remaining: 8.76s
841:	learn: 0.0016101	test: 0.0095191	best: 0.0094723 (813)	total: 46.4s	remaining: 8.71s
842:	learn: 0.0016098	test: 0.0095219	best: 0.0094723 (813)	total: 46.4s	remaining: 8.65s
843:	learn: 0.0016080	test: 0.0095215	best: 0.0094723 (813)	total: 46.5s	remaining: 8.6s
844:	learn: 0.0016076	test: 0.0095132	best: 0.0094723 (813)	total: 46.6s	remaining: 8.54s
845:	learn: 0.0016075	test: 0.0095160	best: 0.0094723 (813)	total: 46.6s	remaining: 8.49s
846:	learn: 0.0016064	test: 0.0095157	best: 0.0094723 (813)	total: 46.7s	remaining: 8.43s
847:	learn: 0.0016004	test: 0.0095105	best: 0.0094723 (813)	total: 46.8s	remaining: 8.38s
848:	learn: 0.0015942	test: 0.0094994	best: 0.0094723 (813)	total: 46.8s	remaining: 8.33s
849:	learn:

932:	learn: 0.0014725	test: 0.0095653	best: 0.0094723 (813)	total: 51.4s	remaining: 3.69s
933:	learn: 0.0014721	test: 0.0095721	best: 0.0094723 (813)	total: 51.5s	remaining: 3.64s
934:	learn: 0.0014713	test: 0.0095698	best: 0.0094723 (813)	total: 51.5s	remaining: 3.58s
935:	learn: 0.0014713	test: 0.0095770	best: 0.0094723 (813)	total: 51.6s	remaining: 3.53s
936:	learn: 0.0014710	test: 0.0095761	best: 0.0094723 (813)	total: 51.7s	remaining: 3.47s
937:	learn: 0.0014707	test: 0.0095799	best: 0.0094723 (813)	total: 51.7s	remaining: 3.42s
938:	learn: 0.0014696	test: 0.0095679	best: 0.0094723 (813)	total: 51.8s	remaining: 3.36s
939:	learn: 0.0014604	test: 0.0096048	best: 0.0094723 (813)	total: 51.9s	remaining: 3.31s
940:	learn: 0.0014580	test: 0.0096070	best: 0.0094723 (813)	total: 51.9s	remaining: 3.25s
941:	learn: 0.0014579	test: 0.0096104	best: 0.0094723 (813)	total: 52s	remaining: 3.2s
942:	learn: 0.0014494	test: 0.0096095	best: 0.0094723 (813)	total: 52s	remaining: 3.14s
943:	learn: 0.0

<catboost.core.CatBoostClassifier at 0x121e53710>

## Predict

In [37]:
preds_raw = cat_model.predict(X_cat_test)

## Evaluate predictions

In [38]:
accuracy_score(y_cat_test, preds_raw)

0.9970232188926375

# =======================  Model Results Comparison  =================================

In [49]:
%timeit xgb_model.predict(X_test)

20.7 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [50]:
%timeit cat_model.predict(X_cat_test)

27.2 ms ± 3.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


<strong>XGBoost</strong> = 0.9954356023020441 <br />
<strong>CatBoost</strong> = 0.9970232188926375

Out of the bag quick and dirty you can see CatBoost outperforms XGBoost by ~0.0016% 