# Gradient Boosting: CPU vs GPU
This is a basic tuturoal which shows how to run gradient boosting on CPU and GPU on Google Colaboratory. It will give you an opportunity to see the speedup that you get from GPU training. The speedup is large even on Tesla K80 that is available in Colaboratory. On newer generations of GPU the speedup will be much bigger.

We will use CatBoost gradient boosting library, which is known for it's good GPU performance.
  
 You could try it out on Colaboratory, just pressing on the following badge:  
 
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/catboost/tutorials/blob/master/tools/google_colaboratory_cpu_vs_gpu_tutorial.ipynb) 

## Set GPU as hardware accelerator
First of all, you need to select GPU as hardware accelerator. There are two simple steps to do so:  
Step 1. Navigate to 'Runtime' menu and select 'Change runtime type'  
Step 2. Choose GPU as hardware accelerator.  
That's all!

## Importing CatBoost

Next big thing is to import CatBoost inside environment. Colaboratory has built in libraries installed and most libraries can be installed quickly with a simple *!pip install* command.  
Please take notice you need to re-import library every time you starts new session of Colab.

In [1]:
!pip install catboost

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/98/03/777a0e1c12571a7f3320a4fa6d5f123dba2dd7c0bca34f4f698a6396eb48/catboost-0.12.2-cp36-none-manylinux1_x86_64.whl (55.5MB)
[K    100% |████████████████████████████████| 55.5MB 858kB/s 
Collecting enum34 (from catboost)
  Downloading https://files.pythonhosted.org/packages/af/42/cb9355df32c69b553e72a2e28daee25d1611d2c0d9c272aa1d34204205b2/enum34-1.1.6-py3-none-any.whl
Installing collected packages: enum34, catboost
Successfully installed catboost-0.12.2 enum34-1.1.6
  [enum]
You must restart the runtime in order to use newly installed versions.[0m


##Download and prepare dataset
The next step is dataset downloading. GPU training is useful for large datsets. You will get a good speedup starting from 10k objects and the more objects you have, the more will be the speedup.
Because of that reason we have selected a large dataset - [Epsilon](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) (500.000 documents and 2.000 features) for this tutorial.
Firstly, we will get the data through catboost.datasets module. The code below does this. It will run for approximately 10-15 minutes. So please be patient :)

In [0]:
from catboost.datasets import epsilon

train, test = epsilon()

X_train, y_train = train.iloc[:,1:], train[0]
X_test, y_test = train.iloc[:,1:], train[0]

## Training on CPU
Now we will train the model on CPU and measure execution time.
We will use 100 iterations for our CPU training since otherwise it will take a long time.
It will take around 15 minutes.

In [3]:
from catboost import CatBoostClassifier
import timeit

def train_on_cpu():  
  model = CatBoostClassifier(
    iterations=100,
    learning_rate=0.03,
    eval_metric='Accuracy',
  )
  
  model.fit(
      X_train, y_train,
      eval_set=(X_test, y_test),
      verbose=5
  );   
      
cpu_time = timeit.timeit('train_on_cpu()', setup="from __main__ import train_on_cpu", number=1)

print('Time to fit model on CPU: {} sec'.format(int(cpu_time)))

0:	learn: 0.6648425	test: 0.6648425	best: 0.6648425 (0)	total: 7.65s	remaining: 12m 37s
5:	learn: 0.6920775	test: 0.6920775	best: 0.6920775 (5)	total: 44.6s	remaining: 11m 38s
10:	learn: 0.7009275	test: 0.7009275	best: 0.7009275 (10)	total: 1m 20s	remaining: 10m 51s
15:	learn: 0.7111575	test: 0.7111575	best: 0.7111575 (15)	total: 1m 55s	remaining: 10m 7s
20:	learn: 0.7219325	test: 0.7219325	best: 0.7219325 (20)	total: 2m 31s	remaining: 9m 31s
25:	learn: 0.7294100	test: 0.7294100	best: 0.7294100 (25)	total: 3m 7s	remaining: 8m 52s
30:	learn: 0.7375125	test: 0.7375125	best: 0.7375125 (30)	total: 3m 41s	remaining: 8m 12s
35:	learn: 0.7445875	test: 0.7445875	best: 0.7445875 (35)	total: 4m 14s	remaining: 7m 32s
40:	learn: 0.7499600	test: 0.7499600	best: 0.7499600 (40)	total: 4m 48s	remaining: 6m 55s
45:	learn: 0.7549100	test: 0.7549100	best: 0.7549100 (45)	total: 5m 21s	remaining: 6m 17s
50:	learn: 0.7599000	test: 0.7599000	best: 0.7599000 (50)	total: 5m 53s	remaining: 5m 40s
55:	learn: 0.7

Take notice that learning time itself wothout data feeding is around 12 minutes. Whereas all the process consumes 14-15 min.

## Training on GPU
The previous code execution has been done on CPU. It's time to use GPU!  
We need to use '*task_type='GPU'*' parameter value to run GPU training. Now the execution time wouldn't be so big :)  
BTW if Colaboratory shows you a warning 'GPU memory usage is close to the limit', just press 'Ignore'.

In [4]:
def train_on_gpu():  
  model = CatBoostClassifier(
    iterations=100,
    learning_rate=0.03,
    eval_metric='Accuracy',
    task_type='GPU'
  )
  
  model.fit(
      X_train, y_train,
      eval_set=(X_test, y_test),
      verbose=5
  );     
      
gpu_time = timeit.timeit('train_on_gpu()', setup="from __main__ import train_on_gpu", number=1)

print('Time to fit model on GPU: {} sec'.format(int(gpu_time)))
print('GPU speedup over CPU: {}x'.format(int(cpu_time/gpu_time)))

0:	learn: 0.6652575	test: 0.6652575	best: 0.6652575 (0)	total: 329ms	remaining: 32.6s
5:	learn: 0.6936800	test: 0.6936800	best: 0.6936800 (5)	total: 1.48s	remaining: 23.2s
10:	learn: 0.7041750	test: 0.7041750	best: 0.7041750 (10)	total: 2.56s	remaining: 20.7s
15:	learn: 0.7138950	test: 0.7138950	best: 0.7138950 (15)	total: 3.61s	remaining: 18.9s
20:	learn: 0.7232200	test: 0.7232200	best: 0.7232200 (20)	total: 4.65s	remaining: 17.5s
25:	learn: 0.7311075	test: 0.7311075	best: 0.7311075 (25)	total: 5.67s	remaining: 16.1s
30:	learn: 0.7377725	test: 0.7377725	best: 0.7377725 (30)	total: 6.67s	remaining: 14.8s
35:	learn: 0.7443750	test: 0.7443750	best: 0.7443750 (35)	total: 7.67s	remaining: 13.6s
40:	learn: 0.7491000	test: 0.7491000	best: 0.7491000 (40)	total: 8.66s	remaining: 12.5s
45:	learn: 0.7551175	test: 0.7551175	best: 0.7551175 (45)	total: 9.67s	remaining: 11.4s
50:	learn: 0.7600325	test: 0.7600325	best: 0.7600325 (50)	total: 10.7s	remaining: 10.3s
55:	learn: 0.7655850	test: 0.7655850

As you can see GPU is much faster than CPU on large datasets. It takes just 3-4 mins vs 14-15 mins to fit the model. Moreover learning process consumes just 30 seconds vs 12 minutes! This is a good reason to use GPU instead of CPU!
  
Thank you for attention! 