<img style="width:100%" src="../images/practical_xgboost_in_python_notebook_header.png" />

# Using Scikit-learn Interface
The following notebook presents the alternative approach for using XGBoost algorithm.

**What's included**:
- <a href="#libs">load libraries</a> and <a href="#data">prepare data</a>,
- <a href="#params">specify parameters</a>,
- <a href="#train">train classifier</a>,
- <a href="#predict">make predictions</a>

### Loading libraries<a name='libs' />
Begin with loading all required libraries.

In [1]:
import numpy as np

from sklearn.datasets import load_svmlight_files
from sklearn.metrics import accuracy_score

from xgboost.sklearn import XGBClassifier

### Loading data<a name='data' />
We are going to use the same dataset as in previous lecture. The scikit-learn package provides a convenient function `load_svmlight` capable of reading many libsvm files at once and storing them as Scipy's sparse matrices. 

In [2]:
X_train, y_train, X_test, y_test = load_svmlight_files(('../data/agaricus.txt.train', '../data/agaricus.txt.test'))

Examine what was loaded

In [3]:
print("Train dataset contains {0} rows and {1} columns".format(X_train.shape[0], X_train.shape[1]))
print("Test dataset contains {0} rows and {1} columns".format(X_test.shape[0], X_test.shape[1]))

Train dataset contains 6513 rows and 126 columns
Test dataset contains 1611 rows and 126 columns


In [4]:
print("Train possible labels: ")
print(np.unique(y_train))

print("\nTest possible labels: ")
print(np.unique(y_test))

Train possible labels: 
[ 0.  1.]

Test possible labels: 
[ 0.  1.]


### Specify training parameters<a name='params' />
All the parameters are set like in the previous example
- we are dealing with binary classification problem (`'objective':'binary:logistic'`),
- we want shallow single trees with no more than 2 levels (`'max_depth':2`),
- we don't any oupout (`'silent':1`),
- we want algorithm to learn fast and aggressively (`'learning_rate':1`), (in naive named `eta`)
- we want to iterate only 5 rounds (`n_estimators`)

In [5]:
params = {
    'objective': 'binary:logistic',
    'max_depth': 2,
    'learning_rate': 1.0,
    'silent': 1.0,
    'n_estimators': 5
}

### Training classifier<a name='train' />

In [6]:
bst = XGBClassifier(**params).fit(X_train, y_train)

### Make predictions<a name='predict' />

In [7]:
preds = bst.predict(X_test)
preds

array([ 0.,  1.,  0., ...,  1.,  0.,  1.])

Calculate obtained error

In [8]:
correct = 0

for i in range(len(preds)):
    if (y_test[i] == preds[i]):
        correct += 1
        
acc = accuracy_score(y_test, preds)

print('Predicted correctly: {0}/{1}'.format(correct, len(preds)))
print('Error: {0:.4f}'.format(1-acc))

Predicted correctly: 1601/1611
Error: 0.0062
