# How to use BaaL with Scikit-Learn models

In this tutorial, you will learn how to use BaaL on a scikit-learn model.
In this case, we will use `RandomForestClassifier`.

This tutorial is based on the tutorial from [Saimadhu Polamuri](https://dataaspirant.com/2017/06/26/random-forest-classifier-python-scikit-learn/).

First, if you have not done it yet, let's install BaaL.

```bash
pip install baal
```

In [12]:
%load_ext autoreload
%autoreload 2
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
HEADERS = ["CodeNumber", "ClumpThickness", "UniformityCellSize", "UniformityCellShape", "MarginalAdhesion",
           "SingleEpithelialCellSize", "BareNuclei", "BlandChromatin", "NormalNucleoli", "Mitoses", "CancerType"]

import pandas as pd
data = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
dataset = pd.read_csv(data)
dataset.columns = HEADERS

# Handle missing labels
dataset = dataset[dataset[HEADERS[6]] != '?']


# Split
train_x, test_x, train_y, test_y = train_test_split(dataset[HEADERS[1:-1]], dataset[HEADERS[-1]],
                                                        train_size=0.7)


clf = RandomForestClassifier()
clf.fit(train_x, train_y)

# Get metrics
predictions = clf.predict(test_x)
print("Train Accuracy :: ", accuracy_score(train_y, clf.predict(train_x)))
print("Test Accuracy  :: ", accuracy_score(test_y, predictions))
print(" Confusion matrix ", confusion_matrix(test_y, predictions))


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Train Accuracy ::  1.0
Test Accuracy  ::  0.9658536585365853
 Confusion matrix  [[119   3]
 [  4  79]]




Now that you have a trained model, you can use it to perform uncertainty estimation.
The SKLearn API directly propose `RandomForestClassifier.predict_proba` which would return the mean
response from the RandomForest.

But if you wish to try one of our heuristics in `baal.active.heuristics`, here's how.

In [13]:
import numpy as np
from baal.active.heuristics import BALD
print(f"Using {len(clf.estimators_)} estimators")

# Predict independently for all estimators.
x = np.array(list(map(lambda e: e.predict_proba(test_x), clf.estimators_)))
# Roll axis because BaaL expect [n_samples, n_classes, ..., n_estimations]
x = np.rollaxis(x, 0, 3)
print("Uncertainty per sample")
print(BALD().compute_score(x))

print("Ranks")
print(BALD()(x))


Using 10 estimators
Uncertainty per sample
[0.         0.         0.         0.         0.         0.
 0.32508297 0.         0.         0.32508297 0.         0.32508297
 0.         0.         0.         0.         0.32508297 0.
 0.         0.         0.         0.         0.         0.50040242
 0.         0.         0.32508297 0.         0.32508297 0.
 0.         0.         0.32508297 0.         0.         0.32508297
 0.         0.         0.         0.         0.         0.
 0.         0.50040242 0.         0.69314718 0.         0.
 0.         0.32508297 0.         0.6108643  0.         0.32508297
 0.         0.         0.         0.         0.         0.
 0.         0.         0.32508297 0.         0.         0.
 0.         0.32508297 0.         0.         0.         0.50040242
 0.         0.6108643  0.         0.         0.         0.
 0.         0.32508297 0.         0.         0.         0.
 0.         0.50040242 0.6108643  0.         0.         0.50040242
 0.         0.         0

## Active learning with SkLearn

You can also try Active learning by using `ActiveNumpyArray`.


**NOTE**: Because we focus on images, we have not made experiments on this setup.

In [14]:
from baal.active.dataset import ActiveNumpyArray
dataset = ActiveNumpyArray((train_x, train_y))

# We start with a 10 labelled samples.
dataset.label_randomly(10)

heuristic = BALD()

# We will use a RandomForest in this case.
clf = RandomForestClassifier()
def predict(test, clf):
    # Predict with all fitted estimators.
    x = np.array(list(map(lambda e: e.predict_proba(test[0]), clf.estimators_)))
    
    # Roll axis because BaaL expect [n_samples, n_classes, ..., n_estimations]
    x = np.rollaxis(x, 0, 3)
    return x

for _ in range(5):
  print("Dataset size", len(dataset))
  clf.fit(*dataset.dataset)
  predictions = clf.predict(test_x)
  print("Test Accuracy  :: ", accuracy_score(test_y, predictions))
  probs = predict(dataset.pool, clf)
  to_label = heuristic(probs)
  query_size = 10
  if len(to_label) > 0:
      dataset.label(to_label[: query_size])
  else:
    break

Dataset size 10
Test Accuracy  ::  0.9219512195121952
Dataset size 20
Test Accuracy  ::  0.9658536585365853
Dataset size 30
Test Accuracy  ::  0.9414634146341463
Dataset size 40
Test Accuracy  ::  0.9512195121951219
Dataset size 50
Test Accuracy  ::  0.9609756097560975


