# Scikit-Learn: Active learning with Random Forest

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/baal-org/baal/blob/master/notebooks/compatibility/sklearn_tutorial.ipynb)

In this tutorial, you will learn how to use Baal on a scikit-learn model.
In this case, we will use `RandomForestClassifier`.

This tutorial is based on the tutorial from [Saimadhu Polamuri](https://dataaspirant.com/2017/06/26/random-forest-classifier-python-scikit-learn/).

First, if you have not done it yet, let's install Baal.

```bash
pip install baal
```

In [12]:
%load_ext autoreload
%autoreload 2
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
HEADERS = ["CodeNumber", "ClumpThickness", "UniformityCellSize", "UniformityCellShape", "MarginalAdhesion",
           "SingleEpithelialCellSize", "BareNuclei", "BlandChromatin", "NormalNucleoli", "Mitoses", "CancerType"]

import pandas as pd
data = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
dataset = pd.read_csv(data)
dataset.columns = HEADERS

# Handle missing labels
dataset = dataset[dataset[HEADERS[6]] != '?']


# Split
train_x, test_x, train_y, test_y = train_test_split(dataset[HEADERS[1:-1]], dataset[HEADERS[-1]],
                                                        train_size=0.7)


clf = RandomForestClassifier()
clf.fit(train_x, train_y)

# Get metrics
predictions = clf.predict(test_x)
print("Train Accuracy :: ", accuracy_score(train_y, clf.predict(train_x)))
print("Test Accuracy  :: ", accuracy_score(test_y, predictions))
print(" Confusion matrix ", confusion_matrix(test_y, predictions))


Now that you have a trained model, you can use it to perform uncertainty estimation.
The SKLearn API directly propose `RandomForestClassifier.predict_proba` which would return the mean
response from the RandomForest.

But if you wish to try one of our heuristics in `baal.active.heuristics`, here's how.

In [13]:
import numpy as np
from baal.active.heuristics import BALD
print(f"Using {len(clf.estimators_)} estimators")

# Predict independently for all estimators.
x = np.array(list(map(lambda e: e.predict_proba(test_x), clf.estimators_)))
# Roll axis because Baal expect [n_samples, n_classes, ..., n_estimations]
x = np.rollaxis(x, 0, 3)
print("Uncertainty per sample")
print(BALD().compute_score(x))

print("Ranks")
print(BALD()(x))


## Active learning with SkLearn

You can also try Active learning by using `ActiveNumpyArray`.


**NOTE**: Because we focus on images, we have not made experiments on this setup.

In [14]:
from baal.active.dataset import ActiveNumpyArray
dataset = ActiveNumpyArray((train_x, train_y))

# We start with a 10 labelled samples.
dataset.label_randomly(10)

heuristic = BALD()

# We will use a RandomForest in this case.
clf = RandomForestClassifier()
def predict(test, clf):
    # Predict with all fitted estimators.
    x = np.array(list(map(lambda e: e.predict_proba(test[0]), clf.estimators_)))
    
    # Roll axis because Baal expect [n_samples, n_classes, ..., n_estimations]
    x = np.rollaxis(x, 0, 3)
    return x

for _ in range(5):
  print("Dataset size", len(dataset))
  clf.fit(*dataset.dataset)
  predictions = clf.predict(test_x)
  print("Test Accuracy  :: ", accuracy_score(test_y, predictions))
  probs = predict(dataset.pool, clf)
  to_label = heuristic(probs)
  query_size = 10
  if len(to_label) > 0:
      dataset.label(to_label[: query_size])
  else:
    break