In [1]:
# Required to correctly display interactive (dynamic) plots in Jupyter notebooks.
# This code cell must be run before any other code cell.
%matplotlib notebook

# Imports
from skmultiflow.data import FileStream
from skmultiflow.data import SEAGenerator
from skmultiflow.evaluation import EvaluatePrequential
from skmultiflow.bayes import NaiveBayes
from skmultiflow.trees import HoeffdingTreeClassifier
from skmultiflow.trees import HoeffdingAdaptiveTreeClassifier
from skmultiflow.drift_detection import ADWIN

from sklearn.linear_model import SGDClassifier

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec


# scikit-multiflow demo

In this notebook, we show how to easily setup and run experiments in `scikit-multiflow`.

The demo is divided into the follwing parts:

1. Running a classification task 
  1. Prequential evaluation implementation
  2. The `EvaluatePrequential` class

2. Concept drift detection
  1. Drift detection test
  2. Impact on predictive performance
  
---

---
## 1. Running a classification task

In this example we will use the `SEA` stream generator. A data generator does not store any data, but generates it on demand.

Next we will setup a learning method (model, estimator, algorithm), in this case the Naive Bayes classifier:

In [2]:
stream = SEAGenerator(random_state=1)
classifier = NaiveBayes()

### Prequential evaluation

The prequentail evaluation is easily implemented as a loop:

In [3]:
# Variables to control evaluation loop and track performance
n_samples = 0
correct_cnt = 0
max_samples = 2000

# Prequential evaluation loop
while n_samples < max_samples and stream.has_more_samples():
   X, y = stream.next_sample()      # Get one sample from the stream
   y_pred = classifier.predict(X)   # Predict class for new data
   if y[0] == y_pred[0]:
       correct_cnt += 1
   classifier.partial_fit(X, y)     # Incrementally train the model with the new data
   n_samples += 1

print('{} samples analyzed.'.format(n_samples))   
print('NaiveBayes classifier accuracy: {}'.format(correct_cnt / n_samples))

2000 samples analyzed.
NaiveBayes classifier accuracy: 0.9395


### `EvaluatePrequential` class

Implements the prequential evaluation method and provides extra functionalities.

Let's run the same experiment on the SEA data but this time we will compare two classifiers:

1. `NaiveBayes`
2. `SGDClassifier`: Linear SVM with SGD training.

We choose the `SGDClassifier` in order to demonstrate the compatibility with incremental methods from `scikit-learn`.

**Note:** `scikit-learn` focuses on the batch learing setting and only a **limited** number of its methods are capable to learn incrementally.

In [4]:
# Setup stream and estimators
stream = SEAGenerator(random_state=1)
nb = NaiveBayes()
svm = SGDClassifier()

# Setup evaluator
eval = EvaluatePrequential(show_plot=True,
                           max_samples=20000,
                           metrics=['accuracy', 'kappa', 'running_time', 'model_size'])

In [5]:
# Run the evaluation
eval.evaluate(stream=stream, model=[nb, svm], model_names=['NB', 'SVM']);

<IPython.core.display.Javascript object>

Prequential Evaluation
Evaluating 1 target(s).
Pre-training on 200 sample(s).
Evaluating...
 #################### [100%] [11.12s]
Processed samples: 20000
Mean performance:
NB - Accuracy     : 0.9430
NB - Kappa        : 0.8621
NB - Training time (s)  : 0.56
NB - Testing time  (s)  : 1.42
NB - Total time    (s)  : 1.98
NB - Size (kB)          : 6.8076
SVM - Accuracy     : 0.9558
SVM - Kappa        : 0.8979
SVM - Training time (s)  : 4.48
SVM - Testing time  (s)  : 1.84
SVM - Total time    (s)  : 6.32
SVM - Size (kB)          : 3.4453


---
## 2. Concept Drift

#### Simulate a data stream with concept drift

For this example, we will generate a synthetic data stream by concatenating 3 distributions of 1000 samples each:
- $dist_a$: $\mu=0.8$, $\sigma=0.05$
- $dist_b$: $\mu=0.4$, $\sigma=0.02$
- $dist_c$: $\mu=0.6$, $\sigma=0.1$.

In [6]:
random_state = np.random.RandomState(12345)
dist_a = random_state.normal(0.8, 0.05, 1000)
dist_b = random_state.normal(0.4, 0.02, 1000)
dist_c = random_state.normal(0.6, 0.1, 1000)

stream = np.concatenate((dist_a, dist_b, dist_c))

# Plot the data
fig = plt.figure(figsize=(7,3), tight_layout=True)
gs = gridspec.GridSpec(1, 2, width_ratios=[3, 1]) 
ax1, ax2 = plt.subplot(gs[0]), plt.subplot(gs[1])
ax1.grid()
ax1.plot(stream, label='Stream')
ax2.grid(axis='y')
ax2.hist(dist_a, label=r'$dist_a$')
ax2.hist(dist_b, label=r'$dist_b$')
ax2.hist(dist_c, label=r'$dist_c$')
plt.legend()
plt.show()

<IPython.core.display.Javascript object>

### Drift detection test

In this example we will use the ADaptive WINdowing (`ADWIN`) drift detection method.

The goal is to detect that drift has occurred, after samples **1000** and **2000** in the synthetic data stream.

In [7]:
# Instantiate the ADWIN drift detector
drift_detector = ADWIN()

for i, val in enumerate(stream):
    drift_detector.add_element(val)        # Data is processed one sample at a time
    if drift_detector.detected_change():
        print('Change detected at index {}'.format(i))
        drift_detector.reset()

Change detected at index 1055
Change detected at index 2079


### Impact on predictive performance

In this example we will use two popular stream models:

1. The `Hoeffding Tree` is a type of decision tree designed for data streams.
2. The `Hoeffding Adaptive Tree` is an improvement over the original `Hoeffding Tree`.

The `Hoeffding Adaptive Tree` uses `ADWIN` to detect changes, if a change is detected at a given branch, an alternate branch is created and eventually replaces the original branch if it shows better performance on new data.

For this example we will load the data from a csv file using the `FileStream` class.

The data corresponds to the output of the `AGRAWALGenerator` with 3 **gradual drifts** at the 5k, 10k and 15k marks.

In [8]:
# Load data
stream = FileStream("agr_a_20k.csv")
# Setup estimators
cfiers = [HoeffdingTreeClassifier(), HoeffdingAdaptiveTreeClassifier()]
# Setup evaluations
eval = EvaluatePrequential(show_plot=True,
                           metrics=['accuracy', 'kappa', 'model_size'],
                           n_wait=100)

Run the evaluation

In [9]:
eval.evaluate(stream=stream, model=cfiers, model_names=['HT', 'HAT']);

<IPython.core.display.Javascript object>

Prequential Evaluation
Evaluating 1 target(s).
Pre-training on 200 sample(s).
Evaluating...
 #################### [100%] [21.20s]
Processed samples: 20000
Mean performance:
HT - Accuracy     : 0.7279
HT - Kappa        : 0.4530
HT - Size (kB)          : 175.8711
HAT - Accuracy     : 0.7589
HAT - Kappa        : 0.5159
HAT - Size (kB)          : 110.7939
