# Pescador demo

This notebook illustrates some of the basic functionality of [pescador](https://github.com/bmcfee/pescador): a package to facilitate iterative learning from data streams (implemented as python generators).

In [1]:
import pescador

import numpy as np
np.set_printoptions(precision=4)
import sklearn
import sklearn.datasets
import sklearn.linear_model
import sklearn.cross_validation
import sklearn.metrics

In [2]:
def data_generator(X, Y, m=20, scale = 1e-1):
    '''A gaussian noise generator for data
    
    Parameters
    ----------
    X : ndarray
        features, n_samples by dimensions
        
    Y : ndarray
        labels, n_samples
        
    m : int
        size of the minibatches to generate
        
    scale : float > 0
        scale of the noise to add
        
    Generates
    ---------
    batch
        An infinite stream of batch dictionaries
        batch = dict(X=X[i], Y=Y[i])
    '''
    
    X = np.atleast_2d(X)
    Y = np.atleast_1d(Y)

    
    n, d = X.shape
    
    while True:
        i = np.random.randint(0, n, size=m)
        
        noise = scale * np.random.randn(m, d)
        
        yield {'X': X[i] + noise, 'Y': Y[i]}

In [3]:
# Load up the iris dataset for the demo
data = sklearn.datasets.load_iris()
X, Y = data.data, data.target
classes = np.unique(Y)

In [4]:
# What does the data stream look like?

# First, we'll wrap the generator function in a Streamer object.
# This is necessary for a few reasons, notably so that we can re-instantiate
# the generator multiple times (eg once per epoch)

stream = pescador.Streamer(data_generator, X, Y)

# The buffer_batch() function takes a batch stream as input, and
# carves it into batches of up to buffer_size (3, in this case) samples
# the buffer size can be larger or smaller than the native size of the input batches
for q in pescador.buffer_batch(stream.generate(max_batches=1), 3):
    print q

{'Y': array([0, 2, 0]), 'X': array([[ 5.111 ,  3.4651,  1.5395,  0.1581],
       [ 7.7168,  3.7435,  6.7296,  2.2304],
       [ 4.804 ,  3.6848,  1.384 ,  0.168 ]])}
{'Y': array([0, 0, 1]), 'X': array([[ 5.3667,  3.8719,  1.4993,  0.1568],
       [ 4.5084,  3.2853,  1.3485,  0.1768],
       [ 5.5969,  3.1929,  4.0782,  1.1483]])}
{'Y': array([0, 1, 2]), 'X': array([[ 4.4934,  3.08  ,  1.402 ,  0.3167],
       [ 6.9038,  2.8601,  4.716 ,  1.3107],
       [ 6.8751,  3.119 ,  5.1122,  2.197 ]])}
{'Y': array([2, 2, 1]), 'X': array([[ 6.391 ,  2.9923,  5.3609,  1.7917],
       [ 6.3414,  3.3781,  5.4042,  2.3355],
       [ 5.7415,  2.6693,  4.0282,  1.3673]])}
{'Y': array([2, 0, 1]), 'X': array([[ 7.5801,  2.9956,  6.0487,  2.388 ],
       [ 5.0181,  3.427 ,  1.3916,  0.2538],
       [ 5.4109,  3.0184,  4.5518,  1.4583]])}
{'Y': array([1, 2, 0]), 'X': array([[ 6.672 ,  3.1787,  4.5161,  1.4835],
       [ 7.045 ,  2.9664,  5.8848,  1.5716],
       [ 5.6515,  3.8174,  1.7312,  0.1626]])}
{'Y'

# Benchmarking
We can benchmark our learner's efficiency by running a couple of experiments on the Iris dataset.

Our classifier will be L1-regularized logistic regression.

In [6]:
%%time
for train, test in sklearn.cross_validation.ShuffleSplit(len(X),
                                                         n_iter=2,
                                                         test_size=0.2):
    
    # Make an SGD learner, nothing fancy here
    classifier = sklearn.linear_model.SGDClassifier(verbose=0, 
                                                    loss='log',
                                                    penalty='l1', 
                                                    n_iter=1)
    
    # Make a streamable wrapper
    model = pescador.StreamLearner(classifier)
    
    # Again, build a streamer object
    stream = pescador.Streamer(data_generator, X[train], Y[train])
    
    samples = stream.generate(max_batches=5e3)
    
    # And train the model on the stream.
    # iter_fit() works just like partial_fit(), except that the input is a generator.
    model.iter_fit(samples, classes=classes)
    
    # How's it do on the test set?
    print 'Test-set accuracy: {:.3f}'.format(sklearn.metrics.accuracy_score(Y[test], model.predict(X[test])))
    print '# Steps: ' + str(model.estimator.t_)

Test-set accuracy: 1.000
# Steps: 100001.0
Test-set accuracy: 0.967
# Steps: 100001.0
CPU times: user 3.61 s, sys: 830 µs, total: 3.61 s
Wall time: 3.62 s


# Parallelism

It's possible that the learner is more or less efficient than the data generator.  If the data generator has higher latency than the learner (SGDClassifier), then this will slow down the learning.

Pescador uses zeromq to parallelize data stream generation, effectively decoupling it from the learner.

In [7]:
%%time
for train, test in sklearn.cross_validation.ShuffleSplit(len(X), n_iter=2, test_size=0.2):
    
    # Make an SGD learner, nothing fancy here
    classifier = sklearn.linear_model.SGDClassifier(verbose=0, 
                                                    loss='log',
                                                    penalty='l1', 
                                                    n_iter=1)
    
    # Make a streamable wrapper
    model = pescador.StreamLearner(classifier)
    
    # First, turn the data_generator function into a Streamer object
    stream = pescador.Streamer(data_generator, X[train], Y[train])
    
    # Then, send this thread to a second process
    zmq_stream = pescador.zmq_stream(5156, stream, max_batches=5e3)
    
    # Run the output through a second buffer for mini-batch training
    #samples = pescador.buffer_batch(zmq_stream, 20)
    samples = zmq_stream
    
    # And fit on the stream
    model.iter_fit(samples, classes=classes)
    
    # How's it do on the test set?
    print 'Test-set accuracy: {:.3f}'.format(sklearn.metrics.accuracy_score(Y[test], model.predict(X[test])))
    print '# Steps: ' + str(model.estimator.t_)

Test-set accuracy: 0.867
# Steps: 9081.0
Test-set accuracy: 0.933
# Steps: 8461.0
CPU times: user 526 ms, sys: 56.1 ms, total: 582 ms
Wall time: 520 ms
