## Introduction
---

Most machine learning algorithms are batch learners meaning they generate a model by learning on the entire dataset at one time. Not surprisingly, these algorithms are the most well known and most used. However, there is another class of algorithms known as online learning algorithms. Intead of learning on the entire dataset at once, data is consumed in sequential order as it becomes available. Said another way, online learning is a way to dynamically update a model in real-time according to the most recent data. For a more detailed discussion, see [Online Machine Learning](https://en.wikipedia.org/wiki/Online_machine_learning).

Both batch and online learning have advantages and disadvantages. This should come as no surprise to anyone who has dabbled in machine learning. Rarely if ever do you get something for nothing; there are always tradeoffs. For a discussion on the pros and cons, see [this](https://www.quora.com/What-are-the-pros-and-cons-of-offline-vs-online-learning) Quora post.

## Implementation
---
Now that we've got that out of the way, let's walkthrough how to implement an online learner in Scikit-learn. First, let's introduce the datasets.

1. [Iris dataset](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris)  

2. [Boston dataset](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston)

The Iris dataset is a famous, canonical classification dataset. See the docs for details. The Boston dataset is a canonical regression dataset. 

*Note: The Iris dataset has 3 classes. This is not ideal. Introducing online learning with a binary class problem would have been better but that requires sourcing a binary dataset and likely going through preprocessing. Therefore, I decided that although 3 classes, which will require a One-Versus-All (OVA) approach - is not ideal, it made sense to me to keep the data ingestion process as streamlined and as simple as possible. Hence, why I chose to load both datasets from Scikit-learn.*


Without further ado, see the code below for an implementation of online learning in Scikit-learn.

## Python Version
---

In [1]:
!python --version

Python 3.5.3 :: Anaconda custom (x86_64)


## Libraries & Versions
---

In [2]:
import cpuinfo
import sklearn
from sklearn.datasets import load_iris, load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier, SGDRegressor
from sklearn.metrics import log_loss, mean_squared_error
from sklearn.preprocessing import StandardScaler

# Output library versions
items = [("Sklearn", sklearn)]
for item in items:
    print(item[0] + " version: " + str(item[1].__version__))

Sklearn version: 0.18.2


## Hardware Specs
---
The hardware specifications of my 2015 Mac are included for comparison purposes. Performance is a function of these specs, which processes are running in the background, and software implementation details. 

In [3]:
# if not installed, type: python -m pip install -U py-cpuinfo

info = cpuinfo.get_cpu_info()
entries = ('flags', 'count', 'cpuinfo_version', 'family', 'hz_actual', 'hz_actual_raw', 'hz_advertised', 
          'hz_advertised_raw', 'model', 'raw_arch_string', 'stepping')
for key in entries:
        if key in info:
            del info[key]
info

{'arch': 'X86_64',
 'bits': 64,
 'brand': 'Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz',
 'l2_cache_size': '256',
 'vendor_id': 'GenuineIntel'}

## Load Data
---
Here we load the Iris and Boston datasets. Keep in mind that Scikit-learn neatly packages these datasets in a convenient dictionary style making it trivial to parse data and target. See the docs for details. 

In [4]:
# Classification
iris = load_iris()

# Regression
boston = load_boston()

## Train & Test Split
---
Each dataset is split using a test size of 20%. A random seed is provided for reproducibility.

In [5]:
# iris dataset
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(iris.data, 
                                                                        iris.target, 
                                                                        test_size=0.2, 
                                                                        random_state=42)

# boston dataset
X_train_boston, X_test_boston, y_train_boston, y_test_boston = train_test_split(boston.data,
                                                                                boston.target,
                                                                                test_size=0.2, 
                                                                                random_state=42)

## Instantiate Models
---

In [6]:
# Classification
svm = SGDClassifier(loss='hinge', 
                    penalty='l2', 
                    alpha=0.1, 
                    fit_intercept=False, 
                    n_iter=5, 
                    shuffle=True, 
                    verbose=1, 
                    n_jobs=1, 
                    random_state=19, 
                    learning_rate='optimal', 
                    class_weight='balanced')

logistic = SGDClassifier(loss='log', 
                         penalty='l2', 
                         alpha=0.1, 
                         fit_intercept=False, 
                         n_iter=5, 
                         shuffle=True, 
                         verbose=1, 
                         n_jobs=1,
                         random_state=19, 
                         learning_rate='optimal', 
                         class_weight='balanced')

# Regression
ols = SGDRegressor(loss='squared_loss', 
                   penalty='l2', 
                   alpha=0.0001, 
                   fit_intercept=False, 
                   n_iter=5, 
                   shuffle=True, 
                   verbose=1, 
                   random_state=42, 
                   learning_rate='invscaling', 
                   eta0=0.01, 
                   power_t=0.5)

robust = SGDRegressor(loss='huber', 
                   penalty='l2', 
                   alpha=0.0001, 
                   fit_intercept=False, 
                   n_iter=5, 
                   shuffle=True, 
                   verbose=1, 
                   epsilon=0.1, 
                   random_state=42, 
                   learning_rate='invscaling', 
                   eta0=0.01, 
                   power_t=0.5)

## Standardize Data
---
We'll utilize gradient descent in our online learners so it's best to standardize our data with mean 0 and standard deviation of 1. Make sure to fit and transform on the training set only. Then use that fit to transform the test set. Unfortunately, many mistakenly standardize the data prior to splitting into train and test, which causes information leakage.  

In [7]:
# iris
sc_iris = StandardScaler()
X_train_iris = sc_iris.fit_transform(X_train_iris)
X_test_iris = sc_iris.transform(X_test_iris)

# boston
sc_boston = StandardScaler()
X_train_boston = sc_boston.fit_transform(X_train_boston)
X_test_boston = sc_boston.transform(X_test_boston)

## Train w/Online Learning Model: Classification
---
The Iris dataset has 3 classes so a One-Versus-All approach is taken in this online learning example. Therefore, there will be 3 runs, each with the specified number of epochs which I've set to 5.  

### [1] Support Vector Machines: Output = Class ID

In [8]:
svm.fit(X_train_iris, y_train_iris)

-- Epoch 1
Norm: 1.23, NNZs: 4, Bias: 0.000000, T: 120, Avg. loss: 0.077963
Total training time: 0.00 seconds.
-- Epoch 2
Norm: 1.32, NNZs: 4, Bias: 0.000000, T: 240, Avg. loss: 0.083950
Total training time: 0.00 seconds.
-- Epoch 3
Norm: 1.28, NNZs: 4, Bias: 0.000000, T: 360, Avg. loss: 0.083947
Total training time: 0.00 seconds.
-- Epoch 4
Norm: 1.31, NNZs: 4, Bias: 0.000000, T: 480, Avg. loss: 0.083557
Total training time: 0.00 seconds.
-- Epoch 5
Norm: 1.30, NNZs: 4, Bias: 0.000000, T: 600, Avg. loss: 0.083154
Total training time: 0.00 seconds.
-- Epoch 1
Norm: 1.02, NNZs: 4, Bias: 0.000000, T: 120, Avg. loss: 0.899049
Total training time: 0.00 seconds.
-- Epoch 2
Norm: 0.75, NNZs: 4, Bias: 0.000000, T: 240, Avg. loss: 0.822984
Total training time: 0.00 seconds.
-- Epoch 3
Norm: 0.89, NNZs: 4, Bias: 0.000000, T: 360, Avg. loss: 0.797535
Total training time: 0.00 seconds.
-- Epoch 4
Norm: 0.92, NNZs: 4, Bias: 0.000000, T: 480, Avg. loss: 0.781262
Total training time: 0.00 seconds.
-

[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished


SGDClassifier(alpha=0.1, average=False, class_weight='balanced', epsilon=0.1,
       eta0=0.0, fit_intercept=False, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=19, shuffle=True, verbose=1,
       warm_start=False)

Let's see how the SVM performs from an accuracy perspective

In [9]:
svm.score(X_test_iris, y_test_iris)

0.90000000000000002

Not too shabby!

### [2] Logistic Regression: Output = Class Probability

In [10]:
logistic.fit(X_train_iris, y_train_iris)

-- Epoch 1
Norm: 1.38, NNZs: 4, Bias: 0.000000, T: 120, Avg. loss: 0.181071
Total training time: 0.00 seconds.
-- Epoch 2
Norm: 1.40, NNZs: 4, Bias: 0.000000, T: 240, Avg. loss: 0.180590
Total training time: 0.00 seconds.
-- Epoch 3
Norm: 1.40, NNZs: 4, Bias: 0.000000, T: 360, Avg. loss: 0.179628
Total training time: 0.00 seconds.
-- Epoch 4
Norm: 1.40, NNZs: 4, Bias: 0.000000, T: 480, Avg. loss: 0.179206
Total training time: 0.00 seconds.
-- Epoch 5
Norm: 1.40, NNZs: 4, Bias: 0.000000, T: 600, Avg. loss: 0.178853
Total training time: 0.00 seconds.
-- Epoch 1
Norm: 0.71, NNZs: 4, Bias: 0.000000, T: 120, Avg. loss: 0.649057
Total training time: 0.00 seconds.
-- Epoch 2
Norm: 0.61, NNZs: 4, Bias: 0.000000, T: 240, Avg. loss: 0.627146
Total training time: 0.00 seconds.
-- Epoch 3
Norm: 0.70, NNZs: 4, Bias: 0.000000, T: 360, Avg. loss: 0.619320
Total training time: 0.00 seconds.
-- Epoch 4
Norm: 0.67, NNZs: 4, Bias: 0.000000, T: 480, Avg. loss: 0.615505
Total training time: 0.00 seconds.
-

[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished


SGDClassifier(alpha=0.1, average=False, class_weight='balanced', epsilon=0.1,
       eta0=0.0, fit_intercept=False, l1_ratio=0.15,
       learning_rate='optimal', loss='log', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=19, shuffle=True, verbose=1,
       warm_start=False)

In [11]:
class_models = (('svm:     ', svm), ('logistic:', logistic))
for name, model in class_models:
    print(name, model.score(X_test_iris, y_test_iris))
    if model == logistic:
        print("\nlogistic log loss: %.2f" % log_loss(y_test_iris, model.predict_proba(X_test_iris)))
print("svm cannot output log loss")

svm:      0.9
logistic: 0.9

logistic log loss: 0.62
svm cannot output log loss


**Log loss** is a better measure of performance than accuracy but only Logistic Regression outputs class probabilities which is required for log loss. 

Keep in mind that we only made 5 passes over the data. We surely could drive the log loss down with more passes, though we need to take care not to overfit. 

## Train w/Online Learning Model: Regression
---
We'll train Ordinary Least Squares (OLS) and a Robust learner that uses [Huber loss](https://en.wikipedia.org/wiki/Huber_loss) on the Boston dataset. Once fit, we'll assess each model by looking at the resulting Root Mean Squared Error (RMSE).

### [1] Ordinary Least Squares

In [12]:
ols.fit(X_train_boston, y_train_boston)

-- Epoch 1
Norm: 3.29, NNZs: 13, Bias: 0.000000, T: 404, Avg. loss: 285.922195
Total training time: 0.00 seconds.
-- Epoch 2
Norm: 3.74, NNZs: 13, Bias: 0.000000, T: 808, Avg. loss: 281.652235
Total training time: 0.00 seconds.
-- Epoch 3
Norm: 3.98, NNZs: 13, Bias: 0.000000, T: 1212, Avg. loss: 279.705793
Total training time: 0.00 seconds.
-- Epoch 4
Norm: 4.15, NNZs: 13, Bias: 0.000000, T: 1616, Avg. loss: 278.524070
Total training time: 0.00 seconds.
-- Epoch 5
Norm: 4.32, NNZs: 13, Bias: 0.000000, T: 2020, Avg. loss: 277.698078
Total training time: 0.00 seconds.


SGDRegressor(alpha=0.0001, average=False, epsilon=0.1, eta0=0.01,
       fit_intercept=False, l1_ratio=0.15, learning_rate='invscaling',
       loss='squared_loss', n_iter=5, penalty='l2', power_t=0.5,
       random_state=42, shuffle=True, verbose=1, warm_start=False)

### [2] Robust Learner

In [13]:
robust.fit(X_train_boston, y_train_boston)

-- Epoch 1
Norm: 0.01, NNZs: 13, Bias: 0.000000, T: 404, Avg. loss: 2.274738
Total training time: 0.00 seconds.
-- Epoch 2
Norm: 0.01, NNZs: 13, Bias: 0.000000, T: 808, Avg. loss: 2.274710
Total training time: 0.00 seconds.
-- Epoch 3
Norm: 0.01, NNZs: 13, Bias: 0.000000, T: 1212, Avg. loss: 2.274698
Total training time: 0.00 seconds.
-- Epoch 4
Norm: 0.01, NNZs: 13, Bias: 0.000000, T: 1616, Avg. loss: 2.274691
Total training time: 0.00 seconds.
-- Epoch 5
Norm: 0.01, NNZs: 13, Bias: 0.000000, T: 2020, Avg. loss: 2.274687
Total training time: 0.00 seconds.


SGDRegressor(alpha=0.0001, average=False, epsilon=0.1, eta0=0.01,
       fit_intercept=False, l1_ratio=0.15, learning_rate='invscaling',
       loss='huber', n_iter=5, penalty='l2', power_t=0.5, random_state=42,
       shuffle=True, verbose=1, warm_start=False)

## Compare RMSE
---

In [14]:
reg_models = (('ols:   ', ols), ('robust:', robust))
for name, model in reg_models:
    print(name, mean_squared_error(y_test_boston, model.predict(X_test_boston)) ** 0.5)

ols:    23.5464882574
robust: 23.1325721384


As expected, the robust learner outperformed OLS because Huber loss is more robust to outliers. 

## Last Thoughts
---
You likely have several questions at this point. Hopefully I can preemptively address some of them here. 

**Q:** *When should I use batch learning and when should I use online learning?*  
**A:** Use batch learning unless: 
1. Your data does not fit into memory 
2. You expect your data to change significantly over time (though retraining over a sliding window can help batch)  

**Q:** *How do I tune an online learning algorithm?*  
**A:** The answer depends on your use case. For the sake of simplicity, let's assume you're using online learning because your data doesn't fit into memory. In that case, you can follow the standard practice of splitting your data and tuning your hyperparameters with cross-validation. If you expect your data to change significantly over time, splitting your data may not be feasible. In this case, you can leverage [progressive cross-validation](http://hunch.net/~jl/projects/prediction_bounds/progressive_validation/coltfinal.pdf) and run multiple models to ascertain which set of hyperparameters is likely to generalize best. If your model is in production, that's a whole different story. Then you really need to rely on someone with expertise in these types of algorithms. As an aside, it's often the case that a hybrid batch and online learning approach is used in production instead of solely relying on online learning.

**Q:** *What are some good resources for delving deeper into online learning?*  
**A:** Here are several to get you started:
1. [CILVR Lab @ NYU](http://cilvr.cs.nyu.edu/doku.php?id=courses:bigdata:slides:start)  
2. [Online Learning & Stochastic Approximations](http://leon.bottou.org/publications/pdf/online-1998.pdf)
3. [Fractal Analytics Blog](http://blog.fractalanalytics.com/institutionalizing-analytics/online-machine-learning-2/)

## Summary
---
Batch and online learning are two common approaches to machine learning. With batch, sometimes called offline learning, all data is consumed to build a model. In contrast, online learning algorithms consume single observations. Batch is more common and prevalent. However, there are many use cases for online learning. For example, online learning shines when data is too large to fit into memory or you expect your the distribution of your data to drift over time. Online learning is typically very fast and once data has been consumed, it's really not needed anymore. Those can of course be great benefits but like all things in machine learning, you have to make sure your approach makes sense for your use case. To use a cliche, think of online learning as another tool in your toolbox. Knowing when to use it is as important as having it. 