### Objectives
* Limitations of basic scikit-learn models
* Introduction to online-learning models
* Applications & Examples
* Limitations of online-learning models

<hr>

### Limitations of basic scikit-learn models
* One you have trained a model, it cannot be retrained.
* When new data arrives, retraining using old data & new data has to be done. Very expensive process.

In [3]:
from sklearn.datasets import california_housing

In [4]:
from sklearn.linear_model import LinearRegression

In [7]:
house_data = california_housing.fetch_california_housing()

In [10]:
feature_data = house_data.data

In [11]:
target_data = house_data.target

In [15]:
current_feature_data = feature_data[:-5]

In [21]:
current_target_data = target_data[:-5]

In [17]:
current_feature_data.shape

(20635, 8)

In [19]:
new_feature_data = feature_data[-5:]

In [20]:
new_feature_data.shape

(5, 8)

In [22]:
new_target_data = target_data[-5:]

In [32]:
lr = LinearRegression()

In [33]:
lr.fit(current_feature_data, current_target_data)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [34]:
lr.coef_

array([ 4.36698295e-01,  9.44056975e-03, -1.07268396e-01,  6.44960409e-01,
       -3.92823742e-06, -3.78591131e-03, -4.21718526e-01, -4.34859339e-01])

* Now, suppose you receive additional 5 rows of data.
* LinearRegression don't support partial training with new data.
* The model have to be trained with all the data.

In [35]:
lr.fit(new_feature_data, new_target_data)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [36]:
lr.coef_

array([-1.25982639e-01, -2.09113671e-02, -1.62817519e-02, -1.52094168e-02,
        6.80123634e-05,  4.74835543e-02,  1.15604694e-02,  3.51442814e-02])

* The previously learned parameters are completely forgotten

### Solution to above problem is using models which support's online learning

https://scikit-learn.org/0.15/modules/scaling_strategies.html

In [26]:
from sklearn.linear_model import SGDRegressor

In [27]:
sgd_lr = SGDRegressor()

In [45]:
sgd_lr.fit(current_feature_data, current_target_data)

SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
             eta0=0.01, fit_intercept=True, l1_ratio=0.15,
             learning_rate='invscaling', loss='squared_loss', max_iter=1000,
             n_iter_no_change=5, penalty='l2', power_t=0.25, random_state=None,
             shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,
             warm_start=False)

In [46]:
sgd_lr.coef_

array([-1.28080906e+11,  5.00885792e+10,  9.21480665e+10, -2.95903968e+10,
        3.08170644e+10,  6.52210424e+11,  1.43928023e+11, -1.66821245e+11])

In [47]:
sgd_lr.partial_fit(new_feature_data, new_target_data)

SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
             eta0=0.01, fit_intercept=True, l1_ratio=0.15,
             learning_rate='invscaling', loss='squared_loss', max_iter=1000,
             n_iter_no_change=5, penalty='l2', power_t=0.25, random_state=None,
             shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,
             warm_start=False)

In [48]:
sgd_lr.coef_

array([-1.28682143e+11,  4.67714586e+10,  9.03930465e+10, -2.99531202e+10,
        6.61995660e+10,  6.51548552e+11,  1.33009170e+11, -1.33257558e+11])

PS : For training the model with large data, use partial fit or online learning models

### Limitations of Online Learning Models

* Doesn't support pipeline

In [49]:
from sklearn.pipeline import make_pipeline

In [53]:
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer

In [51]:
model = make_pipeline(StandardScaler(), SGDRegressor())

In [52]:
ss = StandardScaler()

In [54]:
cv = CountVectorizer()

### Large Text Classification using Online Learning Models


In [55]:
import pandas as pd

  return f(*args, **kwds)
  return f(*args, **kwds)


In [65]:
data = pd.read_csv('Data/18_2157_bundle_archive/Reviews.csv', nrows=10000, usecols=['Score','Text'])

In [77]:
data.Score.unique()

array([5, 1, 4, 2, 3])

In [67]:
from sklearn.feature_extraction.text import HashingVectorizer

In [69]:
hv = HashingVectorizer(n_features=1000)

In [70]:
hv.partial_fit(data.Text)

HashingVectorizer(alternate_sign=True, analyzer='word', binary=False,
                  decode_error='strict', dtype=<class 'numpy.float64'>,
                  encoding='utf-8', input='content', lowercase=True,
                  n_features=1000, ngram_range=(1, 1), norm='l2',
                  preprocessor=None, stop_words=None, strip_accents=None,
                  token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None)

In [72]:
hv.transform(data.Text[:5]).toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [73]:
from sklearn.naive_bayes import MultinomialNB

In [74]:
mnb = MultinomialNB()

In [85]:
data_itr = pd.read_csv('Data/18_2157_bundle_archive/Reviews.csv', chunksize=10000, usecols=['Score','Text'])

In [79]:
import numpy as np

In [86]:
hv = HashingVectorizer(n_features=1000)
mnb = MultinomialNB()

for data in data_itr:
    
    hv.partial_fit(data.Text)
    feature = hv.transform(data.Text)
    feature = np.abs(feature)
    mnb.partial_fit(feature, data.Score,[1,2,3,4,5])
    

In [87]:
feature = hv.transform(data.Text[:5])
feature = np.abs(feature)
mnb.predict(feature)

array([5, 5, 5, 5, 5])

In [88]:
data = pd.read_csv('Data/18_2157_bundle_archive/Reviews.csv', nrows=10000, usecols=['Score','Text'])

In [90]:
data.Score = data.Score.map(lambda v: 0 if v < 3 else 1)

In [92]:
data.Score.value_counts()

1    8478
0    1522
Name: Score, dtype: int64

* Data seems to be imbalanced & needs to be balanced before training.