# Large Scale Machine Learning

So far, we were able to load the entire data in memory and make models. But this might not be possible in many real life situations. 

We will now look at how 
* to handle such large-scale data
* do incremental preprocessing and learning
    * `fit()` vs `partial_fit()`
* Combining preprocessing and incremental learning.



## Incremental Learning

The following estimators implement `partial_fit` method;

* Classification:
    * `MultinomialNB`
    * `BernoulliNB`
    * `SGDClassifier`
    * `Perceptron`

* Regression:
    * `SGDRegressor`
* Clustering:    
    * `MiniBatchKMeans`

## `fit()` vs `partial_fit()`

In [1]:
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

### 1. Traditional Approach

In [2]:
x, y = make_classification(n_samples=50000, n_features=10,
                            n_classes=3,
                            n_clusters_per_class=1)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.15)

In [3]:
clf1 = SGDClassifier(max_iter=1000, tol=0.01)

In [4]:
clf1.fit(x_train, y_train)

In [5]:
train_score = clf1.score(x_train, y_train)
print("Training score: ", train_score)

Training score:  0.8569882352941176


In [6]:
test_score = clf1.score(x_test, y_test)
print("Training score: ", test_score)

Training score:  0.8558666666666667


In [7]:
y_pred = clf1.predict(x_test)
cr = classification_report(y_test, y_pred)
print(cr)

              precision    recall  f1-score   support

           0       0.89      0.87      0.88      2450
           1       0.76      0.93      0.84      2503
           2       0.95      0.76      0.85      2547

    accuracy                           0.86      7500
   macro avg       0.87      0.86      0.86      7500
weighted avg       0.87      0.86      0.86      7500



### 2. Incremental Approach

In [8]:
import numpy as np
import pandas as pd

train_data = np.concatenate((x_train, y_train[:, np.newaxis]), axis=1)

In [9]:
train_data[0:5]

array([[ 0.74275776,  0.49457031, -1.29812498, -0.58699559, -1.1399507 ,
        -0.22620716, -0.44226899,  0.40878169,  0.12225937, -1.89362435,
         0.        ],
       [-1.02595548, -0.03550852,  0.13379618,  1.24075471, -0.83882959,
        -0.40484922,  0.80618779, -1.06090471,  0.39657459, -1.27431611,
         2.        ],
       [-1.52458453,  1.39315769,  0.12053616,  2.79144218, -0.11238248,
        -0.96563916,  0.92150961, -2.39903613,  0.93287024,  0.29855201,
         0.        ],
       [-0.85912332,  0.97274051, -0.57611902,  0.43891912,  0.56196722,
         0.30308825,  0.82153158, -0.2758012 , -0.19077736, -0.96869707,
         1.        ],
       [ 0.0626458 ,  0.83597797,  1.7728184 ,  1.3542036 ,  0.41813458,
        -0.27372466, -1.5831475 , -1.12042463,  0.30810859, -1.39087474,
         2.        ]])

In [10]:
a = np.asarray(train_data)
np.savetxt("train_data.csv", a, delimiter=",")

In [11]:
clf2 = SGDClassifier(max_iter=1000, tol=0.01)

In [12]:
chunksize = 1000
iter = 1

for train_df in pd.read_csv("train_data.csv", chunksize=chunksize, iterator=True):

    if iter == 1:
        # In the first iteration, we are specifying all possible class labels
        x_train_partial = train_df.iloc[:, 0:10]
        y_train_partial = train_df.iloc[:, 10]
        clf2.partial_fit(x_train_partial, y_train_partial, classes = np.array([0, 1, 2]))

    else:
        x_train_partial = train_df.iloc[:, 0:10]
        y_train_partial = train_df.iloc[:, 10]
        clf2.partial_fit(x_train_partial, y_train_partial)

    print("After iter #", iter)
    print(clf2.coef_)
    print(clf2.intercept_)
    iter +=1

After iter # 1
[[ -8.5769008    7.93797752   5.35265501 -34.9807029   -0.80847561
   -7.62726534  -3.79382593  25.66527429   2.94401444   0.88333419]
 [ -4.30405777   1.08537182   4.56458589  50.11388856  -6.09829687
   53.60437629   3.8224841  -27.25428201 -35.87554794 -12.43421994]
 [ -3.65565886  -4.9161556   10.77939477  -8.64574141   5.08815768
  -23.45998127 -23.75614054   1.53363298  16.73174614  -4.93573183]]
[ -4.41000429 -87.11168151  -1.74028946]
After iter # 2
[[-4.86704380e+00 -3.13991446e+00  3.66689684e+00 -1.79652051e+01
  -1.55220865e+01  3.88722966e+00 -3.99855187e+00  1.49208896e+01
  -4.27729317e+00  9.56908264e+00]
 [-3.81282101e+00  9.55507099e+00 -5.44200696e+00  3.56356990e+01
   3.50755056e-02  3.35361864e+01 -3.83024698e+00 -2.04017458e+01
  -2.21123050e+01 -3.44839816e+00]
 [-1.37242640e+00 -6.22909636e+00 -2.28882221e+00  2.41549001e+00
   9.76274909e+00 -2.21285418e+01 -1.08001346e+00 -6.82282160e+00
   1.66022370e+01 -2.57739003e+00]]
[ -4.52742566 -98.359

In [13]:
test_score = clf2.score(x_test, y_test)
print("Training score: ", test_score)

Training score:  0.8129333333333333




In [14]:
y_pred = clf2.predict(x_test)
cr = classification_report(y_test, y_pred)
print(cr)

              precision    recall  f1-score   support

           0       0.82      0.85      0.84      2450
           1       0.78      0.81      0.79      2503
           2       0.84      0.78      0.81      2547

    accuracy                           0.81      7500
   macro avg       0.81      0.81      0.81      7500
weighted avg       0.81      0.81      0.81      7500





## Incremental preprocessing example

### `CountVectorizer` vs `HashingVectorizer`
* `CountVectorizer` and `HashingVectorizer` both perform the task of vectorizing text data
* `HashingVectorizer`does't store the resulting vocabulary, therefore it can be used to learn from data that doesn't fit into main memory. Each mini batch is vectorized using `HashingVectorizer` so as to guarantee that that the input space of the vvectorizer has the same dimensionality.


In [15]:
text = ['Russell was raised by his paternal grandparents after his unconventional parents both died young.', 
        'He was discontented living with his grandparents, but enjoyed four happy years at Winchester College.',
        'His academic education came to a sudden end when he was sent down from Balliol College, Oxford, probably because authorities there had suspicions concerning the nature of his relationship with the future poet Lionel Johnson.',
        'He always bitterly resented his treatment by Oxford.']

#### `CountVectorizer`

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
c_vectorizer = CountVectorizer()

In [17]:
X_c = c_vectorizer.fit_transform(text)

In [18]:
X_c.shape

(4, 56)

In [19]:
c_vectorizer.vocabulary_

{'russell': 41,
 'was': 50,
 'raised': 38,
 'by': 10,
 'his': 27,
 'paternal': 35,
 'grandparents': 23,
 'after': 1,
 'unconventional': 49,
 'parents': 34,
 'both': 8,
 'died': 14,
 'young': 55,
 'he': 26,
 'discontented': 15,
 'living': 30,
 'with': 53,
 'but': 9,
 'enjoyed': 19,
 'four': 20,
 'happy': 25,
 'years': 54,
 'at': 3,
 'winchester': 52,
 'college': 12,
 'academic': 0,
 'education': 17,
 'came': 11,
 'to': 47,
 'sudden': 43,
 'end': 18,
 'when': 51,
 'sent': 42,
 'down': 16,
 'from': 21,
 'balliol': 5,
 'oxford': 33,
 'probably': 37,
 'because': 6,
 'authorities': 4,
 'there': 46,
 'had': 24,
 'suspicions': 44,
 'concerning': 13,
 'the': 45,
 'nature': 31,
 'of': 32,
 'relationship': 39,
 'future': 22,
 'poet': 36,
 'lionel': 29,
 'johnson': 28,
 'always': 2,
 'bitterly': 7,
 'resented': 40,
 'treatment': 48}

In [20]:
print(X_c)

  (0, 41)	1
  (0, 50)	1
  (0, 38)	1
  (0, 10)	1
  (0, 27)	2
  (0, 35)	1
  (0, 23)	1
  (0, 1)	1
  (0, 49)	1
  (0, 34)	1
  (0, 8)	1
  (0, 14)	1
  (0, 55)	1
  (1, 50)	1
  (1, 27)	1
  (1, 23)	1
  (1, 26)	1
  (1, 15)	1
  (1, 30)	1
  (1, 53)	1
  (1, 9)	1
  (1, 19)	1
  (1, 20)	1
  (1, 25)	1
  (1, 54)	1
  :	:
  (2, 5)	1
  (2, 33)	1
  (2, 37)	1
  (2, 6)	1
  (2, 4)	1
  (2, 46)	1
  (2, 24)	1
  (2, 44)	1
  (2, 13)	1
  (2, 45)	2
  (2, 31)	1
  (2, 32)	1
  (2, 39)	1
  (2, 22)	1
  (2, 36)	1
  (2, 29)	1
  (2, 28)	1
  (3, 10)	1
  (3, 27)	1
  (3, 26)	1
  (3, 33)	1
  (3, 2)	1
  (3, 7)	1
  (3, 40)	1
  (3, 48)	1


#### `HashingVectorizer`

In [21]:
from sklearn.feature_extraction.text import HashingVectorizer

In [26]:
h_vectorizer = HashingVectorizer(n_features=30)

In [27]:
X_h = h_vectorizer.fit_transform(text)

In [28]:
X_h.shape

(4, 30)

In [29]:
print(X_h[0])

  (0, 1)	-0.2672612419124244
  (0, 2)	-0.2672612419124244
  (0, 3)	-0.2672612419124244
  (0, 4)	0.5345224838248488
  (0, 10)	-0.2672612419124244
  (0, 11)	0.2672612419124244
  (0, 19)	0.0
  (0, 20)	0.2672612419124244
  (0, 21)	-0.2672612419124244
  (0, 22)	0.2672612419124244
  (0, 26)	0.2672612419124244
  (0, 28)	0.2672612419124244


## Combining preprocessing and fiting in incremental learning

In [30]:
import pandas as pd
from io import StringIO, BytesIO, TextIOWrapper
from zipfile import ZipFile
import urllib.request

response = urllib.request.urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip')
zipfile = ZipFile(BytesIO(response.read()))

data = TextIOWrapper(zipfile.open('sentiment labelled sentences/amazon_cells_labelled.txt'), encoding='utf-8')
df = pd.read_csv(data, sep = '\t')
df.columns = ['review', 'sentiment']

In [31]:
df.head()

Unnamed: 0,review,sentiment
0,"Good case, Excellent value.",1
1,Great for the jawbone.,1
2,Tied to charger for conversations lasting more...,0
3,The mic is great.,1
4,I have to jiggle the plug to get it to line up...,0


In [32]:
df.tail()

Unnamed: 0,review,sentiment
994,The screen does get smudged easily because it ...,0
995,What a piece of junk.. I lose more calls on th...,0
996,Item Does Not Match Picture.,0
997,The only thing that disappoint me is the infra...,0
998,"You can not answer calls with the unit, never ...",0


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     999 non-null    object
 1   sentiment  999 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 15.7+ KB


In [35]:
df.loc[:, 'sentiment'].unique()

array([1, 0], dtype=int64)

In [36]:
from sklearn.model_selection import train_test_split
X = df.loc[:, 'review']
y = df.loc[:,'sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [37]:
X_train.shape

(799,)

In [38]:
vectorizer = HashingVectorizer()

In [39]:
classifier = SGDClassifier(penalty='l2', loss='hinge')

### Iteration 1 of `partial_fit()`

In [40]:
X_train_part1_hashed = vectorizer.fit_transform(X_train[0:400])
y_train_part1 = y_train[0:400]

In [41]:
all_classes = np.unique(df.loc[:, 'sentiment'])

In [42]:
classifier.partial_fit(X_train_part1_hashed, y_train_part1, classes = all_classes)

In [43]:
X_test_hashed = vectorizer.transform(X_test)

In [44]:
test_score = classifier.score(X_test_hashed, y_test)
print("Test score: ", test_score)

Test score:  0.745


### Iteration 2 of `partial_fit()`

In [47]:
X_train_part2_hashed = vectorizer.fit_transform(X_train[400:])
y_train_part2 = y_train[400:]

In [48]:
classifier.partial_fit(X_train_part2_hashed, y_train_part2)

In [49]:
test_score = classifier.score(X_test_hashed, y_test)
print("Test score: ", test_score)

Test score:  0.705
