# Large Scale Machine Learning

So far, we were able to load the entire data in memory and make models. But this might not be possible in many real life situations. 

We will now look at how 
* to handle such large-scale data
* do incremental preprocessing and learning
    * `fit()` vs `partial_fit()`
* Combining preprocessing and incremental learning.



## Incremental Learning

The following estimators implement `partial_fit` method;

* Classification:
    * `MultinomialNB`
    * `BernoulliNB`
    * `SGDClassifier`
    * `Perceptron`

* Regression:
    * `SGDRegressor`
* Clustering:    
    * `MiniBatchKMeans`

## `fit()` vs `partial_fit()`

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

### 1. Traditional Approach

In [None]:
x, y = make_classification(n_samples=50000, n_features=10,
                            n_classes=3,
                            n_clusters_per_class=1)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.15)

In [None]:
clf1 = SGDClassifier(max_iter=1000, tol=0.01)

In [None]:
clf1.fit(x_train, y_train)

In [None]:
train_score = clf1.score(x_train, y_train)
print("Training score: ", train_score)

In [None]:
test_score = clf1.score(x_test, y_test)
print("Training score: ", test_score)

In [None]:
y_pred = clf1.predict(x_test)
cr = classification_report(y_test, y_pred)
print(cr)

### 2. Incremental Approach

In [None]:
import numpy as np
import pandas as pd

train_data = np.concatenate((x_train, y_train[:, np.newaxis]), axis=1)

In [None]:
train_data[0:5]

In [None]:
a = np.asarray(train_data)
np.savetxt("train_data.csv", a, delimiter=",")

In [None]:
clf2 = SGDClassifier(max_iter=1000, tol=0.01)

In [None]:
chunksize = 1000
iter = 1

for train_df in pd.read_csv("train_data.csv", chunksize=chunksize, iterator=True):

    if iter == 1:
        # In the first iteration, we are specifying all possible class labels
        x_train_partial = train_df.iloc[:, 0:10]
        y_train_partial = train_df.iloc[:, 10]
        clf2.partial_fit(x_train_partial, y_train_partial, classes = np.array([0, 1, 2]))

    else:
        x_train_partial = train_df.iloc[:, 0:10]
        y_train_partial = train_df.iloc[:, 10]
        clf2.partial_fit(x_train_partial, y_train_partial)

    print("After iter #", iter)
    print(clf2.coef_)
    print(clf2.intercept_)
    iter +=1

In [None]:
test_score = clf2.score(x_test, y_test)
print("Training score: ", test_score)

In [None]:
y_pred = clf2.predict(x_test)
cr = classification_report(y_test, y_pred)
print(cr)

## Incremental preprocessing example

### `CountVectorizer` vs `HashingVectorizer`
* `CountVectorizer` and `HashingVectorizer` both perform the task of vectorizing text data
* `HashingVectorizer`does't store the resulting vocabulary, therefore it can be used to learn from data that doesn't fit into main memory. Each mini batch is vectorized using `HashingVectorizer` so as to guarantee that that the input space of the vvectorizer has the same dimensionality.


In [None]:
text = ['Russell was raised by his paternal grandparents after his unconventional parents both died young.', 
        'He was discontented living with his grandparents, but enjoyed four happy years at Winchester College.',
        'His academic education came to a sudden end when he was sent down from Balliol College, Oxford, probably because authorities there had suspicions concerning the nature of his relationship with the future poet Lionel Johnson.',
        'He always bitterly resented his treatment by Oxford.']

#### `CountVectorizer`

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
c_vectorizer = CountVectorizer()

In [None]:
X_c = c_vectorizer.fit_transform(text)

In [None]:
X_c.shape

In [None]:
c_vectorizer.vocabulary_

In [None]:
print(X_c)

#### `HashingVectorizer`

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer

In [None]:
h_vectorizer = HashingVectorizer(n_features=30)

In [None]:
X_h = h_vectorizer.fit_transform(text)

In [None]:
X_h.shape

In [None]:
print(X_h[0])

## Combining preprocessing and fiting in incremental learning

In [None]:
import pandas as pd
from io import StringIO, BytesIO, TextIOWrapper
from zipfile import ZipFile
import urllib.request

response = urllib.request.urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip')
zipfile = ZipFile(BytesIO(response.read()))

data = TextIOWrapper(zipfile.open('sentiment labelled sentences/amazon_cells_labelled.txt'), encoding='utf-8')
df = pd.read_csv(data, sep = '\t')
df.columns = ['review', 'sentiment']

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.info()

In [None]:
df.loc[:, 'sentiment'].unique()

In [None]:
from sklearn.model_selection import train_test_split
X = df.loc[:, 'review']
y = df.loc[:,'sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
X_train.shape

In [None]:
vectorizer = HashingVectorizer()

In [None]:
classifier = SGDClassifier(penalty='l2', loss='hinge')

### Iteration 1 of `partial_fit()`

In [None]:
X_train_part1_hashed = vectorizer.fit_transform(X_train[0:400])
y_train_part1 = y_train[0:400]

In [None]:
all_classes = np.unique(df.loc[:, 'sentiment'])

In [None]:
classifier.partial_fit(X_train_part1_hashed, y_train_part1, classes = all_classes)

In [None]:
X_test_hashed = vectorizer.transform(X_test)

In [None]:
test_score = classifier.score(X_test_hashed, y_test)
print("Test score: ", test_score)

### Iteration 2 of `partial_fit()`

In [None]:
X_train_part2_hashed = vectorizer.fit_transform(X_train[400:])
y_train_part2 = y_train[400:]

In [None]:
classifier.partial_fit(X_train_part2_hashed, y_train_part2)

In [None]:
test_score = classifier.score(X_test_hashed, y_test)
print("Test score: ", test_score)