# Large Scale Machine Learning

It differs from traditional ML in the sense that it involves processing a **large amount of data** in terms of its **size** or **number of samples, features or classes**

Although scikit is optimized for smaller datasets, it does offer a decent set of feature preprocessing and learning algorithms classification, regression and clustering for large scale data

Sckit-learn handles large data using `partial_fit()` method instead of `fit()` method

> The idea is to process data in **batches** and **update** the model parameters for each batch. This way of learning is referred to as **incremental (or out-of-core) learning*



### Incremental Learning

This may be required for the following scenarios :
- For **out-of-memory (large) datasets**, where it is not possible to **load the entire data into RAM** at once. One can load the data in chunks and fit the training modell for each chunk of data
- For ML tasks where a new batch of data comes with time, re-training the model with the previous and the new batch of data is a computationally expensive process
    - So, instead of re-training the model with the entire set of data, one can employ an incremental learning approach, where the model parameters are updated with the new batch of data
    
    
### Incremental Learning in `sklearn`


To perform incremental learning, scikit-learn implements `partial_fit()` method. It has the following attributes :

- `X` : feature matrix. Shape : (`n_samples`, `n_features`)
- `y` : label matrix / vector. Shape : (`n_samples`,)
- `classes`: array containing a lis of all the classes that can possibly appear in the y vector. **Must be provided at the first call to `partial_fit()`.** Can be omitted in subsequent calls. Shape : (`n_classes`). 
- `sample_weight` : (Optional) array containing weights applied to individual sample (1, for unweighted). Shape : (`n_samples`). 


**To split the dataset into chunks, we can use the `chunksize` parameter in `pd.read_csv()` method**

In [1]:


import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix, classification_report

**Using `fit()` method**

In [2]:
X, y = make_classification(n_samples=50000, n_features=10, n_classes=3, n_clusters_per_class=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)

clf1 = SGDClassifier(max_iter=1000, tol=0.01)
clf1.fit(X_train, y_train)

train_score = clf1.score(X_train, y_train)
test_score = clf1.score(X_test, y_test)
train_score, test_score

(0.8939294117647059, 0.8929333333333334)

In [3]:
print(classification_report(y_test, clf1.predict(X_test)))

              precision    recall  f1-score   support

           0       0.82      0.88      0.85      2488
           1       0.98      0.93      0.96      2490
           2       0.88      0.87      0.87      2522

    accuracy                           0.89      7500
   macro avg       0.90      0.89      0.89      7500
weighted avg       0.90      0.89      0.89      7500



**Using `partial_fit()` method**

In [7]:
train_data = np.concatenate((X_train, y_train[:, np.newaxis]), axis=1)
train_data[0:5]

array([[-0.69361705, -0.64744075,  0.6422426 , -1.57921405, -0.24399846,
        -0.07378696, -2.32927858, -0.02880938, -1.39603102, -1.28222915,
         2.        ],
       [-0.78335302, -0.81945506, -0.48319819,  0.40735459,  1.03157244,
         0.66277752, -1.40071435,  0.69724878, -1.36271671, -1.27478929,
         0.        ],
       [-0.27796441, -1.24264847, -0.71766529,  0.18127025,  1.16822818,
        -0.50003655, -0.72726385, -0.94261259, -0.98844593, -0.29831685,
         0.        ],
       [-0.05538096, -0.10289486,  0.75766407,  0.38238686, -0.92804648,
        -1.03023004, -0.36739391, -0.83933618,  0.17480582,  0.31312676,
         2.        ],
       [-0.93479936,  0.45279993, -2.16416986,  0.55756503,  2.88121306,
         0.26333814,  0.19281967, -0.57317633, -1.15483786,  0.59513994,
         0.        ]])

In [8]:
a = np.asarray(train_data)
np.savetxt('train_data.csv', a, delimiter=',')

In [9]:
clf2 = SGDClassifier(max_iter=1000, tol=0.01)

Now, read from this file in chunks using pandas read_csv method

In [23]:
import pandas as pd

chunksize = 1000
i = 1

for train_df in pd.read_csv("train_data.csv", chunksize=chunksize, iterator=True):
    X_train_partial = train_df.iloc[:, 0:10]  # Since there are 10 features in dataset
    y_train_partial = train_df.iloc[:, 10] # Last column is the label
    
    # Need to pass the classes in the first iteration, since it's not guaranteed to have the same classes in all chunks of data.
    if i == 1:
        clf2.partial_fit(X_train_partial, y_train_partial, classes=np.array([0, 1, 2]))
    else:
        clf2.partial_fit(X_train_partial, y_train_partial)
#         print(f'After iteration #{iter}')
#         print(clf2.coef_)
#         print(clf2.intercept_)
        i += 1

In [24]:
test_score = clf2.score(X_test, y_test)
print(test_score)

0.882




In [28]:
print(classification_report(y_test, clf2.predict(X_test)))

              precision    recall  f1-score   support

           0       0.84      0.83      0.83      2488
           1       0.96      0.93      0.95      2490
           2       0.85      0.89      0.87      2522

    accuracy                           0.88      7500
   macro avg       0.88      0.88      0.88      7500
weighted avg       0.88      0.88      0.88      7500



