<a href="https://colab.research.google.com/github/ekaratnida/Data_Streaming_and_Realtime_Analytics/blob/main/Week09/Online_ML_with_River_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Online /Incremental Machine Learning Tools
+ Offline ML Learning : it means we have a batch of data, and we optimize an equation to make a prediction 
+ Online ML learning: used when we have streaming data, where we want to process one sample of data at a time.
    - real-time data one observation at a time
    - we update our estimates as each new data point arrives rather than waiting until “the end” (which may never occur)
+ Incremental learning is a method of machine learning in which input data is continuously used to extend the existing model's knowledge i.e. to further train the model. 
+ It represents a dynamic technique of supervised learning and unsupervised learning that can be applied when training data becomes available gradually over time or its size is out of system memory limits.
+ The AIM
    - for the learning model to adapt to new data without forgetting its existing knowledge.



#### Tools For Incremental or Online ML
+ River


#### Usefulness
+ For Online ML 
+ For ml on streaming data




#### Installation
+ !pip install river


### Incremental /Online Machine Learning with River

In [1]:
!pip install river
!pip install -U numpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
# Load ML Pkgs
import river

In [3]:
# Method
dir(river)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__']

In [4]:
# Load Estimators
from river.naive_bayes import MultinomialNB
from river.feature_extraction import BagOfWords
import pandas as pd

More on BagOfWords: https://medium.com/analytics-vidhya/text-classification-from-bag-of-words-to-bert-1e628a2dd4c9

In [5]:
def get_all_attributes(package):
    subpackages = []
    submodules = []
    for i in dir(package):
        if str(i) not in ["__all__", "__builtins__", "__cached__", "__doc__", "__file__", "__loader__", "__name__", "__package__", "__path__", "__pdoc__", "__spec__", "__version__"]:
            subpackages.append(i)
            res = [j for j in dir(eval("river.{}".format(i)))]
            submodules.append(res)
    df = pd.DataFrame(submodules)
    # Transpose
    df = df.T
    df.columns = subpackages
    res_df = df.dropna()
    return res_df
           
    

In [6]:
river_df = get_all_attributes(river)
river_df

Unnamed: 0,base,feature_extraction,naive_bayes,proba,stats,utils
0,Base,Agg,BernoulliNB,Beta,AbsMax,Rolling
1,BinaryDriftAndWarningDetector,BagOfWords,ComplementNB,Gaussian,AutoCorr,SortedWindow
2,BinaryDriftDetector,PolynomialExtender,GaussianNB,Multinomial,BayesianMean,TimeRolling
3,Classifier,RBFSampler,MultinomialNB,__all__,Count,VectorDict
4,Clusterer,TFIDF,__all__,__builtins__,Cov,__all__
5,DriftAndWarningDetector,TargetAgg,__builtins__,__cached__,EWMean,__builtins__
6,DriftDetector,__all__,__cached__,__doc__,EWVar,__cached__
7,Ensemble,__builtins__,__doc__,__file__,Entropy,__doc__
8,Estimator,__cached__,__file__,__loader__,IQR,__file__
9,MiniBatchClassifier,__doc__,__loader__,__name__,Kurtosis,__loader__


In [7]:
### Data: Predict if a text if hardware or software related
data = [("my python program is runnning","software"),
("I tried to run this program, but it has bugs","software"),
("I need a new machine","hardware"),
("the flashdisk is broken","hardware"),
("We need to test our code","software"),
("programming concepts and testing","software"),
("Electrical device","hardware"),
("device drives","hardware"),
("The generator is broken","hardware"),
("im buidling a REST API","software"),
("design the best API so far","software"),
("they need more electrical wiring","hardware"),
("my code has errors","software"),
("i found some program test faulty","software"),
("i broke the car handle","hardware"),
("i tested the user interface code","software")]

test_data = [('he writes programs daily','software'),
             ('my disk is broken','hardware'),
             ("program mantainance","software"),
             ('The drive is full','hardware')]

### Text classification
+ vectorized the text
    - CountVectorizer/ BagOfWords
+ Then build the Naive Bayes model

In [8]:
#  Make a Pipeline
from river.compose import Pipeline

In [9]:
pipe_nb = Pipeline(('vectorizer',BagOfWords(lowercase=True)),('nb',MultinomialNB()))

In [10]:
### Visualize the Pipeline
pipe_nb

In [11]:
# Get steps
pipe_nb.steps

OrderedDict([('vectorizer',
              BagOfWords (
                on=None
                strip_accents=True
                lowercase=True
                preprocessor=None
                stop_words=None
                tokenizer_pattern="(?u)\b\w[\w\-]+\b"
                tokenizer=None
                ngram_range=(1, 1)
              )),
             ('nb',
              MultinomialNB (
                alpha=1.
              ))])

In [12]:
# Fit on our data
# Learn one at a time
# predict_one

for text,label in data:
#     print(label)
    pipe_nb = pipe_nb.learn_one(text,label)

In [13]:
pipe_nb

In [14]:
# Make Prediction
pipe_nb.predict_one("I built an API")

'software'

In [15]:
# Make Prediction
pipe_nb.predict_proba_one("I built an API")

{'software': 0.6286935444447462, 'hardware': 0.37130645555525305}

In [16]:
# Other 
pipe_nb.predict_one("the hard drive  in the computer is damaged")

'hardware'

In [17]:
pipe_nb.predict_proba_one("the hard drive  in the computer is damaged")

{'software': 0.05093100668206932, 'hardware': 0.9490689933179305}

In [18]:
# Update the Model on the test data & Check Accuracy
from river.metrics.accuracy import Accuracy
metric = Accuracy()
print("1.",metric)
for text,label in test_data:
#     print(label)
    y_pred_before = pipe_nb.predict_one(text)
    print("y_pred_before",y_pred_before)
    metric = metric.update(label,y_pred_before)
    print("2.", metric)
    # Has already learnt the pattern
    pipe_nb = pipe_nb.learn_one(text,label)
    

1. Accuracy: 0.00%
y_pred_before hardware
2. Accuracy: 0.00%
y_pred_before hardware
2. Accuracy: 50.00%
y_pred_before software
2. Accuracy: 66.67%
y_pred_before hardware
2. Accuracy: 75.00%


In [19]:
metric

Accuracy: 75.00%