# Statistical Machine Learning on 
# Evolving Non-stationary Data Streams

# Outline

- Beyond i.i.d
- Stream Processing
- Streaming Machine Learning
- Case study
- Conclusion


![](./big-data.png)


# Beyond i.i.d

# Beyond i.i.d

* __Traditional machine learning and data mining__ assume **independent and identically distributed (i.i.d)** of the data.
* __Data samples__, in the **past and current time** do **not affect** the probability for **future** ones. 

# Beyond i.i.d

* Data samples arrive __continuously__, __online__ through unlimited **streams** often at __high speed__ [Sayed-Mouchaweh, 2016].
* The __process__ generating these data streams __may evolve over time__ (i.e. non-stationarity).

![](noniid.png)


# Beyond i.i.d

In order to deal with evolving data streams, the __model learnt from the streaming data__ must capture up-to-date __trends__ and __transient patterns__ in the stream. 

__Types of change in regression__
![](changes-types-labels.png)

# Beyond i.i.d

__Updating the model__ by incorporating new examples, we must also **eliminate the effects of outdated examples** representing outdated concepts through __one-pass__.

__Typical changes in classification__

![](changes-sample-labels.png)

# Stream Processing


# Stream Processing

Given a sequence of data (__a stream__), a series of operations (functions) is applied to each element in the stream, in a declarative way, we specify __what we want to achieve and not how__ [Bifet, 2010].
![](./stream-animation/streamanim.gif)


# Stream Processing

Data __Stream Processing__ is more challenging than __batch processing__ [Bifet, 2010] due to:
* the **amount of data**
* the __speed of arrival__
* the __changes in the distribution generating the data__ 

# Streaming Machine Learning

# Streaming Machine Learning

Unlike __conventional machine learning__, the __data flow__ targeted by __streaming machine learning__ becomes available __continuously__ over time and needs to be __processed, incrementally__ and __in a single pass__.

The inherent __challenges__ here are: 
![](./streaming-challenges.png)

# Case Study
## Principal Component Analysis (PCA)

# Case Study
## Principal Component Analysis (PCA)

![](svd-graphic-simple.png)

# Case Study
## Principal Component Analysis (PCA)

Tackle the __inherent problems in traditional PCA__ impeding it to allow it to learn incrementally from streams:
* Calculation of the __mean and other descriptive statistics__ as the data is available for estimating **covariance matrix**
* __Sorting the dominant eigenvalues__ in the rank update of the QR decomposition

## Towards incremental PCA

## Towards incremental PCA

Incremental __calculation of the mean and other descriptive statistics__ on the datastream.
![](streaming-mean.png)


## Towards incremental PCA


Incremental __updates depending on counts (i.e. histogram)__, which contain sorted eigenvalues.
![](streaming-histogram.png)

# Real-world application

# Real-world application

![](coke.png)

- **1\$ billion of sales in the first 68 seconds** compared with 90 seconds in 2018 and two minutes in 2017
- **544,000 sales per second** compared with 256,000 in 2018

# Real-world application

#### Multi-class classification task (fraud detection in bank transactions) [Axenie et al., ICMLA 2019])

Goal: **identify frauds** in the incoming bank transactions by **querying the eigenvalues and eigenvectors** in order to extract the **normal and fraud transactions** prior to a **multi-class classifier**.

# Real-world application

#### Multi-class classification task (fraud detection in bank transactions) [Axenie et al., ICMLA 2019])

The datastream:

- **2M incoming events at 40 kHz**
- dataset has **10 input features** (e.g. transaction id, account id, transaction amount, transaction date, etc.)
- **eigenvalues** of the input $X$ are close to the **class labels** (i.e. ”high-risk fraud”, ”recurrent fraud”, ”low-risk valid”, ”recurrent valid”).


# Performance of Streaming PCA

## Performance analysis: Throughput
![](throughput-analysis.png)


## Performance analysis
#### Latency & throughput
![](performance-vals.png)
#### Accuracy
![](accuracy-analysis.png)


## Conclusions

- **Streaming machine learning on data streams** is a challenging problem.
- **Data size, data speed and the underlying changes** in the data properties yield **new learning models**.
- Model **accuracy, processing latency and processing throughput** are usually a trade-off.


# References

[Bifet, 2010] Albert Bifet - Adaptive Stream Mining Pattern Learning and Mining from Evolving Data Streams, IOS Press, 2010.

[Sugiyama et al., 2012] Masashi Sugiyama, Motoaki Kawanabe - Machine Learning in Non-Stationary Environments Introduction to Covariate Shift Adaptation-The MIT Press, 2012.

[Sayed-Mouchaweh, 2016] Moamar Sayed-Mouchaweh - Learning from Data Streams in Dynamic Environments-Springer International Publishing, 2016.

[Axenie et al., 2019] Axenie, C.; Tudoran, R.; Bortoli, S.; Hassan, M. A. H.; Carlos, S.; and Brasche, G. In 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA), 2019. IEEE.

![](moa.png)

# Lecture notebook download
![](./qrcode.png)


# Beyond i.i.d

When __training and test data__ follow __different probability distributions__, but the __conditional distributions of output__ values given input points (i.e. __target function__) are __unchanged__, we face __covariate shift__ [Sugiyama et al., 2012]. 
![](covariate-shift.png)


# Stream Processing

Data __Stream Processing__ is more challenging than __batch processing__ [Bifet, 2010] due to:
* the amount of data is __extremely large__, potentially infinite - __impossible to store__ 
* Only a __small summary__ can be computed and stored, and the rest is discarded - unfeasible to go over it
* The __speed of arrival is high__, so that each datum has to be processed in __real time__, and then discarded
* The __distribution generating the items__ can __change over time__
* __Data from the past__ may become __irrelevant (or even harmful)__ for the current summary

![](./basic-pca.png)

![](streaming-neural-covariance.png)

The incremental __estimation of the covariance matrix__ as a **Linear Least Squares** regression [Axenie et al., 2019].


![](streaming-lls.png)


## Performance analysis: Latency
![](latency-analysis.png)
