# Statistical Machine Learning on Evolving Non-stationary Data Streams

## Intuition, Formalism and Examples

## Stream Processing

Given a sequence of data (a stream), a series of operations (functions) is applied to each element in the stream, in a declarative way, we specify what we want to achieve and not how [Bifet, 2010].
![](streaming-intuition.png)


## Stream processing

Big Data Stream Learning is more challenging than batch or offline learning [Bifet, 2010]:
* The amount of data that has arrived and will arrive in the future is extremely large; in fact, the sequence is potentially infinite. Thus, it is impossible to store it all. 
* Only a small summary can be computed and stored, and the rest of the information is thrown away. Even if the information could be all stored, it would be unfeasible to go over it for further processing.
* The speed of arrival is large, so that each particular element has to be processed essentially in real time, and then discarded.
* The distribution generating the items can change over time. Thus, data from the past may become irrelevant (or even harmful) for the current summary.

![](ml-pipeline.png)

## Beyond i.i.d

In traditional machine learning and data mining approaches, the current observed data and the future data are assumed to be sampled independently and from an identical probability distribution (iid). The assumption of independence means that the data samples, generated over time by a variable characterizing a phenomenon, are statistically independent. 

Therefore, past and current data samples do not affect the probability for future ones. The assumption of identically distributed observations means that generated observations over time may be considered as random draws from the same probability distribution.

However, in multiple applications like web mining, social networks, network monitoring, sensor networks, telecommunications, financial forecasting, etc., data samples arrive continuously online through unlimited streams often at high speed, over time. Moreover, the phenomena generating these data streams may evolve over time. In this case, the environment in which the system or the phenomenon generated the data is considered to by dynamic, evolving or nonstationary.

## Beyond i.i.d

In order to deal with evolving data streams, the model learnt from the streaming data must capture up-to-date trends and transient patterns in the stream. 

Updating the model by incorporating new examples, we must also eliminate the effects of outdated examples representing outdated concepts through one-pass.

Types of change
![](changes-types.png)

Typical changes in classification
![](changes-sample.png)

## Beyond i.i.d

Covariate shift is one of the assumptions in supervised learning. The situation where the training input points and test input points follow different probability distributions, but the conditional distributions of output values given input points are unchanged, is called the covariate shift [Sugiyama et al., 2012]. This means that the target function we want to learn is unchanged between the training phase and the test phase, but the distributions of input points are different for training and test data.

![](covariate-shift.png)

The key idea of covariate shift adaptation is to (softly) choose informative training samples in a systematic way, by considering the __importance__ of each training sample in the prediction of test output values, namely the ratio

$$ \frac{p_{te}(x_i^{tr})}{p_{tr}(x_i^{tr})}$$

Basically, in such a formulation, the expectation of a function $f$ (i.e. regression) over $x_{te}$ can be computed by the importance-weighted expectation of the function over $x_{tr}$. Thus, the difference of distributions can be systematically
adjusted by importance weighting.

## Concept drift

Data generated by phenomena in dynamic environments are characterized by: 

* potentially unlimited size;
* sequential access to data samples in the sense that once an observation has been processed, it cannot be retrieved easily unless it is explicitly stored in memory;
* unpredictable, dependent, and not identical distributed observations.

Learning from streams of evolving and unbounded data requires developing new algorithms and methods able to learn under the following constraints[Sayed-Mouchaweh, 2016]:
* Random access to observations is not feasible, or it has high costs. This means that the entire original dataset is not a priori available or it is too large to process.
* Memory is small with respect to the size of data.
* Data distribution or phenomena generating the data may evolve over time. This is also known as concept drift.

## Incremental learning

Unlike conventional machine learning models, the data flow targeted by incremental learning becomes available continuously over time and needs to be processed in a single pass.

The inherent challenges here are: 

* data availability - new data is received and eliminated from the window of interest as the stream evolves in time
* model update - as the data availability is influenced by the dynamics of the window evolution, the updates must be performed in a single pass
* data size - for precise models large windows are required, yet updating the model for the entire window is costly in terms of latency and resource allocation (i.e. memory or disk).

## Incremental dimensionality reduction (PCA)

In a structured form, the basic formulation, PCA follows the following steps:
![](basic-pca.png)

## Incremental dimensionality reduction (PCA)
![](basic-pca-problems.png)

Tackle the inherent problems in traditional PCA impeding it to achieve low latency, high throughput and fixed memory/storage:
* Calculation of the __mean and other descriptive statistics__ as the data is available.
* __Sorting the dominant eigenvalues__ in the rank update of the QR decomposition.
* Calculating the __covariance matrix__.

## Towards incrmental PCA

Incremental __calculation of the mean and other descriptive statistics__ on the datastream.
![](streaming-mean.png)


## Towards incrmental PCA

Incremental __updates depending on counts (i.e. histogram)__, which contain sorted eigenvalues.
![](streaming-histogram.png)

## Towards incrmental PCA

Incremental __estimation of the covariance matrix__, as __neural synaptic weights__ converge to the eigenvectors (unique set of __optimal weights__ and __uncorrelated outputs__) [Axenie et al., 2019]
![](streaming-neural-covariance.png)

## Towards incrmental PCA

Converge from an initially __random set of synaptic weights__ to the __eigenvectors of the input autocorrelation__ in the eigenvalues order __minimizing the linear reconstruction (i.e. using Linear Least Squares)__. 
![](streaming-lls.png)

## Example implementation

Multi-class classification task (fault identification in predictive maintenance [Axenie et al., 2019])

We used a real-world stream with sensory readings from a coal coke prediction production line (i.e. data from 1 preheater temperature sensor, 2 briquetting temperature sensors, 2 cooker temperature sensors, 2 coke quencher temperature sensors, 2 coke transport system temperature sensors and 2 blast furnace temperature sensors). 

We addressed the problem of identifying faults in the production line and queried the eigenvalues and eigenvectors to extract the normal and faulty operation configuration prior to a multi-class classifier. 

The datastream contained 2M incoming events at 40 kHz. Moreover the datastream had the property that the eigenvalues of the input $X$ are close to the class labels (i.e. $1, 2, ..., d$) and the corresponding eigenvectors are close to the canonical basis of $R^d$, where $d$ is the number of principal components to extract and the class number for the multi-class classification task (i.e. various types of faults and normal operation - in our scenario, we consider 10 classes, 9 faults and 1 normal).

![](experimental-setup.png)

## PCA vs. Incremental PCA (Latency Analysis)

![](latency-analysis.png)

## PCA vs. Incremental PCA (Throughput Analysis)

![](throughput-analysis.png)

## PCA vs. Incremental PCA (Accuracy Analysis)

![](accuracy-analysis.png)

## Conclusions

* low-latency (1-ms level), high-throughput (Kevents/s) computation and learning on datastreams,

* streaming PCA by deriving incremental, accumulate/retract update models and leveraging their execution on a distributed system,

*  computation of multiple statistical features and neural learning rules while guaranteeing limited or programmable resource allocation (i.e. memory and disk),

* validated by real-world scenarios (i.e. predictive maintenance) proving its low-latency high-throughput capabilities.

# References

[Bifet, 2010] Albert Bifet - Adaptive Stream Mining Pattern Learning and Mining from Evolving Data Streams, IOS Press, 2010.

[Sugiyama et al., 2012] Masashi Sugiyama, Motoaki Kawanabe - Machine Learning in Non-Stationary Environments Introduction to Covariate Shift Adaptation-The MIT Press, 2012.

[Sayed-Mouchaweh, 2016] Moamar Sayed-Mouchaweh - Learning from Data Streams in Dynamic Environments-Springer International Publishing, 2016.

[Axenie et al., 2019] C. Axenie, Radu Tudoran, Stefano Bortoli, Mohamad Al Hajj Hassan, Alexander Wieder, Goetz Brasche, SPICE: Streaming PCA fault Identification and Classification Engine in Predictive Maintenance, IoT Stream Workshop, European Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2019).

[Weng et al., 2003] J. Weng, Y. Zhang, and W.-S. Hwang, “Candid covariance-free incremental principal component analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 8, pp. 1034–1040, 2003.

[Brand, 2002] M. Brand, “Incremental singular value decomposition of uncertain data with missing values.” in ECCV (1), ser. Lecture Notes in Computer Science, A. Heyden, G. Sparr, M. Nielsen, and P. Johansen, Eds., vol. 2350. Springer, 2002, pp. 707–720.