## Real-time processing of Air Quality Data for Anomaly detection


This notebook outlines a real-time data processing solution using the Bytewax framework for anomaly detection in air quality data. The transition from batch processing to stream processing can provide immediate analysis and response capabilities, crucial for dynamic data environments.

### Stream Processing Pipeline Overview
The data processing pipeline can be visualized as follows:

```bash
Data Ingestion -> Serialization -> Deserialization and Imputation -> Anomaly Detection -> Anomaly Filtering
```

### Detailed Code Walkthrough
Below is a detailed explanation of each component of the pipeline, showing how each function contributes to the real-time data processing

### Setup and Dependency Installation
This section includes installation commands for necessary Python packages to handle data flow and machine learning operations.




In [None]:
!pip install bytewax==0.19 python-dotenv scipy==1.13.0 kafka-python==2.0.2 --q
!pip install pandas==2.0.3 river --q
!pip install scikit-learn==1.4.2 --q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.6/38.6 MB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m246.5/246.5 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.9/3.9 MB[0m [31m44.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m796.8/796.8 kB[0m [31m56.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m69.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m82.0 MB/s[0m

In [17]:
import bytewax.operators as op

from bytewax.dataflow import Dataflow
from bytewax.testing import TestingSource
from bytewax.connectors.stdio import StdOutSink

from bytewax.inputs import StatelessSourcePartition, DynamicSource


import json
from bytewax.testing import run_main

import requests
import json
from datetime import datetime, timezone
from river import anomaly
from sklearn.impute import KNNImputer
import pandas as pd
import numpy as np

from river import preprocessing
from river import stats

The data is fetched from an external URL and prepared for real-time processing through the defined classes `SerializedData` and `SerializedInput`.

In [18]:
# Opening JSON file
url = 'https://raw.githubusercontent.com/bytewax/ml-iot/main/data.json'

resp = requests.get(url)
data = json.loads(resp.text)



To convert the `serialize` function into a Bytewax stream-equivalent format, we need to create a data source that behaves as a generator or a source of streaming data. Below, I will define two classes to model this behavior: one for partition-specific streaming data (`SerializedData`), and another to encapsulate the dynamic data generation across potentially multiple workers (`SerializedInput`).

Step 1: Define `SerializedData` as a `StatelessSourcePartition`
This class will act as a source partition that iterates over a dataset, serializing each entry according to the provided headers and fields.

Step 2: Define `SerializedInput` as a `DynamicSource`
This class encapsulates the partition management for the data source, ensuring that each worker in a distributed environment gets a proper instance of the source partition.

In [19]:
class SerializedData(StatelessSourcePartition):
    """
    Emit serialized data directly for simplicity. This class will serialize
    each entry in the 'data' list by mapping it to the corresponding 'fields'.
    """
    def __init__(self, full_data):
        self.fields = full_data['fields']
        self.data_entries = full_data['data']
        self.metadata = {k: v for k, v in full_data.items() if k not in ['fields', 'data']}
        self._it = iter(self.data_entries)

    def next_batch(self):
        try:
            data_entry = next(self._it)
            # Map each entry in 'data' with the corresponding field in 'fields'
            data_dict = dict(zip(self.fields, data_entry))
            # Merge metadata with data_dict to form the complete record
            complete_record = {**self.metadata, **{"data": data_dict}}
            # Serialize the complete record
            serialized = json.dumps(complete_record).encode('utf-8')
            return [serialized]
        except StopIteration:
            raise StopIteration


class SerializedInput(DynamicSource):
    """
    Dynamic data source that partitions the input data among workers.
    """
    def __init__(self, data):
        self.data = data
        self.total_entries = len(data['data'])

    def build(self, step_id, worker_index, worker_count):
        # Calculate the slice of data each worker should handle
        part_size = self.total_entries // worker_count
        start = part_size * worker_index
        end = start + part_size if worker_index != worker_count - 1 else self.total_entries

        # Create a partition of the data for the specific worker
        # Note: This partitions only the 'data' array. Metadata and fields are assumed
        # to be common and small enough to be replicated across workers.
        data_partition = {
            "api_version": self.data['api_version'],
            "time_stamp": self.data['time_stamp'],
            "data_time_stamp": self.data['data_time_stamp'],
            "max_age": self.data['max_age'],
            "firmware_default_version": self.data['firmware_default_version'],
            "fields": self.data['fields'],
            "data": self.data['data'][start:end]
        }

        return SerializedData(data_partition)

* Data Initialization: The `SerializedData` class now takes the entire data structure, keeps the metadata, and iterates over the data list. Each entry in data is mapped to the corresponding field specified in fields, combined with the metadata, serialized into a JSON string, and then encoded.

* Integration into Dataflow: The class is used directly within a Bytewax dataflow as an input source, demonstrating how serialized data would be produced from the structured input.

We can then deserialize the data with a simple function.

In this function, we will also perform imputation of missing values, particularly, for the temperature, humidity, pressure attributes and PM attributes.

Unlike the batch version, we will use the `river` library to perform imputation, this is due to the library's compatibility with stream processing.

In [20]:
temp_imputer = preprocessing.StatImputer(("temperature", stats.Mean()))
humidity_imputer = preprocessing.StatImputer(("humidity", stats.Mean()))
pressure_imputer = preprocessing.StatImputer(("pressure", stats.Mean()))
pm1_imputer = preprocessing.StatImputer(("pm1.0_cf_1", stats.Mean()))


def process_and_impute_data(byte_data):
    """Deserialize byte data, impute missing values, and prepare for stateful processing."""
    # Deserialize the byte data
    record = json.loads(byte_data.decode('utf-8'))
    sensor_data = record['data']
    key = str(sensor_data.get("sensor_index", "default"))

    # Impute missing values
    for item in [temp_imputer, humidity_imputer, pressure_imputer, pm1_imputer]:
      item.learn_one(sensor_data)
      sensor_data = item.transform_one(sensor_data)
    temp_imputer.learn_one(sensor_data)  # Update imputer with current data
    sensor_data = temp_imputer.transform_one(sensor_data)  # Impute missing values

    # Return the processed data with the key
    return (key, sensor_data)

Like in the batch example, we will use the `river` library's `HalfSpaceTrees` for detecting anomalies in streaming data.

This class is integral for real-time anomaly detection. It maintains a model state across the data stream, continuously learning and scoring new data points.

In [21]:
class AnomalyDetector:
    """
    Anomaly detector using HalfSpaceTrees from River library

    This class is used to detect anomalies in the data using online ML models
    with the River library
    """

    def __init__(self, n_trees=10, height=8, window_size=72, seed=11):
        """
        Initialize the anomaly detector
        """
        self.detector = anomaly.HalfSpaceTrees(
            n_trees=n_trees,
            height=height,
            window_size=window_size,
            limits={'pm1.0_cf_1': (0.0, 1200)},  # Ensure these limits make sense for your data
            seed=seed
        )

    def update(self, data):
        """
        Update the anomaly detector with new data
        """
        # Check if 'pm1.0_cf_1' is not None and is a floatable type
        if data.get('pm1.0_cf_1') is not None:
            try:
                value = float(data['pm1.0_cf_1'])
                score = self.detector.score_one({'pm1.0_cf_1': value})
                self.detector.learn_one({'pm1.0_cf_1': value})
                data['anomaly_score'] = score
            except ValueError:
                print(f"Skipping entry, invalid data for pm1.0_cf_1: {data['pm1.0_cf_1']}")
        else:
            data['anomaly_score'] = None
        return data


# Initialize the anomaly detector
anomaly_detector = AnomalyDetector()

def detect_anomalies(data_tuple):
    """Detect anomalies in sensor data."""
    key, sensor_data = data_tuple
    sensor_data = anomaly_detector.update(sensor_data)
    return (key, sensor_data)




Only anomalies exceeding a specified threshold are passed forward for alerting or further analysis.



In [24]:
def filter_high_anomaly(data_tuple):
    """Filter entries with high anomaly scores."""
    key, data = data_tuple
    # Check if 'anomaly_score' is greater than 0.7
    return data.get('anomaly_score', 0) > 0.7


![](https://github.com/bytewax/ml-iot/blob/main/flow.png?raw=true)

In [26]:
# Setup the dataflow
flow = Dataflow("air-quality-flow")
inp = op.input("inp", flow, SerializedInput(data))
impute_deserialize = op.map("impute_deserialize", inp, process_and_impute_data)


# Add anomaly detection to the dataflow
detect_anomalies_step = op.map("detect_anomalies", impute_deserialize, detect_anomalies)

# Detect anomalies within threshold
filter_anomalies = op.filter("filter_high_anomalies", detect_anomalies_step, filter_high_anomaly)


# Output or further processing
op.inspect("inspect_filtered_anomalies", filter_anomalies)


run_main(flow)

### Comparison with Batch Approach
1. Real-time Adaptability: Unlike the batch processing approach, this stream processing framework is designed to handle data in real-time, which significantly reduces the latency between data acquisition and processing. This is crucial for applications where timely data analysis can lead to immediate actionable insights.

2. Scalability and Efficiency: The stream processing model is inherently more scalable and efficient for data that frequently updates. It processes data incrementally, avoiding the overhead associated with batch processing large datasets at once.

3. Continuous Learning and Adaptation: The real-time model continuously updates its parameters with incoming data, making it more adaptable to new patterns or changes in data trends. This contrasts with batch processing, where models might not adapt quickly to new data between batch runs.

4. Error Handling and Data Drift: Stream processing allows for immediate error detection and handling, which can prevent error propagation that is often seen in batch processes. Additionally, it handles data drift more effectively by continuously adapting the model to the incoming data stream.

## Conclusion
The Bytewax-based real-time data processing approach provides a robust solution for handling air quality data in environments that require immediate analysis and response. This method is superior in scenarios where data integrity, timeliness, and adaptability are crucial, making it an ideal choice over batch processing for dynamic data applications.





