## Real-time processing of Air Quality Data for Anomaly detection


To integrate our Python program with Bytewax for building a dataflow architecture, where input and deserialization are stateless and anomaly detection is stateful, we'll follow these steps:

1. Set up the Bytewax Dataflow: Define the dataflow to ingest data, perform the deserialization, imputation, and pass the data through the anomaly detection which is stateful.
2. Integrate Stateless Steps: These include reading input, deserializing data, and imputing missing values using KNN.
3. Integrate Stateful Step: This will be your anomaly detection, which maintains state over the window of data it analyzes.

In [11]:
!pip install bytewax==0.19 python-dotenv scipy==1.13.0 kafka-python==2.0.2
!pip install pandas==2.0.3 river
!pip install scikit-learn==1.4.2

Collecting river
  Downloading river-0.21.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.9/3.9 MB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
INFO: pip is looking at multiple versions of river to determine which version is compatible with other requirements. This could take a while.
  Downloading river-0.20.1.tar.gz (796 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m796.8/796.8 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Downloading river-0.19.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m29.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: river
Successfully installed river-0.19

In [1]:
import bytewax.operators as op

from bytewax.dataflow import Dataflow
from bytewax.testing import TestingSource
from bytewax.connectors.stdio import StdOutSink

from bytewax.inputs import StatelessSourcePartition, DynamicSource


import json
from bytewax.testing import run_main

import requests
import json
from datetime import datetime, timezone
from river import anomaly
from sklearn.impute import KNNImputer
import pandas as pd
import numpy as np


In [3]:
# Opening JSON file
f = open('synthetic_data.json')

# returns JSON object as
# a dictionary
data = json.load(f)



To convert the `serialize` function into a Bytewax stream-equivalent format, we need to create a data source that behaves as a generator or a source of streaming data. Below, I will define two classes to model this behavior: one for partition-specific streaming data (`SerializedData`), and another to encapsulate the dynamic data generation across potentially multiple workers (`SerializedInput`).

Step 1: Define `SerializedData` as a `StatelessSourcePartition`
This class will act as a source partition that iterates over a dataset, serializing each entry according to the provided headers and fields.

Step 2: Define `SerializedInput` as a `DynamicSource`
This class encapsulates the partition management for the data source, ensuring that each worker in a distributed environment gets a proper instance of the source partition.

In [4]:
class SerializedData(StatelessSourcePartition):
    """
    Emit serialized data directly for simplicity. This class will serialize
    each entry in the 'data' list by mapping it to the corresponding 'fields'.
    """
    def __init__(self, full_data):
        self.fields = full_data['fields']
        self.data_entries = full_data['data']
        self.metadata = {k: v for k, v in full_data.items() if k not in ['fields', 'data']}
        self._it = iter(self.data_entries)

    def next_batch(self):
        try:
            data_entry = next(self._it)
            # Map each entry in 'data' with the corresponding field in 'fields'
            data_dict = dict(zip(self.fields, data_entry))
            # Merge metadata with data_dict to form the complete record
            complete_record = {**self.metadata, **{"data": data_dict}}
            # Serialize the complete record
            serialized = json.dumps(complete_record).encode('utf-8')
            return [serialized]
        except StopIteration:
            raise StopIteration


class SerializedInput(DynamicSource):
    """
    Dynamic data source that partitions the input data among workers.
    """
    def __init__(self, data):
        self.data = data
        self.total_entries = len(data['data'])

    def build(self, step_id, worker_index, worker_count):
        # Calculate the slice of data each worker should handle
        part_size = self.total_entries // worker_count
        start = part_size * worker_index
        end = start + part_size if worker_index != worker_count - 1 else self.total_entries

        # Create a partition of the data for the specific worker
        # Note: This partitions only the 'data' array. Metadata and fields are assumed
        # to be common and small enough to be replicated across workers.
        data_partition = {
            "api_version": self.data['api_version'],
            "time_stamp": self.data['time_stamp'],
            "data_time_stamp": self.data['data_time_stamp'],
            "max_age": self.data['max_age'],
            "firmware_default_version": self.data['firmware_default_version'],
            "fields": self.data['fields'],
            "data": self.data['data'][start:end]
        }

        return SerializedData(data_partition)

* Data Initialization: The `SerializedData` class now takes the entire data structure, keeps the metadata, and iterates over the data list. Each entry in data is mapped to the corresponding field specified in fields, combined with the metadata, serialized into a JSON string, and then encoded.

* Integration into Dataflow: The class is used directly within a Bytewax dataflow as an input source, demonstrating how serialized data would be produced from the structured input.

We can then deserialize the data with a simple function.


In [28]:
def process_deserialized_data(byte_data):
    """Deserialize byte data and prepare for stateful processing."""
    sensor_data = json.loads(byte_data.decode('utf-8'))['data']
    key = str(sensor_data.get("sensor_index", "default"))
    return (key, sensor_data)

Next we can perform imputation of the missing values.

Should KNN Imputation Be Stateful or Stateless?

Stateless processing implies that each data item is processed independently without any need to remember past interactions. This is typically not the case with KNN imputation:

* Stateful: The KNN algorithm typically benefits from "remembering" the dataset it uses to predict missing values because it bases its imputation on the k-nearest neighbors. Therefore, a stateful approach is often necessary if you need to continuously update the training dataset as new data arrives or if the dataset itself is too large to handle in a single batch efficiently.

* Stateless: If the data chunks you process are independent or if the dataset can be managed in small batches without the need for continuity between batches, you could consider a stateless approach.

For Bytewax, since it’s primarily built for stream processing, we might need to adjust your KNN implementation to fit into a stateful paradigm if our dataset is dynamically growing or if you need the model to adapt continuously as new data arrives.



In [41]:
from bytewax.inputs import StatefulSourcePartition, FixedPartitionedSource
from bytewax.inputs import DynamicSource

def impute_data_with_knn(batch):
    df = pd.DataFrame(batch)
    for column in df.columns:
        if df[column].dtype == 'object':
            try:
                df[column] = pd.to_numeric(df[column])
            except ValueError:
                continue

    numeric_columns = df.select_dtypes(include=[np.number]).columns
    imputer = KNNImputer(n_neighbors=5, weights='uniform')
    imputed_array = imputer.fit_transform(df[numeric_columns])
    df[numeric_columns] = imputed_array
    return df.to_dict(orient='records')

def batch_accumulator(key, record):
    """Accumulate records into batches."""
    return [record]

def batch_to_list(window_state):
    """Convert window state (list of lists) to a flat list of records."""
    flat_list = [item for sublist in window_state for item in sublist]
    return flat_list

Step 2: Integrate `KNNImputation` into the Bytewax Dataflow
Next, use this class in your dataflow. Since we're focusing on processing rather than generating new data from a source, we'll integrate it directly after data deserialization:

In [48]:
from bytewax.operators.window import TumblingWindow, WindowConfig
from datetime import datetime, timedelta, timezone


In [47]:
# Setup the dataflow
flow = Dataflow("air-quality-flow")
inp = op.input("inp", flow, SerializedInput(data))
deserialize = op.map("deserialize", inp, process_deserialized_data)




window_config = TumblingWindow(
    length=timedelta(seconds=10),
    align_to=datetime(2023, 1, 1, tzinfo=timezone.utc)
)


# Window the data to create batches
window = op.stateful_map("window", deserialize, window_config, batch_accumulator, batch_to_list)

# Impute data
impute = op.map("impute", window, impute_data_with_knn)

# Output or further processing
op.inspect("inspect_imputed", impute)
run_main(flow)

TypeError: operator 'map' called incorrectly; see cause above