## Batch processing of Air Quality Data for Anomaly detection

This Jupyter notebook demonstrates the application of machine learning techniques for the batch processing and anomaly detection of air quality data. The motivation behind this notebook is to provide an automated solution for cleaning, imputing missing values, and detecting anomalies in air quality data sets. Such processes are critical for environmental monitoring and ensuring the reliability of data used in further analysis or reporting.

### Pipeline Overview
The processing pipeline can be visualized as follows:

```bash
Data Retrieval -> Data Serialization -> Data Deserialization -> Data Imputation -> Anomaly Detection
```

### About the Data

We obtained a sample of the data from the [Purple AIR API](https://community.purpleair.com/t/making-api-calls-with-the-purpleair-api/180)

In this notebook you can find code to clean the data, perform missing values imputation, and detect anomalies using River.

In [None]:
!pip install bytewax==0.19 python-dotenv scipy==1.13.0 kafka-python==2.0.2 --q
!pip install pandas==2.0.3 river --q
!pip install scikit-learn==1.4.2 --q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.6/38.6 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m246.5/246.5 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.9/3.9 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m796.8/796.8 kB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m45.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m52.5 MB/s[0m

### Detailed Code Walkthrough
Below is a breakdown of each component of the pipeline, accompanied by Python code snippets and their explanations.

Below are the imports we will use.



In [None]:

from datetime import datetime, timezone
from river import anomaly
from sklearn.impute import KNNImputer
import pandas as pd
import numpy as np

import requests
import json


### Data Retrieval
Here, the air quality data is fetched from a provided URL.

In [1]:

url = 'https://raw.githubusercontent.com/bytewax/ml-iot/main/data.json'

resp = requests.get(url)
data = json.loads(resp.text)


Let's take a look at the data

In [2]:
data.keys()

dict_keys(['api_version', 'time_stamp', 'data_time_stamp', 'max_age', 'firmware_default_version', 'fields', 'data'])

In [3]:
data['fields']

['sensor_index',
 'date_created',
 'rssi',
 'uptime',
 'latitude',
 'longitude',
 'humidity',
 'temperature',
 'pressure',
 'pm1.0',
 'pm2.5_alt',
 'pm10.0',
 'pm1.0_cf_1',
 'pm2.5_atm',
 'pm2.5_cf_1',
 'pm10.0_cf_1']

In [4]:
data['data'][0:2]

[[53,
  1454548891,
  -50,
  10183,
  40.246742,
  -111.7048,
  None,
  None,
  None,
  0.0,
  2.1,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0],
 [77,
  1456896339,
  -58,
  2320,
  40.750816,
  -111.82529,
  None,
  None,
  None,
  15.5,
  14.5,
  15.9,
  15.5,
  15.8,
  15.8,
  15.9]]

Given the structure of the data, we will first serialize the information so that each record consists of dictionaries with the names of the fields and the corresponding values. In this way, we will 'flatten' the JSON object.

We define a `deserialize` function. This function takes structured data and converts it into a serialized byte format, preparing it for further processing steps like transmission over a network.



In [None]:

def serialize(data):
    """
    This function serializes the data by converting it
    to a JSON string and then encoding it to bytes.

    Args:
    data: A dictionary containing the data to be serialized.

    Returns:
    A list of serialized data in bytes format.
    """
    headers = data['fields']
    serialized_data = []

    for entry in data['data']:
        try:
            # Create a dictionary for each entry, matching fields with values
            entry_data = {headers[i]: entry[i] for i in range(len(headers))}
            # Convert the dictionary to a JSON string and then encode it to bytes
            entry_bytes = json.dumps(entry_data).encode('utf-8')
            serialized_data.append(entry_bytes)
        except IndexError:
            # This block catches cases where the entry might not have all the fields
            print("IndexError with entry:", entry)
            continue

    return serialized_data




Once we serialized the data, we will conver the byte data back into a usable dictionary format.

This function decodes byte data back into a structured dictionary format, including converting timestamps from epoch to Python datetime objects and ensuring numerical data types are correct for analysis.



In [None]:
def deserialize(byte_objects_list):
    """
    This function deserializes the data by decoding the bytes
    it converts epoch time to a datetime object and converts
    "pm2.5_cf_1" to a float.

    Args:
    byte_objects_list: A list of byte objects to be deserialized.

    Returns:
    A list of dictionaries containing the deserialized data.
    """
    results = []  # List to hold the processed sensor data
    for byte_object in byte_objects_list:
        if byte_object:  # Check if byte_object is not empty
            sensor_data = json.loads(byte_object.decode('utf-8'))  # Decode and load JSON from bytes

            # Convert "pm2.5_cf_1" to a float, check if the value exists and is not None
            if 'pm2.5_cf_1' in sensor_data and sensor_data['pm2.5_cf_1'] is not None:
                sensor_data['pm2.5_cf_1'] = float(sensor_data['pm2.5_cf_1'])

            # Convert "date_created" from Unix epoch time to a datetime object, check if the value exists
            if 'date_created' in sensor_data and sensor_data['date_created'] is not None:
                sensor_data['date_created'] = datetime.fromtimestamp(sensor_data['date_created'], tz=timezone.utc)

            results.append(sensor_data)  # Add the processed data to the results list

    return results



Once we have completed this step, it is now needed to impute values. This method uses the KNN imputation technique to fill in missing or null values in the data, which is essential for maintaining the integrity of subsequent analyses.

In [None]:


def impute_data_with_knn(deserialized_data):
    """
    Takes a list of dictionaries from deserialized data, converts it into a DataFrame,
    performs KNN imputation, and converts it back into a list of dictionaries.

    Args:
    deserialized_data: A list of dictionaries containing sensor data.

    Returns:
    A list of dictionaries with imputed data.
    """
    # Convert list of dictionaries to DataFrame
    df = pd.DataFrame(deserialized_data)

    # Ensure all numeric columns are in appropriate data types
    for column in df.columns:
        if df[column].dtype == 'object':
            try:
                df[column] = pd.to_numeric(df[column])
            except ValueError:
                continue  # Keep non-convertible columns as object if needed

    # Apply KNN imputer to numeric columns
    numeric_columns = df.select_dtypes(include=[np.number]).columns
    imputer = KNNImputer(n_neighbors=5, weights='uniform')
    imputed_array = imputer.fit_transform(df[numeric_columns])

    # Update numeric columns in DataFrame with imputed values
    df[numeric_columns] = imputed_array

    # Convert DataFrame back to a list of dictionaries
    imputed_data = df.to_dict(orient='records')

    return imputed_data

Next, we will define an anomaly detector class.
This class utilizes the HalfSpaceTrees algorithm from the river package to detect anomalies in real-time. It scores and learns from the data iteratively, flagging data points that significantly deviate from expected patterns.



In [None]:
class AnomalyDetector:
    """
    Anomaly detector using HalfSpaceTrees from River library

    This class is used to detect anomalies in the data using online ML models
    with the River library
    """

    def __init__(self, n_trees=10, height=8, window_size=72, seed=11):
        """
        Initialize the anomaly detector
        """
        self.detector = anomaly.HalfSpaceTrees(
            n_trees=n_trees,
            height=height,
            window_size=window_size,
            limits={'x': (0.0, 1200)},  # ensure these limits make sense for your data
            seed=seed
        )

    def update(self, data):
        """
        Update the anomaly detector with new data
        """
        # Check if 'pm1.0_cf_1' is not None and is a floatable type
        if data['pm1.0_cf_1'] is not None:
            try:
                value = float(data['pm1.0_cf_1'])
                score = self.detector.score_one({'x': value})
                self.detector.learn_one({'x': value})
                data['score'] = score
            except ValueError:
                print(f"Skipping entry, invalid data for pm1.0_cf_1: {data['pm1.0_cf_1']}")
        else:
            print(f"Skipping entry, missing data for pm1.0_cf_1: {data}")
        return data

Let's bring the pieces together.

In [None]:
# Begin data processing
# Serialize the data to bytes
serialized_entries = serialize(data)
# Deserialize the data and transform epoch
deserialized_data = deserialize(serialized_entries)


In [None]:
pd.DataFrame(deserialized_data).isna().sum()


sensor_index       0
date_created       0
rssi               1
uptime             1
latitude          54
longitude         54
humidity         965
temperature      965
pressure        1023
pm1.0             31
pm2.5_alt         31
pm10.0            31
pm1.0_cf_1        31
pm2.5_atm         31
pm2.5_cf_1        31
pm10.0_cf_1       31
dtype: int64

In [None]:
# Perform KNN imputation on deserialized data
imputed_data = impute_data_with_knn(deserialized_data)


In [None]:
pd.DataFrame(imputed_data).isna().sum()

sensor_index    0
date_created    0
rssi            0
uptime          0
latitude        0
longitude       0
humidity        0
temperature     0
pressure        0
pm1.0           0
pm2.5_alt       0
pm10.0          0
pm1.0_cf_1      0
pm2.5_atm       0
pm2.5_cf_1      0
pm10.0_cf_1     0
dtype: int64

We can identify outliers after applying the model as seen below.

In [None]:
anomaly_detector = AnomalyDetector(n_trees=4, height=3, window_size=50, seed=11)

# Iterate over each deserialized data entry
for entry in imputed_data:
    updated_entry = anomaly_detector.update(entry)
    if updated_entry['score']>0.7:
      print(updated_entry)


{'sensor_index': 1970.0, 'date_created': Timestamp('2017-07-11 18:58:30+0000', tz='UTC'), 'rssi': -76.0, 'uptime': 17419.0, 'latitude': 33.99827, 'longitude': -118.437546, 'humidity': 34.0, 'temperature': 86.0, 'pressure': 1013.58, 'pm1.0': 2759.8, 'pm2.5_alt': 0.0, 'pm10.0': 2759.8, 'pm1.0_cf_1': 4139.3, 'pm2.5_atm': 2759.8, 'pm2.5_cf_1': 4139.3, 'pm10.0_cf_1': 4139.3, 'score': 0.9333333333333333}
{'sensor_index': 13907.0, 'date_created': Timestamp('2018-08-01 20:40:19+0000', tz='UTC'), 'rssi': -79.0, 'uptime': 26054.0, 'latitude': 37.85105, 'longitude': -122.27175, 'humidity': 27.0, 'temperature': 92.0, 'pressure': 1013.48, 'pm1.0': 3330.9, 'pm2.5_alt': 0.0, 'pm10.0': 3330.9, 'pm1.0_cf_1': 4997.0, 'pm2.5_atm': 3330.9, 'pm2.5_cf_1': 4997.0, 'pm10.0_cf_1': 4997.0, 'score': 0.9333333333333333}
{'sensor_index': 20423.0, 'date_created': Timestamp('2018-11-30 14:17:23+0000', tz='UTC'), 'rssi': -48.0, 'uptime': 3630.0, 'latitude': 37.826344, 'longitude': -120.732544, 'humidity': 31.0, 'temp

## Weaknesses of the Batch Processing Approach for Real-Time Data Changes
The batch processing approach described in the Jupyter notebook is well-suited for handling large datasets in a systematic manner, allowing for thorough cleaning, imputation, and anomaly detection. However, there are significant limitations when it comes to the adaptability and efficiency of this method, especially when dealing with data that changes in real time. Below are some critical weaknesses:

1. Lag in Response Time
Batch processing inherently involves processing data in large blocks at scheduled intervals. This results in a lag between data collection and data processing, making the approach less effective for applications that require real-time analysis or immediate action based on the latest data inputs. In environmental monitoring, for example, real-time data analysis can be crucial for issuing health advisories due to poor air quality.

2. Scalability Issues with Frequent Updates
As the data updates increase in frequency, the batch processing system may struggle to keep up without significant resources dedicated to handling these updates. If the data changes significantly between batches, the system might not capture transient anomalies or shifts in data trends effectively, potentially leading to missed detections or delayed responses.

3. Inefficiency in Resource Usage
Batch processes often require more computational resources because they handle large volumes of data at once. This can be inefficient, especially if only small parts of the dataset require updates or if the data contains a lot of redundancies. Continuous processing, on the other hand, can be more resource-efficient as it processes data incrementally.

4. Difficulty Adapting to New Patterns
The models used in batch processing are typically trained on historical data and might not adapt quickly to new or emerging patterns. This is particularly problematic for anomaly detection in environmental data, which can be influenced by sudden and unpredictable changes in environmental conditions. If the model cannot update its parameters in real-time or near-real-time, it may not perform well against newly evolving data trends.

5. Potential for Data Drift
Data drift refers to the change in the input data's distribution over time. In batch processing, there can be a significant delay between model updates, during which the data might drift, leading to model degradation. This can cause the model to make inaccurate predictions or fail to detect anomalies, as it no longer represents the current state of the environment accurately.

6. Error Accumulation
Errors in earlier stages of the batch processing pipeline can propagate and amplify by the time the data reaches the anomaly detection stage. Since data is processed in large chunks, identifying the source of errors or inconsistencies can be challenging, complicating the troubleshooting and adjustment processes.

Conclusion
While the batch processing approach offers a structured and comprehensive method for handling complex datasets, its application to real-time or near-real-time scenarios is limited. For environments where data is rapidly changing or where immediate data processing is crucial, a more dynamic approach such as stream processing might be necessary. Stream processing allows for continuous data ingestion and immediate analysis, which is more suitable for applications demanding quick responses and high adaptability.