## Batch processing of Air Quality Data for Anomaly detection

This Jupyter notebook demonstrates the application of machine learning techniques for the batch processing and anomaly detection of air quality data. The motivation behind this notebook is to provide an automated solution for cleaning, imputing missing values, and detecting anomalies in air quality data sets. Such processes are critical for environmental monitoring and ensuring the reliability of data used in further analysis or reporting.

### Pipeline Overview
The processing pipeline can be visualized as follows:

```bash
Data Retrieval -> Data Serialization -> Data Deserialization -> Data Imputation -> Anomaly Detection
```

### About the Data

We obtained a sample of the data from the [Purple AIR API](https://community.purpleair.com/t/making-api-calls-with-the-purpleair-api/180)

In this notebook you can find code to clean the data, perform missing values imputation, and detect anomalies using River.

In [2]:
!pip install bytewax==0.19 python-dotenv scipy==1.13.0 kafka-python==2.0.2 --q
!pip install pandas==2.0.3
!pip install scikit-learn==1.4.2 --q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.6/38.6 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m246.5/246.5 kB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25h

### Detailed Code Walkthrough
Below is a breakdown of each component of the pipeline, accompanied by Python code snippets and their explanations.

Below are the imports we will use.



In [3]:
from datetime import datetime, timezone
from sklearn.impute import KNNImputer
import pandas as pd
import numpy as np

import requests
import json

from sklearn.ensemble import IsolationForest


### Data Retrieval
Here, the air quality data is fetched from a provided URL.

In [4]:

url = 'https://raw.githubusercontent.com/bytewax/ml-iot/main/data.json'

resp = requests.get(url)
data = json.loads(resp.text)


Let's take a look at the data

In [2]:
data.keys()

dict_keys(['api_version', 'time_stamp', 'data_time_stamp', 'max_age', 'firmware_default_version', 'fields', 'data'])

In [3]:
data['fields']

['sensor_index',
 'date_created',
 'rssi',
 'uptime',
 'latitude',
 'longitude',
 'humidity',
 'temperature',
 'pressure',
 'pm1.0',
 'pm2.5_alt',
 'pm10.0',
 'pm1.0_cf_1',
 'pm2.5_atm',
 'pm2.5_cf_1',
 'pm10.0_cf_1']

In [4]:
data['data'][0:2]

[[53,
  1454548891,
  -50,
  10183,
  40.246742,
  -111.7048,
  None,
  None,
  None,
  0.0,
  2.1,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0],
 [77,
  1456896339,
  -58,
  2320,
  40.750816,
  -111.82529,
  None,
  None,
  None,
  15.5,
  14.5,
  15.9,
  15.5,
  15.8,
  15.8,
  15.9]]

Given the structure of the data, we will first serialize the information so that each record consists of dictionaries with the names of the fields and the corresponding values. In this way, we will 'flatten' the JSON object.

We define a `deserialize` function. This function takes structured data and converts it into a serialized byte format, preparing it for further processing steps like transmission over a network.



In [5]:

def serialize(data):
    """
    This function serializes the data by converting it
    to a JSON string and then encoding it to bytes.

    Args:
    data: A dictionary containing the data to be serialized.

    Returns:
    A list of serialized data in bytes format.
    """
    headers = data['fields']
    serialized_data = []

    for entry in data['data']:
        try:
            # Create a dictionary for each entry, matching fields with values
            entry_data = {headers[i]: entry[i] for i in range(len(headers))}
            # Convert the dictionary to a JSON string and then encode it to bytes
            entry_bytes = json.dumps(entry_data).encode('utf-8')
            serialized_data.append(entry_bytes)
        except IndexError:
            # This block catches cases where the entry might not have all the fields
            print("IndexError with entry:", entry)
            continue

    return serialized_data




Once we serialized the data, we will conver the byte data back into a usable dictionary format.

This function decodes byte data back into a structured dictionary format, including converting timestamps from epoch to Python datetime objects and ensuring numerical data types are correct for analysis.



In [6]:
def deserialize(byte_objects_list):
    """
    This function deserializes the data by decoding the bytes
    it converts epoch time to a datetime object and converts
    "pm2.5_cf_1" to a float.

    Args:
    byte_objects_list: A list of byte objects to be deserialized.

    Returns:
    A list of dictionaries containing the deserialized data.
    """
    results = []  # List to hold the processed sensor data
    for byte_object in byte_objects_list:
        if byte_object:  # Check if byte_object is not empty
            sensor_data = json.loads(byte_object.decode('utf-8'))  # Decode and load JSON from bytes

            # Convert "pm2.5_cf_1" to a float, check if the value exists and is not None
            if 'pm2.5_cf_1' in sensor_data and sensor_data['pm2.5_cf_1'] is not None:
                sensor_data['pm2.5_cf_1'] = float(sensor_data['pm2.5_cf_1'])

            # Convert "date_created" from Unix epoch time to a datetime object, check if the value exists
            if 'date_created' in sensor_data and sensor_data['date_created'] is not None:
                sensor_data['date_created'] = datetime.fromtimestamp(sensor_data['date_created'], tz=timezone.utc)

            results.append(sensor_data)  # Add the processed data to the results list

    return results



Once we have completed this step, it is now needed to impute values. This method uses the KNN imputation technique to fill in missing or null values in the data, which is essential for maintaining the integrity of subsequent analyses.

In [7]:


def impute_data_with_knn(deserialized_data):
    """
    Takes a list of dictionaries from deserialized data, converts it into a DataFrame,
    performs KNN imputation, and converts it back into a list of dictionaries.

    Args:
    deserialized_data: A list of dictionaries containing sensor data.

    Returns:
    A list of dictionaries with imputed data.
    """
    # Convert list of dictionaries to DataFrame
    df = pd.DataFrame(deserialized_data)

    # Ensure all numeric columns are in appropriate data types
    for column in df.columns:
        if df[column].dtype == 'object':
            try:
                df[column] = pd.to_numeric(df[column])
            except ValueError:
                continue  # Keep non-convertible columns as object if needed

    # Apply KNN imputer to numeric columns
    numeric_columns = df.select_dtypes(include=[np.number]).columns
    imputer = KNNImputer(n_neighbors=5, weights='uniform')
    imputed_array = imputer.fit_transform(df[numeric_columns])

    # Update numeric columns in DataFrame with imputed values
    df[numeric_columns] = imputed_array

    # Convert DataFrame back to a list of dictionaries
    imputed_data = df.to_dict(orient='records')

    return imputed_data

Next, we will define an anomaly detector class.


The Isolation Forest is a good choice for batch anomaly detection due to its effectiveness with multidimensional data and its capability of identifying anomalies without needing a target label. Here's how:

1. Initialize the Isolation Forest: The detector is initialized with parameters such as the number of trees (n_estimators), the sample size (max_samples), and the contamination factor which represents the proportion of outliers expected in the data.
2. Fit the Model: Since we are assuming batch processing, the model can be fitted on a predefined dataset. This dataset should ideally represent typical "normal" data to help the model learn the structure of non-anomalous data.
3. Predict and Score Anomalies: After the model is fitted, it can predict new data points as anomalies based on the isolation properties learned during training. The `score_samples` function gives a raw anomaly score, which can be used to determine how anomalous a data point is.


In [8]:
class AnomalyDetector:
    """
    Anomaly detector using Isolation Forest from scikit-learn
    """

    def __init__(self, n_estimators=100, max_samples='auto', contamination=0.01, random_state=42):
        """
        Initialize the anomaly detector with parameters suitable for Isolation Forest
        """
        self.detector = IsolationForest(
            n_estimators=n_estimators,
            max_samples=max_samples,
            contamination=contamination,
            random_state=random_state
        )

    def fit(self, data):
        """
        Fit the Isolation Forest model with data
        """
        self.detector.fit(data)

    def predict(self, data):
        """
        Predict data using the fitted model and tag entries as anomalies
        """
        # -1 for anomalies, 1 for normal
        predictions = self.detector.predict(data)
        scores = self.detector.score_samples(data)
        return predictions, scores

Let's bring the pieces together.

In [9]:
# Begin data processing
# Serialize the data to bytes
serialized_entries = serialize(data)
# Deserialize the data and transform epoch
deserialized_data = deserialize(serialized_entries)


In [None]:
pd.DataFrame(deserialized_data).isna().sum()


sensor_index       0
date_created       0
rssi               1
uptime             1
latitude          54
longitude         54
humidity         965
temperature      965
pressure        1023
pm1.0             31
pm2.5_alt         31
pm10.0            31
pm1.0_cf_1        31
pm2.5_atm         31
pm2.5_cf_1        31
pm10.0_cf_1       31
dtype: int64

In [10]:
# Perform KNN imputation on deserialized data
imputed_data = impute_data_with_knn(deserialized_data)


In [None]:
pd.DataFrame(imputed_data).isna().sum()

sensor_index    0
date_created    0
rssi            0
uptime          0
latitude        0
longitude       0
humidity        0
temperature     0
pressure        0
pm1.0           0
pm2.5_alt       0
pm10.0          0
pm1.0_cf_1      0
pm2.5_atm       0
pm2.5_cf_1      0
pm10.0_cf_1     0
dtype: int64

We can identify outliers after applying the model as seen below.

In [11]:
# Prepare your data (assuming imputed_data is a DataFrame ready for input)
detector = AnomalyDetector()

# You must convert the list of dictionaries to a DataFrame if not already done
df = pd.DataFrame(imputed_data)

# Select only the numeric columns for anomaly detection
numeric_columns = df.select_dtypes(include=[np.number])

# Fit the model with numeric data
detector.fit(numeric_columns)

# Predict anomalies on the same or new data
predictions, scores = detector.predict(numeric_columns)

In [13]:
# Add predictions and scores back to the DataFrame for review or further processing
df['anomaly'] = predictions
df['score'] = scores

# Print or process anomalies
anomalies = df[df['anomaly'] == -1]
anomalies

Unnamed: 0,sensor_index,date_created,rssi,uptime,latitude,longitude,humidity,temperature,pressure,pm1.0,pm2.5_alt,pm10.0,pm1.0_cf_1,pm2.5_atm,pm2.5_cf_1,pm10.0_cf_1,anomaly,score
100,1970.0,2017-07-11 18:58:30+00:00,-76.0,17419.0,33.998270,-118.437546,34.0,86.0,1013.58,2759.8,0.0,2759.8,4139.3,2759.8,4139.3,4139.3,-1,-0.743272
127,2334.0,2017-07-31 18:03:42+00:00,-55.0,231.0,41.045155,-111.985910,70.0,39.0,862.87,108.9,53.6,133.7,159.0,132.4,185.8,186.5,-1,-0.737338
1116,13247.0,2018-07-15 18:44:55+00:00,-76.0,74.0,12.448359,75.694170,70.0,86.0,887.86,123.2,153.2,185.6,185.8,176.0,265.1,279.4,-1,-0.760949
1161,13907.0,2018-08-01 20:40:19+00:00,-79.0,26054.0,37.851050,-122.271750,27.0,92.0,1013.48,3330.9,0.0,3330.9,4997.0,3330.9,4997.0,4997.0,-1,-0.757399
1748,19383.0,2018-11-16 22:24:14+00:00,-72.0,62256.0,37.328760,-121.897870,36.0,86.0,1012.87,155.4,149.4,310.6,155.4,178.0,268.2,310.6,-1,-0.730170
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25038,206631.0,2023-12-18 19:42:28+00:00,-73.0,7628.0,21.075000,105.809070,60.0,85.0,1013.73,34.3,55.0,95.3,52.2,60.8,92.5,144.3,-1,-0.698228
25076,207267.0,2023-12-20 20:14:46+00:00,-73.0,7628.0,21.075190,105.808510,59.0,86.0,1013.89,34.1,49.9,86.4,50.4,57.8,87.8,128.8,-1,-0.689231
25190,209433.0,2024-01-04 16:05:37+00:00,-59.0,41566.0,13.705800,100.539000,50.0,93.0,1001.68,44.2,54.9,72.7,67.7,61.3,92.9,101.1,-1,-0.707000
25322,211941.0,2024-01-26 19:39:40+00:00,-63.0,336.0,59.360455,17.990343,30.0,78.0,1015.15,112.0,160.5,192.2,169.3,181.5,273.5,289.7,-1,-0.755836


## Weaknesses of the Batch Processing Approach for Real-Time Data Changes
The batch processing approach described in the Jupyter notebook is well-suited for handling large datasets in a systematic manner, allowing for thorough cleaning, imputation, and anomaly detection. However, there are significant limitations when it comes to the adaptability and efficiency of this method, especially when dealing with data that changes in real time. Below are some critical weaknesses:

1. Lag in Response Time
Batch processing inherently involves processing data in large blocks at scheduled intervals. This results in a lag between data collection and data processing, making the approach less effective for applications that require real-time analysis or immediate action based on the latest data inputs. In environmental monitoring, for example, real-time data analysis can be crucial for issuing health advisories due to poor air quality.

2. Scalability Issues with Frequent Updates
As the data updates increase in frequency, the batch processing system may struggle to keep up without significant resources dedicated to handling these updates. If the data changes significantly between batches, the system might not capture transient anomalies or shifts in data trends effectively, potentially leading to missed detections or delayed responses.

3. Inefficiency in Resource Usage
Batch processes often require more computational resources because they handle large volumes of data at once. This can be inefficient, especially if only small parts of the dataset require updates or if the data contains a lot of redundancies. Continuous processing, on the other hand, can be more resource-efficient as it processes data incrementally.

4. Difficulty Adapting to New Patterns
The models used in batch processing are typically trained on historical data and might not adapt quickly to new or emerging patterns. This is particularly problematic for anomaly detection in environmental data, which can be influenced by sudden and unpredictable changes in environmental conditions. If the model cannot update its parameters in real-time or near-real-time, it may not perform well against newly evolving data trends.

5. Potential for Data Drift
Data drift refers to the change in the input data's distribution over time. In batch processing, there can be a significant delay between model updates, during which the data might drift, leading to model degradation. This can cause the model to make inaccurate predictions or fail to detect anomalies, as it no longer represents the current state of the environment accurately.

6. Error Accumulation
Errors in earlier stages of the batch processing pipeline can propagate and amplify by the time the data reaches the anomaly detection stage. Since data is processed in large chunks, identifying the source of errors or inconsistencies can be challenging, complicating the troubleshooting and adjustment processes.

Conclusion
While the batch processing approach offers a structured and comprehensive method for handling complex datasets, its application to real-time or near-real-time scenarios is limited. For environments where data is rapidly changing or where immediate data processing is crucial, a more dynamic approach such as stream processing might be necessary. Stream processing allows for continuous data ingestion and immediate analysis, which is more suitable for applications demanding quick responses and high adaptability.

In this setup:

The fit method is used to train the model on the data you expect to be normal.
The predict method then labels new data points as normal or anomalous based on their isolation depth in the forest.
This approach is quite flexible and allows for both initial training and subsequent anomaly detection on new batches of data.