In [1]:
import pandas as pd
# import numpy as np
import json
import os
import re


In [5]:
aranetExp_path = "./datasets/aranetExp.csv"
aranet4_path = "./datasets/aranet4.csv"

aranetExp = pd.read_csv(aranetExp_path)
aranet4 = pd.read_csv(aranet4_path)


Here's a plan for the function:

1. Define a function `calculate_rolling` that takes four parameters: `dataframe`, `column_names`, `window_sizes`, and `operations`.
2. Inside the function, iterate over each `column_name` in `column_names`.
3. For each `column_name`, iterate over each `window_size` in `window_sizes`.
4. For each `window_size`, iterate over each `operation` in `operations`.
5. For each `operation`, calculate the rolling operation of the `column_name` with the `window_size` using the `rolling` method of the dataframe.
6. Use an if-elif-else statement to decide which operation to apply based on the `operation` parameter.
7. Store the result in a new column in the dataframe with a name that combines the `column_name`, `window_size`, and `operation`.
8. After all iterations, return the modified dataframe.

Here's the Python code for the function:



In [6]:
def calculate_rolling(dataframe, column_names, window_sizes, operations):
    for column_name in column_names:
        for window_size in window_sizes:
            for operation in operations:
                try:
                    dataframe[f'{column_name}_rolling_{operation}_{window_size}'] = getattr(dataframe[column_name].rolling(window=window_size, min_periods=1), operation)()
                except Exception as e:
                    print(f"Operation '{operation}' not supported: {str(e)}")
    return dataframe



You can use this function with your `main_features` list, a list of window sizes, and a list of operations. For example:



In [7]:
main_features = ['Carbon dioxide(ppm)', 'Temperature(°F)', 'Relative humidity(%)', 'Atmospheric pressure(hPa)']
window_sizes = [5, 15, 30, 60]
operations = ['mean', 'max', 'sum', 'std', 'count']
aranet4_mod = aranet4.copy()
aranet4_mod = calculate_rolling(aranet4_mod, main_features, window_sizes, operations)

aranet4_mod.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97861 entries, 0 to 97860
Data columns (total 87 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   Carbon dioxide(ppm)                         97861 non-null  int64  
 1   Temperature(°F)                             97861 non-null  float64
 2   Relative humidity(%)                        97808 non-null  float64
 3   Atmospheric pressure(hPa)                   95339 non-null  float64
 4   Date                                        97861 non-null  object 
 5   Time                                        97861 non-null  object 
 6   Datetime                                    97861 non-null  object 
 7   Carbon dioxide(ppm)_rolling_mean_5          97861 non-null  float64
 8   Carbon dioxide(ppm)_rolling_max_5           97861 non-null  float64
 9   Carbon dioxide(ppm)_rolling_sum_5           97861 non-null  float64
 10  Carbon dio

The `rolling` function in pandas provides a variety of operations that can be performed on a rolling window of data. Here are some of them:

1. `mean()`: Compute the rolling mean of the data.
2. `sum()`: Compute the rolling sum of the data.
3. `min()`: Compute the rolling minimum of the data.
4. `max()`: Compute the rolling maximum of the data.
5. `std()`: Compute the rolling standard deviation of the data.
6. `var()`: Compute the rolling variance of the data.
7. `median()`: Compute the rolling median of the data.
8. `quantile()`: Compute the rolling quantile of the data.
9. `apply()`: Apply a function of your choice to the rolling window of data.
10. `corr()`: Compute the rolling correlation between two series.
11. `cov()`: Compute the rolling covariance between two series.
12. `skew()`: Compute the rolling skewness of the data.
13. `kurt()`: Compute the rolling kurtosis of the data.
14. `count()`: Count the number of non-NA values in the rolling window of data.

You can modify the `calculate_rolling` function to include any of these operations.

We can likewise use the `diff()` method to track the minute-to-minute differences in between records. This is effectively measureing the minute-to-minute rate of change. Because our smallest measurement of time is units of minutes, we use this as our smallest measure for our difference calculus. The distribution of the minute-to-minute rate of change values is approximately normal and centered about mean zero.

- `diff(1)` is the `one` minute difference (or the difference of the current record with the `1st` previous record).
- `diff(2)` is the `two` minute difference (or the difference of the current record with `2nd` previous record).
- `diff(n)` is the `n` minute difference (or the difference of the current record with `nth` previous record).


I question whether anything  beyond `diff(5)` is relevant. What is interesting is the pattern formed by tracking the consecutive magnitude and sign of the difference. Much of this can be captured with `diff(1)` alone though and watching the rolling statistics as well as the sign change patterns of `diff(1)`

We can additionally take the difference of the difference which could lead to patterns of continuous accelerated change in the monitored features.

Given the structure of our air monitor and the fact it is stable and note moving the vast majority of the time, we can also consider this be the a measure of a volume or cross-section of the air-flow through (into) the sensor chamber. The difference being a measure of the finite difference velocity at which `CO2 (ppm)` flows into, out of, or not at all from minute to minute. The second difference is the finite difference acceleration.

In [50]:
import pandas as pd

# Create a time series
df = pd.DataFrame({'Position': [1, 2, 4, 7, 11, 16]})

# Calculate the first difference (velocity)
df['Velocity'] = df['Position'].diff()

# Calculate the second difference (acceleration)
df['Acceleration'] = df['Velocity'].diff()

print(df)

   Position  Velocity  Acceleration
0         1       NaN           NaN
1         2       1.0           NaN
2         4       2.0           1.0
3         7       3.0           1.0
4        11       4.0           1.0
5        16       5.0           1.0



We want to make a function similar to `calculate_rolling()` for `calculate_diff()`.

In [8]:
def calculate_diff(dataframe, column_names, window_sizes):
    for column_name in column_names:
        for window_size in window_sizes:
            dataframe[f'{column_name}_diff_{window_size}'] = dataframe[column_name].diff(periods=window_size)
    return dataframe

In [9]:

main_features = ['Carbon dioxide(ppm)']
                #  , 'Temperature(°F)', 'Relative humidity(%)', 'Atmospheric pressure(hPa)']
window_sizes = [1]
                # , 2, 3, 4, 5]

# operations = ['mean', 'max', 'sum', 'std', 'count']

# Calculate the first difference (velocity)
aranet4_diff = aranet4.copy()
aranet4_diff = calculate_diff(aranet4_diff, main_features, window_sizes)

# Calculate the second difference (acceleration)
main_features= ['Carbon dioxide(ppm)_diff_1']
aranet4_diff = calculate_diff(aranet4_diff, main_features, window_sizes)
aranet4_diff.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97861 entries, 0 to 97860
Data columns (total 9 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Carbon dioxide(ppm)                97861 non-null  int64  
 1   Temperature(°F)                    97861 non-null  float64
 2   Relative humidity(%)               97808 non-null  float64
 3   Atmospheric pressure(hPa)          95339 non-null  float64
 4   Date                               97861 non-null  object 
 5   Time                               97861 non-null  object 
 6   Datetime                           97861 non-null  object 
 7   Carbon dioxide(ppm)_diff_1         97860 non-null  float64
 8   Carbon dioxide(ppm)_diff_1_diff_1  97859 non-null  float64
dtypes: float64(5), int64(1), object(3)
memory usage: 6.7+ MB


In [10]:
main_features = ['Carbon dioxide(ppm)', 'Carbon dioxide(ppm)_diff_1', 'Carbon dioxide(ppm)_diff_1_diff_1']
                #  , 'Temperature(°F)', 'Relative humidity(%)', 'Atmospheric pressure(hPa)']
window_sizes = [5]

operations = ['mean', 'max', 'sum', 'std']

aranet4_diff_rolling = aranet4_diff.copy()

aranet4_diff_rolling = calculate_rolling(aranet4_diff_rolling, main_features, window_sizes, operations)


In [11]:
aranet4_diff_rolling.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97861 entries, 0 to 97860
Data columns (total 21 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   Carbon dioxide(ppm)                               97861 non-null  int64  
 1   Temperature(°F)                                   97861 non-null  float64
 2   Relative humidity(%)                              97808 non-null  float64
 3   Atmospheric pressure(hPa)                         95339 non-null  float64
 4   Date                                              97861 non-null  object 
 5   Time                                              97861 non-null  object 
 6   Datetime                                          97861 non-null  object 
 7   Carbon dioxide(ppm)_diff_1                        97860 non-null  float64
 8   Carbon dioxide(ppm)_diff_1_diff_1                 97859 non-null  float64
 9   Carbon dioxide(pp