# Sensor Box Validation Functions

This is a quick example notebook to help streamline data analysis for sensor box experiments. If and when the format of the raspberry pi logfile changes, this information will need to change as well. Hopefully this collection of small scripts removes some of the overhead associated with the first steps of processing data. For completed notebooks, see the experiment-related directories.

As these functions become more complicated, it might be nice to turn it into a library with functions you can quickly pull from. I chose not do that here because all of these functions are so brief and the requirements associated with the sensor box are still somewhat nebulous sometimes, but if the analysis becomes more complex, it might prove both appropriate and productive.

In [3]:
import bisect
from collections import Counter
import datetime as dt
import math
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as stats

## Parse CPC log data and add time index
This is a useful step to get the CPC log data into a nice format you can easily work with. Indexing by timestamp is also super helpful, especially when trying to make comparisons with the raspberry pi logs. Note that all of the other functions in this notebook assume that you do this step first.

In [4]:
file_path = 'data/cpc-logged-data.txt' # Change for your data
cpc = pd.read_csv(file_path, sep='\t'); 
index = []
for date, time in zip(cpc['#YY/MM/DD'], cpc['HR:MN:SC']):
    Y = int(date[:2]) + 2000
    M = int(date[3:5])
    D = int(date[6:8])
    H = int(time[:2])
    T = int(time[3:5])
    S = int(time[6:8])
    index.append(dt.datetime(Y, M, D, H, T, S))
cpc.index = index

FileNotFoundError: [Errno 2] File b'data/cpc-logged-data.txt' does not exist: b'data/cpc-logged-data.txt'

## Parse Raspi log data and add time index
This is a useful step to get the raspi log data into a nice format you can easily work with. Indexing by timestamp is also super helpful, especially when trying to make comparisons with the CPC logs. Note that all of the other functions in this notebook assume that you do this step first.

In [None]:
raspi = pd.read_csv('data/raspi-logged-data.csv'); # Change for your data
raspi = pd.read_csv(file_path)
# For Raspi Data
index = []
for timestamp in raspi['#YY/MM/DD:HR:MN:SC']:
    Y = int(timestamp[:4])
    M = int(timestamp[5:7])
    D = int(timestamp[8:10])
    H = int(timestamp[11:13])
    T = int(timestamp[14:16])
    S = int(timestamp[17:19]) # Truncates value instead of rounding
    index.append(dt.datetime(Y, M, D, H, T, S))
raspi.index = index

## Find Time Bounds on Dataset
This can be useful for validating that both instruments were storing values for the same amount of time. Also, if there's a fixed delay between the CPC logs and the raspi logs (assuming, of course that both started at the same time), it should be apparent from here.

In [None]:
print(f'CPC Start time: {cpc.index[0]}')
print(f'CPC End time: {cpc.index[-1]}')
print(f'Raspi Start time: {raspi.index[0]}')
print(f'Raspi End time: {raspi.index[-1]}')

## Logging Period
This can be useful for examining the symmetry (or lack thereof) of the logging frequency (i.e. the amount of seconds between samples) of the CPC and the Raspberry Pi.

In [None]:
cpc_timestamp_deltas = pd.Series((cpc[1:].index - cpc[:-1].index).seconds)
cpc_counts = cpc_timestamp_deltas.value_counts()
print("CPC seconds between samples: ")
print(cpc_counts)
raspi_timestamp_deltas = pd.Series((raspi[1:].index - raspi[:-1].index).seconds)
raspi_counts = raspi_timestamp_deltas.value_counts()
print("Raspi seconds between samples: ")
print(raspi_counts)


### Plot Raspi Logging Frequency
As the raspberry pi logging frequency has historically been finnicky, it can be useful to plot it to find patterns in its inconsistency. 

In [None]:
start_bin = 50000
number_of_points = 100
plt.stem(raspi_timestamp_deltas[start_bin:start_bin + number_of_points], use_line_collection=True)
plt.xlabel("Samples")
plt.ylabel("Time Between Samples (in Seconds)")
plt.title("Time Between Samples in Raspi Logger Dataset")

In [None]:
freq_width = 5 # Number of points per window
skips = [] # Number of values that equal 2 
skip_time = 2
for i in range(int(len(raspi_timestamp_deltas)/ freq_width)):
    window = raspi_timestamp_deltas[i*freq_width: i*freq_width + freq_width].values
    skips.append(list(window).count(skip_time))
Counter(skips)

## Logging Accuracy
Plot the logged concentrations from both the Raspberry Pi and the CPC. Note that they ideally should match up after accounting for a fixed time delay on the raspberry pi. 

In [None]:
width = 1000 # Number of seconds to look at.
start = 6000 # Starting data point on raspi.
raspi_start_dt = raspi.index[start]
cpc_start_dt = cpc.index.get_loc(raspi_start_dt)
raspi_concs = raspi.concent[raspi_start_dt:raspi_start_dt + dt.timedelta(seconds=width)]
cpc_concs = cpc.concent[cpc_start_dt: cpc_start_dt + width] # Assume 1Hz cadence, be careful using this around datapoint 50000

plt.plot(raspi_concs, label="Raspberry Pi Logged Concentration")
plt.plot(cpc_concs, label="CPC Logged Concentration")
plt.xlabel("Timestamp (Day Hour:Minute)")
plt.ylabel("Concentration (in #/$cm^3$)")
plt.title("Logged Concentrations Over Time")
plt.legend(loc="lower right")
plt.show()