In [None]:
%load_ext autoreload
%autoreload 1  
# Automatically reload bioscout package
%aimport bioscout_tech_challenge

In [17]:
import pandas as pd
from bioscout_tech_challenge.utils.file_operations import read_csv_file, read_json_file
import plotly.express as px
def plot_sensor_data(df, sensor_type, device_id=None, fig=None,**kwargs):
    """
    Plot the sensor data for a given sensor type and device id.
    """
    df = df[df['sensor_type']==sensor_type]
    yaxis_str = sensor_type + " (" + df.sensor_units.unique()[0] + ")"   
    if device_id is not None:
        df = df[df['device_id']== device_id]
    if fig is None:
        fig = px.scatter(df,
           x="timestamp",
           y="sensor_value",
           color=df['device_id'].astype(str) + ' - ' + df['sensor_device'],**kwargs
            )
    else:
        fig = fig.add_scatter(df,
           x="timestamp",
           y="sensor_value",
           color=df['device_id'].astype(str) + ' - ' + df['sensor_device'],**kwargs
            )
    fig.update_layout(
        legend_title_text='Device',
        xaxis_title="Time (s)",
        yaxis_title=yaxis_str.capitalize(), 
    )
    return fig


# 2. Visualise Weather Data & Provide Insights
Please ensure the either the exploration.ipynb has been run or the cli commands given in the README.md have been executed before running this notebook.

In [None]:
# hardcode our merged and flattened weather data
try:
    df
except NameError:
    df = read_csv_file(r"../data/tables/weather_data/output/exploration_data_output.csv")
print(df.head())
print(df.info())
print(df.shape)


## Mapping Device Locations
Lets have a look at the location of the devices to give us a sense of the data.

In [None]:
# plot latitude and longitude on interactive map      

# Get unique device locations
unique_locations = df[['device_id', 'latitude', 'longitude']].drop_duplicates()

# Create an interactive map using plotly express
fig = px.scatter_mapbox(unique_locations, 
                        lat='latitude',
                        lon='longitude',
                        hover_data=['device_id'],
                        color='latitude',
                        zoom=3,
                        title='Device Locations')

# Update the map style and layout
fig.update_layout(
    mapbox_style='open-street-map',
    margin={"r":0,"t":30,"l":0,"b":0}
)

fig.show()



## Visualising Sensor Data
The above gives us a general indication of the device locations. Lets broadly look at the different sensors and their values over time.

In [None]:
# 2.1 Visualise Weather Data
# plot the temperature over time
# Create a scatter plot showing individual lines for each device-sensor combination
sensors = df['sensor_type'].unique()
print("Sensors: ", sensors)
idx = 3
sensor_df = df[df['sensor_type'] == sensors[idx]]


fig = plot_sensor_data(sensor_df, sensors[idx])
fig.show()



The data appears messy but we can see that some devices are malfunctioning. This is especially true for the sensor at device location 255.
Some unique patterns in the device suggests that it was the first prototype as the data is sporadic and not continuous containing outliers in most of the sensors. It is also noted that the device is located off the coast of Sydney. This suggest that the device is a prototype that was tested at the office in Sydney with issues with the GPS. It is also noted that the sensor data is time stamped before the deployment of the rest of the devices. Thus moving forward we will remove this device for the analysis.

Temperatures in the $100\degree\text{C}$ are not possible (unless global warming gets out of control). Lets remove this device and change our y-axis to something more sensible. Since we are in NSW and below is October - December, we can assume that the temperature will not exceed $50\degree\text{C}$.


### Temperature
Looking more specifically we produce the following plot of the temperature data for the rest of the devices.

In [None]:
device_id = 255
temperature_df = df[df['device_id']!= device_id]

fig = plot_sensor_data(temperature_df, 'temperature')
fig.update_layout(yaxis_range=[0, 50])
fig.show()



Whilst this is still quite busy we can clearly see the patterns of temperature fulcuations of each day. This is a good start as it shows some reasonability in the data. Turning off the traces and focusing on each device site we can see that there is a fixed offset in the temperature between the BME680 and the SHT30 sensor. This would require some further engineering analysis combined with some source of truth to calibrate the output data. It is interesting to note that the offset is more pronouced on the highside during the peak in temperature in the day. This could be due to the position of the sensors on the device.
![Device 265 Temperature Plot](./plots/265-temp.png)




Another interesting pattern is that devices with device_id > 280 each had one or more calibration events between the 5th and 9th of November. These sensors all started to capture continuous data at varying dates onwards and were deployed to NSW, Victoria and South Australia. These early events can therefore be filtered from further analysis based on the assumption that engineers were testing the sensors and calibrating them before deployment. Therefore moving forward these data points will be filtered from the deeper analysis. 

![Calibration Event](./plots/calibration.png)



To do this we develop a tool in the bioscout package to remove known periods of testing data.   

In [22]:
from bioscout_tech_challenge.utils.weather import apply_single_filter

filter_dict = {'device_id': 255}
rows_to_remove = apply_single_filter(df, filter_dict)
df = df.drop(rows_to_remove)

filter_prototype = {'device_id': {'min': 280},
                    'timestamp': {'max': pd.to_datetime('2024-11-09')}}


rows_to_remove = apply_single_filter(df, filter_prototype)
filtered_df = df.drop(rows_to_remove)


In [None]:
fig = plot_sensor_data(filtered_df, 'temperature')

fig.show()

Thus we removed the calibration events and the prototype data. We can now move forward with the analysis. One final thing to note about the temperature data is that the initial some of the initial readings from the SHT30 device in the first group rolled out device_id <=279 the first data point read a value of $130\degree\text{C}$ which is not possible. This seems like something to do with the initialisation process that was ironed out in the second group of device the worst case however was that of device_id 278 which was incorrect for the first ~1 day of data gathering.

![SHT30 Temperature](./plots/278-temp.png)


We can choose to filter this data out or keep it in for further analysis. However, im working with the assumption that this bug was ironed out in the second group of device.

In [None]:
filter_prototype = {'sensor_device': "SHT30",
                    'sensor_type': "temperature",
                    'sensor_value': 130}


rows_to_remove = apply_single_filter(filtered_df, filter_prototype)
filtered_df = filtered_df.drop(rows_to_remove)
fig = plot_sensor_data(filtered_df, 'temperature')

fig.show()

### Humidity
Now that we have removed the calibration events and the prototype data based on the analysis of the temperature data lets have a look at the humidity data. The first thing we notice is that there is very different trends of data between the two sensors; "BME680" and "SHT30". The BME680 sensor seems to show a much more stable trend of humidity over time. 

![Humidity](./plots/289_290-humid.png)

The SHT30 sensor however shows a much more fluctuating trend of humidity over time. This is likely due to the fact that the SHT30 sensor is a relative humidity sensor and is affected by the temperature of the air. The BME680 sensor is an absolute humidity sensor and is not affected by the temperature of the air. Lets compare the temperature and humidity data for the same device.

In [None]:
# Create the initial figure with humidity data
device_ids = [289]
fig1 = plot_sensor_data(filtered_df[filtered_df['device_id'].isin(device_ids)], 'humidity')

# Create temperature scatter plot
temp_data = filtered_df[filtered_df['device_id'].isin(device_ids)]
fig2 = plot_sensor_data(filtered_df[filtered_df['device_id'].isin(device_ids)], 'temperature',)
fig2.update_traces(
    marker=dict(
        size=5,          # Size of markers
        symbol='diamond', # Marker symbol: 'circle', 'square', 'diamond', 'cross', 'x', etc.
        opacity=0.2,     # Marker opacity
        line=dict(
            width=0.3,
            color='grey'
        )
    )
)

# Add the temperature traces to the humidity figure
for trace in fig2.data:
    trace.yaxis = 'y2'
    fig1.add_trace(trace)

# Update layout for secondary y-axis
fig1.update_layout(
    yaxis2=dict(
        title='Temperature',
        overlaying='y',
        side='right'
    )
)

fig1.show()

As we can see the humidity data from the SHT30 follows the inverse semi-sinusoidal pattern of the temperature data which is to be expected. 

### Rainfall
It would be interesting to compare the rainfall data to the absolute humidity data from the BME680 sensor. Lets have a look at one of the longer term devices to see if we can see any interesting patterns.

In [None]:
device_id = 262
sensor_device = 'BME680'

humidity_filter = apply_single_filter(filtered_df, {'sensor_device': sensor_device,'device_id':device_id})
fig1 = plot_sensor_data(filtered_df[filtered_df['device_id']==device_id], 'rainfall')

fig2 = plot_sensor_data(filtered_df.loc[humidity_filter], 'humidity')

fig2.update_traces(
    marker=dict(
        size=5,          # Size of markers
        symbol='diamond', # Marker symbol: 'circle', 'square', 'diamond', 'cross', 'x', etc.
        opacity=0.2,     # Marker opacity
        line=dict(
            width=0.3,
            color='grey'
        )
    )
)
    # Add the temperature traces to the humidity figure
for trace in fig2.data:
    trace.yaxis = 'y2'
    fig1.add_trace(trace)

# Update layout for secondary y-axis
fig1.update_layout(
    yaxis2=dict(
        title='humidity',
        overlaying='y',
        side='right'
    )
)

fig1.show()



There is no obvious simple relationship between the rainfall and the humidity data. The rainfall detection sensor are a simple measurement of the rain in the last 5 minutes and thus postprocessing of this data is required to develop deeper insights. There was a large spike in rainfall on the 20th-21st of November on the Optical Rainfall Gauge of device_id 290. Looking at BOMs hhistorical data no rain was recorded in Melbourne on this date. This suggests that there was an issue with the device where it recorded its maximum value of 200mm. This seems to have been corrected in the following days as the device correctly detected significant rainfall on the 24th of November which was also recorded by BOM.

![Rainfall in Melbourne](./plots/290-rainfall.png)

### Pressure
Lets have a look at the pressure data for the devices.

In [None]:
fig = plot_sensor_data(filtered_df, 'pressure')
fig.show()




Since the pressure data is consisent and within viable ranges we can assume that the pressure data is correct. To extract any more insitghtful metrics we would need to process the data with knowledge of the device location and the elevation.

### VOC
Loking at the VOC data we can see that the VOC sensor is not working correctly. The sensor is reading a value of 0.0 for the entire duration of the data. This is likely due to the sensor being incorrectly calibrated.

In [None]:
fig = plot_sensor_data(filtered_df, 'voc')
fig.show()



It is interesting to note that the VOC sensor is fulctuating with what appears the time of day. Further analysis into how the BME680 VOC sensor measures the total volatile organic compounds would be required to understand this pattern. Below is a plot of the VOC data with temperature from the same device and sensor. Showing that the VOC sensor is inversely effected by the temperature and therefore readings will likely need calibration via sensor fusion with the temperature data.


In [None]:
device_id = 262
sensor_device = 'BME680'
voc_filter = apply_single_filter(filtered_df, {'sensor_device': sensor_device,'device_id':device_id})
fig1 = plot_sensor_data(filtered_df.loc[voc_filter], 'voc')
fig1.update_traces(name='VOC')
fig2 = plot_sensor_data(filtered_df.loc[voc_filter], 'temperature')
fig2.update_traces(
    marker=dict(
        size=5,          # Size of markers
        symbol='diamond', # Marker symbol: 'circle', 'square', 'diamond', 'cross', 'x', etc.
        opacity=0.5,     # Marker opacity

        color='red'
    ),
    name = 'temperature'
)
for trace in fig2.data:
    trace.yaxis = 'y2'
    fig1.add_trace(trace)
fig1.update_layout(
    yaxis2=dict(
        title='temperature',
        overlaying='y',
        side='right'
    )
)
fig1.update_layout(title=f'VOC vs Temperature for device_id {device_id} {sensor_device}')
fig1.show()


## Conclusions and Recommendations

From analysis of the sensors in the flattened data set we were able to draw some conclusions about the data and the sensors. Specifically we were able to identify the following:  

- Device 255 was a prototype device that was tested in the office and deployed to the field with issues. The data is not reliable and should be removed from the data set.
- Devices > 280 had one or more calibration events between the 5th and 9th of November. These sensors all started to capture continuous data at varying dates onwards and were deployed to NSW, Victoria and South Australia. These early events can therefore be filtered from further analysis based on the assumption that engineers were testing the sensors and calibrating them before deployment. Therefore moving forward these data points will be filtered from the deeper analysis.
- The SHT30 sensor is a relative humidity sensor and is affected by the temperature of the air. The BME680 sensor is an absolute humidity sensor and is not affected by the temperature of the air.
- The VOC sensor is affected by the temperature of the air. Therefore readings will likely need calibration via sensor fusion with the temperature data.

Recommendations for further analysis:
- Confirm  above findings with engineering team and confirm the calibration events and prototype data. Further develop filtering tools for use by the engineering team and company broadly so that known issues can be filtered out of the data set.
- Investigate the VOC sensor data and develop a calibration tool for the VOC sensor data.
- Determine the best way to process the windspeed and direction data to extract meaningful metrics. Plotly does not support plotting arrow vectors on scatter maps so we will need to develop a custom solution for this.
- Investigate the rainfall detection sensor data and develop a postprocessing tool to extract meaningful metrics.
- Run a sensor analysis tool with some key metrics identified by the engineering team including operational limits. This could include mean, median, standard deviation, min, max, quantiles, etc over different time periods.
- Calculate hourly and daily averages from the data and create a comparison with known weather data from BOM.



## CLI Development
The new filter function has been added to the CLI a filter json file applying the removal and tagging of data from the above analsis is shown below and is saved in `src/bioscout_tech_challenge/filter_prototype.json`


```json
{
  "remove_filters": [
    {
      "name": "remove_prototype_devices",
      "description": "Remove sensor readings from prototype devices",
      "filters": [
        {
          "device_id": 255
        }
      ]
    },
    {
      "name": "remove_calibration_data",
      "description": "Remove calibration data from second group of devices calibrated between 2024-11-05 and 2024-11-08",
      "filters": [
        {
          "device_id": {"min": 280, "max": 291},
          "timestamp": {"min": "2024-11-05", "max": "2024-11-08"}
        }
      ]
    }
  ],
  "tag_filters": [
    {
      "name": "optical_device_malfunction",
      "description": "Tag optical devices that have malfunctioned",
      "tag": "OpticalRainGauge_malfunction",
      "filters": [
        {
          "device_id": 290,
          "timestamp": {"min": "2024-11-20", "max": "2024-11-23"},
          "sensor_type": "rainfall",
          "sensor_device": "OpticalRainGauge"
        }
      ]
    },
    {
      "name": "device_group_a",
      "description": "Tag specific device group",
      "tag": "group_a",
      "filters": [
        {
          "device_id": {"min": 280, "max": 290}
        }
      ]
    }
  ]
}

Running it is as follows:

In [None]:
! bioscout-tech-challenge weather filter --help
