## Load data

Data is generated by a RaspberryPi with a temperature and humidity sensor at home. Readings are taken every 10 minutes using a bash script, then stored onto DynamoDB. However there have been some period of missing data, e.g. when I got a new WiFi network and forgot to update the password on the Pi. This notebook loads and formats the data, then replaces missing periods with the previous day. 

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
from datetime import timedelta

In [2]:
import os
os.chdir('../')

In [3]:
df = pd.read_csv('../saved_files/ddb_output.csv'
         ).rename(columns={'humidity.S': 'humidity',
                   'temperature.S':'temperature',
                   'timestamp.S':'timestamp'},
         ).drop(columns=['Unnamed: 0']
         ).sort_values(by='timestamp')
# Convert the timestamp column to datetime format
df['timestamp'] = pd.to_datetime(df['timestamp'], format='mixed')

# Round the timestamp to the nearest minute
df['timestamp'] = df['timestamp'].dt.round('1min')

print(df.shape)

(17238, 3)


Missing periods are visible below. 

In [4]:
fig = px.scatter(df, x="timestamp", y=["humidity","temperature"], title='All-time humidity and temperature in the grove!')
fig.show()

In [5]:
df_original = df
print(df_original.shape)

(17238, 3)


Identify missing entries, and create rows with N/A for them

In [6]:
time_interval = timedelta(minutes=10)

i = 0 

while i < df.shape[0]-1: 

    current_time = pd.Timestamp(df.iloc[i]['timestamp'])
    next_time = pd.Timestamp(df.iloc[i + 1]['timestamp'])
    
    # Check if the time interval between current and next timestamp is much longer than we expect
    if (next_time - current_time) > time_interval + timedelta(minutes=60):
        
        # Insert a new row with NA values and a timestamp 10 minutes after the current timestamp
        new_row = pd.DataFrame({'timestamp': [current_time + time_interval],
                                'humidity': [np.nan],
                                'temperature':[np.nan]})
        
        # Concatenate the new row to the dataframe
        df = pd.concat([df.iloc[:i+1], new_row, df.iloc[i+1:]], ignore_index=True)

    i += 1

# Updated dataframe with inserted rows for missing timestamps
print(df.shape)

(18724, 3)


What proportion of the data is missing?

In [7]:
rows_missing = (df.shape[0]-df_original.shape[0])/df.shape[0]

print(f'We identified {np.round(rows_missing*100,2)}% of rows as missing, '
      f'out of an enriched datamframe with {df.shape[0]} rows.')

We identified 7.94% of rows as missing, out of an enriched datamframe with 18724 rows.


Let's fill using the previous day's readings (thanks to my partner for this idea!)


In [8]:
values_changed = 0

one_day = 10 * 24 # Number of readings ago

# Loop through the dataframe
for i in range(one_day, len(df)):
    current_value = df.loc[i, 'temperature']
    
    # Check if the value is NaN
    if pd.isna(current_value):

        # Set the value equal to the value from 240 rows ago
        previous_value_temp = df.loc[i-one_day, 'temperature']
        df.loc[i, 'temperature'] = previous_value_temp

        # Set the value equal to the value from 240 rows ago
        previous_value_humidity = df.loc[i-one_day, 'humidity']
        df.loc[i, 'humidity'] = previous_value_humidity

In [9]:
df.shape

(18724, 3)

In [10]:
fig = px.scatter(df, x="timestamp", y=["humidity","temperature"], title='All-time humidity and temperature in the grove!')
fig.show()

This does lead to some strange discontinous patterns, but overall seems close enough to reality. 

In [11]:
df.shape

(18724, 3)

In [12]:
df[:600]

# Took until midday on 30th April to stabilise on one reading every ten mins

Unnamed: 0,humidity,temperature,timestamp
0,56.00,17.83,2023-04-28 08:25:00
1,56.00,17.85,2023-04-28 08:26:00
2,56.00,17.85,2023-04-28 08:28:00
3,56.00,17.88,2023-04-28 08:32:00
4,56.00,17.92,2023-04-28 09:30:00
...,...,...,...
595,49.52,20.12,2023-05-04 16:20:00
596,49.67,20.09,2023-05-04 16:30:00
597,49.75,20.13,2023-05-04 16:40:00
598,50.26,20.17,2023-05-04 17:00:00


Selecting only data past that point, so data is evenly spaced.

In [13]:
df = df[df['timestamp'] > pd.Timestamp(year=2023, month=5, day=1)]

In [14]:
df.to_csv('../saved_files/cleaned_ddb_output.csv')