# Brains4Buildings data extraction and backup

This JupyterLabs notebook can be used download raw data from a Twomes database (see also [more information how to setup a Twomes server](https://github.com/energietransitie/twomes-backoffice-configuration#jupyterlab)).

In particular, it has been set up to get data from the [Brains4Buildings data collection](https://www.energietransitiewindesheim.nl/brains4buildings2022/privacy/index.html).

Don't forget to install the requirements listed in [requirements.txt](../requirements.txt) first!



## Setting the stage

First several imports and variables need to be defined


### Imports and generic settings

In [1]:
from datetime import datetime, timedelta
import pytz
import math
import pylab as plt

import pandas as pd
import numpy as np

# usually, two decimals suffice for displaying DataFrames (NB internally, precision may be higher)
pd.options.display.precision = 2

import sys
sys.path.append('../data/')
sys.path.append('../view/')
sys.path.append('../analysis/')

%load_ext autoreload
import gc

%matplotlib widget
from plotter import Plot

from measurements import Measurements
from preprocessor import Preprocessor

from tqdm.notebook import tqdm


import logging
logging.basicConfig(level=logging.INFO, 
                    format='%(asctime)s %(levelname)-8s %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    filename='log_b4b.txt',
                   )

### Defining which account, which period 

- which account was used to provision the measurements? 
- the location and timezone is
- from which `start_day` to which `end_day' 

In [2]:
#location: T-building, Windesheim, in Zwolle
lat, lon = 52.4350486, 5.4040816

#timezone: 
timezone_database = 'UTC'
timezone_buildings = 'Europe/Amsterdam'

# Below, the maximum period for data collection
first_day = pytz.timezone(timezone_buildings).localize(datetime(2022, 10, 1))
last_day = pytz.timezone(timezone_buildings).localize(datetime(2022, 11, 2))

# all devices were provisioned by a single account
account = [820921]

b4b_db_properties = [
    'roomTemp',
    'CO2concentration',
    'relativeHumidity',
    'countPresence'
]

device_mapping = {
    'TWOMES-979368': 999169,
    'TWOMES-9799B8': 900846,
    'TWOMES-ACDEF0': 948634,
    'TWOMES-ACEB08': 917810,
    'TWOMES-ACEB4C': 925038
}
rooms = [999169, 900846, 948634, 917810, 925038, 924038]

property_rename = {
    'CO2concentration': 'co2_ppm',
    'countPresence': 'occupancy_p',
    'relativeHumidity': 'rel_humidity_0',
    'roomTemp': 'temp_in_degC'
}

property_types = {
    'temp_in_degC' : 'float32',
    'co2_ppm' : 'float32',
    'rel_humidity_0' : 'float32',
    'valve_frac_0' : 'float32',
    'door_open_bool': 'Int8',
    'window_open_bool': 'Int8',
    'occupancy_bool': 'Int8',
    'occupancy_p' : 'Int8'
}


## Getting accounts

In [3]:
%%time 
%autoreload 2
df = Measurements.get_accounts_devices(first_day, last_day,
                                       timezone_database, timezone_buildings)

0it [00:00, ?it/s]

CPU times: user 73 ms, sys: 12.4 ms, total: 85.4 ms
Wall time: 1.09 s


In [4]:
df

Unnamed: 0,account_id,device_id,device_name,latest_timestamp_UTC,property,value,unit
0,820921,119,TWOMES-979368,2022-11-02 13:10:00,batteryVoltage,4.11,V
1,820921,118,TWOMES-9799B8,2022-11-02 13:00:00,batteryVoltage,4.08,V
2,820921,121,TWOMES-ACDEF0,2022-11-02 13:20:00,batteryVoltage,4.26,V
3,820921,114,TWOMES-ACDF70,2036-01-06 06:30:00,batteryVoltage,4.26,V
4,820921,114,TWOMES-ACDF70,2036-01-06 06:30:00,batteryVoltage,4.26,V
5,820921,120,TWOMES-ACEB08,2022-11-02 13:10:00,batteryVoltage,4.15,V
6,820921,117,TWOMES-ACEB4C,2022-11-02 13:00:00,batteryVoltage,4.22,V
7,820921,119,TWOMES-979368,2022-11-02 13:10:00,CO2concentration,412.0,ppm
8,820921,118,TWOMES-9799B8,2022-11-02 13:00:00,CO2concentration,403.0,ppm
9,820921,121,TWOMES-ACDEF0,2022-11-02 13:20:00,CO2concentration,352.0,ppm


## Getting measurements from sources

### Getting measurements from the database

In [5]:
%%time 
%autoreload 2
df_db_meas = (Measurements.get_raw_measurements(
    account,
    first_day, last_day,
    b4b_db_properties,
    timezone_database, timezone_buildings)
           .loc[account[0]]
           .rename(index=device_mapping)
           .rename(index=property_rename)
           .sort_index()
          )

df_db_meas.index.names = ['id', 'source', 'timestamp', 'property']
df_db_meas = df_db_meas.loc[[device_mapping[id] for id in device_mapping.keys()]]
# del df_db_meas['unit']
df_db_meas.value = df_db_meas.value.astype('float')

0it [00:00, ?it/s]

CPU times: user 3.25 s, sys: 116 ms, total: 3.37 s
Wall time: 3.8 s


In [6]:
df_db_meas.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 60018 entries, (999169, 'CO2-meter-SCD4x', Timestamp('2022-10-12 16:13:00+0200', tz='Europe/Amsterdam'), 'co2_ppm') to (925038, 'CO2-meter-SCD4x', Timestamp('2022-11-02 14:00:00+0100', tz='Europe/Amsterdam'), 'temp_in_degC')
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   value   60018 non-null  float64 
 1   unit    60018 non-null  category
dtypes: category(1), float64(1)
memory usage: 994.8+ KB


In [7]:
df_db_meas

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,value,unit
id,source,timestamp,property,Unnamed: 4_level_1,Unnamed: 5_level_1
999169,CO2-meter-SCD4x,2022-10-12 16:13:00+02:00,co2_ppm,543.0,ppm
999169,CO2-meter-SCD4x,2022-10-12 16:13:00+02:00,occupancy_p,1.0,
999169,CO2-meter-SCD4x,2022-10-12 16:13:00+02:00,rel_humidity_0,49.4,%RH
999169,CO2-meter-SCD4x,2022-10-12 16:13:00+02:00,temp_in_degC,21.1,°C
999169,CO2-meter-SCD4x,2022-10-12 16:20:00+02:00,co2_ppm,524.0,ppm
...,...,...,...,...,...
925038,CO2-meter-SCD4x,2022-11-02 13:50:00+01:00,temp_in_degC,19.2,°C
925038,CO2-meter-SCD4x,2022-11-02 14:00:00+01:00,co2_ppm,925.0,ppm
925038,CO2-meter-SCD4x,2022-11-02 14:00:00+01:00,occupancy_p,2.0,
925038,CO2-meter-SCD4x,2022-11-02 14:00:00+01:00,rel_humidity_0,47.8,%RH


### Get other measurements

N.B. You need to download [b4b-rawdata.zip from the source](https://liveadminwindesheim.sharepoint.com/:u:/r/sites/O365-Brains4Buildings/Gedeelde%20documenten/General/Windesheim%20as%20Living%20Lab/data-raw-anon/b4b-rawdata.zip?csf=1&web=1&e=M0NX1r) first and save it in the ../data/ folder): 

In [8]:
%%time 
df = pd.read_csv('../data/b4b-rawdata.zip', parse_dates=['timestamp'], index_col=['timezone', 'timestamp']).sort_index(level='timestamp')


df_other_meas = pd.DataFrame()
for tz in df.index.unique(level='timezone'):
    df_other_meas = pd.concat([df_other_meas, df.loc[tz].tz_localize(tz, ambiguous='NaT')])


df_other_meas = df_other_meas.sort_index()

df_other_meas = df_other_meas.loc[df_other_meas.index.dropna()]


CPU times: user 17.7 s, sys: 31 s, total: 48.7 s
Wall time: 48.8 s


In [9]:
df_other_meas.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 7831891 entries, 2022-10-06 00:00:00+02:00 to 2022-11-02 23:59:01+01:00
Data columns (total 4 columns):
 #   Column    Dtype  
---  ------    -----  
 0   id        int64  
 1   source    object 
 2   property  object 
 3   value     float64
dtypes: float64(1), int64(1), object(2)
memory usage: 298.8+ MB


In [10]:
df_other_meas

Unnamed: 0_level_0,id,source,property,value
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-10-06 00:00:00+02:00,900846,bms,co2_ppm,367.00
2022-10-06 00:00:00+02:00,924038,bms,co2_ppm,329.12
2022-10-06 00:00:00+02:00,925038,bms,co2_ppm,424.00
2022-10-06 00:00:00+02:00,900846,bms,temp_in_degC,21.20
2022-10-06 00:00:00+02:00,924038,bms,temp_in_degC,21.29
...,...,...,...,...
2022-11-02 23:59:01+01:00,917810,bms,occupancy_bool,0.00
2022-11-02 23:59:01+01:00,917810,bms,co2_ppm,455.00
2022-11-02 23:59:01+01:00,999169,bms,temp_in_degC,20.40
2022-11-02 23:59:01+01:00,948634,bms,occupancy_bool,0.00


### Merge database and other measurements

In [11]:
df_meas = (pd.concat([
    df_db_meas.reset_index(), 
    df_other_meas.reset_index()[['id', 'source', 'timestamp', 'property', 'value']]])
                   .drop_duplicates()
                   .set_index(['id', 'source', 'timestamp', 'property'])
                   .sort_index()
                  )

In [12]:
df_meas.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 590125 entries, (900846, 'CO2-meter-SCD4x', Timestamp('2022-10-06 13:49:00+0200', tz='Europe/Amsterdam'), 'co2_ppm') to (999169, 'xovis', Timestamp('2022-11-02 13:55:00+0100', tz='Europe/Amsterdam'), 'occupancy_p')
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype   
---  ------  --------------   -----   
 0   value   590125 non-null  float64 
 1   unit    60018 non-null   category
dtypes: category(1), float64(1)
memory usage: 10.3+ MB


In [13]:
df_meas

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,value,unit
id,source,timestamp,property,Unnamed: 4_level_1,Unnamed: 5_level_1
900846,CO2-meter-SCD4x,2022-10-06 13:49:00+02:00,co2_ppm,527.0,ppm
900846,CO2-meter-SCD4x,2022-10-06 13:49:00+02:00,occupancy_p,2.0,
900846,CO2-meter-SCD4x,2022-10-06 13:49:00+02:00,rel_humidity_0,58.0,%RH
900846,CO2-meter-SCD4x,2022-10-06 13:49:00+02:00,temp_in_degC,22.5,°C
900846,CO2-meter-SCD4x,2022-10-06 13:50:00+02:00,co2_ppm,530.0,ppm
...,...,...,...,...,...
999169,xovis,2022-11-02 13:35:00+01:00,occupancy_p,0.0,
999169,xovis,2022-11-02 13:40:00+01:00,occupancy_p,0.0,
999169,xovis,2022-11-02 13:45:00+01:00,occupancy_p,0.0,
999169,xovis,2022-11-02 13:50:00+01:00,occupancy_p,0.0,


### Writing raw measurements to a parquet file

In [14]:
%%time 
df_meas.to_parquet('b4b_raw_measurements.parquet', index=True, engine='pyarrow')

CPU times: user 206 ms, sys: 35.6 ms, total: 241 ms
Wall time: 232 ms


### Write raw measurements per home to parquet files

In [15]:
%%time 
for room_id in tqdm(list(df_meas.index.unique(level='id'))):
    df_meas.xs(room_id, drop_level=False).to_parquet(f'{room_id}_raw_measurements.parquet', index=True, engine='pyarrow')

  0%|          | 0/6 [00:00<?, ?it/s]

CPU times: user 258 ms, sys: 36.6 ms, total: 295 ms
Wall time: 275 ms


## Put properties in separate columns, apply types and write parquet file(s)

In [16]:
# unstacking takes the entire Twomes dataset uses 32 GB memory, so we have to do it home by home
del df_meas
gc.collect()

19

### Writing raw properties per home to a parquet file

In [17]:
%%time
%autoreload 2

df_prop = pd.DataFrame()

for room_id in tqdm(rooms):
    df_prop_room = Measurements.to_properties(
        pd.read_parquet(f'{room_id}_raw_measurements.parquet', engine='pyarrow'),
        property_types
    )
    df_prop_room.to_parquet(f'{room_id}_raw_properties.parquet', index=True, engine='pyarrow')
    df_prop = pd.concat([df_prop, df_prop_room])   

  0%|          | 0/6 [00:00<?, ?it/s]

CPU times: user 1.09 s, sys: 63.5 ms, total: 1.15 s
Wall time: 1.08 s


In [18]:
df_prop.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 130030 entries, (999169, 'CO2-meter-SCD4x', Timestamp('2022-10-12 16:13:00+0200', tz='Europe/Amsterdam')) to (924038, 'xovis', Timestamp('2022-11-02 13:55:00+0100', tz='Europe/Amsterdam'))
Data columns (total 8 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   co2_ppm           120351 non-null  float32
 1   door_open_bool    29 non-null      Int8   
 2   occupancy_bool    103338 non-null  Int8   
 3   occupancy_p       24334 non-null   Int8   
 4   rel_humidity_0    118338 non-null  float32
 5   temp_in_degC      120351 non-null  float32
 6   valve_frac_0      103338 non-null  float32
 7   window_open_bool  29 non-null      Int8   
dtypes: Int8(4), float32(4)
memory usage: 5.0+ MB


In [19]:
df_prop

Unnamed: 0_level_0,Unnamed: 1_level_0,property,co2_ppm,door_open_bool,occupancy_bool,occupancy_p,rel_humidity_0,temp_in_degC,valve_frac_0,window_open_bool
id,source,timestamp,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
999169,CO2-meter-SCD4x,2022-10-12 16:13:00+02:00,543.0,,,1,49.4,21.1,,
999169,CO2-meter-SCD4x,2022-10-12 16:20:00+02:00,524.0,,,1,47.1,21.0,,
999169,CO2-meter-SCD4x,2022-10-12 16:30:00+02:00,534.0,,,1,45.7,21.4,,
999169,CO2-meter-SCD4x,2022-10-12 16:40:00+02:00,491.0,,,1,45.0,21.5,,
999169,CO2-meter-SCD4x,2022-10-12 16:50:00+02:00,472.0,,,1,44.8,21.4,,
...,...,...,...,...,...,...,...,...,...,...
924038,xovis,2022-11-02 13:35:00+01:00,,,,0,,,,
924038,xovis,2022-11-02 13:40:00+01:00,,,,0,,,,
924038,xovis,2022-11-02 13:45:00+01:00,,,,0,,,,
924038,xovis,2022-11-02 13:50:00+01:00,,,,0,,,,


### Writing raw properties to a parquet file

In [20]:
%%time 
df_prop.to_parquet('b4b_raw_properties.parquet', index=True, engine='pyarrow')

CPU times: user 56.6 ms, sys: 7.57 ms, total: 64.2 ms
Wall time: 57.1 ms
