# Building Energy Clustering and Outlier Visualization

In [2]:
import pandas as pd

# load building gnome dataset (BGD)
df_bgd = pd.read_csv('data/temp_open_utc_complete.csv')
print("Building Gnome Dataset: hourly meter data from {} buildings".format(len(df_bgd.columns) - 1))

# load dc building dataset (DC)
# df_dc = pd.read_csv('../data/temp_open_utc_complete.csv')
# print("DC Dataset: hourly meter data from {} buildings".format(len(df_dc.columns)))


Building Gnome Dataset: hourly meter data from 507 buildings


## Sampling Methods

The goal is to have sample the original raw data in such a way that we end up with a n x m matrix, where n is the number of buildings and m is the meter data time series values. An additional feature can be appended at the beginning to indicate the building ID, although the index of the table can be used for this purpose.

\begin{bmatrix}%
x_1^1 & x_2^1 & \dots & x_m^1 \\
x_1^2 & x_2^2 & \dots & x_m^2 \\
\vdots & \vdots & \ddots & \vdots \\
x_1^n & x_2^n & \dots & x_m^n \\
\end{bmatrix}

### Context Definition

In order to determine the matrix stated above, a common time window is found (**from 01/01/15 to 30/11/15**). In this time period there are **368** buildings (**72.6%** of the dataset). Additionally, the following contexts are defined:

- Week day: Daily cumulative meter readings for each week day of the selected time window **(data/weekdayContext.csv)**.
- Weekend: Daily cumulative meter readongs for weekend days of the selected time window **(data/weekendContext.csv)**.


## Features

### Raw Values

Raw values of the time series, within from 01/01/15 and 30/11/15:
- Week day context: 368 buildings x 238 days
- Weekend context: 368 buildings x 97 days

### Features learned using TSFRESH and DTW

We will use Time Series Feature extraction based on scalable hypothesis tests (TSFRESH) library (https://github.com/blue-yonder/tsfresh). Additonally, another feature that will be appended will be the Dynamic Time Wrapping (DTW).

It is important to highlight that TSFRESH will require the raw time series values from above.

### Temporal Features from existing work on BGD

Approximately 215 features have already been extracted in previous work (https://github.com/buds-lab/temporal-features-for-nonres-buildings-library)

## Experiments

First, generate the csv files for each context. Currently, the **week day** and **weekend** context csv files can be found in **data/**

### Experiment 1: k-Shape on Raw Time Series

1. Select context csv to work with (see above)
2. Download k-Shape library (https://github.com/Mic92/kshape)
3. Run k-Shape algorithm
4. Evaluation:
    1. Evaluate resulting clusters with sillouhette coefficient plot
    2. Evaluate resulting clusters with elbow method

### Experiment 2: Feature Extraction and Clustering

1. Select context csv to work with (see above)
2. Download TSFERSH library (https://github.com/blue-yonder/tsfresh)
3. Run TSFRESH on dataset
4. Calculate Dynamic time Warping (DTW) (https://pypi.org/project/fastdtw/) as an extra feature
5. Run clustering algorithms
    1. Run K-means on resulting features (TSFRESH + DTW)
        1. Run with K = 5
        2. Run with K $\epsilon$ [2,10]
    2. Run Hierarchical clustering on resulting features (TSFRESH + DTW)
        1. Run with K = 5
        2. Run with K $\epsilon$ [2,10]
6. Evaluation:
    1. Evaluate resulting clusters with sillouhette coefficient plot
    2. Evaluate resulting clusters with elbow method

### Experiment 3: Feature Extraction and  Classification

1. Select context csv to work with (see above)
2. Download TSFERSH library (https://github.com/blue-yonder/tsfresh)
3. Run TSFRESH on dataset
4. Calculate Dynamic time Warping (DTW) (https://pypi.org/project/fastdtw/) as an extra feature
5. Run classification algorithms:
    1. Append primary use type as ground truth labels from meta data **(data/meta_open.csv)**
    2. Run Random-Forest on resulting features (TSFRESH + DTW)
    3. Run SVM on resulting features (TSFRESH + DTW)
6. Evaluation:
    1. F-1 micro score using ground truth labels from metadata