# Building Energy Clustering and Outlier Visualization

The file name `functions.py` contails all the functions that will be explaind and use across this notebook

# Data Collection

Currently the following datasets are in the repo:
- Building Genome Dataset
- Washington D.C. dataset

In [9]:
# functions already ran, csv files can be found in data/

import functions as func

# load building gnome dataset (BDG)
# df_bdg = func.loadDataset('BDG')
# print("Building Gnome Dataset: hourly meter data from {} buildings".format(len(df_bdg.columns)))

# load dc building dataset (DC)
# df_dc = func.loadDataset('DC')
# print("DC Dataset: 15min interval meter data (resampled to hourly) from {} buildings".format(len(df_dc.columns)))


# Extract Context

The goal of this step is to make the data homogeneous by grouping the hourley read readings. Currently, the following context are being considered
- Weekday `weekday`
- Weekend `weekend`
- Entire Week `entireweek`

The function `extractContext(context, dataframe, datasetName)` from `functions.py` takes a time series dataframe and returns the context-related dataframe of the specified dataset. Meaning it will only keep the instances where its timestamps matches the context. The dataset name is needed because depending on it, only some specific time periods are being evaluated. For more details view the file `RawFeatures_BDG.ipynb`

In [10]:
# functions already ran, csv files can be found in data/

# import functions as func

# df_weekday_BDG = func.getContext('weekday', df_bdg, 'BDG')
# df_weekend_BDG = func.getContext('weekend', df_bdg, 'BDG')
# df_weekday_DC = func.getContext('weekday', df_dc, 'DC')
# df_weekend_DC = func.getContext('weekend', df_dc, 'DC')

# df_weekday_BDG.head(3)

In [11]:
# df_weekend_BDG.head(3)

In [12]:
# df_weekday_DC.head(3)

In [13]:
# df_weekend_DC.head(3)

# Load Curves Aggregation

Performance aggregation of the energy consumption based on a specific context and aggregation function. Currently the following functions are implemented:
- Average
- Median
- Linear Regression

Up to this point, the dataframes will have the following shape:

\begin{bmatrix}%
a_1^1 & a_2^1 & \dots & a_m^1 \\
a_1^2 & a_2^2 & \dots & a_m^2 \\
\vdots & \vdots & \ddots & \vdots \\
a_1^n & a_2^n & \dots & a_m^n \\
\end{bmatrix}

Where `n` are the different timestamps (1hour timestamps) and `m` is the different buildings. The values themselves, `a`, are the meter readings. The goal is to aggregate all existing pair of timestamps-values of each building, based on the defined granularity, and perform an aggregation on them.

For example, by calling the function `doAggregation(df_weekday_BDG, 'average', 'day', 'BDG')`, we will calculate one load curve for a dataframe from the Building Data Genome Dataset (`BDG`), where the hourly readings for each building have been added by `day`. Finally, each building will have a matrix like the following:

\begin{bmatrix}%
b_1^1 & b_2^1 & \dots & b_j^1 \\
b_1^2 & b_2^2 & \dots & b_j^2 \\
\vdots & \vdots & \ddots & \vdots \\
b_1^i & b_2^i & \dots & b_j^i \\
\end{bmatrix}

Where `i` are the different days obtained for the specific building and `j` are the daily hourly timestamp, from 0 to 23 in this case, and the values `b` are the hourly readings. Finally, the aggregation function `average` will calculate the average for each hour for all calculated days. This means that a column-wise average will be calculated, resulting in one vector of shape `(1, 24)`, one building with one day worth of readings.

The specific code snippet for each aggregation function is as follows:


```python
# calculate load curve based on function
if function == 'average':
    load_curve = np.mean(df_sampledReadings, axis = 0)

elif function =='median':
    load_curve = np.median(df_sampledReadings, axis = 0)

elif function == 'regression':
    # 1. Generate one single time series for the entire building
    df_one_ts = pd.DataFrame() # empty data frame to hold complete time series
    df_trans = df_sampledReadings.T
    # iterate through each day worth of readings
    for column in range(len(df_trans.columns)):
        currentColumn = pd.DataFrame(df_trans.iloc[:, column])
        df_one_ts = df_one_ts.append(currentColumn, ignore_index=True)
    # rename variables            
    x_values = df_one_ts.index.values.reshape(-1, 1)
    y_values = df_one_ts.values

    # 2. Perform polynomial regressions on the single time series,
    # the curve with the lowest Root-mean square error (RMSE) will be kept
    degrees = range(1, 21)
    base_model = linear_model.LinearRegression().fit(x_values, y_values)
    base_curve = base_model.predict(x_values)
    rmse = np.sqrt(mean_squared_error(y_values, base_curve))
    load_curve = base_curve

    for d in degrees: # fit a curve for each degree
        polynomial_features= PolynomialFeatures(degree=d)
        x_poly = polynomial_features.fit_transform(x_values)    
        poly_model = linear_model.LinearRegression()
        poly_model.fit(x_poly, y_values)
        poly_curve = poly_model.predict(x_poly)
        rmse_d = np.sqrt(mean_squared_error(y_values,poly_curve))
        
        # keep the polynomial with lowest RSME
        if rmse_d < rmse :
            rmse = rmse_d
            load_curve = poly_curve    
```


After repeating the process for all buildings, csv will be generated (main reason why the `name` parameter is used, to distinguishing saved csv) and the returned dataframe will look like the following:

\begin{bmatrix}%
c_1^1 & c_2^1 & \dots & c_l^1 \\
c_1^2 & c_2^2 & \dots & c_l^2 \\
\vdots & \vdots & \ddots & \vdots \\
c_1^k & c_2^k & \dots & c_l^k \\
\end{bmatrix}

Where `k` is the different buildings, `l` are the different hours in a `day`, and the `c` values are the representative curve calculated based on the aggregation function that was chosen, in the example `average`.



In [14]:
# calculate load curves based on aggregation functions

# df_average_weekday_BDG = func.doAggregation(df_weekday_BDG, contexts[0], aggregation_functions[0], 'day', datasets[0])
# df_median_weekday_BDG = func.doAggregation(df_weekday_BDG, contexts[0], aggregation_functions[1], 'day', datasets[0])
# df_regression_weekday_BDG = func.doAggregation(df_weekday_BDG, contexts[0], aggregation_functions[2], 'day', datasets[0])
# df_average_weekend_BDG = func.doAggregation(df_weekend_BDG, contexts[1], aggregation_functions[0], 'day', datasets[0])
# df_median_weekend_BDG = func.doAggregation(df_weekend_BDG, contexts[1], aggregation_functions[1], 'day', datasets[0])
# df_regression_weekend_BDG = func.doAggregation(df_weekend_BDG, contexts[0], aggregation_functions[2], 'day', datasets[0])

# df_average_weekday_DC = func.doAggregation(df_weekday_DC, contexts[0], aggregation_functions[0], 'day', datasets[1])
# df_median_weekday_DC = func.doAggregation(df_weekday_DC, contexts[0], aggregation_functions[1], 'day', datasets[1])
# df_regression_weekday_DC = func.doAggregation(df_weekday_DC, contexts[0], aggregation_functions[2], 'day', datasets[1])
# df_average_weekend_DC = func.doAggregation(df_weekend_DC, contexts[1], aggregation_functions[0], 'day', datasets[1])
# df_median_weekend_DC = func.doAggregation(df_weekend_DC, contexts[1], aggregation_functions[1], 'day', datasets[1])
# df_regression_weekend_DC = func.doAggregation(df_weekend_DC, contexts[1], aggregation_functions[2], 'day', datasets[1])

# functions already ran, csv files can be found in data/

# Feature Extraction

## Features learned using TSFRESH and DTW

We will use Time Series Feature extraction based on scalable hypothesis tests (TSFRESH) library (https://github.com/blue-yonder/tsfresh). Additonally, another feature that will be appended will be the Dynamic Time Wrapping (DTW).

It is important to highlight that TSFRESH will require the raw time series values from above.

## Temporal Features from existing work on BGD

Approximately 215 features have already been extracted in previous work (https://github.com/buds-lab/temporal-features-for-nonres-buildings-library)

# Experiments

### Experiment 1: k-Shape on Raw Time Series

1. Select context csv to work with (see above)
2. Download k-Shape library (https://github.com/Mic92/kshape and https://tslearn.readthedocs.io/en/latest/gen_modules/clustering/tslearn.clustering.KShape.html#tslearn.clustering.KShape)
3. Run k-Shape algorithm
4. Evaluation:
    1. Evaluate resulting clusters with sillouhette coefficient plot
    2. Evaluate resulting clusters with elbow method
    
See the notebook `Experiment1_kshape.ipynb` for the actual code

### Experiment 2: Feature Extraction and Clustering

1. Select context csv to work with (see above)
2. Download TSFERSH library (https://github.com/blue-yonder/tsfresh)
3. Run TSFRESH on dataset
4. Calculate Dynamic time Warping (DTW) (https://pypi.org/project/fastdtw/) as an extra feature
5. Run clustering algorithms
    1. Run K-means on resulting features (TSFRESH + DTW)
        1. Run with K = 5
        2. Run with K $\epsilon$ [2,10]
    2. Run Hierarchical clustering on resulting features (TSFRESH + DTW)
        1. Run with K = 5
        2. Run with K $\epsilon$ [2,10]
6. Evaluation:
    1. Evaluate resulting clusters with sillouhette coefficient plot
    2. Evaluate resulting clusters with elbow method

### Experiment 3: Feature Extraction and  Classification

1. Select context csv to work with (see above)
2. Download TSFERSH library (https://github.com/blue-yonder/tsfresh)
3. Run TSFRESH on dataset
4. Calculate Dynamic time Warping (DTW) (https://pypi.org/project/fastdtw/) as an extra feature
5. Run classification algorithms:
    1. Append primary use type as ground truth labels from meta data **(data/meta_open.csv)**
    2. Run Random-Forest on resulting features (TSFRESH + DTW)
    3. Run SVM on resulting features (TSFRESH + DTW)
6. Evaluation:
    1. F-1 micro score using ground truth labels from metadata

# Validation

In order to determine th goodness of the different clusters, the following calculations were executed and from them, the right K was chosen

## Calculations

- **Cohesion** calculates the sum of squared distances from each data point to its respective centroid. **LOWER THE BETTER**

- **Separation** calculation leverages the fact that the following equation always holds true: TSS = WSS + BSS

where TSS is the total sum of squared distances from each data point to the overall centroid. WSS is cohesion and BSS is separation. **LOWER THE BETTER**

- The **Calinski-Harabasz index (CH)** evaluates the cluster validity based on the average between- and within- cluster sum of squares. It is a ratio of cohesion and separation adjusted by the respective degrees of freedom. **HIGHER THE BETTER**

- **Davies-Bouldin index** iterates every cluster and calculates a statistic using DB_find_max_j. Then, it averages the statistic over all clusters.

The statistic is calculated as follows:

- For every other cluster, calculate the average distance of every point in that cluster to its centroid
- Also calculate the average distance of every point in the current cluster to the centroid
- Add the two values together
- Divide the sum by the Euclidean distance between the two cluster centroids and obtain a candidate value
- Find the maximum value among all the candidate values for every other cluster to be the statistic

**The smaller the index is, the better the clustering result is**. By minimizing this index, clusters are the most distinct from each other, and therefore achieves the best partition

- **R-squared** can be expressed in terms of separation and cohesion as follows: R-squared = separation / (cohesion + separation) **HIGHER BETTER**

- **RMSSTD** first computes the the sum of squared distances from each data point to its respective centroid, which is SSE or cohesion. Then, it divides the value by the product of the number of attributes and the degree of freedom, which is calculated as the number of data points minus the number of clusters. Lastly, we take the square root of the value to obtain RMSSTD. **LOWER THE BETTER**

- **Xie-Beni** index first calculates SSE or cohesion by taking the sum of the squared distances from each data point to its respective centroid. We denote this by A. Then, it finds the minimum pairwise squared distances between cluster centroids. We denote this by B. We denote the number of data points as n. Xie-Beni index is calculated as A / (n*B).

The Xie-Beni index defines the inter-cluster separation as the minimum square distance between cluster centers, and the intra-cluster compactness as the mean square distance between each data object and its cluster center. **The optimal cluster number is reached when the minimum of Xie-Beni index is found.**

Result pictures can be seen in `img/`

## Choosing of K

In order to choose the right K, we list the top 3 K values for each metric. Then, we count how many times each different K value appears in those top 3 for the 7 different metrics. Finally, we choose the K that has the highest repetition among them.

If a tie occurs, all K are considered for a next step evaluation, for which kShape is ran for those K candidates