## This notebook contains routines to analyze the electric load profiles from PG&E office building
#### Brief description of this workflow:
1. Pre-processing (skip this step your if you have cleaned data)
    - Extract data for a building type (e.g. office building in our case)
    - Remove empty and problematic data
    - Convert data into dataframes ([example](../data_all/1004541105.csv))
    - Get typical building load profiles
        - Create annual heatmaps
        - Use pre-trained CNN model and k-means clustering to distinguish typical building loads (high-load during daytime) and non-typical loads (e.g., high-load during night time)
2. Conduct Frequency-Domain analysis
    - Annual analysis
        - Create bins to group high, medium, and low frequency features
    - Daily analysis
        - How

### Step 1. Pre-processing
Previously, we found some load profiles have high-peak during the night time, we want to explore how those load profile look like and whether we should separate them from typical load profiles.

In [1]:
%pwd
%cd ..

/mnt/c/Users/hlee9/Documents/GitHub/DOE_EULP/EULP


#### Import libraries and set up paths

In [None]:
# Import utility functions
# Change directory to the EULP root path, use %cd path_to_EULP
import os
from lib import data_exploration_utils as ex
dir_root = %pwd
dir_data = os.path.join(dir_root, "data_all")
dir_fig = os.path.join(dir_root, "fig")

#### Generate heatmaps and time-series line plot for visualization

In [3]:
v_ts_CSVs = ex.get_all_file_paths(dir_data, 'csv')
# Create these paths if not exist
dir_heatmaps = os.path.join(dir_fig, "ts_heatmaps")
dir_lines = os.path.join(dir_fig, "ts_lines")

# for i, ts_csv in enumerate(v_ts_CSVs):
#     sp_id = os.path.basename(ts_csv).split('.')[0]
#     df_t = ex.clean_pge_df_ts(ts_csv, 2015).dropna()
#     df_t = df_t['Value'].squeeze()
#     dir_heatmap = os.path.join(dir_heatmaps, f"{sp_id}.png")
#     try:
#         ex.generate_heatmap(df_t, dir_heatmap)
#         ex.generate_ts_html(df_t, sp_id, dir_lines)
#     except:
#         pass

#### Apply k-means clustering with features generated pre-trained CNN model

In [12]:
# Get conv base features
v_heatmaps = ex.get_all_file_paths(dir_heatmaps, 'png')
model_conv_base = ex.model_vgg16_conv_base()
v_conv_base_features = [ex.get_conv_base_features(img, model_conv_base).flatten() for img in v_heatmaps]

In [13]:
# Clustering (k=2 because we already know there are two distinct patterns) 
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(v_conv_base_features)

In [15]:
# Copy heatmaps to corresponding cluster folder for comparison
import shutil
kmeans_labels = kmeans.labels_
step_1_out_dir = os.path.join(dir_fig, 'step_1')

for i, label in enumerate(kmeans_labels):
    cluster_dir = os.path.join(step_1_out_dir, f"cluster_{label}")
    if not os.path.exists(cluster_dir):
        os.mkdir(cluster_dir)
    shutil.copy(v_heatmaps[i], cluster_dir)

In [36]:
# Get the typical load profiles
import numpy as np
v_typical_ts_CSVs = list(np.take(v_ts_CSVs, np.where(kmeans_labels==1))[0]) # Caution: cluster label might be 0 for typical load profiles

### Step 2. Frequency-domain analysis

#### Get frequency-domain features at annual window level