**Table of contents**<a id='toc0_'></a>    
- [Import packages](#toc1_)    
- [Define read and save path](#toc2_)    
- [Load data](#toc3_)    
- [Training](#toc4_)    
- [Cluster Analysis](#toc5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [None]:
# !pip install -q folium sktime tslearn validclust tsfresh tsfel autoelbow deeptime
# !pip install -q smac==0.8.0 autocluster

This is the main notebook for modelling

TODO:

1. cluster analysis, what is the common characteristic for clusters
2. ensemble clustering
3. auto clustering
4. better imputation methods
5. other feature extraction method
6. other dim reduction method

# <a id='toc1_'></a>[Import packages](#toc0_)

In [None]:
import os
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from matplotlib.colors import to_hex
from importlib import reload

import sys
if Path('/content/drive/MyDrive').exists():
  sys.path.append('/content/drive/MyDrive/Colab Notebooks/custom_modules')
else:
  sys.path.append('./custom_modules')
import TSClustering
reload(TSClustering)
from TSClustering import TSClustering

# <a id='toc2_'></a>[Define read and save path](#toc0_)

In [None]:
local_path = Path('../data_preprocessed')
drive_path = Path('/content/drive/MyDrive/ProcessedData_Melbourne_Footfalls')

base_path = local_path if local_path.exists() else drive_path

save_dir = Path('../Results') if local_path.exists() else Path('/content/drive/MyDrive/Results_Melbourne_Footfalls')
if save_dir.exists() == False:
  save_dir.mkdir(parents=True, exist_ok=True)

read_processed_dir = base_path / '1. merged_peds_data_hist_curr'
read_raw_dir = Path('./Data (20230918)') if local_path.exists() else Path('/content/drive/MyDrive/Data/Melbourne_Footfalls')

# <a id='toc3_'></a>[Load data](#toc0_)

In [None]:
data = pd.read_csv(read_processed_dir / 'footfall_merged.csv') # the data should be unpivoted
data.rename(columns={'New_Sensor_Name': 'Sensor_Name'}, inplace=True)
data.head()

the original data is unpivoted

In [None]:
data.shape

In [None]:
# sensor_locations = pd.read_excel(read_raw_dir / 'pedestrian-counting-system-sensor-locations.xlsx')
sensor_locations = pd.read_excel(read_processed_dir / 'sensor_locations_processed.xlsx')
sensor_locations.drop(columns='Sensor_Name', inplace=True)
sensor_locations.rename(columns={'New_Sensor_Name': 'Sensor_Name'}, inplace=True)
sensor_locations.head()

# <a id='toc4_'></a>[Training](#toc0_)

    """
    Parameters:
    - data: by default is unpivot (wide format) hourly footfall data
    - metric: 
      "euclidean", "dtw", "softdtw" or None
    - scale: None or
      "day", 'week', 'month', 'year', 'hour'
      'early_morning', 'morning', 'midday', 'afternoon', 'evening'
      'workday', 'weekend'
    - model: 
      "kmeans", "kshape", "kernelkmeans", "birch", "ensemble"
    - time_span: float, int or list
      "normal" (before 2020), 
      2019 (or other single year), 
      [start_date, end_date] or None
    - normalise: 
      "meanvariance", "minmax" or None
    - feature_extraction: 
      True, False or None
    - dim_reduction: 
      'PCA', 'IPCA' or None
    - "order_of_impute_agg": 
      "impute_agg_norm", "impute_norm_agg", "agg_impute_norm", or "agg_norm_impute"
    """

In [None]:
model_configs = {
  "metric": 'dtw',
  "random_state": 42
}

configs = {
  "data": data.copy(),
  "target_column": 'Sensor_Name', # target (sensor name)
  "time_column": 'Date_Time', # feature names (timestamp)
  "value_column": 'Hourly_Counts', # value
  # "sensor_locations": sensor_locations.copy(), # sensor location meta data
  "sensor_locations": data[['Sensor_Name', 'Latitude', 'Longitude', 'Location']],
  "save_dir": save_dir,
  "algorithm": 'kmeans',
  "scale": 'week', 
  "order_of_impute_agg_norm": "impute_agg_norm", 
  "time_span": 2019, 
  "feature_extraction": None, 
  "dim_reduction": "PCA", 
  "normalise": "meanvariance", 
  "model_configs": model_configs, 
  "seed": 42,
  "verbose": False
}

In [None]:
if configs['algorithm'] == 'birch':
  # the data has been split into chunks, sensors in each chunk should have same time span 
  # and have less than 50% missing values

  read_path = base_path / '4. final_group'

  # List all files in the directory and sort them based on the start year for processing in order
  # all_files = [f for f in os.listdir(read_path) if f.startswith('grouped_data_')] # pivoted format
  # all_files = sorted(all_files, key=lambda x: int(x.split('_')[2]))

  all_files = [f for f in os.listdir(read_path) if f.startswith('data_')] # wide format
  all_files = sorted(all_files, key=lambda x: int(x.split('_')[1]))

  TSClustering(**configs).online_training(all_files)
else:
  scales = ["day", "week", "month", "hour",
          "early_morning", "morning", "midday", "afternoon", "evening",
          "workday", "weekend"]
  for scale in scales:
    configs['scale'] = scale
    TSClustering(**configs).offline_training()

# <a id='toc5_'></a>[Cluster Analysis](#toc0_)
Are there certain years or sensors that tend to cluster together more often?