
# PORT-CITY SIMULATION DATA - DATA PREPROCESSING

## Introduction

**Segments of Interest**

The points identified are locations with the **highest interactions of car and truck flows**. These are expected to be most impacted by policy changes. The points are as follows:

#### 1. Lungomare Canepa (Eastbound)
- **Description**: Gathers traffic from trucks and cars moving west towards the Genova Ovest highway booth, and the east part of the city.  
- **Additional Traffic**: Includes trucks coming from the Etiopia Gate.

#### 2. Via di Francia (Eastbound)
- **Description**: Often congested, especially during heavy ferry traffic.

#### 3. Elicoidale "Downstream"
- **Location**: Part of the Elicoidale roundabout near the Genova Ovest highway booth.  
- **Description**: Collects trucks from the highway and mobility traffic.

#### 4. Elicoidale "Upstream"
- **Location**: Another portion of the Elicoidale roundabout, handling trucks coming from the San Benigno Gate.  
- **Description**: Routes traffic towards the Genova Ovest highway booth and mobility traffic.

---

### Corresponding Data Indices

In the data provided, these points correspond to the following indices in the matrices (range: 0–84 in Python style):

| **Point**            | **Index** |
|-----------------------|-----------|
| Lungomare Canepa     | 43        |
| Via di Francia        | 70        |
| Elicoidale "Downstream" | 6       |
| Elicoidale "Upstream" | 62        |

---

### Additional Notes

- **Initial Selection**: These points were chosen **a priori**, based on expected behavior.  
- **Further Adjustments**: Modifications might be needed as insights are derived from the analysis.  
- **Starting Point**: These locations provide a solid foundation for the analysis.

## Import Dependencies & Pickle Files





In [1]:
import pickle
import os
import pandas as pd
import numpy as np

In [2]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
path_directory = '/content/drive/MyDrive/TESE/data/raw/'

path_scenario_0 = path_directory + 'data_scenario0.pkl'
path_scenario_1 = path_directory + 'data_scenario1.pkl'
path_scenario_0_new = path_directory + 'data_scenario0_new.pkl'
path_scenario_1_new = path_directory + 'data_scenario1_new.pkl'

path_scenario_2_normal = path_directory + 'data_scenario2_unif.pkl'
path_scenario_2_alt = path_directory + 'data_scenario2_alt_unif.pkl'

path_scenario_2_weekend_normal = path_directory + 'data_scenario2_unif_weekend.pkl'
path_scenario_2_weekend_alt = path_directory + 'data_scenario2_alt_unif_weekend.pkl'

def import_pickle(path):
  if os.path.exists(path):
    with open(path, 'rb') as f:
      print("Pickle file imported from: ", path)
      return pickle.load(f)
  else:
    print("File not found.")

data_scenario_0_week = import_pickle(path_scenario_2_normal)
data_scenario_1_week = import_pickle(path_scenario_2_alt)
data_scenario_0_weekend = import_pickle(path_scenario_2_weekend_normal)
data_scenario_1_weekend = import_pickle(path_scenario_2_weekend_alt)

data_scenario_0_gaussian = import_pickle(path_scenario_0_new)
data_scenario_1_gaussian = import_pickle(path_scenario_1_new)

Pickle file imported from:  /content/drive/MyDrive/TESE/data/raw/data_scenario2_unif.pkl
Pickle file imported from:  /content/drive/MyDrive/TESE/data/raw/data_scenario2_alt_unif.pkl
Pickle file imported from:  /content/drive/MyDrive/TESE/data/raw/data_scenario2_unif_weekend.pkl
Pickle file imported from:  /content/drive/MyDrive/TESE/data/raw/data_scenario2_alt_unif_weekend.pkl
Pickle file imported from:  /content/drive/MyDrive/TESE/data/raw/data_scenario0_new.pkl
Pickle file imported from:  /content/drive/MyDrive/TESE/data/raw/data_scenario1_new.pkl


### Explore raw data variable shapes

In [4]:
def show_shapes(data):
  for key, value in data.items():
    print(f"Variable: {key}")
    print(data[f'{key}'].shape)
    print(type(data[f'{key}']))
    print("-" * 20)

print("SCENARIO 0 \n")
show_shapes(data_scenario_0_week)
print("\n\n")
print("SCENARIO 1 \n")
show_shapes(data_scenario_1_week)

SCENARIO 0 

Variable: speed_all
(1200, 85, 12)
<class 'numpy.ndarray'>
--------------------
Variable: speed_cars
(1200, 85, 12)
<class 'numpy.ndarray'>
--------------------
Variable: speed_trucks
(1200, 85, 12)
<class 'numpy.ndarray'>
--------------------
Variable: num_cars
(1200, 85, 12)
<class 'numpy.ndarray'>
--------------------
Variable: num_trucks
(1200, 85, 12)
<class 'numpy.ndarray'>
--------------------
Variable: flow_cars
(1200, 85, 12)
<class 'numpy.ndarray'>
--------------------
Variable: flow_trucks
(1200, 85, 12)
<class 'numpy.ndarray'>
--------------------
Variable: inflows
(1200, 25)
<class 'numpy.ndarray'>
--------------------



SCENARIO 1 

Variable: speed_all
(1200, 85, 12)
<class 'numpy.ndarray'>
--------------------
Variable: speed_cars
(1200, 85, 12)
<class 'numpy.ndarray'>
--------------------
Variable: speed_trucks
(1200, 85, 12)
<class 'numpy.ndarray'>
--------------------
Variable: num_cars
(1200, 85, 12)
<class 'numpy.ndarray'>
--------------------
Variable

## Define variables

In [5]:
# Define the road segment k
interesting_k = [6, 43, 62, 70]


# Define the variable names
variable_names = ["Speed Cars", "Speed Trucks", "Number of Cars", "Number of Trucks", "Flow Cars", "Flow Trucks"]

# Define the list of variables and their names
variables_s0_uniform0 = [data_scenario_0_week['speed_cars'],data_scenario_0_week['speed_trucks'],
                data_scenario_0_week['num_cars'], data_scenario_0_week['num_trucks'],
                data_scenario_0_week['flow_cars'], data_scenario_0_week['flow_trucks']]

variables_s1_uniform0  = [data_scenario_1_week['speed_cars'],data_scenario_1_week['speed_trucks'],
                data_scenario_1_week['num_cars'], data_scenario_1_week['num_trucks'],
                data_scenario_1_week['flow_cars'], data_scenario_1_week['flow_trucks']]

variables_s0_uniform1 = [data_scenario_0_weekend['speed_cars'],data_scenario_0_weekend['speed_trucks'],
                data_scenario_0_weekend['num_cars'], data_scenario_0_weekend['num_trucks'],
                data_scenario_0_weekend['flow_cars'], data_scenario_0_weekend['flow_trucks']]

variables_s1_uniform1 = [data_scenario_1_weekend['speed_cars'],data_scenario_1_weekend['speed_trucks'],
                data_scenario_1_weekend['num_cars'], data_scenario_1_weekend['num_trucks'],
                data_scenario_1_weekend['flow_cars'], data_scenario_1_weekend['flow_trucks']]


# Define the mapping from variable names to their indices
variable_name_to_index = {
    "Speed Cars": 0,
    "Speed Trucks": 1,
    "Number of Cars": 2,
    "Number of Trucks": 3,
    "Flow Cars": 4,
    "Flow Trucks": 5
}

aggregation_types = {
    "Speed Cars": "mean",
    "Speed Trucks": "mean",
    "Number of Cars": "mean",
    "Number of Trucks": "mean",
    "Flow Cars": "sum",
    "Flow Trucks": "sum"
}


index_segments_k = {
    "Elicoidale Downstream": 0,
    "Lungomare Canepa": 1,
    "Elicoidale Upstream": 2,
    "Via di Francia": 3
}

segment_variable_count = {
    "Elicoidale Downstream": 6,
    "Lungomare Canepa": 6,
    "Elicoidale Upstream": 6,
    "Via di Francia": 3
}

# W is the timechunk window we aggregate measurements
W = 2

## Reaggregate Variables into Multivariate Time Series format

### Functions

In [6]:
def get_key_by_value(dictionary, value):
    for key, val in dictionary.items():
        if val == value:
            return key
    return None

In [7]:
# Helper function to reshuffle indices
def get_shuffled_indices(N, B):
    indices = np.random.choice(np.arange(N), size=2*B, replace=False)
    return indices[:B], indices[B:2*B]  # Return two non-overlapping subsets

In [8]:
def reshape_raw_data(data, interesting_k, variable_list):
    num_runs, num_segments, num_time_chunks = variable_list[0].shape  # Assumes all variable shapes are equal
    num_interesting_k = len(interesting_k)
    num_variables = len(variable_list)

    # Changed the shape of reshaped_data to (number_of_interesting_k * num_variables, n, t)
    reshaped_data = np.zeros((num_interesting_k * num_variables, num_runs, num_time_chunks))

    for i, k in enumerate(interesting_k):
        for j, variable in enumerate(variable_list):
            # Now assigning data for each segment k across all simulation runs and time chunks
            reshaped_data[i * num_variables + j, :, :] = variable[:, k, :]
    return reshaped_data


In [9]:
def print_first_dimension_organization(reaggregated_data):
    """
    Prints the organization of the first dimension of the reaggregated data.

    Args:
    - reaggregated_data (numpy.ndarray): Reaggregated data array.
    """
    segment_index = list(index_segments_k.values())
    num_variables = reaggregated_data.shape[0] // len(segment_index) if segment_index else 1
    num_interesting_k = len(segment_index) if segment_index else reaggregated_data.shape[0] // num_variables

    for i in range(num_interesting_k):
        start_index = i * num_variables
        end_index = (i + 1) * num_variables - 1
        segment = segment_index[i]
        variables = ", ".join(variable_names)
        print(f"{start_index} to {end_index}: Segment [{get_key_by_value(index_segments_k, segment)}] (Variables {variables})")


In [10]:
def remove_variables_for_segment(data, segment_key, variables_to_remove):
    """
    Removes specified variables for a given segment from the reaggregated data.

    Args:
    - data (numpy.ndarray): The reaggregated data array (shape: (num_interesting_k * num_variables, num_runs, num_time_chunks)).
    - segment_key (str): The key representing the segment in the `index_segments_k` dictionary.
    - variables_to_remove (list): The list of variable names to be removed for the given segment.

    Returns:
    - numpy.ndarray: The modified reaggregated data with the specified variables removed.
    """
    # Get the segment index
    segment_index = index_segments_k[segment_key]

    # Standard number of variables per segment (before removing)
    num_variables = len(variable_names)

    # Calculate the start and end indices for the segment
    start_index = segment_index * num_variables
    end_index = start_index + num_variables

    # Convert variable names to their indices relative to the segment
    variable_indices_to_remove = [variable_name_to_index[var] for var in variables_to_remove]

    # Generate the indices for the variables to keep within this segment
    segment_variables_to_keep = [
        start_index + i for i in range(num_variables)
        if i not in variable_indices_to_remove
    ]

    # Generate the global indices to keep for all other segments
    indices_to_keep = list(range(0, start_index)) + segment_variables_to_keep + list(range(end_index, data.shape[0]))

    # Filter the first dimension (retain only the indices_to_keep)
    modified_data = data[indices_to_keep, :, :]

    return modified_data


In [11]:
def print_segment_index_and_variables_in_data(data):
    """
    Prints the start and end indexes for each segment and the corresponding variables,
    directly based on the reaggregated data.

    Args:
    - data (numpy.ndarray): The reaggregated data array.
    """
    print(f"Data shape: {data.shape}")
    print("-" * 50)

    # Initialize the current index
    current_index = 0

    # Iterate over the segments
    for segment_name, segment_index in index_segments_k.items():
        # Get the number of variables for the current segment
        num_variables = segment_variable_count[segment_name]

        # Calculate start and end indices for the segment
        start_index = current_index
        end_index = start_index + num_variables - 1

        # Slice the data for this segment
        segment_data = data[start_index:end_index + 1]

        # Get the variable names for this segment
        #variable_names = list(variable_name_to_index.keys())[:num_variables] #

        # Print the segment information
        print(f"Segment: {segment_name}")
        print(f"Indexes: {start_index} to {end_index}")
        #print(f"Variables: {', '.join(variable_names)}")
        print(f"Data slice shape: {segment_data.shape}")
        print("-" * 50)

        # Update the current index
        current_index = end_index + 1


### Data pipeline

The idea here is to have the TS data as a np.ndarray with shape (number_of_interesting_k * num_variables, n, t), where n is the amount of simulation runs and t is the amount of timechunks

In [12]:
# Scenario 0:
reshaped_s0 = reshape_raw_data(data_scenario_0_week, interesting_k, variables_s0_uniform0)
print(reshaped_s0.shape)

# Scenario 1:
reshaped_s1 = reshape_raw_data(data_scenario_1_week, interesting_k, variables_s1_uniform0)
print(reshaped_s1.shape)

(24, 1200, 12)
(24, 1200, 12)


In [13]:
# Scenario 0:
reshaped_s0_weekend = reshape_raw_data(data_scenario_0_weekend, interesting_k, variables_s0_uniform1)
print(reshaped_s0_weekend.shape)

# Scenario 1:
reshaped_s1_weekend = reshape_raw_data(data_scenario_1_weekend, interesting_k, variables_s1_uniform1)
print(reshaped_s1_weekend.shape)

(24, 1000, 12)
(24, 1000, 12)


In [14]:
reshaped_s0_gaussian = reshape_raw_data(data_scenario_0_gaussian, interesting_k, variables_s0_uniform0)
print(reshaped_s0_gaussian.shape)

reshaped_s1_gaussian = reshape_raw_data(data_scenario_1_gaussian, interesting_k, variables_s1_uniform0)
print(reshaped_s1_gaussian.shape)

(24, 1200, 12)
(24, 1200, 12)


In [15]:
print_first_dimension_organization(reshaped_s0)

0 to 5: Segment [Elicoidale Downstream] (Variables Speed Cars, Speed Trucks, Number of Cars, Number of Trucks, Flow Cars, Flow Trucks)
6 to 11: Segment [Lungomare Canepa] (Variables Speed Cars, Speed Trucks, Number of Cars, Number of Trucks, Flow Cars, Flow Trucks)
12 to 17: Segment [Elicoidale Upstream] (Variables Speed Cars, Speed Trucks, Number of Cars, Number of Trucks, Flow Cars, Flow Trucks)
18 to 23: Segment [Via di Francia] (Variables Speed Cars, Speed Trucks, Number of Cars, Number of Trucks, Flow Cars, Flow Trucks)


In [16]:
data_s0_uniform1 = remove_variables_for_segment(reshaped_s0, "Via di Francia", ["Speed Trucks", "Number of Trucks", "Flow Trucks"])
print_segment_index_and_variables_in_data(data_s0_uniform1)

Data shape: (21, 1200, 12)
--------------------------------------------------
Segment: Elicoidale Downstream
Indexes: 0 to 5
Data slice shape: (6, 1200, 12)
--------------------------------------------------
Segment: Lungomare Canepa
Indexes: 6 to 11
Data slice shape: (6, 1200, 12)
--------------------------------------------------
Segment: Elicoidale Upstream
Indexes: 12 to 17
Data slice shape: (6, 1200, 12)
--------------------------------------------------
Segment: Via di Francia
Indexes: 18 to 20
Data slice shape: (3, 1200, 12)
--------------------------------------------------


In [17]:
data_s0_uniform2 = remove_variables_for_segment(reshaped_s0_weekend, "Via di Francia", ["Speed Trucks", "Number of Trucks", "Flow Trucks"])
print_segment_index_and_variables_in_data(data_s0_uniform2)

Data shape: (21, 1000, 12)
--------------------------------------------------
Segment: Elicoidale Downstream
Indexes: 0 to 5
Data slice shape: (6, 1000, 12)
--------------------------------------------------
Segment: Lungomare Canepa
Indexes: 6 to 11
Data slice shape: (6, 1000, 12)
--------------------------------------------------
Segment: Elicoidale Upstream
Indexes: 12 to 17
Data slice shape: (6, 1000, 12)
--------------------------------------------------
Segment: Via di Francia
Indexes: 18 to 20
Data slice shape: (3, 1000, 12)
--------------------------------------------------


In [18]:
data_s1_uniform1 = remove_variables_for_segment(reshaped_s1, "Via di Francia", ["Speed Trucks", "Number of Trucks", "Flow Trucks"])
print_segment_index_and_variables_in_data(data_s1_uniform1)

Data shape: (21, 1200, 12)
--------------------------------------------------
Segment: Elicoidale Downstream
Indexes: 0 to 5
Data slice shape: (6, 1200, 12)
--------------------------------------------------
Segment: Lungomare Canepa
Indexes: 6 to 11
Data slice shape: (6, 1200, 12)
--------------------------------------------------
Segment: Elicoidale Upstream
Indexes: 12 to 17
Data slice shape: (6, 1200, 12)
--------------------------------------------------
Segment: Via di Francia
Indexes: 18 to 20
Data slice shape: (3, 1200, 12)
--------------------------------------------------


In [19]:
data_s1_uniform2 = remove_variables_for_segment(reshaped_s1_weekend, "Via di Francia", ["Speed Trucks", "Number of Trucks", "Flow Trucks"])
print_segment_index_and_variables_in_data(data_s1_uniform2)

Data shape: (21, 1000, 12)
--------------------------------------------------
Segment: Elicoidale Downstream
Indexes: 0 to 5
Data slice shape: (6, 1000, 12)
--------------------------------------------------
Segment: Lungomare Canepa
Indexes: 6 to 11
Data slice shape: (6, 1000, 12)
--------------------------------------------------
Segment: Elicoidale Upstream
Indexes: 12 to 17
Data slice shape: (6, 1000, 12)
--------------------------------------------------
Segment: Via di Francia
Indexes: 18 to 20
Data slice shape: (3, 1000, 12)
--------------------------------------------------


In [20]:
data_s0_gaussian = remove_variables_for_segment(reshaped_s0_gaussian, "Via di Francia", ["Speed Trucks", "Number of Trucks", "Flow Trucks"])
print_segment_index_and_variables_in_data(data_s0_gaussian)

Data shape: (21, 1200, 12)
--------------------------------------------------
Segment: Elicoidale Downstream
Indexes: 0 to 5
Data slice shape: (6, 1200, 12)
--------------------------------------------------
Segment: Lungomare Canepa
Indexes: 6 to 11
Data slice shape: (6, 1200, 12)
--------------------------------------------------
Segment: Elicoidale Upstream
Indexes: 12 to 17
Data slice shape: (6, 1200, 12)
--------------------------------------------------
Segment: Via di Francia
Indexes: 18 to 20
Data slice shape: (3, 1200, 12)
--------------------------------------------------


In [21]:
data_s1_gaussian = remove_variables_for_segment(reshaped_s1_gaussian, "Via di Francia", ["Speed Trucks", "Number of Trucks", "Flow Trucks"])
print_segment_index_and_variables_in_data(data_s1_gaussian)

Data shape: (21, 1200, 12)
--------------------------------------------------
Segment: Elicoidale Downstream
Indexes: 0 to 5
Data slice shape: (6, 1200, 12)
--------------------------------------------------
Segment: Lungomare Canepa
Indexes: 6 to 11
Data slice shape: (6, 1200, 12)
--------------------------------------------------
Segment: Elicoidale Upstream
Indexes: 12 to 17
Data slice shape: (6, 1200, 12)
--------------------------------------------------
Segment: Via di Francia
Indexes: 18 to 20
Data slice shape: (3, 1200, 12)
--------------------------------------------------


## Prepare Data for Batching

In [22]:
def reshape_and_shuffle(arr):
    """
    Reshapes the input array from shape (v, n, t) to (v, n*t) and shuffles the second dimension.

    Parameters:
    arr (numpy.ndarray): Input array with shape (v, n, t).

    Returns:
    numpy.ndarray: Reshaped and shuffled array with shape (v, n*t).
    """
    v, n, t = arr.shape  # Unpack the shape into variables
    reshaped_arr = arr.reshape(v, n * t)  # Reshape the array to (v, n*t)

    # Shuffle the second dimension (axis=1, which is the second dimension of the reshaped array)
    np.random.shuffle(reshaped_arr.T)  # We shuffle along the columns, so we transpose first

    return reshaped_arr

### CASE 1 (W=1)

 Number of consecutive time steps window W=1. So the shape in this case should be (k_segments x num_var_per_segment, simulation_runs x timechuncks)

In [23]:
data_case1_s0_uniform1 = reshape_and_shuffle(data_s0_uniform1)
print(data_case1_s0_uniform1.shape)

data_case1_s1_uniform1 = reshape_and_shuffle(data_s1_uniform1)
print(data_case1_s1_uniform1.shape)

data_case1_s0_uniform2 = reshape_and_shuffle(data_s0_uniform2)
print(data_case1_s0_uniform2.shape)

data_case1_s1_uniform2 = reshape_and_shuffle(data_s1_uniform2)
print(data_case1_s1_uniform2.shape)

(21, 14400)
(21, 14400)
(21, 12000)
(21, 12000)


In [24]:
data_case1_s0_gaussian = reshape_and_shuffle(data_s0_gaussian)
print(data_case1_s0_gaussian.shape)

data_case1_s1_gaussian = reshape_and_shuffle(data_s1_gaussian)
print(data_case1_s1_gaussian.shape)

(21, 14400)
(21, 14400)


### CASE 2



Case 2: W > 1, Sampling Consecutive Measurements per Day

Aggregate `W` consecutive time chunks for each simulation run. Maintain the dimensionality of variables, so in the end we have an array of shape (k_segments x num_var_per_segment, simulation_runs x timechuncks/W)

In [25]:
def aggregate_timechunks(data, timechunk_step):
    """
    Aggregates time chunks in a NumPy array based on specified steps and aggregation types,
    handling variable numbers per segment.

    Args:
        data (np.ndarray): The input data with shape (num_variables, num_runs, num_time_chunks).
        timechunk_step (int): The number of time chunks to aggregate into one.

    Returns:
        np.ndarray: The aggregated data.
    """

    num_runs, num_time_chunks = data.shape[1:]  # Extract num_runs and num_time_chunks
    new_num_time_chunks = num_time_chunks // timechunk_step

    # Initialize an empty list to store aggregated data for each segment
    aggregated_segments = []

    # Keep track of the cumulative variable count
    cumulative_variable_count = 0
    for segment_name, num_variables in segment_variable_count.items():
        segment_data = data[cumulative_variable_count : cumulative_variable_count + num_variables]
        aggregated_segment = np.zeros((num_variables, num_runs, new_num_time_chunks))

        for var_idx in range(num_variables):
            var_name = list(variable_name_to_index.keys())[var_idx]
            for run_idx in range(num_runs):
                for new_chunk_idx in range(new_num_time_chunks):
                    start_idx = new_chunk_idx * timechunk_step
                    end_idx = (new_chunk_idx + 1) * timechunk_step
                    chunk = segment_data[var_idx, run_idx, start_idx:end_idx]

                    if aggregation_types[var_name].lower() == 'mean':
                        aggregated_segment[var_idx, run_idx, new_chunk_idx] = np.mean(chunk)
                    elif aggregation_types[var_name].lower() == 'sum':
                        aggregated_segment[var_idx, run_idx, new_chunk_idx] = np.sum(chunk)
                    else:
                        raise ValueError(f"Invalid aggregation type: {aggregation_types[var_name]}")

        aggregated_segments.append(aggregated_segment)
        cumulative_variable_count += num_variables

    # Concatenate aggregated segments into a single array
    aggregated_data = np.concatenate(aggregated_segments, axis=0)
    return aggregated_data

In [26]:
aggregated_data_s0_uniform1 = aggregate_timechunks(data_s0_uniform1, W)
print(aggregated_data_s0_uniform1.shape)

aggregated_data_s1_uniform1 = aggregate_timechunks(data_s1_uniform1, W)
print(aggregated_data_s1_uniform1.shape)

aggregated_data_s0_uniform2 = aggregate_timechunks(data_s0_uniform2, W)
print(aggregated_data_s0_uniform2.shape)

aggregated_data_s1_uniform2 = aggregate_timechunks(data_s1_uniform2, W)
print(aggregated_data_s1_uniform2.shape)

(21, 1200, 6)
(21, 1200, 6)
(21, 1000, 6)
(21, 1000, 6)


In [27]:
aggregated_data_s0_gaussian = aggregate_timechunks(data_s0_gaussian, W)
print(aggregated_data_s0_gaussian.shape)

aggregated_data_s1_gaussian = aggregate_timechunks(data_s1_gaussian, W)
print(aggregated_data_s1_gaussian.shape)

(21, 1200, 6)
(21, 1200, 6)


In [28]:
data_case2_s0_uniform1 = reshape_and_shuffle(aggregated_data_s0_uniform1)
print(data_case2_s0_uniform1.shape)

data_case2_s1_uniform1 = reshape_and_shuffle(aggregated_data_s1_uniform1)
print(data_case2_s1_uniform1.shape)

data_case2_s0_uniform2 = reshape_and_shuffle(aggregated_data_s0_uniform2)
print(data_case2_s0_uniform2.shape)

data_case2_s1_uniform2 = reshape_and_shuffle(aggregated_data_s1_uniform2)
print(data_case2_s1_uniform2.shape)

(21, 7200)
(21, 7200)
(21, 6000)
(21, 6000)


In [29]:
data_case2_s0_gaussian = reshape_and_shuffle(aggregated_data_s0_gaussian)
print(data_case2_s0_gaussian.shape)

data_case2_s1_gaussian = reshape_and_shuffle(aggregated_data_s1_gaussian)
print(data_case2_s1_gaussian.shape)

(21, 7200)
(21, 7200)


### CASE 3

 Case 3: W > 1, Stacking Measurements as New Dimensions. Aggregate `W` consecutive time steps into a single vector for each day, increasing the number of variables per day by stacking measurements vertically.

In [30]:
def stack_timechunks(data, timechunk_step):
    """
    Stacks time chunks in a NumPy array based on specified steps.

    Args:
        data (np.ndarray): The input data with shape (num_variables, num_runs, num_time_chunks).
        timechunk_step (int): The number of time chunks to stack.

    Returns:
        np.ndarray: The stacked data.
    """
    num_variables, num_runs, num_time_chunks = data.shape
    new_num_variables = num_variables * timechunk_step
    new_num_time_chunks = num_time_chunks // timechunk_step

    stacked_data = np.zeros((new_num_variables, num_runs, new_num_time_chunks))

    for v in range(num_variables):
        for n in range(num_runs):
            for t in range(new_num_time_chunks):
                start_idx = t * timechunk_step
                end_idx = (t + 1) * timechunk_step
                stacked_data[v * timechunk_step:(v + 1) * timechunk_step, n, t] = data[v, n, start_idx:end_idx]
    return stacked_data

In [31]:
stacked_data_s0_uniform1 = stack_timechunks(data_s0_uniform1, W)
print(stacked_data_s0_uniform1.shape)

stacked_data_s1_uniform1 = stack_timechunks(data_s1_uniform1, W)
print(stacked_data_s1_uniform1.shape)

stacked_data_s0_uniform2 = stack_timechunks(data_s0_uniform2, W)
print(stacked_data_s0_uniform2.shape)

stacked_data_s1_uniform2 = stack_timechunks(data_s1_uniform2, W)
print(stacked_data_s1_uniform2.shape)

(42, 1200, 6)
(42, 1200, 6)
(42, 1000, 6)
(42, 1000, 6)


In [32]:
stacked_data_s0_gaussian = stack_timechunks(data_s0_gaussian, W)
print(stacked_data_s0_gaussian.shape)

stacked_data_s1_gaussian = stack_timechunks(data_s1_gaussian, W)
print(stacked_data_s1_gaussian.shape)

(42, 1200, 6)
(42, 1200, 6)


In [33]:
data_case3_s0_uniform1 = reshape_and_shuffle(stacked_data_s0_uniform1)
print(data_case3_s0_uniform1.shape)

data_case3_s1_uniform1 = reshape_and_shuffle(stacked_data_s1_uniform1)
print(data_case3_s1_uniform1.shape)

data_case3_s0_uniform2 = reshape_and_shuffle(stacked_data_s0_uniform2)
print(data_case3_s0_uniform2.shape)

data_case3_s1_uniform2 = reshape_and_shuffle(stacked_data_s1_uniform2)
print(data_case3_s1_uniform2.shape)

(42, 7200)
(42, 7200)
(42, 6000)
(42, 6000)


In [34]:
data_case3_s0_gaussian = reshape_and_shuffle(stacked_data_s0_gaussian)
print(data_case3_s0_gaussian.shape)

data_case3_s1_gaussian = reshape_and_shuffle(stacked_data_s1_gaussian)
print(data_case3_s1_gaussian.shape)

(42, 7200)
(42, 7200)


## Save Datasets

In [36]:
def save_dataset(dataset, filename, directory="/content/drive/MyDrive/TESE/data/preprocessed"):
  """Saves a dataset to a specified directory on Google Drive.

  Args:
      dataset: The dataset to save (e.g., a NumPy array).
      filename: The name of the file to save the dataset to.
      directory: The directory to save the dataset to. Defaults to /content/drive/MyDrive/datasets.
  """
  os.makedirs(directory, exist_ok=True)  # Create the directory if it doesn't exist
  filepath = os.path.join(directory, filename)
  with open(filepath, 'wb') as f:
    pickle.dump(dataset, f)

save_dataset(data_case1_s0_uniform1, "data_case1_s0_uniform1.pkl")
save_dataset(data_case1_s1_uniform1, "data_case1_s1_uniform1.pkl")
save_dataset(data_case1_s0_uniform2, "data_case1_s0_uniform2.pkl")
save_dataset(data_case1_s1_uniform2, "data_case1_s1_uniform2.pkl")

save_dataset(data_case2_s0_uniform1, "data_case2_s0_uniform1.pkl")
save_dataset(data_case2_s1_uniform1, "data_case2_s1_uniform1.pkl")
save_dataset(data_case2_s0_uniform2, "data_case2_s0_uniform2.pkl")
save_dataset(data_case2_s1_uniform2, "data_case2_s1_uniform2.pkl")

save_dataset(data_case3_s0_uniform1, "data_case3_s0_uniform1.pkl")
save_dataset(data_case3_s1_uniform1, "data_case3_s1_uniform1.pkl")
save_dataset(data_case3_s0_uniform2, "data_case3_s0_uniform2.pkl")
save_dataset(data_case3_s1_uniform2, "data_case3_s1_uniform2.pkl")

save_dataset(data_case1_s0_gaussian, "data_case1_s0_gaussian.pkl")
save_dataset(data_case1_s1_gaussian, "data_case1_s1_gaussian.pkl")
save_dataset(data_case2_s0_gaussian, "data_case2_s0_gaussian.pkl")
save_dataset(data_case2_s1_gaussian, "data_case2_s1_gaussian.pkl")
save_dataset(data_case3_s0_gaussian, "data_case3_s0_gaussian.pkl")
save_dataset(data_case3_s1_gaussian, "data_case3_s1_gaussian.pkl")