# Simulating N samples simultaneously

Since we are only interested in ground level, we only need to store ground level data (Saves memory).

This means that our simulation can be reduce to just a 2D computation.

In [1]:
def run_simulation_2D(x, y, nx, ny, Lx, Ly, cx, cy, sx, sy):
    dx, dy = Lx/(nx-1), Ly/(ny-1)
    dt = 1
    tend = 1200
    t = 0

    cfl_x, cfl_y = cx * dt/dx, cy * dt/dy
    diff_x, diff_y = sx * dt/dx**2, sy * dt/dy**2

    u = np.zeros((nx+2, ny+2))
    sol = []
    source_x, source_y = nx // 2, ny // 2
    Q = 1e-6
    
    while t < tend:
        unew = u.copy()
        sol.append(u[1:-1, 1:-1])

         # Advection (Upwind Scheme)
        unew[1:-1, 1:-1] -= cfl_x * (u[1:-1, 1:-1] - u[1:-1, :-2])
        unew[1:-1, 1:-1] -= cfl_y * (u[1:-1, 1:-1] - u[:-2, 1:-1])
    
        # Diffusion (Central Differencing)
        unew[1:-1, 1:-1] += diff_x * (u[1:-1, 2:] - 2*u[1:-1, 1:-1] + u[1:-1, :-2])
        unew[1:-1, 1:-1] += diff_y * (u[2:, 1:-1] - 2*u[1:-1, 1:-1] + u[:-2, 1:-1])

        # Source Term
        unew[source_x, source_y] += Q * dt

        # Additional Source Points (forming a small area)
        offsets = [(-1, -1), (-1, 1), (1, -1), (1, 1), (-1, 0), (1, 0), (0, -1), (0, 1)]
        for dx, dy in offsets:
            unew[source_x + dx, source_y + dy] += Q * dt

        u = unew
        t += dt
        
    return np.array(sol)

In [None]:
from joblib import Parallel, delayed
import numpy as np
import time as time

nx, ny= 51, 51  # Grid points
Lx, Ly = 5000, 5000  # Domain size in meters
x = np.linspace(-2500, 2500, nx)  # Centered at (0,0)
y = np.linspace(-2500, 2500, ny)
n = 1000
cx, cy = np.random.RandomState().uniform(0, 10, n), np.random.RandomState().uniform(0, 10, n)
sx, sy = np.random.RandomState().uniform(0, 1, n), np.random.RandomState().uniform(0, 1, n)
num_cores = -1

start_time = time.time()
results = Parallel(n_jobs=num_cores)(
    delayed(run_simulation_2D)(x, y, nx, ny, Lx, Ly, cx[i], cy[i], sx[i], sy[i])
    for i in range(n)
)
end_time = time.time()

print(f"Simulation took: {end_time-start_time}")

In [None]:
observed = np.load("test.npy")
observed.shape, results[0].shape

We need to fix the shapes so that they correspond to each other.

Currently the simulated results is represented in an 3-D array, where each array within a timestep, and the respective concentration in the x-y grid. The observed is also in a 3-D array, however in shape (51, 51, 1200). This should mean that for each x-grid, it shows the value of y over time 1200.

Using `np.reshape` work in matching the dimensions. However, as the time and spatial dimensions are handled differently, it may not yield the same definition. Specifically, each slice in the observed in a seperate calculation per time step (instantaneous snapshots at t_i), whereas the simulation tracks concentration evolving over time. Time for simulated is the first axis, whereas it is last axis for observed.

In [4]:
results[0].reshape((51, 51, 1200)).shape, observed.shape

((51, 51, 1200), (51, 51, 1200))

An alternative here is to use `np.tranpose()`. This gives us more clarity into how we transform the simulated data.

Structure for simulated solution (axis 0: Time, axis 1:, Nx, axis 2: Ny) and analytical solution (axis 0: Nx, axis 1: Ny, axis 2: Time).

By using using `np.transpose(results[0], (1, 2, 0))`, the axes are rearranged to be:

- Simulated solution (axis 0: Nx, axis 1:, Ny, axis 2: Time)

Which is what we wanted.

In [5]:
np.transpose(results[0], (1, 2, 0)).shape

(51, 51, 1200)

The updated function now looks like this:

In [85]:
def run_simulation_2D(x, y, nx, ny, Lx, Ly, cx, cy, sx, sy):
    dx, dy = Lx/(nx-1), Ly/(ny-1)
    dt = 1
    tend = 1200
    t = 0

    cfl_x, cfl_y = cx * dt/dx, cy * dt/dy
    diff_x, diff_y = sx * dt/dx**2, sy * dt/dy**2

    u = np.zeros((nx+2, ny+2))
    sol = []
    source_x, source_y = nx // 2, ny // 2
    Q = 1e-6
    
    while t < tend:
        unew = u.copy()
        sol.append(u[1:-1, 1:-1])

         # Advection (Upwind Scheme)
        unew[1:-1, 1:-1] -= cfl_x * (u[1:-1, 1:-1] - u[1:-1, :-2])
        unew[1:-1, 1:-1] -= cfl_y * (u[1:-1, 1:-1] - u[:-2, 1:-1])
    
        # Diffusion (Central Differencing)
        unew[1:-1, 1:-1] += diff_x * (u[1:-1, 2:] - 2*u[1:-1, 1:-1] + u[1:-1, :-2])
        unew[1:-1, 1:-1] += diff_y * (u[2:, 1:-1] - 2*u[1:-1, 1:-1] + u[:-2, 1:-1])

        # Source Term
        unew[source_x, source_y] += Q * dt

        # Additional Source Points (forming a small area)
        offsets = [(-1, -1), (-1, 1), (1, -1), (1, 1), (-1, 0), (1, 0), (0, -1), (0, 1)]
        for dx, dy in offsets:
            unew[source_x + dx, source_y + dy] += Q * dt

        u = unew
        t += dt

    # sol = np.transpose(sol, (1, 2, 0))
    return np.array(sol)

In [86]:
from joblib import Parallel, delayed
import numpy as np
import time as time

nx, ny= 51, 51  # Grid points
Lx, Ly = 5000, 5000  # Domain size in meters
x = np.linspace(-2500, 2500, nx)  # Centered at (0,0)
y = np.linspace(-2500, 2500, ny)
n = 50
cx, cy = np.random.RandomState().uniform(0, 10, n), np.random.RandomState().uniform(0, 10, n)
sx, sy = np.random.RandomState().uniform(0, 1, n), np.random.RandomState().uniform(0, 1, n)
num_cores = -1

start_time = time.time()
results = Parallel(n_jobs=num_cores)(
    delayed(run_simulation_2D)(x, y, nx, ny, Lx, Ly, cx[i], cy[i], sx[i], sy[i])
    for i in range(n)
)
end_time = time.time()

print(f"Simulation took: {end_time-start_time}")

Simulation took: 1.8181843757629395


In [87]:
results[0].shape

(1200, 51, 51)

## Applying Distance Metrics

We try to implement the same way as we have for the 1D problem, and adjust if there are any issues.

Because the shape of the results would be in 4D (n, Nx, Ny, time), it would be infeasible to try and solve everything all at once.

However, parallelisation can still be utilised.

In [4]:
import numpy as np 
from scipy.spatial.distance import pdist, squareform, cdist

In [88]:
observed = np.load("test.npy")

In [89]:
observed = np.transpose(observed, (2, 1, 0))

The distances should be modified such that it computes the distance **for each spatial location** across time first (outputting a 51x51 matrix), with each (i, j) representating the distance at that point, and then output an average (?).

The original distance metrics were desgined so that it computes the distance between each column.
- This is because each column represented one solution.

### Wasserstein

#### Original

In [6]:
def wasserstein_distance(simulated_sample: np.ndarray, observed_sample: np.ndarray) -> float:
    # Mean Difference between simulated and observed
    simulated_sorted = np.sort(simulated_sample, axis=0)
    observed_sorted = np.sort(observed_sample, axis=0)
    distance = np.mean(np.abs(simulated_sorted - observed_sorted), axis=0)

    return distance

#### Modified

In [7]:
def wasserstein_distance_3D(simulated_sample: np.ndarray, observed_sample: np.ndarray) -> np.ndarray:
    """
    Compute the Wasserstein distance between two (51, 51, 1200) shaped arrays 
    along the time dimension.

    Returns a (51, 51) array of distances for each spatial location.
    """
    # Sort along the time axis (axis=2)
    simulated_sorted = np.sort(simulated_sample, axis=2)
    observed_sorted = np.sort(observed_sample, axis=2)

    # Compute the mean absolute difference along the time axis
    distance = np.mean(np.abs(simulated_sorted - observed_sorted), axis=2)

    return distance 

In [8]:
wass = Parallel(n_jobs=num_cores)(
    delayed(wasserstein_distance_3D)(results[i], observed)
    for i in range(n)
)

### Energy

I expect that energy, MMD and KLD will take a lot longer because of the nature of the $O(n^2)$ nature of the metrics. Where the need to calculate pairwise distances between 2D matricies will be computationally difficult.

#### Original

In [22]:
def energy_dist(simulated_sample: np.ndarray, observed_sample: np.ndarray) -> float:
    ncol = simulated_sample.shape[1]

    mean_dist_XY = np.empty(ncol)  # Mean distances between columns
    mean_dist_XX = np.empty(ncol)  # Mean distances within array1
    mean_dist_YY = np.empty(ncol)  # Mean distances within array2

    for i in range(ncol):
        mean_dist_XX[i] = np.mean(squareform(pdist(simulated_sample[:, i, np.newaxis], metric='euclidean')))
        mean_dist_YY[i] = np.mean(squareform(pdist(observed_sample[:, i, np.newaxis], metric='euclidean')))
        mean_dist_XY[i] = np.mean(cdist(simulated_sample[:, i, np.newaxis], observed_sample[:, i, np.newaxis], metric='euclidean'))

    # Calculate the energy distances for each column in a vectorized way
    energy_distances = 2 * mean_dist_XY - mean_dist_XX - mean_dist_YY

    return energy_distances

#### Modified

To apply energy distance for our problem, we have to calculate energy distance at each [i, j] component of the matrix. 

Each [i, j] component has t=1200 values. i.e Each grid is a curve of its own. So in total we have 51x51=2601 curves.

So we need to calculate the energy distance 2601 times, and then return an average of energy distance?
- May be concerns because of domain size, maybe we only care about the middle bit, or the max energy distance, etc.

Given a [2601, 1200]-shaped matrix, each row represents a curve. 

If we transpose it, we then have something similar to the previous problem, where each column represent one curve, and each row is the concentration at the time t. I.e. at index 0, it is the change in concentration over time at ...

In [83]:
results[0].shape

(51, 51, 1200)

We calculate the energy distance at each time point.

At each time point, we have a [51 x 51] matrix. To find the pairwise distance of that, we can flatten it out and do the calculation. This can be done for both the simulated and observed. We can save computation time by calculating the pairwise distance for the observed sample first, instead of calculating it at every iteration.

In [7]:
# Average pairwise distance at each time 
obs_reshaped = observed.reshape((1200, 51*51))
obs_pairwise = np.zeros(1200)

# At each time point, we have an array of size (2601, ), and we find the pairwise distance of that
for i in range(1200):
    obs_pairwise[i] = np.mean(squareform(pdist(obs_reshaped[i].reshape(-1, 1))))    

We have the general workflow, but we want a way to compute all of it at once to optimise.

In [9]:
import matplotlib.pyplot as plt

In [13]:
sim_reshaped = results[0].reshape((1200, 51*51))
sim_pairwise_3 = np.zeros(1200)

# Same thing for average pairwise distance at each time for the simulated 

start = time.time()
for i in range(1200):
    sim_pairwise_3[i] = np.mean(pdist(sim_reshaped[i].reshape(-1, 1)))
end = time.time()
end - start

16.936415672302246

In [14]:
start = time.time()
sim_pairwise = np.apply_along_axis(lambda x: np.mean(pdist(x.reshape(-1, 1))), axis=1, arr=sim_reshaped)
end = time.time()
end - start

16.762940406799316

In [16]:
sim_pairwise == sim_pairwise_3

array([ True,  True,  True, ...,  True,  True,  True])

These two appraoches are exactly the same.

In [41]:
sim_obs_pairwise = np.zeros(1200)
for i in range(1200):
    sim_obs_pairwise[i] = np.mean(cdist(sim_pairwise[i].reshape(-1, 1), obs_pairwise[i].reshape(-1, 1)))

In [39]:
sim_reshaped = results[0].reshape((1200, 51*51))
sim_pairwise_3 = np.zeros(1200)
sim_obs_pairwise = np.zeros(1200)

# Same thing for average pairwise distance at each time for the simulated 

start = time.time()
for i in range(1200):
    sim_pairwise_3[i] = np.mean(pdist(sim_reshaped[i].reshape(-1, 1)))
    sim_obs_pairwise[i] = np.mean(cdist(sim_reshaped[i].reshape(-1, 1), obs_reshaped[i].reshape(-1, 1)))
    
end = time.time()
end - start

51.4509060382843

In [42]:
energy_distance = 2 * sim_obs_pairwise - sim_pairwise - obs_pairwise

In [43]:
energy_distance

array([8.32395814e-08, 8.16471759e-08, 8.05084494e-08, ...,
       1.94131207e-08, 1.94131207e-08, 1.94131207e-08])

Using `energy_distance` from scipy.stats is faster, but it's infeasible when we scale it to large n's. We need a way to at least batch compute it.

In [46]:
import numpy as np
from scipy.stats import energy_distance

start = time.time()
energy_distances = np.array([energy_distance(sim_reshaped[i], obs_reshaped[i]) for i in range(1200)])
end = time.time()
end - start

0.5429883003234863

In [69]:
energy_dist_arr = []

for j in range(n):
    energy_dist_arr.append(np.array([energy_distance(sim_reshaped[i], obs_reshaped[i]) for i in range(1200)]))

In [76]:
energy_dist_arr[47].shape

(1200,)

#### Optimising the distance

Key points to optimise:
- **Avoid the double loop**: Because the output is the average (?), it will work as long as the indicies between the two input arrays match.
- **Pairwise distances**: Is it possible to not calculate the pairwise distance between observed and itself (YY) everytime?