_preprocess stage_

Encoder fonde dati eterogenei (satellite, modelli, sensori) per produrre uno stato atmosferico
iniziale coerente su una griglia uniforme.

NOAA APT Satellite Images: These are low-earth-orbit weather satellite images (e.g. from
 NOAA-15/18) with ~4 km/pixel resolution, typically two channels (infrared and visible) broadcast
 via Automatic Picture Transmission.

Open Oceanographic/Meteorological API Data: This includes buoy measurements (e.g. wave
 height, sea temperature), weather radar snapshots (e.g. precipitation radar images), and ship
 observations. 


---------------------------------------------------------------------------------------------------------
All data must be mapped to a common geospatial
 grid. Define a latitude-longitude grid covering the Western Mediterranean at the desired
 resolution (1 km × 1 km).

Interpolate point data (buoys, ship observations) onto this grid. A simple approach is nearest
neighbor or inverse-distance weighting for each variable. Satellite and radar data, which may
 come as images or their own grids, should be reprojected or resampled onto the common grid.

Normalization: Once all data layers are on the same grid, stack them into a multi-channel array.
Normalize each channel (e.g. min-max scaling or z-score) so that different data ranges are comparable for the model.

In [None]:
 import requests, json
 from PIL import Image
 import numpy as np
 # Define grid parameters (example bounds and resolution)
 
 lat_min, lat_max = 30.0, 45.0
 lon_min, lon_max =-5.0, 15.0
 res_deg = 0.01 # ~1 km
 lat_vals = np.arange(lat_min, lat_max, res_deg)
 lon_vals = np.arange(lon_min, lon_max, res_deg)
 NY, NX = len(lat_vals), len(lon_vals)
 
 # Initialize multichannel grid (e.g. 6 channels)
 data_grid = np.zeros((6, NY, NX), dtype=np.float32)
 
 # 1. Fetch latest NOAA APT satellite image (simulate by loading from file or URL)
 sat_img = Image.open("noaa_apt_latest.jpg") # placeholder for actual fetch
 sat_arr = np.array(sat_img.convert("L")) # convert to grayscale numpy array

 # Resample satellite image to grid resolution:
 # (In practice, use PIL resize or cv2.remap for projection. Here assume already aligned.)
 sat_arr_resized = np.array(sat_img.resize((NX, NY)))
 data_grid[0,:,:] = sat_arr_resized # e.g. IR channel

 # If APT has a second channel (e.g. visible), place it in data_grid[1]

 # 2. Fetch buoy data from an open API (e.g. NOAA NDBC or Copernicus Marine)

 buoy_url = "https://api.example.com/latest_buoys?region=med"
 resp = requests.get(buoy_url)
 buoy_data = resp.json() # assume JSON with list of {lat, lon, value} for some variable
 for buoy in buoy_data["stations"]:
 lat, lon = buoy["latitude"], buoy["longitude"]
 val = buoy["measurement"]["value"]
 # Find nearest grid index
 iy = int((lat-lat_min) / res_deg)
 ix = int((lon-lon_min) / res_deg)
 if 0 <= iy < NY and 0 <= ix < NX:
 data_grid[2, iy, ix] = val # e.g. assign sea surface temp at buoy location

 # 3. Fetch radar data (e.g. rainfall radar composite)
 radar_img_data = requests.get("https://openapi.example.com/radar/med_latest.png").content
 with open("radar.png", "wb") as f:
 f.write(radar_img_data)
 radar_img = Image.open("radar.png")
 radar_arr = np.array(radar_img.convert("L"))
 # Assume radar image already in lat/lon projection for region; resample to grid
 radar_arr_resized = np.array(radar_img.resize((NX, NY)))
 data_grid[3,:,:] = radar_arr_resized # e.g. rainfall intensity

 # 4. Fetch ship observations (if available) similar to buoy
 ship_url = "https://api.example.com/ships/meteo"
 ship_data = requests.get(ship_url).json()
 for obs in ship_data["observations"]:
 lat, lon = obs["lat"], obs["lon"]
 wind = obs["wind_speed"]
 iy = int((lat- lat_min) / res_deg); ix = int((lon- lon_min) / res_deg)
 if 0 <= iy < NY and 0 <= ix < NX:
 data_grid[4, iy, ix] = wind # place ship-reported wind speed

 # 5. Interpolate missing values on the grid if necessary:
 # For simplicity, fill small gaps via nearest neighbor
 for ch in range(data_grid.shape[0]):
 layer = data_grid[ch]
 mask = (layer == 0) # assuming 0 means no data; in practice use NaN
 if mask.any():
 # simple nearest neighbor fill
 coords = np.array(np.nonzero(~mask)).T
 vals = layer[~mask]
 from scipy import spatial
 tree = spatial.KDTree(coords)
 missing_coords = np.array(np.nonzero(mask)).T
 nearest = tree.query(missing_coords)[1]
 layer[mask] = vals[nearest]
 data_grid[ch] = layer

 # 6. Normalization (e.g., scale each channel 0-1)
 for ch in range(data_grid.shape[0]):
 arr = data_grid[ch]
 data_grid[ch] = (arr- np.nanmin(arr)) / (np.nanmax(arr)- np.nanmin(arr)
 + 1e-6)

_encoder_

The Encoder’s job is to ingest the multi-modal, multi-channel grid (and implicitly the point data, since we
 have interpolated them onto the grid) and produce a learned representation (embedding) that captures
 the current state of the atmosphere/ocean. We choose a Vision Transformer (ViT) for this task because
 ViTs use self-attention to capture global spatial relationships, which can be advantageous for
 integrating data over a large region.

ViT treats the image as a sequence of patches and models global interactions between all patches . Each input image
(our multi-channel grid) is split into patches, flattened, and linearly projected into an embedding space.

Positional embeddings are added to preserve spatial context, then a Transformer encoder (multi
head self-attention layers) processes the sequence of patch embeddings . We take the Transformer’s
 output as the encoded state.

--------------------------------------------------------------------------------------------------------------
Implementation details: We need to adjust a standard ViT to handle our input. Typically, ViT expects a 3
channel RGB image; here we have N-channel data (say 6 channels as in our example). We implement a
 patch embedding layer as a Conv2d with kernel size = patch size and output channels = 
embed_dim
 (the transformer dimension). This will reduce each patch (of size e.g. 16×16) across all input channels
 into a vector. 


 In our design, we will output a spatial feature map that can interface
 with the Processor. One approach is to reshape the Transformer output (excluding the class token) back
 to a 2D grid of patches.

In [None]:
import torch
import torch.nn as nn
class ViTEncoder(nn.Module):
def __init__(self, img_size, patch_size, in_channels, embed_dim, num_layers=6, num_heads=8):
    super().__init__()
    assert img_size % patch_size == 0, "Image size must be divisible by patch size"
    self.patch_size = patch_size
    self.num_patches = (img_size // patch_size) ** 2
    self.embed_dim = embed_dim

    # Patch embedding: conv layer that produces embed_dim feature maps from input channels
    self.patch_embed = nn.Conv2d(in_channels, embed_dim, kernel_size=patch_size, stride=patch_size)
    
    # Class token and positional embedding
    self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
    self.pos_embed = nn.Parameter(torch.zeros(1, self.num_patches + 1, embed_dim))
 
    # Transformer Encoder
    encoder_layer = nn.TransformerEncoderLayer(d_model=embed_dim, nhead=num_heads, dim_feedforward=4*embed_dim)
    self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
    
def forward(self, x):
    # x shape: (B, in_channels, H, W)
    B = x.size(0)
    # Create patch embeddings
    patches = self.patch_embed(x)
    # (B, embed_dim, H/patch, W/ patch)

    patches = patches.flatten(2).transpose(1, 2) # (B, N, embed_dim), 
    N=num_patches
    
    # Prepend class token
    cls_tokens = self.cls_token.expand(B,-1,-1) # (B, 1, embed_dim)
    tokens = torch.cat([cls_tokens, patches], dim=1) # (B, N+1, embed_dim)
    tokens = tokens + self.pos_embed[:, :tokens.size(1), :]
 
    # Transformer encoding
    tokens = tokens.transpose(0, 1)
    # (N+1, B, embed_dim) for transformer
    enc_outputs = self.transformer(tokens)
    enc_outputs = enc_outputs.transpose(0, 1)
    # Separate class token and patch embeddings
    cls_out = enc_outputs[:, 0, :]
    # (N+1, B, embed_dim)
    # (B, N+1, embed_dim)
    # (B, embed_dim)
 patch_out = enc_outputs[:, 1:, :].transpose(1, 2) # (B, embed_dim, N)
 
 # Reshape patch_out back to spatial grid
 grid_size = int(self.num_patches**0.5)
 feature_map = patch_out.view(B, self.embed_dim, grid_size, grid_size)
 
 # (B, embed_dim, H/patch, W/patch)
 return cls_out, feature_map

This encoder produces two outputs: 
cls_out (a global feature vector summarizing the state) and
 feature_map (a smaller spatial feature map of size 
H/patch × W/patch capturing spatial details).
 We will use 
feature_map as input to the next stage, and potentially use 
cls_out to initialize hidden
 states.