# Generate fake data with the same chunk size

In this notebook, we generate multiple data stores of increasingly fine resolution so that the total spatial size of the dataset grows by 2 on each iteration.

This is so we can understand the relationship between the number of chunks and tiling performance.

## Step 1: Load the necessary libraries

In [3]:
import xarray as xr
import numpy as np
import os
import s3fs
import sys; sys.path.append('..')
import eodc_hub_role
import zarr_helpers

## Step 2: Setup data storage

Store data in the fake data directory in a "with chunks".

In [4]:
credentials = eodc_hub_role.fetch_and_set_credentials()
bucket = 'nasa-eodc-data-store'
fake_data_dir = f'fake_data/with_chunks'
s3_fs = s3fs.S3FileSystem(
    key=credentials['AccessKeyId'],
    secret=credentials['SecretAccessKey'],
    token=credentials['SessionToken'], 
    anon=False
)

## Step 3: Define starting starting conditions

The following are set as variables so tests can be modified easily for different starting conditions. For example, we might want to test a different target size.

In [18]:
# Define starting conditions
# variable: target size of chunks in mb
target_size = 32
# not variable: bytes per mb
onemb = 1024 # bytes per mb
# number of data values per chunk
data_values_per_chunk = (target_size * onemb * onemb)/8 # 8 bytes for each data value
# since there are half as many latitudes as longitudes, calculate the y dimension to be half the x dimension
y = round(np.sqrt(data_values_per_chunk/2))
x = 2*y
target_chunks = {'time': 1, 'lat': y, 'lon': x}
print(f"Each dataset will have chunks of the following dimensions {target_chunks}.")

# timesteps are 1 for now
time_steps = 1

# how much do you want the dataset to grow by each iteration
multiple = 2 
# how many datasets we want to test
n_multiples = 5
print(f"We will generate {n_multiples} datasets, each being {multiple} times larger.")

Each dataset will have chunks of the following dimensions {'time': 1, 'lat': 1448, 'lon': 2896}.
We will generate 5 datasets, each being 2 times larger.


## Step 4: Generate Datastores

In [24]:
# If necessary, remove anything that is there
# !aws s3 rm --recursive s3://{bucket}/{fake_data_dir}/

In [None]:
for n_multiple in range(n_multiples):
    # if this isn't the first iteration, grow the total size of the dataset by 2
    if n_multiple != 0:
        # expand grid by multiple
        data_values_per_chunk = y * x * multiple
        # to maintain the aspect ratio, where we know size == y * x and x = 2y
        y = round(np.sqrt(data_values_per_chunk/2))
        x = 2*y
        print(f"x is {x}, y is {y}")
        
    data = np.random.random(size=(time_steps, y, x))

    # Create Xarray datasets with dimensions and coordinates
    ds = xr.Dataset({
        'data': (['time', 'lat', 'lon'], data),
    }, coords={
        'time': np.arange(time_steps),
        'lat': np.linspace(-90, 90, y),
        'lon': np.linspace(-180, 180, x)
    })

    try:
        ds = ds.chunk(target_chunks)
        path = f'{fake_data_dir}/store_lat_{y}x_lon_{x}.zarr'
        store = s3fs.S3Map(root=f'{bucket}/{path}', s3=s3_fs, check=False)
        ds.to_zarr(store, mode='w')
    except Exception as e:
        print(e)

## Step 5 (Optional): Check that it worked

In [20]:
# List all items in the directory
directories = s3_fs.ls(f'{bucket}/{fake_data_dir}')
directories

['nasa-eodc-data-store/fake_data/with_chunks/store_lat_1448x_lon_2896.zarr',
 'nasa-eodc-data-store/fake_data/with_chunks/store_lat_2048x_lon_4096.zarr',
 'nasa-eodc-data-store/fake_data/with_chunks/store_lat_2896x_lon_5792.zarr',
 'nasa-eodc-data-store/fake_data/with_chunks/store_lat_4096x_lon_8192.zarr',
 'nasa-eodc-data-store/fake_data/with_chunks/store_lat_5793x_lon_11586.zarr']

In [22]:
# Loop through each item and open it with xarray if it's a Zarr store
for path in directories:
    # Check if the item is a directory (Zarr stores are directories)
    try:
        # Attempt to open the Zarr store using xarray
        store = s3fs.S3Map(root=path, s3=s3_fs, check=False)
        ds = xr.open_zarr(store)
        print(f"Chunk size for {path}:")
        print(zarr_helpers.get_chunk_size(ds['data'])[2])
        print('-' * 80)  # Print a separator line
    except Exception as e:
        # Print an error message if unable to open the Zarr store
        print(f"Could not open {item} as a Zarr store. Error: {e}")

Chunk size for nasa-eodc-data-store/fake_data/with_chunks/store_lat_1448x_lon_2896.zarr:
31.9931640625
--------------------------------------------------------------------------------
Chunk size for nasa-eodc-data-store/fake_data/with_chunks/store_lat_2048x_lon_4096.zarr:
31.9931640625
--------------------------------------------------------------------------------
Chunk size for nasa-eodc-data-store/fake_data/with_chunks/store_lat_2896x_lon_5792.zarr:
31.9931640625
--------------------------------------------------------------------------------
Chunk size for nasa-eodc-data-store/fake_data/with_chunks/store_lat_4096x_lon_8192.zarr:
31.9931640625
--------------------------------------------------------------------------------
Chunk size for nasa-eodc-data-store/fake_data/with_chunks/store_lat_5793x_lon_11586.zarr:
31.9931640625
--------------------------------------------------------------------------------
