# Generate fake data with the same chunk size

In this notebook, we generate multiple data stores of increasingly fine resolution so that the total spatial size of the dataset grows by 2 on each iteration.

This is so we can understand the relationship between the number of chunks and tiling performance.

## Setup 1: Load the necessary libraries

In [33]:
%load_ext autoreload
%autoreload
import xarray as xr
import numpy as np
import os
import s3fs
import sys; sys.path.append('..')
import eodc_hub_role
import zarr_helpers

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Setup 2: Setup data storage

Store data in the fake data directory in a "with chunks".

In [34]:
credentials = eodc_hub_role.fetch_and_set_credentials()
bucket = 'nasa-eodc-data-store'
fake_data_dir = 'fake_data'
s3_fs = s3fs.S3FileSystem(
    key=credentials['AccessKeyId'],
    secret=credentials['SecretAccessKey'],
    token=credentials['SessionToken'], 
    anon=False
)

## Fake Data Generation Part 1: Generate data stores with a single chunk

These datastores will have varying chunk size since we generate them at varying resolution but no spatial chunking.

In [40]:
# Define starting conditions
time_steps = 1
ydim = 512
xdim = 1024
multiple = 2 # how much do you want the dataset to grow by each iteration
n_multiples = 7
data_path = 'fake_data/single_chunk'

In [37]:
# If necessary, remove anything that is there
#!aws s3 rm --recursive s3://{bucket}/{data_path}/

In [41]:
# generate and store data
zarr_helpers.generate_multiple_datastores(
    n_multiples,
    xdim,
    ydim,
    f'{bucket}/{data_path}',
    s3_fs
)

Writing to nasa-eodc-data-store/fake_data/single_chunk/store_lat512_lon1024.zarr
Writing to nasa-eodc-data-store/fake_data/single_chunk/store_lat724_lon1448.zarr
Writing to nasa-eodc-data-store/fake_data/single_chunk/store_lat1024_lon2048.zarr
Writing to nasa-eodc-data-store/fake_data/single_chunk/store_lat1448_lon2896.zarr
Writing to nasa-eodc-data-store/fake_data/single_chunk/store_lat2048_lon4096.zarr
Writing to nasa-eodc-data-store/fake_data/single_chunk/store_lat2896_lon5792.zarr
Writing to nasa-eodc-data-store/fake_data/single_chunk/store_lat4096_lon8192.zarr


### Check that it worked

In [42]:
directories = s3_fs.ls(f'{bucket}/{data_path}')
for path in directories:
    try:
        # Attempt to open the Zarr store using xarray
        store = s3fs.S3Map(root=path, s3=s3_fs, check=False)
        ds = xr.open_zarr(store)
        chunk_size = round(zarr_helpers.get_chunk_size(ds['data'])[2], 2)
        print(f"Chunk size for {path}: {chunk_size} MB")
    except Exception as e:
        # Print an error message if unable to open the Zarr store
        print(f"Could not open {item} as a Zarr store. Error: {e}")

Chunk size for nasa-eodc-data-store/fake_data/single_chunk/store_lat1024_lon2048.zarr: 16.0 MB
Chunk size for nasa-eodc-data-store/fake_data/single_chunk/store_lat1448_lon2896.zarr: 31.99 MB
Chunk size for nasa-eodc-data-store/fake_data/single_chunk/store_lat2048_lon4096.zarr: 64.0 MB
Chunk size for nasa-eodc-data-store/fake_data/single_chunk/store_lat2896_lon5792.zarr: 127.97 MB
Chunk size for nasa-eodc-data-store/fake_data/single_chunk/store_lat4096_lon8192.zarr: 256.0 MB
Chunk size for nasa-eodc-data-store/fake_data/single_chunk/store_lat512_lon1024.zarr: 4.0 MB
Chunk size for nasa-eodc-data-store/fake_data/single_chunk/store_lat724_lon1448.zarr: 8.0 MB


## Fake Data Generation Part 2

### Part 2 Step 1: Define starting conditions for generating data of the same chunk size, but varied chunk shape

The following are set as variables so tests can be modified easily for different starting conditions. For example, we might want to test a different target chunk size.

In [43]:
# Define starting conditions
# variable: target size of chunks in mb
target_size = 32
# not variable: bytes per mb
onemb = 1024 # bytes per mb
# number of data values per chunk
data_values_per_chunk = (target_size * onemb * onemb)/8 # 8 bytes for each data value
# since there are half as many latitudes as longitudes, calculate the y dimension to be half the x dimension
ydim = round(np.sqrt(data_values_per_chunk/2))
xdim = 2*ydim
target_chunks = {'time': 1, 'lat': ydim, 'lon': xdim}
print(f"Each dataset will have chunks of the following dimensions {target_chunks}.")

# timesteps are 1 for now
time_steps = 1
# how much do you want the dataset to grow by each iteration
multiple = 2 
# how many datasets we want to test
n_multiples = 5
print(f"We will generate {n_multiples} datasets, each being {multiple} times larger.")

data_path = 'fake_data/with_chunks'

Each dataset will have chunks of the following dimensions {'time': 1, 'lat': 1448, 'lon': 2896}.
We will generate 5 datasets, each being 2 times larger.


### Part 2 Step 2: Generate Datastores

In [45]:
# If necessary, remove anything that is there
#!aws s3 rm --recursive s3://{bucket}/{data_path}/

In [47]:
zarr_helpers.generate_multiple_datastores(
    n_multiples,
    xdim,
    ydim,
    f'{bucket}/{data_path}',
    s3_fs,
    target_chunks
)

Writing to nasa-eodc-data-store/fake_data/with_chunks/store_lat1448_lon2896.zarr
Writing to nasa-eodc-data-store/fake_data/with_chunks/store_lat2048_lon4096.zarr
Writing to nasa-eodc-data-store/fake_data/with_chunks/store_lat2896_lon5792.zarr
Writing to nasa-eodc-data-store/fake_data/with_chunks/store_lat4096_lon8192.zarr
Writing to nasa-eodc-data-store/fake_data/with_chunks/store_lat5793_lon11586.zarr


### Part 2 Step 3 (Optional): Check that it worked

In [48]:
# List all items in the directory
directories = s3_fs.ls(f'{bucket}/{data_path}')
directories

['nasa-eodc-data-store/fake_data/with_chunks/store_lat1448_lon2896.zarr',
 'nasa-eodc-data-store/fake_data/with_chunks/store_lat2048_lon4096.zarr',
 'nasa-eodc-data-store/fake_data/with_chunks/store_lat2896_lon5792.zarr',
 'nasa-eodc-data-store/fake_data/with_chunks/store_lat4096_lon8192.zarr',
 'nasa-eodc-data-store/fake_data/with_chunks/store_lat5793_lon11586.zarr']

In [49]:
for path in directories:
    try:
        # Attempt to open the Zarr store using xarray
        store = s3fs.S3Map(root=path, s3=s3_fs, check=False)
        ds = xr.open_zarr(store)
        chunk_size = round(zarr_helpers.get_chunk_size(ds['data'])[2], 2)
        print(f"Chunk size for {path}: {chunk_size} MB")
    except Exception as e:
        # Print an error message if unable to open the Zarr store
        print(f"Could not open {item} as a Zarr store. Error: {e}")

Chunk size for nasa-eodc-data-store/fake_data/with_chunks/store_lat1448_lon2896.zarr: 31.99 MB
Chunk size for nasa-eodc-data-store/fake_data/with_chunks/store_lat2048_lon4096.zarr: 31.99 MB
Chunk size for nasa-eodc-data-store/fake_data/with_chunks/store_lat2896_lon5792.zarr: 31.99 MB
Chunk size for nasa-eodc-data-store/fake_data/with_chunks/store_lat4096_lon8192.zarr: 31.99 MB
Chunk size for nasa-eodc-data-store/fake_data/with_chunks/store_lat5793_lon11586.zarr: 31.99 MB
