# irp-dbk24 - "Optimising Demand Response Strategies for Carbon-Intelligent Electricity Use"

# Investigating ERA5 World Data

**NOTEBOOK PURPOSE(S):**
* Investigate structure and contents of data retrieved from the ERA5 World dataset.

**NOTEBOOK OUTPUTS:**
* N/A


### Importing Libraries

In [1]:
%matplotlib inline

# ────────────────────────────────────────────────────────────────────────────
# Data Manipulation & Analysis
# ─────────────────────────────────────────────────────────────────────────────
import pandas as pd
import numpy as np

# ─────────────────────────────────────────────────────────────────────────────
# Geospatial Data Handling
# ─────────────────────────────────────────────────────────────────────────────
import xarray as xr
import pygrib

# ─────────────────────────────────────────────────────────────────────────────
# Notebook/Display Tools
# ─────────────────────────────────────────────────────────────────────────────
from IPython.display import display

# ─────────────────────────────────────────────────────────────────────────────
# System / Miscellaneous
# ─────────────────────────────────────────────────────────────────────────────
import os

### Loading Data from Local Storage

In [2]:
cwd = os.getcwd()
print("-"*120)
print("Current Working Directory and contents:\n"+"-"*120)
for root, dirs, files in os.walk(cwd):
    print(f"\nDirectory: {root}")
    print(f"Subdirectories: {dirs}\n"+ "-"*40)
    for file in sorted(files):
        print(f"-> File: {file}")


------------------------------------------------------------------------------------------------------------------------
Current Working Directory and contents:
------------------------------------------------------------------------------------------------------------------------

Directory: /Users/Daniel/Desktop/IRP_WORK_UPDATED.nosync/new_repo/MSc-Thesis-OptimisingDemandResponseStrategiesForCarbonIntelligentElectricityUse/code_and_analysis/jupyter_notebooks
Subdirectories: []
----------------------------------------
-> File: .DS_Store
-> File: marginal_emissions_data_prep.ipynb
-> File: marginal_emissions_model_development.ipynb
-> File: step1_hitachi_data_retrieval.ipynb
-> File: step2_initial_data_analysis.ipynb
-> File: step2_investigating_data_size_and_types.ipynb
-> File: step2_investigating_era5_world_data.ipynb
-> File: step3_combining_era5_datasets.ipynb
-> File: step3_combining_grid_and_weather_data.ipynb
-> File: step3_processing_era5_land_data.ipynb
-> File: step3_process

### Defining File Paths

In [3]:
# Reminder of data directory structure and contents
root_directory = os.path.join('..', '..')

# Base data directory
base_data_directory = os.path.join(root_directory, "data")

# Directory where the dataframes will be saved
hitachi_data_directory = os.path.join(base_data_directory, 'hitachi')

# ERA5 data directory
era5_data_directory = os.path.join(base_data_directory, "era5")  # Directory where the ERA5 data is stored
grib_era5_data_directory = os.path.join(era5_data_directory, "grib_downloads")  # Directory where the ERA5 grib files will be saved
parquets_era5_data_directory = os.path.join(era5_data_directory, "parquets")  # Directory where the ERA5 parquet files will be saved

# Directory where the outputs are saved
outputs_directory = os.path.join(root_directory, 'outputs')
outputs_metrics_directory = os.path.join(outputs_directory, 'metrics')
outputs_images_directory = os.path.join(outputs_directory, 'images')


In [4]:
print("-"*120)
print("[grib_era5_data_directory] and contents:\n"+"-"*120)
for root, dirs, files in os.walk(grib_era5_data_directory):
    print(f"\nDirectory: {root}")
    print(f"Subdirectories: {dirs}\n"+ "-"*40)
    for file in sorted(files):
        print(f"-> File: {file}")


------------------------------------------------------------------------------------------------------------------------
[grib_era5_data_directory] and contents:
------------------------------------------------------------------------------------------------------------------------

Directory: ../../data/era5/grib_downloads
Subdirectories: []
----------------------------------------
-> File: 125ae282169904325e8bc153160be150.grib
-> File: 289f2aac241f8a158ff074a66682452e.grib
-> File: 554832a6209258041784298e5401a7ab.grib
-> File: 5aee58993569287064988fbc8ad385dd.grib
-> File: 5bcc58c42bdde8ce6b147b00099404bc.grib
-> File: ad36c26a5d6daae43c9aeab1747e078c.grib
-> File: b4eac1bff8a020500806be638e9d4ab9.grib
-> File: bc20f736fa82ab5167820d9116ab4859.grib
-> File: c8a985ffc4908e6597c4498ff596cbad.grib
-> File: d1313a3f750d6e7bd89dff34b112d8a8.grib
-> File: de87f0d77e8aeed868c68ac0daae3dc9.grib
-> File: e23fa435dfdf294eba51378e96410b31.grib


In [5]:
file1 = "125ae282169904325e8bc153160be150.grib"
file2 = "289f2aac241f8a158ff074a66682452e.grib"
file3 = "554832a6209258041784298e5401a7ab.grib"
file4 = "5aee58993569287064988fbc8ad385dd.grib"
file5 = "5bcc58c42bdde8ce6b147b00099404bc.grib"
file6 = "ad36c26a5d6daae43c9aeab1747e078c.grib"
file7 = "b4eac1bff8a020500806be638e9d4ab9.grib"
file8 = "bc20f736fa82ab5167820d9116ab4859.grib"
file9 = "c8a985ffc4908e6597c4498ff596cbad.grib"
file10 = "d1313a3f750d6e7bd89dff34b112d8a8.grib"
file11 = "de87f0d77e8aeed868c68ac0daae3dc9.grib"
file12 = "e23fa435dfdf294eba51378e96410b31.grib"

In [6]:
# create list of the files
files = [
    file1, file2, file3, file4, file5,
    file6, file7, file8, file9, file10,
    file11, file12
]

# create full paths for each of the files
grib_files = [os.path.join(grib_era5_data_directory, file) for file in files]


### Inspecting files

In [7]:
def inspect_with_pygrib(path):
    with pygrib.open(path) as grbs:
        msgs = list(grbs)  # load all messages into memory
    # 1) collect unique variable shortNames
    vars_ = sorted({m.shortName for m in msgs})
    # 2) collect all validDates and then get min/max
    dates = sorted({m.validDate for m in msgs})
    t0, t1 = dates[0], dates[-1]
    # 3) get lat/lon bounds from the first message
    lat, lon = msgs[0].latlons()
    lat0, lat1 = float(lat.min()), float(lat.max())
    lon0, lon1 = float(lon.min()), float(lon.max())
    # print summary
    print(f"\nFile: \'{path}\'")
    print(f"\t- Variables: [{', '.join(vars_)}]")
    print(f"\t- Time range: [{t0.isoformat()}]  to  [{t1.isoformat()}]")
    print(f"\t- Latitude:  [{lat0:.3f}]  to  [{lat1:.3f}]")
    print(f"\t- Longitude: [{lon0:.3f}]  to  [{lon1:.3f}]")

In [8]:
# Example usage
for f in grib_files:
    inspect_with_pygrib(f)


File: '../../data/era5/grib_downloads/125ae282169904325e8bc153160be150.grib'
	- Variables: [10u, 10v, 2t, hcc, lcc, mcc, ssr, ssrd, tcc, tp]
	- Time range: [2024-12-31T18:00:00]  to  [2025-07-11T05:00:00]
	- Latitude:  [26.000]  to  [30.000]
	- Longitude: [75.000]  to  [79.000]

File: '../../data/era5/grib_downloads/289f2aac241f8a158ff074a66682452e.grib'
	- Variables: [10u, 10v, 2t, hcc, lcc, mcc, ssr, ssrd, tcc, tp]
	- Time range: [2020-12-31T18:00:00]  to  [2021-12-31T23:00:00]
	- Latitude:  [17.000]  to  [21.000]
	- Longitude: [70.000]  to  [74.000]

File: '../../data/era5/grib_downloads/554832a6209258041784298e5401a7ab.grib'
	- Variables: [10u, 10v, 2t, hcc, lcc, mcc, ssr, ssrd, tcc, tp]
	- Time range: [2023-12-31T18:00:00]  to  [2024-12-31T23:00:00]
	- Latitude:  [26.000]  to  [30.000]
	- Longitude: [75.000]  to  [79.000]

File: '../../data/era5/grib_downloads/5aee58993569287064988fbc8ad385dd.grib'
	- Variables: [10u, 10v, 2t, hcc, lcc, mcc, ssr, ssrd, tcc, tp]
	- Time range: [20

Note:

The city of Delhi is covered by
 - Latitude:  [26.000]  to  [30.000]
 - Longitude: [75.000]  to  [79.000]

The city of Mumbai is covered by
 - Latitude:  [17.000]  to  [21.000]
 - Longitude: [70.000]  to  [74.000]


In [9]:
# Copying the values from the above into corresponding file names
file_2025_delhi_raw_filename = "125ae282169904325e8bc153160be150.grib"
file_2021_mumbai_raw_filename = "289f2aac241f8a158ff074a66682452e.grib"
file_2024_delhi_raw_filename = "554832a6209258041784298e5401a7ab.grib"
file_2024_mumbai_raw_filename = "5aee58993569287064988fbc8ad385dd.grib"
file_2023_delhi_raw_filename = "5bcc58c42bdde8ce6b147b00099404bc.grib"
file_2022_mumbai_raw_filename = "ad36c26a5d6daae43c9aeab1747e078c.grib"
file_2020_delhi_raw_filename = "b4eac1bff8a020500806be638e9d4ab9.grib"
file_2020_mumbai_raw_filename = "bc20f736fa82ab5167820d9116ab4859.grib"
file_2023_mumbai_raw_filename = "c8a985ffc4908e6597c4498ff596cbad.grib"
file_2021_delhi_raw_filename = "d1313a3f750d6e7bd89dff34b112d8a8.grib"
file_2025_mumbai_raw_filename = "de87f0d77e8aeed868c68ac0daae3dc9.grib"
file_2022_delhi_raw_filename = "e23fa435dfdf294eba51378e96410b31.grib"

In [10]:
# creating a list of files
named_files = [file_2025_delhi_raw_filename, file_2024_delhi_raw_filename, file_2023_delhi_raw_filename, file_2022_delhi_raw_filename,
               file_2021_delhi_raw_filename, file_2020_delhi_raw_filename, file_2025_mumbai_raw_filename, file_2024_mumbai_raw_filename,
               file_2023_mumbai_raw_filename, file_2022_mumbai_raw_filename, file_2021_mumbai_raw_filename, file_2020_mumbai_raw_filename]

# creating a list containing full paths for each of the named files
named_filepaths = [os.path.join(grib_era5_data_directory, file) for file in named_files]

# creating a list of names that will be used as keys for a dictionary of the files
dictionary_keys = [
    "2025_delhi_era5_reanalysis_single_levels_data",
    "2024_delhi_era5_reanalysis_single_levels_data",
    "2023_delhi_era5_reanalysis_single_levels_data",
    "2022_delhi_era5_reanalysis_single_levels_data",
    "2021_delhi_era5_reanalysis_single_levels_data",
    "2020_delhi_era5_reanalysis_single_levels_data",
    "2025_mumbai_era5_reanalysis_single_levels_data",
    "2024_mumbai_era5_reanalysis_single_levels_data",
    "2023_mumbai_era5_reanalysis_single_levels_data",
    "2022_mumbai_era5_reanalysis_single_levels_data",
    "2021_mumbai_era5_reanalysis_single_levels_data",
    "2020_mumbai_era5_reanalysis_single_levels_data",
]

# creating a dictionary that maps the keys to the full file paths
named_filepaths_dictionary = dict(zip(dictionary_keys, named_filepaths))


In [11]:
print("-"*120)
print("Testing named_filepaths_dictionary:\n"+"-"*120)

print("2025 Delhi ERA5 Reanalysis Single Levels Data File Path:")
print("\texpected: [data/era5/grib_downloads/125ae282169904325e8bc153160be150.grib]")
print(f"\tactual: [{named_filepaths_dictionary['2025_delhi_era5_reanalysis_single_levels_data']}]")

print("\n" + "-"*80 + "\n")
print("2023 Mumbai ERA5 Reanalysis Single Levels Data File Path:")
print("\texpected: [data/era5/grib_downloads/c8a985ffc4908e6597c4498ff596cbad.grib]")
print(f"\tactual: [{named_filepaths_dictionary['2023_mumbai_era5_reanalysis_single_levels_data']}]")


------------------------------------------------------------------------------------------------------------------------
Testing named_filepaths_dictionary:
------------------------------------------------------------------------------------------------------------------------
2025 Delhi ERA5 Reanalysis Single Levels Data File Path:
	expected: [data/era5/grib_downloads/125ae282169904325e8bc153160be150.grib]
	actual: [../../data/era5/grib_downloads/125ae282169904325e8bc153160be150.grib]

--------------------------------------------------------------------------------

2023 Mumbai ERA5 Reanalysis Single Levels Data File Path:
	expected: [data/era5/grib_downloads/c8a985ffc4908e6597c4498ff596cbad.grib]
	actual: [../../data/era5/grib_downloads/c8a985ffc4908e6597c4498ff596cbad.grib]


### Investigating Resolution and Granularity

In [12]:
vars_list = ['10u','10v','2t','hcc','lcc','mcc','ssr','ssrd','tcc','tp']

In [13]:
# Reminder of the names of the structures:
# list: named_filepaths
# dictionary: named_filepaths_dictionary


# we'll just use the first file as an example
fpath = named_filepaths[0]

# temporary storage for results of this investigation
records = []
# loop through each variable in the list
for var in vars_list:
    try:
        ds = xr.open_dataset(
            fpath,
            engine="cfgrib",
            backend_kwargs={"filter_by_keys": {"shortName": var}}
        )
    except Exception as e:
        print(f"Skipping {var!r}: {e}")
        continue

    # retrieving spatial coordinates
    lat = ds.latitude.values
    lon = ds.longitude.values

    # flatten valid_time if there's a step dimension
    if "step" in ds.dims and "valid_time" in ds.coords:
        times = np.unique(ds.valid_time.values)
    else:
        times = ds.time.values

    # compute resolutions
    lat_res = np.unique(np.diff(lat)).round(6).tolist()
    lon_res = np.unique(np.diff(lon)).round(6).tolist()

    # temporal resolution in hours
    dt_hrs = (np.diff(times).astype("timedelta64[h]") / np.timedelta64(1, "h")).astype(int)
    time_res = np.unique(dt_hrs).tolist()

    # appending the results to records
    records.append({
        "variable":   var,
        "n_lat":      lat.size,
        "lat_res°":   lat_res,
        "n_lon":      lon.size,
        "lon_res°":   lon_res,
        "n_time":     times.size,
        "time_res_h": time_res,
        "t0":         str(times.min()),
        "t1":         str(times.max()),
    })
    # close the dataset to free resources
    ds.close()

  vars, attrs, coord_names = xr.conventions.decode_cf_variables(
  vars, attrs, coord_names = xr.conventions.decode_cf_variables(
  vars, attrs, coord_names = xr.conventions.decode_cf_variables(
  vars, attrs, coord_names = xr.conventions.decode_cf_variables(
  vars, attrs, coord_names = xr.conventions.decode_cf_variables(
  vars, attrs, coord_names = xr.conventions.decode_cf_variables(
  vars, attrs, coord_names = xr.conventions.decode_cf_variables(
  vars, attrs, coord_names = xr.conventions.decode_cf_variables(
  vars, attrs, coord_names = xr.conventions.decode_cf_variables(
  vars, attrs, coord_names = xr.conventions.decode_cf_variables(


In [14]:
# build and display the summary table
df = pd.DataFrame(records, columns=[
    "variable","n_lat","lat_res°","n_lon","lon_res°",
    "n_time","time_res_h","t0","t1"
])
display(df)

Unnamed: 0,variable,n_lat,lat_res°,n_lon,lon_res°,n_time,time_res_h,t0,t1
0,10u,17,[-0.25],17,[0.25],4590,[1],2025-01-01T00:00:00.000000000,2025-07-11T05:00:00.000000000
1,10v,17,[-0.25],17,[0.25],4590,[1],2025-01-01T00:00:00.000000000,2025-07-11T05:00:00.000000000
2,2t,17,[-0.25],17,[0.25],4590,[1],2025-01-01T00:00:00.000000000,2025-07-11T05:00:00.000000000
3,hcc,17,[-0.25],17,[0.25],4590,[1],2025-01-01T00:00:00.000000000,2025-07-11T05:00:00.000000000
4,lcc,17,[-0.25],17,[0.25],4590,[1],2025-01-01T00:00:00.000000000,2025-07-11T05:00:00.000000000
5,mcc,17,[-0.25],17,[0.25],4590,[1],2025-01-01T00:00:00.000000000,2025-07-11T05:00:00.000000000
6,ssr,17,[-0.25],17,[0.25],4596,[1],2024-12-31T19:00:00.000000000,2025-07-11T06:00:00.000000000
7,ssrd,17,[-0.25],17,[0.25],4596,[1],2024-12-31T19:00:00.000000000,2025-07-11T06:00:00.000000000
8,tcc,17,[-0.25],17,[0.25],4590,[1],2025-01-01T00:00:00.000000000,2025-07-11T05:00:00.000000000
9,tp,17,[-0.25],17,[0.25],4596,[1],2024-12-31T19:00:00.000000000,2025-07-11T06:00:00.000000000


### Summary

**Data Structure and Resolution**
- The ERA5-World reanalysis data covers Delhi and Mumbai regions with a spatial resolution of ~0.25 degrees (~30km)
- This resolution is adequate for regional weather patterns but may miss urban microclimate effects
- The data contains hourly measurements from 2020-2025, offering high temporal granularity

**Weather Variables**
- The dataset includes key meteorological variables: 
  - Surface temperature (2t)
  - Wind components (10u, 10v) - note potential component swapping issue
  - Cloud cover at different levels (hcc, lcc, mcc, tcc)
  - Solar radiation (ssr, ssrd)
  - Precipitation (tp)

**Temporal Considerations**
- Data requires timezone conversion (UTC to IST, +5.5 hours) for accurate daily pattern analysis
- Some variables are instantaneous readings (temperature, wind) while others are accumulated (precipitation, radiation)

**Spatial Alignment**
- Customer locations don't directly align with ERA5 grid points
- ERA5-Land locations also have a different grid structure
- Spatial joining methodology needed to match locations to nearest weather data points

**Next Steps**
- Convert data to parquet format for more efficient storage and retrieval
- Implement spatial joining methodology for ERA5-Land
- Use dataset for gap filling