# Weather Data Ingestion (ERA5)

**Project Phase:** 2. Data Integration and Preprocessing
**Goal:** Download historical weather data for Norway, Denmark, and Spain.

### Data Source
We use **ERA5 climate reanalysis** data from the **Copernicus Climate Change Service (CDS)**.
ERA5 provides hourly estimates of atmospheric variables and represents the current state-of-the-art in climate reanalysis.

### Requirements
* **Timeframe:** 10 years (2015 - 2025)
* **Resolution:** Hourly
* **Format:** NetCDF (standard for multidimensional climate data)
* **License:** Copernicus License

In [None]:
import cdsapi
import os
import netCDF4
import time 
from pathlib import Path
from dotenv import load_dotenv

# 1. Setup Paths
current_dir = Path(os.getcwd())
project_dir = current_dir.parent.parent
secrets_path = project_dir / 'config' / 'secrets.env'

# Define output directory for raw weather data
output_dir = project_dir / 'data' / 'raw' / 'weather'
output_dir.mkdir(parents=True, exist_ok=True)

print(f"Output directory: {output_dir}")

# 2. Load API Credentials; load the CDS_URL and CDS_KEY from the secrets.env file
if secrets_path.exists():
    load_dotenv(secrets_path)
    print("Credentials loaded from secrets.env")
else:
    print(f"Error: Secrets file not found at {secrets_path}")

# 3. Initialize CDS Client
try:
    cds_url = os.getenv('CDS_URL')
    cds_key = os.getenv('CDS_KEY')
    
    if not cds_url or not cds_key:
        raise ValueError("Missing CDS_URL or CDS_KEY in secrets file.")
        
    # Initialize the client with the specific key
    client = cdsapi.Client(url=cds_url, key=cds_key)
    print("CDS API Client initialized successfully.")
    
except Exception as e:
    print(f"Initialization failed: {e}")

Output directory: c:\Users\maxim\Projects\DsLab25W_marbl.energy\data\raw\weather
Credentials loaded from secrets.env
CDS API Client initialized successfully.


## Configuration

We define the parameters strictly according to the project proposal.

### [cite_start]Variables [cite: 71]
1.  **2m temperature**: Air temperature at 2 meters height.
2.  **Total precipitation**: Rain and snow accumulation.
3.  **10m u-component of wind**: Eastward wind component.
4.  **10m v-component of wind**: Northward wind component.
5.  **Surface solar radiation downwards**: Solar irradiance (for solar power).

*Note: We download U and V wind components to calculate exact wind speed and direction later.*

### [cite_start]Locations [cite: 28, 55-57]
We define bounding boxes (North, West, South, East) for:
* **DK1** (Denmark West)
* **ES** (Spain)
* **NO2** (Norway - Covering Southern/Southwestern region)

In [None]:
# Timeframe Configuration
YEARS = [str(year) for year in range(2023, 2026)]
MONTHS = [f"{month:02d}" for month in range(1, 13)]

# Variables required for forecasting models
VARIABLES = [
    '2m_temperature',
    'total_precipitation',
    '10m_u_component_of_wind',
    '10m_v_component_of_wind',
    'surface_solar_radiation_downwards'
]

# Geographical Bounding Boxes [North, West, South, East]
AREAS = {
    'ES':  [44, -10, 35, 5],     # Spain (Iberian Peninsula)
    'NO2': [62, 4, 57, 12],      # Norway South
    'NO4': [72, 16, 68, 32],     # Norway North (TromsÃ¸)
    'DK1': [58, 7, 54, 12]       # Denmark West
}

print("Job Configuration:")
print(f"  Years: {len(YEARS)} ({YEARS[0]}-{YEARS[-1]})")
print(f"  Months: {len(MONTHS)}")
print(f"  Variables: {len(VARIABLES)}")
print(f"  Zones: {list(AREAS.keys())}")
print(f"  Total files: {len(AREAS) * len(YEARS) * len(MONTHS)}")

Job Configuration:
  Years: 3 (2015-2024)
  Variables: 5
  Zones: ['DK1', 'ES', 'NO2']


## Data Download

The following function downloads data in **monthly chunks**.
This is necessary because the ERA5 API has limits on request size. Downloading 10 years in one request did fail.

**Logic:**
1.  Check if the file already exists (to avoid re-downloading).
2.  If not, send a request to the Copernicus API.
3.  Save the result as a `.nc` (NetCDF) file.

In [None]:
def download_era5_month(year, month, zone_name, zone_coords):
    """
    Downloads one month of ERA5 data for a specific zone.
    Includes validation to ensure the file is not corrupt.
    """
    # Define filename
    file_name = f"era5_{zone_name}_{year}_{month}.nc"
    file_path = output_dir / file_name
    
    # Skip if file already exists
    if file_path.exists():
        print(f"Skipping {file_name} (Already exists)")
        return

    print(f"Requesting {file_name}...")
    
    try:
        # 1. Send API Request
        client.retrieve(
            'reanalysis-era5-single-levels',
            {
                'product_type': 'reanalysis',
                'format': 'netcdf',
                'variable': VARIABLES,
                'year': year,
                'month': month,
                'day': [str(d).zfill(2) for d in range(1, 32)],
                'time': [f"{h:02d}:00" for h in range(24)],
                'area': zone_coords,
            },
            str(file_path)
        )
        
        # 2. Validation Step
        # Try to open the file to check if it is a valid NetCDF
        # If the API sent an error message text file, this will fail
        try:
            time.sleep(2)  # Let file finish writing
            with netCDF4.Dataset(file_path, 'r') as ds:
                # Verify variables exist
                var_names = [v for v in ds.variables]
                if len(var_names) < 5:
                    raise ValueError(f"Only {len(var_names)} variables found")
                pass # File is good
            print(f"Success: {file_name}")
            
        except Exception:
            print(f"Error: File {file_name} is corrupt. Deleting it.")
            if file_path.exists():
                file_path.unlink()
        
    except Exception as e:
        print(f"Failed to download {file_name}: {e}")

# --- Execution Loop ---
print("Starting download queue...")
print("Note: This process may take time depending on the CDS queue.")

for zone, coords in AREAS.items():
    print(f"\nProcessing Zone: {zone}")
    print("-" * 30)
    
    for year in YEARS:
        for month in MONTHS:
            download_era5_month(year, month, zone, coords)

print("\nAll downloads completed.")

Starting download queue...
Note: This process may take time depending on the CDS queue.

Processing Zone: DK1
------------------------------
Requesting era5_DK1_2023_01.nc...


2025-11-25 09:55:12,097 INFO Request ID is c9738c30-92d2-4e8a-8ade-a192d8d6b6cd
2025-11-25 09:55:12,948 INFO status has been updated to accepted
2025-11-25 09:55:22,967 INFO status has been updated to running
2025-11-25 09:59:37,188 INFO status has been updated to successful
                                                                                          

Error: File era5_DK1_2023_01.nc is corrupt. Deleting it.
Requesting era5_DK1_2023_02.nc...


2025-11-25 09:59:40,109 INFO Request ID is 1a955bb9-04e9-4bcd-918f-7ae8dadd2ac6
2025-11-25 09:59:40,282 INFO status has been updated to accepted
2025-11-25 09:59:54,546 INFO status has been updated to running
2025-11-25 10:04:03,774 INFO status has been updated to successful
                                                                                          

Error: File era5_DK1_2023_02.nc is corrupt. Deleting it.
Requesting era5_DK1_2023_03.nc...


2025-11-25 10:04:05,858 INFO Request ID is 10373a51-63a7-4a69-a5bb-8b130adc1cb8
2025-11-25 10:04:05,924 INFO status has been updated to accepted
2025-11-25 10:04:15,289 INFO status has been updated to running
2025-11-25 10:07:01,713 INFO status has been updated to successful
                                                                                          

Error: File era5_DK1_2023_03.nc is corrupt. Deleting it.
Requesting era5_DK1_2023_04.nc...


2025-11-25 10:07:03,496 INFO Request ID is f3594ef0-79aa-4010-8018-7b0d9020d5c0
2025-11-25 10:07:03,582 INFO status has been updated to accepted
2025-11-25 10:07:54,080 INFO status has been updated to running
2025-11-25 10:11:24,123 INFO status has been updated to successful
                                                                                          

Error: File era5_DK1_2023_04.nc is corrupt. Deleting it.
Requesting era5_DK1_2023_05.nc...


2025-11-25 10:11:25,854 INFO Request ID is 5c297cb7-f5dc-4e22-b9d5-ec8b1f9d75ed
2025-11-25 10:11:25,943 INFO status has been updated to accepted
2025-11-25 10:11:34,481 INFO status has been updated to running
2025-11-25 10:15:46,939 INFO status has been updated to successful
                                                                                          

Error: File era5_DK1_2023_05.nc is corrupt. Deleting it.
Requesting era5_DK1_2023_06.nc...


2025-11-25 10:15:48,914 INFO Request ID is 8d9005e2-0348-4687-9fcc-a1b7142b829d
2025-11-25 10:15:48,999 INFO status has been updated to accepted
2025-11-25 10:16:04,492 INFO status has been updated to running


KeyboardInterrupt: 