# NOAA Global Forecast System (GFS) quickstart notebook on AWS
This quickstart notebook demonstrates how to work with data from the NOAA Global Forecast System (GFS) hosted on the [Registry of Open Data on AWS](https://registry.opendata.aws/noaa-gfs-bdp-pds/). The Global Forecast System (GFS) is a weather forecast model produced by the National Centers for Environmental Prediction (NCEP).  
This quickstart notebook covers
1. Data Access - how to list and download data from the Registry of Open Data on AWS
2. Data Inspecting - how to inspect the data by printing the variables, etc.
3. Data Preprocessing - convert longitude, temperature from K to °C, etc. Only keep data from a specific area. Export as Parquet.

This notebook is designed to run in [Amazon SageMaker Studio](https://aws.amazon.com/sagemaker-ai/studio/). You can also run it in [SageMaker Studio Lab](https://studiolab.sagemaker.aws/) if you don't have an AWS account.

## Dependencies installation

In [None]:
pip install --quiet cfgrib xarray boto3 pandas pyarrow plotly

## 1. Data Access

NOAA GFS data is made available by the [Registry of Open Data on AWS](https://registry.opendata.aws/noaa-gfs-bdp-pds/) in a public S3 bucket.  
Let's list all objects for a given date and forecast cycle.

In [None]:
import boto3
from botocore import UNSIGNED
from botocore.config import Config

s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))

# --- Configuration ---
GFS_BUCKET_NAME = "noaa-gfs-bdp-pds"
FORECAST_DATE = "20250708"
# Forecasts are made available four times per day, every 6 hours starting at midnight UTC.
FORECAST_CYCLE = "00"
# GFS couples four separate models (atmosphere, ocean model, land/soil model, and sea ice)
# that work together to accurately depict weather conditions
FORECAST_MODEL = "atmos"

folder = f"gfs.{FORECAST_DATE}/{FORECAST_CYCLE}/{FORECAST_MODEL}/"

In [None]:
try:
    response = s3.list_objects_v2(Bucket=GFS_BUCKET_NAME, Prefix=folder, Delimiter='/')

    # Print common prefixes (subfolders)
    if 'CommonPrefixes' in response:
        for prefix in response['CommonPrefixes']:
            print(prefix['Prefix'])

    # Print object keys
    if 'Contents' in response and response['Contents']:
        for obj in response['Contents']:
            print(obj['Key'])
    else:
        print("No objects found in the folder.")
except Exception as e:
    print(f"Error accessing S3: {e}")

After identifying the required files, the next step is to download them to your local environment. Let's download a single forecast file from the dataset.

In [None]:
# Standard GFS model output in GRIB2 format
# containing a wide range of atmospheric forecast variables at multiple pressure levels and resolutions.
FILE_TYPE = "pgrb2"
# Current operational GFS uses a base horizontal resolution of 0.25 degrees (about 28 km between grid points) 
# for the first 10 days of the forecast
GRID_RESOLUTION = "0p25" 
# Number of hours into the future from the model’s initialization time for which a forecast is provided
FORECAST_HOUR = "017"

object_key = f"gfs.t{FORECAST_CYCLE}z.{FILE_TYPE}.{GRID_RESOLUTION}.f{FORECAST_HOUR}"

In [None]:
remote_object_key = folder+object_key
local_file_name = object_key
s3.download_file(GFS_BUCKET_NAME, remote_object_key, local_file_name)
print(f"File {object_key} downloaded successfully.")

## 2. Data Inspecting
Open the dataset with xarray, using the `engine='cfgrib'` option to read GRIB files.

In [None]:
import xarray as xr

# Vertical coordinate or reference surface for each variable in the dataset.
# The Earth's surface (elevation = 0). Used for variables like surface temperature, precipitation, etc
TYPE_OF_LEVEL = "surface"
# Type of time step or time range associated with a forecast variable
STEP_TYPE = "instant"

ds = xr.open_dataset(object_key, engine='cfgrib',
    filter_by_keys={'typeOfLevel': TYPE_OF_LEVEL, 'stepType': STEP_TYPE})

We can subset our dataset to only include data over Canada.

In [None]:
# Obtained from https://nominatim.openstreetmap.org/search?q=CA&format=json
lat_min, lat_max = 41.6765597, 83.3362128
lon_min, lon_max = -141.0027500,-52.3237664

# Adjust longitude range as the dataset uses 0–360 instead of -180–180
if ds.longitude.max() > 180:
    lon_min = (lon_min + 360) % 360
    lon_max = (lon_max + 360) % 360

canada_ds = ds.sel(latitude=slice(lat_max, lat_min), longitude=slice(lon_min, lon_max))

Let's explore the variables of this dataset

In [None]:
import pandas as pd
from IPython.display import display

variable_rows = []
for variable in canada_ds:
    variable_rows.append({
        "Variable": variable,
        "Long Name": ds[variable].attrs["long_name"],
        "Units": canada_ds[variable].attrs["units"]
    })

variable_df = pd.DataFrame(variable_rows)
display(variable_df)

Let's explore the attributes of the `t` (Temperature) variable.  

In [None]:
display(canada_ds['t'])

Let's plot the Wind Speed

In [None]:
import plotly.express as px

wind_df = canada_ds.get(["gust"]).to_dataframe()
# Moving latitude and longitude from index to columns
wind_df = wind_df.reset_index()

fig = px.scatter_geo(
    wind_df,
    lat="latitude",
    lon="longitude",
    color="gust",
    labels={"gust": "Wind Speed"},
    color_continuous_scale=px.colors.sequential.Viridis
)
fig.update_geos(fitbounds="locations")
fig.update_layout(title="Wind Speed in meters per second (m/s)", title_x=0.5)
fig.show()

## 3. Data Preprocessing

When we explored the attributes of the `t` variable, we noticed that
- Longitudes are expressed in a 0 to 360 degrees range, rather than the standard -180 to 180 degrees
- Unit is Kelvin (K) rather than Celsius (C)

Let's run transformation on these two attributes.

In [None]:
temperature_df = canada_ds.get(["t"]).to_dataframe()
temperature_df = temperature_df.rename(columns={"t": "temperature_kelvin"})

# Convert temperature from Kelvin to Celsius
temperature_df["temperature_celsius"] = temperature_df["temperature_kelvin"] - 273.15

# Move latitude and longitude from index of the dataframe to column
# Convert longitude expressed in a 0 to 360 degrees range to the standard -180 to 180 degrees
temperature_df["latitude"] = temperature_df.index.get_level_values("latitude")
longitude = temperature_df.index.get_level_values("longitude")
standard_longitude = longitude.map(lambda lon: lon - 360 if lon > 180 else lon)
temperature_df["longitude"] = standard_longitude

# Reset index and select columns
temperature_df = temperature_df.reset_index(drop=True)
temperature_df = temperature_df[["time", "latitude", "longitude", "temperature_celsius"]]

We can now export the data in Parquet format.  
Storing your preprocessed NOAA GFS data in Parquet format enables
- efficient storage: in the following example we compress the data using `snappy` algorithm
- rapid access to specific data slices: with Parquet, we can read just the required column, minimizing I/O and memory usage
- and seamless scalability for future analyses

You can quickly reload and analyze just what you need—making your workflow robust, scalable, and ready for advanced analytics or machine learning.

In [None]:
temperature_df.to_parquet("preprocessed_gfs.parquet", compression='snappy')

We can load only the specific columns we need from the Parquet file we just wrote

In [None]:
# Load only the temperature
global_temperature_df = pd.read_parquet('preprocessed_gfs.parquet', columns=['time', 'temperature_celsius'])
# Run 
mean_temp = global_temperature_df.groupby('time')['temperature_celsius'].mean()
print(mean_temp)