# Example illustrating different forecast evaluation statistics for precipitation

## This notebook is part 1: preprocessing the data. If this has already been done, can skip to part 2.

## Note, this won't work in Colab because it requires xesmf, which is only installable via conda

## We'll use the heavy rainfall in eastern Colorado from 21-22 June 2023 for the example

In [37]:
import xarray as xr
import pandas as pd
import numpy as np

from herbie import Herbie
import cdsapi

import xesmf as xe
import metpy.calc as mpcalc


In [2]:
vtime = pd.Timestamp(2023,6,22,12) ### valid time

fxx = 24 ### what lead time do we want to evaluate? (in hours)

init = vtime - pd.Timedelta(hours=fxx)   ### this will give us the forecasts from 12 UTC 21 June (a 24-h forecast)
init


Timestamp('2023-06-21 12:00:00')

## get Stage-IV precipitation estimate for 24-h period ending 12 UTC 22 June

#### this is the one that isn't quite as easy to pull from the cloud. I obtained it from here and provide the file. https://data.eol.ucar.edu/dataset/21.093

In [3]:
stage4 = xr.open_dataset("stage4/st4_conus."+vtime.strftime("%Y%m%d%H")+".24h.grb2", engine='cfgrib')
stage4

## let's also use Herbie to get a few different forecast grib files

### reference: https://herbie.readthedocs.io/en/stable/gallery/index.html

### GFS

In [4]:
H = Herbie(init.strftime("%Y-%m-%d %H:%M"), model="gfs", fxx=fxx, product="pgrb2.0p25")


✅ Found ┊ model=gfs ┊ [3mproduct=pgrb2.0p25[0m ┊ [38;2;41;130;13m2023-Jun-21 12:00 UTC[92m F24[0m ┊ [38;2;255;153;0m[3mGRIB2 @ aws[0m ┊ [38;2;255;153;0m[3mIDX @ aws[0m


#### to see what's in the file:

In [5]:
#H.inventory()

### or, more specifically:
H.inventory(":APCP:surface:")

Unnamed: 0,grib_message,start_byte,end_byte,range,reference_time,valid_time,variable,level,forecast_time,search_this
595,596,427852044,428229938.0,427852044-428229938,2023-06-21 12:00:00,2023-06-22 12:00:00,APCP,surface,18-24 hour acc fcst,:APCP:surface:18-24 hour acc fcst
596,597,428229939,428755555.0,428229939-428755555,2023-06-21 12:00:00,2023-06-22 12:00:00,APCP,surface,0-1 day acc fcst,:APCP:surface:0-1 day acc fcst


#### in the GFS, there are two sets of precipitation variables -- the most recent 1/3/6 hours (which can be annoying to work with), and the total precipitation up to that point (easier, we'll use that)

In [6]:
gfs = H.xarray(":APCP:surface:0-1 day acc fcst")

In [7]:
gfs

### HRRR

In [8]:
H2 = Herbie(init.strftime("%Y-%m-%d %H:%M"), model="hrrr", fxx=fxx, product="sfc")

✅ Found ┊ model=hrrr ┊ [3mproduct=sfc[0m ┊ [38;2;41;130;13m2023-Jun-21 12:00 UTC[92m F24[0m ┊ [38;2;255;153;0m[3mGRIB2 @ aws[0m ┊ [38;2;255;153;0m[3mIDX @ aws[0m


In [9]:
H2.inventory(":APCP:surface:")

Unnamed: 0,grib_message,start_byte,end_byte,range,reference_time,valid_time,variable,level,forecast_time,search_this
83,84,65036978,66079076.0,65036978-66079076,2023-06-21 12:00:00,2023-06-22 12:00:00,APCP,surface,0-1 day acc fcst,:APCP:surface:0-1 day acc fcst
89,90,66196579,66539127.0,66196579-66539127,2023-06-21 12:00:00,2023-06-22 12:00:00,APCP,surface,23-24 hour acc fcst,:APCP:surface:23-24 hour acc fcst


In [10]:
hrrr = H2.xarray(":APCP:surface:0-1 day acc fcst")

In [11]:
hrrr

### Let's also make a smoothed version of the HRRR for comparison. We'll use metpy's gaussian smoother

In [12]:
hrrr_smooth = hrrr.copy()

hrrr_smooth['tp'] = mpcalc.smooth_gaussian(hrrr.tp, 15).metpy.dequantify()

hrrr_smooth

### ERA5

In [13]:
dataset = "reanalysis-era5-land"
request = {
    'variable': ['total_precipitation'],
    'year': vtime.strftime("%Y"),
    'month': vtime.strftime("%m"),
    'day': [init.strftime("%d"),vtime.strftime("%d")],  ### need more flexible code if crossing month boundary or longer lead times
    'time': ['00:00', '01:00', '02:00', '03:00', '04:00', '05:00', '06:00', 
             '07:00', '08:00', '09:00', '10:00', '11:00', '12:00', '13:00', 
             '14:00', '15:00', '16:00', '17:00', '18:00', '19:00', '20:00', 
             '21:00', '22:00', '23:00'],
    'data_format': 'grib',
    'download_format': 'unarchived',
    'area': [60, -130, 20, -60]
}

target="era5_land_"+vtime.strftime("%Y%m%d%H.grib")

client = cdsapi.Client()
client.retrieve(dataset, request, target)


2024-09-09 16:48:49,375 INFO Request ID is ad86cf98-68fb-4f7d-89cd-b7dbae5c9447
2024-09-09 16:48:49,596 INFO status has been updated to accepted

KeyboardInterrupt



In [14]:
era5 = xr.open_dataset(target, engine='cfgrib')

era5

Ignoring index file 'era5_land_2023062212.grib.923a8.idx' older than GRIB file


#### ERA5 precipitation is a little annoying to work with, because it accumulates over each UTC day and then resets each day at 00 UTC. So for 12z-12z periods, we need to get the data from the 2nd half of the first day (13-24 h), and then the 1st half of the 2nd day (00-12 h), and then add those together. Here's how we can do that. 

In [15]:
vtime_m1 = vtime - pd.Timedelta("36 hours")  ### this will give 00 UTC the day before our valid day

era5_day1 = era5.sel(time=vtime_m1).tp.sel(step='24:00:00') - era5.sel(time=vtime_m1).tp.sel(step='12:00:00')
era5_day2 = era5.sel(time=vtime.strftime("%Y-%m-%d")).tp.sel(step='12:00:00')
era5_sum = era5_day1 + era5_day2

### and set the 'step' back to 24 hours because that's what it is
era5_sum = era5_sum.assign_coords({'step':pd.Timedelta("24 hours")})


## Now, let's regrid all of these datasets to a common grid so it's possible to calculate verification statistics. For this purpose, we're going to use a 4-km lat/lon grid, which allows for easy subsetting as well. We'll use xesmf's bilinear interpolation method to do the regridding.

#### here's the grid we want to put all the datasets on. (This is the grid used by the PRISM precipitation dataset.)

In [16]:
### build regridder
ds_out = xr.Dataset(
    {
        "lat": (["lat"], np.arange(24.08333, 49.91667, 0.04166667)),
        "lon": (["lon"], np.arange(-125, -66.499, 0.04166667)),
    }
)

#### create the xesmf regridders
#### (this will be slow the first time; if using these more than once, see here for how to reuse them and save time: https://xesmf.readthedocs.io/en/latest/notebooks/Reuse_regridder.html)

In [17]:
regridder_gfs = xe.Regridder(gfs, ds_out, "bilinear")
fn = regridder_gfs.to_netcdf()

regridder_hrrr = xe.Regridder(hrrr, ds_out, "bilinear")
fn2 = regridder_hrrr.to_netcdf()

regridder_era5 = xe.Regridder(era5, ds_out, "bilinear")
fn3 = regridder_era5.to_netcdf()

regridder_st4 = xe.Regridder(stage4, ds_out, "bilinear")
fn4 = regridder_st4.to_netcdf()


#### do the regridding

In [18]:
gfs_regrid = regridder_gfs(gfs)
hrrr_regrid = regridder_hrrr(hrrr)
hrrr_smooth_regrid = regridder_hrrr(hrrr_smooth)
era5_regrid = regridder_era5(era5_sum)
stage4_regrid = regridder_st4(stage4)

#### and one last thing -- ERA5 precipitation comes in meters, let's convert that to mm 
era5_regrid = era5_regrid*1000.

### now, let's combine these all in to one xarray dataset, and then subset the data down to a region of eastern Colorado that we can focus on

In [19]:
data_all = xr.Dataset(data_vars = dict(
    gfs = (gfs_regrid.tp),
    hrrr = (hrrr_regrid.tp),    
    hrrr_smooth = (hrrr_smooth_regrid.tp),
    era5 = (era5_regrid),
    stage4 = (stage4_regrid.tp),
))


## Let's write out this dataset to netcdf now, so we can maybe use it for other purposes later

In [20]:
data_all.to_netcdf("precip_data_preproc_"+vtime.strftime("%Y%m%d%H")+".nc")

## now, go on to the next notebook, 'precip_verif_example.ipynb' !