In [None]:
%load_ext autoreload
%autoreload 2

# Fix ripple in error against lead time plot

When plotting error against lead time for GDPS, there is a ripple in the error that is not fully explained.
On average, we expect the error against lead time to grow monotonically.
Thus, we would like to explain why this ripple is visible.

In [None]:
import dask
import dask.array as da
import dask.dataframe as dd
import dask.distributed
import dask_jobqueue
import numpy as np
import os
import pandas as pd
import pathlib
import plotly.express as px

## Load dataset in memory

In [None]:
DATA_DIR = pathlib.Path(os.getenv('DATA_DIR'))
INPUT_DATASET = DATA_DIR / 'interpolated/2021-12-20-test/'

In [None]:
input_path = pathlib.Path(INPUT_DATASET)

In [None]:
sample_path = next(iter(input_path.glob('*.parquet')))

In [None]:
sample = pd.read_parquet(sample_path)

In [None]:
columns = set(sample.columns)
columns -= set(['gdps_hpbl'])

In [None]:
cluster = dask_jobqueue.SLURMCluster(
    env_extra=['source ~/.bash_profile','conda activate smc01'],
    name='smc01-dask',
)

In [None]:
cluster.scale(jobs=8)

In [None]:
client = dask.distributed.Client(cluster)

In [None]:
client

In [None]:
df = dd.read_parquet(list(iter(input_path.glob('*.parquet'))), columns=columns)


In [None]:
df.head()

In [None]:
df = df.reset_index()
df['step_hour'] = df['step'] / 3600
df['error_2t'] = df['obs_2t'] - df['gdps_2t']
df['squared_error_2t'] = (df['gdps_2t'] - df['obs_2t']) ** 2
df['rmse_2t'] = da.sqrt(df['squared_error_2t'])
df['mabs_2t'] = np.abs(df['error_2t'])
df['forecast_month'] = df['date'].dt.month
df['forecast_hour'] = df['date'].dt.hour

In [None]:
df = df.set_index('date')
df['step_td'] = dd.to_timedelta(df['step'], unit='S')
df['valid_hour'] = (df.index + df['step_td']).dt.hour

## Visualize the ripple

In [None]:
df.columns

In [None]:
df['step_hour'].head()

In [None]:
error_by_step = df.groupby(['forecast_hour', 'step_hour']).agg({'squared_error_2t': 'mean', 'index': 'count', 'valid_hour': 'mean'}).compute()
error_by_step = error_by_step.reset_index().rename(columns={'index': 'obs_count'})

In [None]:
error_by_step['rmse_2t'] = np.sqrt(error_by_step['squared_error_2t'])

In [None]:
px.line(data_frame=error_by_step, x='step_hour', y='rmse_2t', color='forecast_hour')

* The ripple is still visible.
* At first, the ripple is different for the 0h forecast and the 12h forecast. There is more error in the early hours of the 0h forecast at first.
* As the lead time moves forward, the phase between both ripples disappears and both error plots seem to synchronize
* The ripple has an amplitude of about 0.5 degree at worst.
    * This is 5% to 25% of the total error value.
* The ripple looks like a sinusoidal signal with a 24 h period.

## Hypotheses

We have a few ideas on how to explain the phenomenon. This notebook will try and validate them one by one.

### An underlying wave in the observation database causes the ripple

Here we suppose that there is an underlying wave in the observation signal form our database.
The 24h period would indicate that there is a daily cycle in the qty and quality of observations we get.
This makes sense since some stations could stop making observations at night.

To fully explain the ripple, we need to show that

* there is a ripple in the observation signal which matches the error ripple
* the stations that turn on and off daily have an average error that is different from the others.

In other words, our average error changes through the day because the stations we make predictions for are more difficult to predict depending on the time of day.

In [None]:
error_by_step_melt = error_by_step.melt(id_vars=['forecast_hour', 'step_hour'], value_vars=['rmse_2t', 'obs_count'])

In [None]:
error_by_step_melt.head()

In [None]:
fig = px.line(data_frame=error_by_step_melt, x='step_hour', y='value', facet_row='variable', color='forecast_hour', height=600)
fig.update_yaxes(matches=None)

* There is a daily cycle of observation count.
* The amplitude of the cycle is of about 30k observations, on a total of 1.1M. That is an amplitude of ~3% of the total signal.
    * Can 3% of the observations provoke a variation in the error signal up to 25%?
* There is an overall decline in the number of observations as we move through the step hours.
    * The decline is of about 10k observations.
    * This could be simply due to the fact that for the latest forecasts in the year, we don't have corresponding observations.
* There seems to be a synchronization between the number of observations and the average error. 
    * The error for the 0h forecast is high when we have few observations, and high when we have more observations.

Are those two different phenomenons that are both connected to the daily cycle? It seems unlikely that such a small part of the observations be responsible for such an oscillation in the error.
I will try to measure the impact of the oscillating stations on the error.

In [None]:
midnight_forecast = df[df['forecast_hour'] == 0]

In [None]:
# step hour 6 is the through in obs, step hour 18 is close to the peak.
hour_6_forecasts = midnight_forecast[midnight_forecast['step_hour'] == 6].groupby('station').agg({'index': 'count'}).rename(columns={'index': 'obs_count'}).compute()

In [None]:
midnight_forecast['step_hour'].value_counts().compute()

In [None]:
hour_18_mask = (midnight_forecast['step_hour'] > 17.0) & (midnight_forecast['step_hour'] < 19.0)

In [None]:
hour_18_forecasts = midnight_forecast[hour_18_mask].groupby('station').agg({'index': 'count'}).rename(columns={'index': 'obs_count'}).compute()

In [None]:
hour_6_forecasts.head()

In [None]:
hour_18_forecasts.head()

In [None]:
merged_6_18_forecasts = hour_6_forecasts.merge(hour_18_forecasts, how='outer', on='station', suffixes=('_6h', '_18h'))

In [None]:
merged_6_18_forecasts['delta'] = np.abs(merged_6_18_forecasts['obs_count_18h'] - merged_6_18_forecasts['obs_count_6h'])

In [None]:
merged_6_18_forecasts.sort_values('delta', ascending=False)

In [None]:
merged_6_18_forecasts.isna().sum()

In [None]:
merged_6_18_forecasts.describe()

In [None]:
moving_stations = set(merged_6_18_forecasts[merged_6_18_forecasts['delta'] > 30].index)

The moving stations dataframe contains the list of stations that have a big difference in number of observation for between the peak and the trough of the n of observations.

In [None]:
midnight_forecast['daily_cycle'] = midnight_forecast['station'].isin(moving_stations)

In [None]:
midnight_forecast_download = midnight_forecast[['index', 'squared_error_2t', 'daily_cycle', 'step_hour', 'forecast_hour']].compute()

In [None]:
error_by_step_cycle = midnight_forecast_download.groupby(['forecast_hour', 'step_hour', 'daily_cycle']).agg({'squared_error_2t': 'mean', 'index': 'count'})
error_by_step_cycle['rmse_2t'] = np.sqrt(error_by_step_cycle['squared_error_2t'])

In [None]:
error_by_step_cycle.head()

In [None]:
error_by_step_cycle = error_by_step_cycle.reset_index()


In [None]:
error_by_step_cycle.head()

In [None]:
px.line(data_frame=error_by_step_cycle, x='step_hour', y='rmse_2t', color='daily_cycle')

The daily ripple is still distinctly visible despite the fact that we control for stations that have a big daily cycle in observations vs those that don't.
It seems to indicate that the variation in the underlying observations do not explain the ripple fully.

### Hypothesis 2: the average error is larger at night

In [None]:
cyul = df[df['station'] == 'CYUL']

In [None]:
cyul_error = cyul.groupby(['forecast_hour', 'step_hour']).agg({'squared_error_2t': 'mean', 'index': 'count', 'valid_hour': 'mean', 'error_2t': 'mean'}).compute()

In [None]:
cyul_error = cyul_error.reset_index()

In [None]:
cyul_error['rmse_2t'] = np.sqrt(cyul_error['squared_error_2t'])

In [None]:
cyul.head()

In [None]:
px.line(data_frame=cyul_error, x='step_hour', y='rmse_2t', color='forecast_hour')

The ripple seems to exist inside the station data itself. 
It seems to be harder to forecast at night than it is during the day.
So the solution would be simply to 
- compare only values from the same step for a given station
- aggregate the error data in bunches of 24hrs

If we indeed decide to aggregate the error in bunches of 24hrs it gives something like this

In [None]:
df['step_td'].dt.days.head()

In [None]:
df['step_days'] = df['step_td'].dt.days

In [None]:
error_by_step = df.groupby(['forecast_hour', 'step_days']).agg({'squared_error_2t': 'mean', 'index': 'count', 'valid_hour': 'mean'}).compute()
error_by_step['rmse_2t'] = np.sqrt(error_by_step['squared_error_2t'])

In [None]:
error_by_step = error_by_step.reset_index()

In [None]:
px.line(data_frame=error_by_step, x='step_days', y='rmse_2t', color='forecast_hour')

Ok, to me this is problem solved.
Now, that means that if the systematic biases for other variables (wind for instance) are different than that of temperature, we will need to adjust our validation graphs for that variable.
But compensating for the diurnal cycle everywhere seems to be a good idea.
There is also the idea of using the RPSS or something similar where we measure the average improvement over the forecast instead of the error metric.