# Understanding Historical Climate Data on the Analytics Engine
This notebook is a walkthrough of how to utilize different kinds of historical climate data, including weather observations, reanalysis products, and model output available on the Analytics Engine. 

* Weather observations are inherently point locations, tied to a single station location, and represent the actual values of weather variables. Weather observations are highly localized weather information, and are limited by instrumentation constraints.
* Reanalysis products are reconstructions of the historical weather observation period. Like a climate model, reanalysis products have a complete set of atmospheric and surface weather variables on a full spatial grid. Because reanalysis synthesizes many sources of observations and utilizes simulations to produce a continuous output,  it should closely resemble observation from weather station data but should not be expected to match exactly.

* Climate models run over the historical period represent the *general conditions* during a time period, but *do not reproduce specific events from the historical record*. Each simulation from a climate model is a free-running simulation producing unique weather events and internal variability. This variability in individual climate model realizations is how we are able to determine the range of potential future realities. In the historical period, variability across each climate model realizaiton represents the range of possible conditions that could have occurred. 

**Intended Application**: As a user, I want to be able to understand the <span style="color:red">**strengths and weaknesses of comparing historical observations, reanalysis, and model output**</span> by:
1. Visualizing observations compred to reanalysis
2. Visualizing observations compared to climate model output

**Runtime**: With the default settings, this notebook takes approximately **less than 1 minute** to run from start to finish. Modifications to selections may increase the runtime. 

### Step 0: Set-up

In [None]:
import climakitae as ck
from climakitae.core.data_interface import DataParameters
import xarray as xr
import pandas as pd
import numpy as np

import hvplot.pandas  # noqa
import hvplot.xarray
pd.options.plotting.backend = 'holoviews'

### Step 1: Select data
#### 1a) Climate model data
First we will retrieve precipitation data using LOCA2-Hybrid models: multiple ensemble members for both CESM2-LENS and EC-Earth3 for a single location. 

In [None]:
selections = DataParameters()

selections.downscaling_method = 'Statistical'
selections.variable = 'Precipitation (total)'
selections.timescale = 'monthly'
selections.units = 'inches'
selections.resolution = '3 km'
selections.time_slice = (1980, 2001) # trim to the time period that overlaps between models, reanalysis, and station observations
selections.area_subset = 'lat/lon'
selections.cached_area = ['coordinate selection']
selections.area_average = 'No'
selections.latitude = (34.067 - 0.02, 34.067 + 0.02) # specifically at station coordinates, with small buffer
selections.longitude = (-117.65 - 0.02, -117.65 + 0.02) # specifically at station coordinates, with small buffer

In [None]:
# retrieving data
ds = selections.retrieve()

# subset for models of interest
mdls = ['LOCA2_EC-Earth3_r1i1p1f1', 'LOCA2_EC-Earth3_r2i1p1f1', 'LOCA2_CESM2-LENS_r1i1p1f1', 'LOCA2_CESM2-LENS_r2i1p1f1',  'LOCA2_CESM2-LENS_r3i1p1f1']
ds = ds.sel(simulation = mdls)

# loading into memory -- will take a few minutes! 
ds = ck.load(ds)
model_ds = ds.squeeze()
model_ds

### 1b) Reanalysis data
Next retrieve dynamically downscaled reanalysis for the same location.

In [None]:
selections = DataParameters()

selections.variable = 'Precipitation (total)'
selections.scenario_historical = ["Historical Reconstruction"]
selections.timescale = 'monthly'
selections.units = 'inches'
selections.time_slice = (1980, 2001)
selections.area_subset = 'lat/lon'
selections.cached_area = ['coordinate selection']
selections.area_average = 'No'
selections.latitude = (34.067 - 0.03, 34.067 + 0.03) # specifically at station coordinates, with small buffer
selections.longitude = (-117.65 - 0.03, -117.65 + 0.03) # specifically at station coordinates, with small buffer

In [None]:
# retrieving data
ds = selections.retrieve()

# loading into memory -- will take a few minutes! 
ds = ck.load(ds)
reanalysis_ds = ds.squeeze()
reanalysis_ds

#### 1c) Observational data
Lastly, we'll read in weather observations for comparison. In this example, we are looking at precipitation observations from a weather station near Ontario, in San Bernadino County.

In [None]:
wx_obs = pd.read_csv('1026_data_cleaned.csv') ## we'll use the "total_precipitation_in" column for comparison
# trim to the time period that overlaps between models, reanalysis, and station observations
wx_obs = wx_obs[(wx_obs['year'] >= 1980) & (wx_obs['year'] <= 2001)]

# adding an easy to interpret time (month-year) column so we can compare side by side
wx_obs['day'] = 1 # using first of the month for ease
wx_obs['time'] = pd.to_datetime(wx_obs[['year', 'month', 'day']])
wx_obs = wx_obs.drop(columns=['year', 'month', 'day']) # minor cleanup

wx_obs

### Step 2: Visualize timseries between observations, reanalysis, and model output

First, visualize the variability across realizations of climate models

In [None]:
rolling_window = 12 # use a smoothing window to better visualize trends
models_to_plot = model_ds.rolling(time=rolling_window).mean().hvplot.line(
    x='time', by='simulation', title='Inter- and Intra-model comparison', width=1000, height=400);

models_to_plot

Key takeaways here:
* Each run of a climate model produces a unique timeseries, with the timing wet and dry years varying from run to run. This is okay -- it's by design!
* Even within a single model, different realizations (or runs, e.g. r1i1p1f1, r2i1p1f1) produce different outputs.
* Although the timing is different in each run, the overall range of values and magnitude of interannual variability is similar across models.

Useful note: You can also "turn off" certain lines in the plot above if you want to focus on any particular model better. Just click on the name in the legend to "hide" and "unhide" a particular model.

Next, we'll plot the station observations alongside the WRF-ERA5 reanalysis product to see how these datasets compare.

In [None]:
wx_obs['total_precipitation_in_smooth'] = wx_obs['total_precipitation_in'].rolling(rolling_window).mean()
obs = wx_obs.hvplot(x='time', y='total_precipitation_in_smooth', color='black', label='Observations', 
                    line_width=3, width=1000, height=400,title='Station to Reanalysis Comparison');

models_to_plot = model_ds.sel(simulation='LOCA2_CESM2-LENS_r1i1p1f1').rolling(time=rolling_window).mean().hvplot.line(
    x='time', by='simulation', color = 'orange');
reanalysis_to_plot = reanalysis_ds.rolling(time=rolling_window).mean().hvplot.line(
    x='time', color='blue', label='ERA5 Reanalysis',  width=1000, height=400);

obs * models_to_plot * reanalysis_to_plot

Key takeaways here:
* Unlike the climate model data (single orange line), the reanalysis (blue line) generally follows the same sequence of events as the station data (e.g. wet years in 1983 and 1993, dry period 1984-1991).
* There are still some discrepancies between the reanalysis and station observations. These could be due to limitations of the instruments at the station, calibration issues, or the reanalysis producing weather events in slightly different locations than the observations.
* An ordered timeseries from a climate model will never match the observational timeseries, but it should generally reproduce climate characteristics and variabilty (more on this below).

### Step 3: Visualize distibutions and variability over a climatology period
Next, we'll visualize the overall distribution of precipitation values to assess whether the model reproduces the *overall general conditions* of the station. We are also going to mask out the lowest 0.1 inch of precipitation from all 3 datasets. This is a common practice in climate analyses to remove *trace precipitaition* which can dramatically change the distribution of precipitation by introducing instrumentation inaccuracies to the result.

In [None]:
# Remove dry days (< 0.1 inch) to compare distribution of precipitation
model_ds = model_ds.where(model_ds > .1)
reanalysis_ds = reanalysis_ds.where(reanalysis_ds > .1)

# mask <1mm in in weather obs
valid_obs = wx_obs.loc[wx_obs['total_precipitation_in'] > .1]

In [None]:
def hist_overlay(obs, model, reanalysis):
    bins = np.arange(0,10,0.4)
    
    # relabel for legend
    obs.name = 'Observations'
    model.name = 'LOCA2_CESM2-LENS_r1i1p1f1'
    reanalysis.name = 'WRF-ERA5 reanalysis'
    
    obs_plot = obs.to_xarray().hvplot.hist(bins=bins, color='black', alpha=0.75)
    mdl_plot = model.hvplot.hist(bins=bins, color='orange')
    re_plot = reanalysis.hvplot.hist(bins=bins, color='blue', title='Observations to Model Comparison')
    return (re_plot * mdl_plot * obs_plot).options(xlabel='Precipitation (total) [inches]', 
                                                   ylabel='Number of months', width=800, height=400)

# now plot all three datasets together!
hist_overlay(obs=valid_obs['total_precipitation_in'],
             model = model_ds.sel(simulation='LOCA2_CESM2-LENS_r1i1p1f1'),
             reanalysis = reanalysis_ds)

Now looking at the comparison of distributions, we see that the spread of monthly precipitation values lines up reasonably well between observations and model output. 
* An ordered timeseries from a climate model will never match the observational timeseries, but it should generally reproduce climate characteristics and variabilty, which we see above in the spread of values. 
* This is especially important with **precipitation**, given that precipitation does not follow a normal distribution and can be highly variable spatiotemporally and is challenging to model accurately. 

If you want to analyze real-world events from the historical record, we recommend the use of reanalysis data produccts like WRF-ERA5 (or ERA5 itself!) as an easy way to do so that is much more flexible than individual weather stations that may have gaps in their reporting record or other instrumentation inconsistencies (e.g., missing data). Climate model data will not match the year-to-year historical record, but it can be very useful to compare model output to observations or reanalysis product to evaluate how well each model represents particular features of the climate. When making these comparisons, it's important to only compare measurements aggregated over a sufficiently long enough time period (typically at least 30 years) that samples a range of the climate's natural variability. 

For more information on climate variability, check out the `internal_variability.ipynb` and `model_variability.ipynb` notebooks!