# CAMS50 VRA2015 Collocated obs/runs
CAMS50 runs a reanalysis with validated obrvations 2 years after the fact.
On the 2015 Validated ReAnalysis, or VRA2015 for short, the EMEP analysis results
failed to capture O3 and PM10 exedances. PM10 was not assimilated, so no surprice there.
The O3 exedances were missed as the assimation system was set to reject observations
larger than 150 ug/m3.

In [1]:
from glob import glob
from os.path import basename, isfile
from os import remove

import numpy as np
import pandas as pd
import xarray as xr
import xarray.ufuncs as xu
from dask.diagnostics import ProgressBar

import warnings
warnings.filterwarnings('ignore')

# only 3 decimal points on df.head() and df.describe()
pd.options.display.float_format = '{:,.3f}'.format

## Datasets
- eeaVRA: validated surface obs for data assimilation
- eeaVAL: validated surface obs for model evaluataion
- cifsBC: CIFS boundary conditions
- emepHC: hindcast run (no DA)
- emepAN: (re)analysis run (DA: NO2,O3,SO2), too low rejecteion max
- emepRE: (re)analysis re-run (DA: NO2,O3,SO2), higher rejecteion max
- emepPM: (re)analysis re-run (DA: NO2,O3,SO2,PM25,PM10), higher rejecteion max & new DA modules (w/PM DA feedback)

In [2]:
lustre = "/lustre/storeA/users/alvarov/CAMS50/%s"
ncfile = lustre%'vra2015colloc.nc' # save collocated datasets
files = dict(
    eeaVRA=glob(lustre%'obs/VRA_2015/assimilation_*.nc'),
    eeaVAL=glob(lustre%'obs/VRA_2015/validation_*.nc'),
    cifsBC=glob(lustre%'2015_VRA/VRA_*_EU_EVA.nc'),
    emepHC=glob(lustre%'VRA-2015/BM_CAMS50.201706/VRA00-2015.nc'),
    emepAN=glob(lustre%'VRA-2015/BM_CAMS50.201706/VRA00AN-2015Q?.nc'),
    emepRE=glob(lustre%'VRA-2015/BM_CAMS50.201706/VRA00RE-2015Q?.nc'),
    emepPM=glob(lustre%'VRA-2015/BM_CAMS50.201801/VRA00AN-2015Q?.nc'),
)
for k,v in files.items():
    print("%s: %3d files"%(k,len(v)))

eeaVRA:   5 files
eeaVAL:   5 files
cifsBC: 335 files
emepHC:   1 files
emepAN:   4 files
emepRE:   4 files
emepPM:   4 files


In [3]:
def save2nc(ds,f):
    if isfile(f): remove(f)
    ds.to_netcdf(f)

# Validated Observations
Observations for *O3*, *NO2*, *SO2*, *PM25* and *PM10* in *ug/m3*, are divided on 2 datasets,
assimilation and validation.
- The dataset split is not consistent across species.
- The classification is not consistent across species.
- The datasets contain negative concentrations.
- Station `FR23003` is outside the European domain.
- Station `MT00008` is defined with slightly different longitude on assimilation NO2/O3 sets.

The observations were stored in NetCDF files as part of the pre-processing for data assimilation. Netative oncentrations are discarded by the data assimilation system, and will be trown off as the files are read in.

In [4]:
def surfObs(ds):
    # load the dataset in order to use .loc
    ds.load()

    # byte to sting
    ds['station'] = ds['iso'].astype(str) # station names
    ds['class'] = ds['class'].astype(str)

    # fix MT00008 coordinates
    if 'MT00008' in ds.station.values:
        ds.lat.loc['MT00008'] = 35.89002
        ds.lon.loc['MT00008'] = 14.434464
    
    # mask negative concetrations
    for param in ds.data_vars: 
        if ds[param].attrs.get('units',None) == 'ug/m3':
            ds[param] = ds[param].where(ds[param]>0)
            ds['class'] = ds['class'].assign_coords(poll=param).expand_dims('poll')

    return ds.drop(['iso'])

%time ds = surfObs(xr.open_dataset(files['eeaVRA'][0]))
ds

CPU times: user 2.39 s, sys: 316 ms, total: 2.7 s
Wall time: 7.21 s


<xarray.Dataset>
Dimensions:  (poll: 1, station: 1214, time: 8761)
Coordinates:
  * time     (time) datetime64[ns] 2015-01-01 2015-01-01T01:00:00 ...
  * station  (station) <U8 'IT1961A' 'FR04014' 'FR02031' 'CZ0ALIB' 'PL0129A' ...
  * poll     (poll) <U3 'NO2'
Data variables:
    lon      (station) float64 8.256 2.394 5.213 14.45 20.96 4.021 -0.5147 ...
    lat      (station) float64 46.31 48.84 43.42 50.01 52.41 49.22 44.9 ...
    alt      (station) float64 1.639e+03 40.0 10.0 301.0 91.0 93.0 36.0 ...
    class    (poll, station) <U26 'background/rural' 'background/urban' ...
    NO2      (time, station) float32 17.0 112.25 51.25 nan 16.71 37.63 24.0 ...

## Read all obseervations

In [5]:
# multy file reader, with all the options
readObs = lambda files, dataset: xr.open_mfdataset(
        files, concat_dim=None, preprocess=surfObs
    ).assign_coords(dataset=dataset).expand_dims('dataset')

In [6]:
data = xr.Dataset()
with ProgressBar():
    for k,v in files.items():
        if k.startswith('eea'):
            # read the eeaVRA/VAL dataset separately
            %time obs = readObs(v,k)

            # contatenate eeaVRA/VAL datasets
            %time data = data.combine_first(obs)

data

[########################################] | 100% Completed |  5.2s
[########################################] | 100% Completed |  5.5s
[########################################] | 100% Completed |  1.1s
[########################################] | 100% Completed |  2.5s
[########################################] | 100% Completed |  4.5s
CPU times: user 10.4 s, sys: 1.34 s, total: 11.7 s
Wall time: 21.2 s
CPU times: user 100 ms, sys: 80 ms, total: 180 ms
Wall time: 178 ms
[########################################] | 100% Completed |  1.7s
[########################################] | 100% Completed |  7.7s
[########################################] | 100% Completed |  1.3s
[########################################] | 100% Completed |  2.5s
[########################################] | 100% Completed |  0.9s
CPU times: user 6.54 s, sys: 1.03 s, total: 7.58 s
Wall time: 16 s
CPU times: user 5.03 s, sys: 1.15 s, total: 6.18 s
Wall time: 6.16 s


<xarray.Dataset>
Dimensions:  (dataset: 2, poll: 5, station: 2237, time: 8761)
Coordinates:
  * dataset  (dataset) object 'eeaVAL' 'eeaVRA'
  * station  (station) object 'AD0944A' 'AD0945A' 'AL0203A' 'AL0204A' ...
  * poll     (poll) object 'NO2' 'O3' 'PM10' 'PM25' 'SO2'
  * time     (time) datetime64[ns] 2015-01-01 2015-01-01T01:00:00 ...
Data variables:
    lon      (dataset, station) float64 nan 1.717 nan 19.49 nan nan nan nan ...
    lat      (dataset, station) float64 nan 42.53 nan 40.4 nan nan nan nan ...
    alt      (dataset, station) float64 nan 2.515e+03 nan 25.0 nan nan nan ...
    class    (dataset, poll, station) object nan nan nan 'background/urban' ...
    NO2      (dataset, time, station) float32 nan nan nan 13.5451 nan nan ...
    PM10     (dataset, time, station) float32 nan nan nan nan nan nan nan ...
    PM25     (dataset, time, station) float32 nan nan nan nan nan nan nan ...
    SO2      (dataset, time, station) float32 nan nan nan 7.299 nan nan nan ...
    O3    

## Observations per dataset

In [7]:
data.sel(dataset='eeaVRA').drop(['lon','lat','alt']).to_dataframe().describe()

Unnamed: 0,NO2,PM10,PM25,SO2,O3
count,49155520.0,30317365.0,13764850.0,23499805.0,48637320.0
mean,14.128,18.302,12.711,3.603,44.816
std,15.582,16.594,12.893,10.963,30.678
min,0.0,0.0,0.0,0.0,0.0
25%,6.0,9.57,5.242,1.0,31.0
50%,12.148,15.75,9.225,2.1,54.72
75%,23.849,24.65,16.112,4.25,75.38
max,416.3,958.03,945.6,997.0,282.0


In [8]:
data.sel(dataset='eeaVAL').drop(['lon','lat','alt']).to_dataframe().describe()

Unnamed: 0,NO2,PM10,PM25,SO2,O3
count,21189195.0,13074295.0,5973645.0,9980630.0,20449935.0
mean,17.915,19.775,12.924,4.497,52.664
std,16.615,16.227,12.556,8.233,31.927
min,0.001,0.002,0.001,0.0,0.001
25%,6.79,10.0,5.25,1.21,28.461
50%,13.4,16.0,9.4,2.9,52.5
75%,25.75,24.91,16.299,5.0,74.0
max,356.0,943.0,738.58,983.25,296.0


## Unique stations

In [9]:
%time stat = data[['lon','lat','alt','class']]
%time stat = stat.sel(dataset='eeaVRA').combine_first(stat.sel(dataset='eeaVAL'))
stat

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 300 µs
CPU times: user 4 ms, sys: 4 ms, total: 8 ms
Wall time: 7.13 ms


<xarray.Dataset>
Dimensions:  (poll: 5, station: 2237)
Coordinates:
  * poll     (poll) object 'NO2' 'O3' 'PM10' 'PM25' 'SO2'
  * station  (station) object 'AD0944A' 'AD0945A' 'AL0203A' 'AL0204A' ...
Data variables:
    lon      (station) float64 1.565 1.717 20.78 19.49 19.52 13.67 16.77 ...
    lat      (station) float64 42.52 42.53 40.63 40.4 42.31 48.39 47.77 ...
    alt      (station) float64 1.637e+03 2.515e+03 848.0 25.0 13.0 525.0 ...
    class    (poll, station) object nan nan 'background/suburban' ...

In [10]:
stat.to_dataframe().head()

Unnamed: 0_level_0,Unnamed: 1_level_0,lon,lat,alt,class
poll,station,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NO2,AD0944A,1.565,42.517,1637.0,
NO2,AD0945A,1.717,42.535,2515.0,
NO2,AL0203A,20.78,40.626,848.0,background/suburban
NO2,AL0204A,19.486,40.403,25.0,background/urban
NO2,AL0206A,19.523,42.314,13.0,background/urban


## Station classification
Make it a coordinate, as it should not change as we add more datasets

In [11]:
data['class'] = stat['class']
data = data.set_coords('class')
data

<xarray.Dataset>
Dimensions:  (dataset: 2, poll: 5, station: 2237, time: 8761)
Coordinates:
    class    (poll, station) object nan nan 'background/suburban' ...
  * dataset  (dataset) object 'eeaVAL' 'eeaVRA'
  * station  (station) object 'AD0944A' 'AD0945A' 'AL0203A' 'AL0204A' ...
  * poll     (poll) object 'NO2' 'O3' 'PM10' 'PM25' 'SO2'
  * time     (time) datetime64[ns] 2015-01-01 2015-01-01T01:00:00 ...
Data variables:
    lon      (dataset, station) float64 nan 1.717 nan 19.49 nan nan nan nan ...
    lat      (dataset, station) float64 nan 42.53 nan 40.4 nan nan nan nan ...
    alt      (dataset, station) float64 nan 2.515e+03 nan 25.0 nan nan nan ...
    NO2      (dataset, time, station) float32 nan nan nan 13.5451 nan nan ...
    PM10     (dataset, time, station) float32 nan nan nan nan nan nan nan ...
    PM25     (dataset, time, station) float32 nan nan nan nan nan nan nan ...
    SO2      (dataset, time, station) float32 nan nan nan 7.299 nan nan nan ...
    O3       (datase

In [12]:
save2nc(data,ncfile)

# Collocation
For point-wise collocation, the lon/lat indexerrs need to be xarray.DataArrays.

In [13]:
def collocate(ds, lon=stat.lon, lat=stat.lat, dlon=1.25, dlat=1.25):
    """
    collocate dataset to coordinates
      for point-wise selection lon/lat need to be DataArrays (and ds.load())
      .sel(.., tolerance=max(dlat,dlon)) raise a KeyError for points outside domain
    """
    col = ds.load().sel(lon=lon, lat=lat, method='nearest')
    return col.where(abs(col.lon-lon)<dlon*0.5)\
              .where(abs(col.lat-lat)<dlat*0.5)\
              .reset_coords()

# Boundary conditions
From CIFS reanalysis. Daily files with 3-hourly records. 
- November BC files are missing, not sure if they failed to be created or were
wrongly cleaned up.
- 335 files ~256M each, total 84Gb.

In [14]:
%time ds = xr.open_dataset(files['cifsBC'][0])
ds

CPU times: user 44 ms, sys: 24 ms, total: 68 ms
Wall time: 422 ms


<xarray.Dataset>
Dimensions:    (latitude: 65, level: 60, longitude: 207, time: 8, x: 61)
Coordinates:
  * longitude  (longitude) float32 -115.875 -114.75 -113.625 -112.5 -111.375 ...
  * latitude   (latitude) float32 81.0 79.875 78.75 77.625 76.5 75.375 74.25 ...
  * level      (level) float64 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 ...
  * time       (time) datetime64[ns] 2015-02-22 2015-02-22T03:00:00 ...
Dimensions without coordinates: x
Data variables:
    t          (time, level, latitude, longitude) float32 ...
    aermr01    (time, level, latitude, longitude) float32 ...
    aermr02    (time, level, latitude, longitude) float32 ...
    aermr03    (time, level, latitude, longitude) float32 ...
    aermr04    (time, level, latitude, longitude) float32 ...
    aermr05    (time, level, latitude, longitude) float32 ...
    aermr06    (time, level, latitude, longitude) float32 ...
    aermr07    (time, level, latitude, longitude) float32 ...
    aermr08    (time, level, latitud

## Collocate

In [15]:
surfBCs = lambda ds: ds.rename(dict(
    longitude='lon',
    latitude='lat',
    no2='NO2',
    so2='SO2',
    go3='O3',
)).sel(level=60).drop('level')
""" PM*
    aermr01='SEASALT_F',
    aermr02='SEASALT_C',
   #aermr03='SEASALT_C',    # not used
    aermr04='DUST_SAH_F',
    aermr05='DUST_SAH_F',
    aermr06*.15='DUST_SAH_F',
    aermr06*.35='DUST_SAH_C',
   #aermr07*1.7='FFIRE_OM', # not used
   #aermr08*1.7='FFIRE_OM', # not used
    aermr09='FFIRE_BC',     # not used
    aermr10='FFIRE_BC',     # not used
    aermr11='SO4',
   #aermr12='SO2',          # not used
"""

dropBCs = "aermr01 aermr02 aermr03 aermr04 aermr05 aermr06 aermr07 aermr08 aermr09 aermr10 aermr11 aermr12 co hno3 pan no hcho ch4 c5h8 oh n2o5 c2h6 c3h8 z hyai hybi".split()

%time collocate(surfBCs(ds.drop(dropBCs)))

CPU times: user 96 ms, sys: 0 ns, total: 96 ms
Wall time: 533 ms


<xarray.Dataset>
Dimensions:  (station: 2237, time: 8)
Coordinates:
  * time     (time) datetime64[ns] 2015-02-22 2015-02-22T03:00:00 ...
  * station  (station) object 'AD0944A' 'AD0945A' 'AL0203A' 'AL0204A' ...
Data variables:
    t        (time, station) float32 272.0734 273.6152 278.31186 282.89673 ...
    NO2      (time, station) float32 3.4185346e-09 2.854822e-09 7.858242e-09 ...
    SO2      (time, station) float32 9.1194113e-10 8.4438995e-10 ...
    O3       (time, station) float32 5.7367092e-08 5.7853256e-08 ...
    lnsp     (time, station) float32 11.431013 11.45578 11.455184 11.50319 ...
    lon      (station) float32 1.125 2.25 20.25 19.125 19.125 13.5 16.875 ...
    lat      (station) float32 42.75 42.75 40.5 40.5 42.75 48.375 47.25 ...
Attributes:
    CDI:          Climate Data Interface version 1.6.9 (http://mpimet.mpg.de/...
    history:      Thu Aug 03 10:09:36 2017: cdo -b 32 -f nc4 -z zip -s merge ...
    Conventions:  CF-1.6
    CDO:          Climate Data Operators v

In [16]:
%%time
ds = xr.open_mfdataset(   
    files['cifsBC'], chunks={'time':10}, concat_dim='time', autoclose=True,
    preprocess=surfBCs, drop_variables=dropBCs,
).assign_coords(dataset='cifsBC').expand_dims('dataset')

CPU times: user 17.3 s, sys: 7.28 s, total: 24.6 s
Wall time: 1min 44s


In [17]:
%%time
with ProgressBar():
    bcs = collocate(ds)

[########################################] | 100% Completed | 30min  2.7s
CPU times: user 17min 38s, sys: 2min, total: 19min 39s
Wall time: 30min 7s


In [18]:
bcs

<xarray.Dataset>
Dimensions:  (dataset: 1, station: 2237, time: 2680)
Coordinates:
  * time     (time) datetime64[ns] 2015-02-22 2015-02-22T03:00:00 ...
  * dataset  (dataset) <U6 'cifsBC'
  * station  (station) object 'AD0944A' 'AD0945A' 'AL0203A' 'AL0204A' ...
Data variables:
    t        (dataset, time, station) float32 272.0734 273.6152 278.31186 ...
    NO2      (dataset, time, station) float32 3.4185346e-09 2.854822e-09 ...
    SO2      (dataset, time, station) float32 9.1194113e-10 8.4438995e-10 ...
    O3       (dataset, time, station) float32 5.7367092e-08 5.7853256e-08 ...
    lnsp     (dataset, time, station) float32 11.431013 11.45578 11.455184 ...
    lon      (station) float32 1.125 2.25 20.25 19.125 19.125 13.5 16.875 ...
    lat      (station) float32 42.75 42.75 40.5 40.5 42.75 48.375 47.25 ...
Attributes:
    CDI:          Climate Data Interface version 1.6.9 (http://mpimet.mpg.de/...
    history:      Thu Aug 03 10:09:36 2017: cdo -b 32 -f nc4 -z zip -s merge ...
   

## Unit conversion
CIFS concentrations come in `kg/kg`, observations are in `ug/m3`

In [19]:
def unitConv(ds):
    rho = xu.exp(ds.lnsp)/(287.05 * ds.t)    
    for param in bcs.data_vars: 
        if ds[param].attrs.get('units',None) == 'kg kg**-1':
            ds[param] *= 1e9*rho
            ds[param].attrs['units'] = 'ug/m3'
    return ds.drop(['t','lnsp'])
    
%time bcs = unitConv(bcs)

bcs.drop(['lon','lat']).to_dataframe().describe()

CPU times: user 172 ms, sys: 0 ns, total: 172 ms
Wall time: 169 ms


Unnamed: 0,NO2,SO2,O3
count,5949600.0,5949600.0,5949600.0
mean,10.838,3.391,54.968
std,10.432,5.342,26.932
min,0.0,0.0,0.0
25%,2.966,0.849,35.206
50%,7.339,1.732,57.579
75%,15.664,3.736,76.19
max,126.738,353.801,200.72


## Add to observation dataset

In [20]:
%time data = data.combine_first(bcs)
data.sel(dataset='cifsBC').drop(['lon','lat','alt']).to_dataframe().describe()

CPU times: user 4.89 s, sys: 15.6 s, total: 20.5 s
Wall time: 20.4 s


Unnamed: 0,NO2,PM10,PM25,SO2,O3
count,29748000.0,0.0,0.0,29748000.0,29748000.0
mean,9.992,,,3.032,53.542
std,9.906,,,5.135,26.079
min,0.0,,,0.0,0.0
25%,2.966,,,0.849,35.206
50%,7.339,,,1.732,57.579
75%,15.664,,,3.736,76.19
max,126.738,,,353.801,200.72


In [21]:
save2nc(data,ncfile)
del(bcs)

# Model runs
The EMEP domain has 3 times the records and ~8 times more grid points than the CIFS domain.
- `emepHC`: Single run, producing one **29Gb** hourly output file.
- `emepAN`: 4 overlaping runs, each producing **~6G** hourly output files.
- `emepRE`: 4 overlaping runs, each producing **~6G** hourly output files.
- `emepPM`: 4 overlaping runs, each producing **~6G** hourly output files.

In [22]:
def readRun(run):   
    ds = xr.Dataset()
    for fname in files[run]:
        ds = ds.combine_first(xr.open_dataset(fname, chunks={'time':6}))
    return ds.assign_coords(dataset=run).expand_dims('dataset')

# Hindcast run

In [23]:
%time ds = readRun('emepHC')
ds

CPU times: user 3.45 s, sys: 2.14 s, total: 5.59 s
Wall time: 3min 54s


<xarray.Dataset>
Dimensions:            (dataset: 1, ilev: 9, lat: 369, lev: 8, lon: 301, time: 8761)
Coordinates:
  * lon                (lon) float64 -30.0 -29.75 -29.5 -29.25 -29.0 -28.75 ...
  * lat                (lat) float64 30.0 30.12 30.25 30.38 30.5 30.62 30.75 ...
  * lev                (lev) float64 0.9946 0.9838 0.9703 0.9509 0.8932 ...
  * ilev               (ilev) float64 0.9892 0.9784 0.9621 0.9396 0.8756 ...
  * time               (time) datetime64[ns] 2015-01-01 2015-01-01T01:00:00 ...
  * dataset            (dataset) <U6 'emepHC'
Data variables:
    P0                 (dataset) float64 1.013e+03
    hyam               (dataset, lev) float64 dask.array<shape=(1, 8), chunksize=(1, 8)>
    hybm               (dataset, lev) float64 dask.array<shape=(1, 8), chunksize=(1, 8)>
    hyai               (dataset, ilev) float64 dask.array<shape=(1, 9), chunksize=(1, 9)>
    hybi               (dataset, ilev) float64 dask.array<shape=(1, 9), chunksize=(1, 9)>
    SURF_ug_O3      

## Collocate

In [24]:
surfEMEP = lambda ds: ds.rename(dict(
    SURF_ug_O3='O3',
    SURF_ug_NO2='NO2',
    SURF_ug_SO2='SO2',
    SURF_ug_PM25_rh50='PM25',
    SURF_ug_PM10_rh50='PM10',
)).isel(lev=0).drop('lev')

dropEMEP = 'P0 ilev hyam hybm hyai hybi SURF_ug_CO COLUMN_NO2_k20 COLUMN_O3_k20 AOD_550nm'.split()

with ProgressBar():
    %time emep = collocate(surfEMEP(ds.drop(dropEMEP)), dlon=1/4, dlat=1/8)

[########################################] | 100% Completed | 23min 38.2s
CPU times: user 9min 7s, sys: 1min, total: 10min 8s
Wall time: 23min 55s


In [25]:
emep

<xarray.Dataset>
Dimensions:  (dataset: 1, station: 2237, time: 8761)
Coordinates:
  * time     (time) datetime64[ns] 2015-01-01 2015-01-01T01:00:00 ...
  * dataset  (dataset) <U6 'emepHC'
  * station  (station) object 'AD0944A' 'AD0945A' 'AL0203A' 'AL0204A' ...
Data variables:
    O3       (dataset, time, station) float32 72.93429 69.54431 49.42756 ...
    NO2      (dataset, time, station) float32 0.16582742 0.5343682 5.835265 ...
    PM25     (dataset, time, station) float32 0.9968202 1.126349 8.84856 ...
    PM10     (dataset, time, station) float32 1.5719622 1.7240334 10.386877 ...
    SO2      (dataset, time, station) float32 0.07718428 0.097183295 ...
    lon      (station) float64 1.5 1.75 20.75 19.5 19.5 13.75 16.75 16.0 ...
    lat      (station) float64 42.5 42.5 40.62 40.38 42.38 48.38 47.75 46.75 ...

## Add to observation dataset

In [26]:
%time data = data.combine_first(emep)
data.sel(dataset='emepHC').drop(['lon','lat','alt']).to_dataframe().describe()

CPU times: user 4.74 s, sys: 1.4 s, total: 6.14 s
Wall time: 6.12 s


Unnamed: 0,NO2,PM10,PM25,SO2,O3
count,96195780.0,96195780.0,96195780.0,96195780.0,96195780.0
mean,6.785,8.9,6.7,2.079,22.833
std,11.391,13.855,10.646,5.426,38.035
min,0.001,0.402,0.402,0.0,0.0
25%,3.025,6.972,4.34,0.389,47.514
50%,6.572,12.253,8.185,1.109,63.518
75%,13.367,20.021,14.756,2.955,78.874
max,167.586,869.055,434.214,270.812,260.836


In [27]:
save2nc(data,ncfile)
del(emep)

## (Re)Analysis runs
4 overlaping runs, each producing **~6G** hourly output files.

In [28]:
%time ds = readRun('emepAN')
ds

CPU times: user 8.61 s, sys: 3.11 s, total: 11.7 s
Wall time: 3min 22s


<xarray.Dataset>
Dimensions:            (dataset: 1, ilev: 9, lat: 369, lev: 8, lon: 301, time: 8761)
Coordinates:
  * time               (time) datetime64[ns] 2015-01-01 2015-01-01T01:00:00 ...
  * lon                (lon) float64 -30.0 -29.75 -29.5 -29.25 -29.0 -28.75 ...
  * lat                (lat) float64 30.0 30.12 30.25 30.38 30.5 30.62 30.75 ...
  * lev                (lev) float64 0.9946 0.9838 0.9703 0.9509 0.8932 ...
  * ilev               (ilev) float64 0.9892 0.9784 0.9621 0.9396 0.8756 ...
  * dataset            (dataset) <U6 'emepAN'
Data variables:
    P0                 (dataset) float64 1.013e+03
    hyam               (dataset, lev) float64 dask.array<shape=(1, 8), chunksize=(1, 8)>
    hybm               (dataset, lev) float64 dask.array<shape=(1, 8), chunksize=(1, 8)>
    hyai               (dataset, ilev) float64 dask.array<shape=(1, 9), chunksize=(1, 9)>
    hybi               (dataset, ilev) float64 dask.array<shape=(1, 9), chunksize=(1, 9)>
    SURF_ug_O3      

### Collocate

In [29]:
surfEMEP = lambda ds: ds.rename(dict(
    SURF_ug_O3='O3',
    SURF_ug_NO2='NO2',
#   SURF_ug_SO2='SO2',
    SURF_ug_PM25_rh50='PM25',
    SURF_ug_PM10_rh50='PM10',
)).isel(lev=0).drop('lev')

dropEMEP = 'P0 ilev hyam hybm hyai hybi COLUMN_NO2_k20 COLUMN_O3_k20 AOD_550nm'.split()

with ProgressBar():
    %time emep = collocate(surfEMEP(ds.drop(dropEMEP)), dlon=1/4, dlat=1/8)

[########################################] | 100% Completed | 12min  6.3s
CPU times: user 9min 2s, sys: 2min 59s, total: 12min 1s
Wall time: 12min 26s


### Add to observation dataset

In [30]:
%time data = data.combine_first(emep)
data.sel(dataset='emepAN').drop(['lon','lat','alt']).to_dataframe().describe()

CPU times: user 5.01 s, sys: 560 ms, total: 5.57 s
Wall time: 5.58 s


Unnamed: 0,NO2,PM10,PM25,SO2,O3
count,96195780.0,96195780.0,96195780.0,0.0,96195780.0
mean,8.579,11.329,8.1,,23.851
std,12.333,15.777,12.414,,34.364
min,0.0,0.402,0.402,,0.0
25%,5.487,8.376,5.211,,27.607
50%,10.449,14.518,9.711,,49.529
75%,18.697,23.648,17.37,,69.582
max,235.185,880.153,434.334,,218.01


In [31]:
save2nc(data,ncfile)
del(emep)

# (Re)Analysis Re-runs
4 overlaping runs, each producing **~6G** hourly output files.

In [32]:
%time ds = readRun('emepRE')
ds

CPU times: user 9.03 s, sys: 2.38 s, total: 11.4 s
Wall time: 2min 7s


<xarray.Dataset>
Dimensions:            (dataset: 1, ilev: 9, lat: 369, lev: 8, lon: 301, time: 8761)
Coordinates:
  * time               (time) datetime64[ns] 2015-01-01 2015-01-01T01:00:00 ...
  * lon                (lon) float64 -30.0 -29.75 -29.5 -29.25 -29.0 -28.75 ...
  * lat                (lat) float64 30.0 30.12 30.25 30.38 30.5 30.62 30.75 ...
  * lev                (lev) float64 0.9946 0.9838 0.9703 0.9509 0.8932 ...
  * ilev               (ilev) float64 0.9892 0.9784 0.9621 0.9396 0.8756 ...
  * dataset            (dataset) <U6 'emepRE'
Data variables:
    P0                 (dataset) float64 1.013e+03
    hyam               (dataset, lev) float64 dask.array<shape=(1, 8), chunksize=(1, 8)>
    hybm               (dataset, lev) float64 dask.array<shape=(1, 8), chunksize=(1, 8)>
    hyai               (dataset, ilev) float64 dask.array<shape=(1, 9), chunksize=(1, 9)>
    hybi               (dataset, ilev) float64 dask.array<shape=(1, 9), chunksize=(1, 9)>
    SURF_ug_O3      

In [33]:
surfEMEP = lambda ds: ds.rename(dict(
    SURF_ug_O3='O3',
    SURF_ug_NO2='NO2',
    SURF_ug_SO2='SO2',
    SURF_ug_PM25_rh50='PM25',
    SURF_ug_PM10_rh50='PM10',
)).isel(lev=0).drop('lev')

dropEMEP = 'P0 ilev hyam hybm hyai hybi COLUMN_NO2_k20 COLUMN_O3_k20 AOD_550nm'.split()

with ProgressBar():
    %time emep = collocate(surfEMEP(ds.drop(dropEMEP)), dlon=1/4, dlat=1/8)

[########################################] | 100% Completed | 17min  9.7s
CPU times: user 11min 22s, sys: 5min 41s, total: 17min 4s
Wall time: 17min 40s


In [34]:
emep

<xarray.Dataset>
Dimensions:  (dataset: 1, station: 2237, time: 8761)
Coordinates:
  * time     (time) datetime64[ns] 2015-01-01 2015-01-01T01:00:00 ...
  * dataset  (dataset) <U6 'emepRE'
  * station  (station) object 'AD0944A' 'AD0945A' 'AL0203A' 'AL0204A' ...
Data variables:
    O3       (dataset, time, station) float32 68.562416 64.24701 54.29532 ...
    NO2      (dataset, time, station) float32 0.0 0.0 5.753371 2.8036213 ...
    PM25     (dataset, time, station) float32 0.9968202 1.126349 8.84856 ...
    PM10     (dataset, time, station) float32 1.5719622 1.7240334 10.386877 ...
    SO2      (dataset, time, station) float32 0.0 0.0 5.1408052 7.74594 ...
    lon      (station) float64 1.5 1.75 20.75 19.5 19.5 13.75 16.75 16.0 ...
    lat      (station) float64 42.5 42.5 40.62 40.38 42.38 48.38 47.75 46.75 ...

## Add to observation dataset

In [35]:
%time data = data.combine_first(emep)
data.sel(dataset='emepRE').drop(['lon','lat','alt']).to_dataframe().describe()

CPU times: user 6.97 s, sys: 1.19 s, total: 8.16 s
Wall time: 8.15 s


Unnamed: 0,NO2,PM10,PM25,SO2,O3
count,96195780.0,96195780.0,96195780.0,96195780.0,96195780.0
mean,8.582,11.388,8.161,1.728,24.063
std,12.351,15.949,12.497,4.03,34.545
min,0.0,0.402,0.402,0.0,0.0
25%,5.487,8.379,5.21,0.64,27.61
50%,10.449,14.535,9.716,1.472,49.533
75%,18.697,23.716,17.419,3.088,69.601
max,235.1,880.371,434.334,423.458,261.428


In [36]:
save2nc(data,ncfile)
del(emep)

# (Re)Analysis Re-runs with PM assimilation
4 overlaping runs, each producing **~6G** hourly output files.

In [37]:
%time ds = readRun('emepPM')
ds

CPU times: user 10 s, sys: 2.71 s, total: 12.7 s
Wall time: 2min 27s


<xarray.Dataset>
Dimensions:            (dataset: 1, ilev: 9, lat: 369, lev: 8, lon: 301, time: 8737)
Coordinates:
  * time               (time) datetime64[ns] 2015-01-01 2015-01-01T01:00:00 ...
  * lon                (lon) float64 -30.0 -29.75 -29.5 -29.25 -29.0 -28.75 ...
  * lat                (lat) float64 30.0 30.12 30.25 30.38 30.5 30.62 30.75 ...
  * lev                (lev) float64 0.9946 0.9838 0.9703 0.9509 0.8932 ...
  * ilev               (ilev) float64 0.9892 0.9784 0.9621 0.9396 0.8756 ...
  * dataset            (dataset) <U6 'emepPM'
Data variables:
    P0                 (dataset) float64 1.013e+03
    hyam               (dataset, lev) float64 dask.array<shape=(1, 8), chunksize=(1, 8)>
    hybm               (dataset, lev) float64 dask.array<shape=(1, 8), chunksize=(1, 8)>
    hyai               (dataset, ilev) float64 dask.array<shape=(1, 9), chunksize=(1, 9)>
    hybi               (dataset, ilev) float64 dask.array<shape=(1, 9), chunksize=(1, 9)>
    SURF_ug_O3      

In [38]:
surfEMEP = lambda ds: ds.rename(dict(
    SURF_ug_O3='O3',
    SURF_ug_NO2='NO2',
    SURF_ug_SO2='SO2',
    SURF_ug_PM25_rh50='PM25',
    SURF_ug_PM10_rh50='PM10',
)).isel(lev=0).drop('lev')

dropEMEP = 'P0 ilev hyam hybm hyai hybi SURF_ug_CO COLUMN_NO2_k20 COLUMN_O3_k20 AOD_550nm'.split()

with ProgressBar():
    %time emep = collocate(surfEMEP(ds.drop(dropEMEP)), dlon=1/4, dlat=1/8)

[########################################] | 100% Completed | 17min 26.7s
CPU times: user 12min 31s, sys: 4min 54s, total: 17min 26s
Wall time: 18min 17s


In [39]:
emep

<xarray.Dataset>
Dimensions:  (dataset: 1, station: 2237, time: 8737)
Coordinates:
  * time     (time) datetime64[ns] 2015-01-01 2015-01-01T01:00:00 ...
  * dataset  (dataset) <U6 'emepPM'
  * station  (station) object 'AD0944A' 'AD0945A' 'AL0203A' 'AL0204A' ...
Data variables:
    O3       (dataset, time, station) float32 89.58768 81.01154 57.31129 ...
    NO2      (dataset, time, station) float32 14.632897 22.119638 6.565931 ...
    PM25     (dataset, time, station) float32 0.9968202 1.126349 8.84856 ...
    PM10     (dataset, time, station) float32 16.215338 15.908787 17.878729 ...
    SO2      (dataset, time, station) float32 0.059296325 0.09271451 ...
    lon      (station) float64 1.5 1.75 20.75 19.5 19.5 13.75 16.75 16.0 ...
    lat      (station) float64 42.5 42.5 40.62 40.38 42.38 48.38 47.75 46.75 ...

## Add to observation dataset

In [40]:
%time data = data.combine_first(emep)
data.sel(dataset='emepPM').drop(['lon','lat','alt']).to_dataframe().describe()

CPU times: user 17.3 s, sys: 1.66 s, total: 19 s
Wall time: 18.9 s


Unnamed: 0,NO2,PM10,PM25,SO2,O3
count,95932260.0,95932260.0,95932260.0,95932260.0,95932260.0
mean,10.369,13.155,9.357,2.148,24.846
std,14.165,22.998,15.863,5.51,36.878
min,0.0,0.004,0.004,0.0,0.0
25%,6.27,10.498,6.143,0.794,32.412
50%,11.727,16.938,11.412,1.797,53.475
75%,21.363,27.05,19.446,3.677,73.39
max,238.011,739.413,521.227,884.986,291.141


In [41]:
save2nc(data,ncfile)
del(emep)