# 1. Data Preparation

## Dataset files
The used dataset files have 256x128 Gaussian gridpoints covering the whole earth for 40542 days (1/1/1900 to 31/12/2010):
- Mean sea level pressure: https://climexp.knmi.nl/ERA-20C/era20c_msl_daily.nc
- 500 hPa geopotential height: https://climexp.knmi.nl/ERA-20C/era20c_z500_daily.nc

These files are stored in a "data" submap and imported with the netCDF4 library, after installing it with 'pip install netCDF4'. 

In [1]:
import netCDF4 as nc
for dataset in ["msl","z500"]:
    df = nc.Dataset("data/era20c_" + dataset + "_daily.nc",'r')
    #print(df.dimensions)
    print(df.variables.keys())
    #print(df)

dict_keys(['longitude', 'latitude', 'time', 'msl'])
dict_keys(['longitude', 'latitude', 'time', 'z500'])


## Cut the region of interest

The region of interest is Europe and the North-Atlantic (58W-33E, 25N-70N) and is choosen so that the longitude/latitute dimensions are a power of 2 (64x32). The zone metadata is saved in a numpy array.

In [2]:
import numpy as np
zone=np.array([-58.,33.,25.,70.])
for dataset in ["msl","z500"]:
    df = nc.Dataset("data/era20c_" + dataset + "_daily.nc",'r')
    east = np.array(df[dataset][:,0:int(zone[1]*256/360),int((90-zone[3])*128/180):int((90-zone[2])*128/180)])  # 0->33E, 70->25N
    west = np.array(df[dataset][:,int(zone[0]*256/360):,int((90-zone[3])*128/180):int((90-zone[2])*128/180)])   # 58W->0, 70->25N
    data = np.concatenate((west,east),axis=1)
    print(data.shape)
    np.save("data/" + dataset + ".npy", data)
np.save("data/zone.npy", zone)

(40542, 64, 32)
(40542, 64, 32)


## Creating the ML dataset

The msl and z500 data is scaled and saved in one numpy array with shape (40542,32,64,2). The last dimension are the 2 channels, msl and z500, which will be used by the convolute layer of the neural network. The scaling factors are saved in a npy file.

In [3]:
msl = np.load("data/msl.npy")
z500 = np.load("data/z500.npy")
# scale data so that 99,99% is in [0,1]
scale = np.array([[94000,107000],[43000,58000]])
msl = (msl - scale[0,0]) / (scale[0,1] - scale[0,0])
z500 = (z500 - scale[1,0]) / (scale[1,1] - scale[1,0])
data = np.zeros((msl.shape[0],msl.shape[1],msl.shape[2],2))
data[:,:,:,0] = msl
data[:,:,:,1] = z500
data = np.swapaxes(data,1,2)
print(data.shape)
np.save("data/data.npy", data)
np.save("data/scale.npy", scale)

(40542, 32, 64, 2)


## Create date list

In [4]:
import numpy as np
import datetime
base = datetime.datetime.strptime("01-01-1900", "%d-%m-%Y")
dates = [(base + datetime.timedelta(days=x)).strftime("%Y %-m %-d") for x in range(msl.shape[0])]
np.save("data/dates.npy", dates)
dates[-1]

'2010 12 31'