## NetCDF Weather File Converter
Description:  ETL system for weather files from the EU Copernicus project.  The input is a NetCDF file with 2-m temperature, 10-m wind, total cloud cover, and total precip data, provided as monthly means for a one-year period.  Produces a reduced CSV file for use with html browsers on the "2018 Weather" web site.
Rationale:  This notebook automates and allows reproduction of the data table generation process.  As the originator of the data files and builder of the end-use web site, I am the logical choice for authoring the notebook.  The process is needed to finish the web site.  Doing it in the notebook will enable expeditious future website updates.
by:  Andrew Guenthner
v0.5:  05-07-2019

### Requirements:
The notebook expects a set of files named 'means_30yr\[month#\].csv' where month# is from 1 to 12 in the same directory as this notebook. It also expects a file named 'city_by_latlongroup.csv' in order to label lat/lons with city and country names for readability.  The other files contain the 30-year averages for 2-m air temperature, total precipitation, 10-m wind, and total cloud cover by month of year.  These files are generated by running the notebook named 'generate_30yr_mean.jypnb' which uses a 30-year data file.  The 1-year input file for this notebook and input path can be set below.  The SciPy (0.16), Pandas (0.23), and Numpy (1.15) modules are also needed.

### Note:  The files involved can be >100 MB in size
Make sure your system resources are adequate before running this notebook.
The operations involved attempt to keep data on disk as much as possible, but can still consume a lot of memory. 

In [1]:
# Import dependencies
import numpy as np
from scipy.io import netcdf
import datetime as dt
import pandas as pd
filepath = '../Large_File/'

In [2]:
# Input the filename you want to process here...
file_to_open = 'wx_monthly_data_2018.nc'

In [3]:
# Do a quick file test before starting ...
infile = filepath + file_to_open
with netcdf.netcdf_file(infile, 'r') as f:
    print(f.history)

b'2019-05-07 21:07:41 GMT by grib_to_netcdf-2.10.0: /opt/ecmwf/eccodes/bin/grib_to_netcdf -o /cache/data6/adaptor.mars.internal-1557263252.1882875-9670-11-dcde4521-4ef2-4a84-b494-d4054b6b9cb7.nc /cache/tmp/dcde4521-4ef2-4a84-b494-d4054b6b9cb7-adaptor.mars.internal-1557263252.189089-9670-4-tmp.grib'


The code above should have provided a text message about the file.  

### File content check

The next steps are meant for basic file exploration.  These can be skipped if you know the file contains the needed info.

In [4]:
# Print the available variables
infile = filepath + file_to_open
with netcdf.netcdf_file(infile, 'r') as f:
    print(f.variables)

OrderedDict([('longitude', <scipy.io.netcdf.netcdf_variable object at 0x000001A73B4A2B38>), ('latitude', <scipy.io.netcdf.netcdf_variable object at 0x000001A73B4A2C50>), ('time', <scipy.io.netcdf.netcdf_variable object at 0x000001A73B4A2BA8>), ('si10', <scipy.io.netcdf.netcdf_variable object at 0x000001A73B4A2B00>), ('t2m', <scipy.io.netcdf.netcdf_variable object at 0x000001A73B4A2A20>), ('tcc', <scipy.io.netcdf.netcdf_variable object at 0x000001A73B4A2F28>), ('tp', <scipy.io.netcdf.netcdf_variable object at 0x000001A73B4BB080>)])


What you should see is:
*  longitude
*  latitude
*  time
*  si10 -- this is the wind speed at 10 m height
*  t2m -- this is the 2-meter air temperature
*  tcc -- total cloud cover
*  tp -- total precipitation

In [5]:
# Print the units of these varaibles
infile = filepath + file_to_open
with netcdf.netcdf_file(infile, 'r') as f:
    print('longitude:  ',f.variables['longitude'].units)
    print('latitude:   ',f.variables['latitude'].units)
    print('time:       ',f.variables['time'].units)
    print('temperature:',f.variables['t2m'].units)
    print('wind speed: ',f.variables['si10'].units)
    print('cloud cover ',f.variables['tcc'].units)
    print('total precip',f.variables['tp'].units)

longitude:   b'degrees_east'
latitude:    b'degrees_north'
time:        b'hours since 1900-01-01 00:00:00.0'
temperature: b'K'
wind speed:  b'm s**-1'
cloud cover  b'(0 - 1)'
total precip b'm'


In [6]:
# Print the spatio-temporal characteristics of the data
infile = filepath + file_to_open
with netcdf.netcdf_file(infile, 'r') as f:
    print('Longitude:')
    print('# of Points: ',f.variables['longitude'].shape)
    print('From: ',f.variables['longitude'][0],' to ',f.variables['longitude'][-1])
    print('Latitude:')
    print('# of Points: ',f.variables['latitude'].shape)
    print('From: ',f.variables['latitude'][0],' to ',f.variables['latitude'][-1])
    print('Time:')
    print('# of Points: ',f.variables['time'].shape)
    print('From: ',f.variables['time'][0],' to ',f.variables['time'][-1])

Longitude:
# of Points:  (1440,)
From:  0.0  to  359.75
Latitude:
# of Points:  (721,)
From:  90.0  to  -90.0
Time:
# of Points:  (12,)
From:  1034376  to  1042392


### Process the 2018 Data

### Temperature

In [7]:
# Copy the contents into memory, then close the file
infile = filepath + file_to_open
with netcdf.netcdf_file(infile, 'r') as f:
    fieldarray = f.variables['t2m'][:].copy()

In [8]:
# Check that we have a 12 x 721 x 1440 array
fieldarray.shape

(12, 721, 1440)

In [9]:
# Compute the monthly means as a 12-item list of 7200 downsamples data points
# Each list item is a data point for one lat-lon pair
# Only the final, smaller grid will be stored 
# Input data are raw 16-bit integers, the list is a floating point average of these raw values
tdatalist = []
for monthnum in range(12):
    #Downsample using the means of a 12 x 12 lat-lon grid to make a 7200 item list
    #Ignore the final latitude point of -90 
    fieldlist = [fieldarray[monthnum,i:i+12,j:j+12].mean() for i in range(0,720,12) for j in range(0, 1440, 12)]
    tdatalist.append(fieldlist)

### Precipitation

In [10]:
# Copy the contents into memory, then close the file
infile = filepath + file_to_open
with netcdf.netcdf_file(infile, 'r') as f:
    fieldarray = f.variables['tp'][:].copy()

In [11]:
# Check that we have a 12 x 721 x 1440 array
fieldarray.shape

(12, 721, 1440)

In [12]:
# Compute the monthly means as a 12-item list of 7200 downsamples data points
# Each list item is a data point for one lat-lon pair
# Only the final, smaller grid will be stored 
# Input data are raw 16-bit integers, the list is a floating point average of these raw values
pdatalist = []
for monthnum in range(12):
    #Downsample using the means of a 12 x 12 lat-lon grid to make a 7200 item list
    #Ignore the final latitude point of -90 
    fieldlist = [fieldarray[monthnum,i:i+12,j:j+12].mean() for i in range(0,720,12) for j in range(0, 1440, 12)]
    pdatalist.append(fieldlist)

### Wind

In [13]:
# Copy the contents into memory, then close the file
infile = filepath + file_to_open
with netcdf.netcdf_file(infile, 'r') as f:
    fieldarray = f.variables['si10'][:].copy()

In [14]:
# Check that we have a 12 x 721 x 1440 array
fieldarray.shape

(12, 721, 1440)

In [15]:
# Compute the monthly means as a 12-item list of 7200 downsamples data points
# Each list item is a data point for one lat-lon pair
# Only the final, smaller grid will be stored 
# Input data are raw 16-bit integers, the list is a floating point average of these raw values
wdatalist = []
for monthnum in range(12):
    #Downsample using the means of a 12 x 12 lat-lon grid to make a 7200 item list
    #Ignore the final latitude point of -90 
    fieldlist = [fieldarray[monthnum,i:i+12,j:j+12].mean() for i in range(0,720,12) for j in range(0, 1440, 12)]
    wdatalist.append(fieldlist)

### Cloud Cover

In [16]:
# Copy the contents into memory, then close the file
infile = filepath + file_to_open
with netcdf.netcdf_file(infile, 'r') as f:
    fieldarray = f.variables['tcc'][:].copy()

In [17]:
# Check that we have a 12 x 721 x 1440 array
fieldarray.shape

(12, 721, 1440)

In [18]:
# Compute the monthly means as a 12-item list of 7200 downsamples data points
# Each list item is a data point for one lat-lon pair
# Only the final, smaller grid will be stored 
# Input data are raw 16-bit integers, the list is a floating point average of these raw values
cdatalist = []
for monthnum in range(12):
    #Downsample using the means of a 12 x 12 lat-lon grid to make a 7200 item list
    #Ignore the final latitude point of -90 
    fieldlist = [fieldarray[monthnum,i:i+12,j:j+12].mean() for i in range(0,720,12) for j in range(0, 1440, 12)]
    cdatalist.append(fieldlist)

### Compose the Data Frames

In [19]:
# Build latitude and longitude lists (does NOT automatically use input data)
# Look at the values just prior to the "Data Processing" section if you need to adjust
# The list data grid is now 61 latitudes x 120 longitudes -- adjust if needed below
lat_len = 60
long_len = 120
# Calculate lat and long intervals (should be 3 deg x 3 deg if using 0.25 deg gridded input data)
lat_interval = 180 / lat_len
long_interval = 360 / long_len
# Use center-points as list coordinates
# Lat list should repeat each lat value long_len times in a periodic sequence 1,2,3,1,2,3,1,2,3...
#The lat list starts at 90 - lat_interval / 2
#The lat goes for lat_interval * lat_len, so should end at 90 - lat_interval / 2 - 
#                                                            lat_interval * (lat_len + 1)
#The lat list should stride by -1 * lat_interval (it starts at north pole and goes down)
#The last value in lat_list should be -90 + lat_interval / 2
latlist = [90 - lat_interval * (0.5 + n) for n in range(lat_len) for _ in range(long_len)]
#The long list should go from long_interval / 2 to long_interval / 2 + long_interval * (long_len + 1)
#The last value should equal 360 - long_interval / 2
#The long list should repeat lat_len times in a non-periodic manner 1,1,1,2,2,2,3,3,3,etc. 
longlist = [long_interval * (0.5 + n) for _ in range(lat_len) for n in range(long_len)]

In [20]:
# Initialize a DataFrame with the lat, long lists
wx_starter_df = pd.DataFrame({'lat':latlist,'long':longlist})
# Merge in the citynames from the file 'city_by_latlongroup.csv'
city_df = pd.read_csv('city_by_latlongroup.csv')
wx_starter_df = wx_starter_df.merge(city_df,left_index=True,right_index=True)
# Do some housekeeping to retain only what we want in a nicely labeled way
wx_starter_df = wx_starter_df.drop(columns = ['Unnamed: 0','lat_label','long_label','lat_y','long_y','pop'])
wx_starter_df.columns = ['lat','long','city','country']
wx_starter_df.head()

Unnamed: 0,lat,long,city,country
0,88.5,1.5,--,--
1,88.5,4.5,--,--
2,88.5,7.5,--,--
3,88.5,10.5,--,--
4,88.5,13.5,--,--


In [21]:
#Before adding the data, we need to convert the numbers from the raw formats using the 
#scale and offset factors found in the file.
infile = filepath + file_to_open
with netcdf.netcdf_file(infile, 'r') as f:
    t_scale = f.variables['t2m'].scale_factor
    p_scale = f.variables['tp'].scale_factor
    w_scale = f.variables['si10'].scale_factor
    c_scale = f.variables['tcc'].scale_factor
    t_offset = f.variables['t2m'].add_offset
    p_offset = f.variables['tp'].add_offset
    w_offset = f.variables['si10'].add_offset
    c_offset = f.variables['tcc'].add_offset

In [22]:
#Now convert the data and add it into a list of 12 DataFrames
climdflist = []
for monthnum in range(12):
    build_df = wx_starter_df.copy()
# For temperature, we will subtract -273.15 to convert from K to deg C
    build_df['temp'] = [val * t_scale + t_offset - 273.15 for val in tdatalist[monthnum]]
    build_df['prcp'] = [val * p_scale + p_offset for val in pdatalist[monthnum]]
    build_df['wind'] = [val * w_scale + w_offset for val in wdatalist[monthnum]]
    build_df['cloud'] = [val * c_scale + c_offset for val in cdatalist[monthnum]]
    climdflist.append(build_df)

In [23]:
#Next we will compute the departure from normal
for monthnum in range(12):
    # Read in the mean values from the file
    infile = 'means_30yr' + str(monthnum) + '.csv'
    means_df = pd.read_csv(infile)
    # Subtract, or divide in the case of precip, to get the anomaly
    climdflist[monthnum]['temp_anom'] = climdflist[monthnum]['temp'] - means_df['temp']
    # For precip, we need to take special care not to divide by 0, so we add a tiny number 
    # to the denominator
    climdflist[monthnum]['prcp_pct_of_norm'] = climdflist[monthnum]['prcp'] * 100 \
                        / means_df['prcp'].apply(lambda x: x + .000000001 if x == 0 else x)
    climdflist[monthnum]['wind_anom'] = climdflist[monthnum]['wind'] - means_df['wind']
    climdflist[monthnum]['cloud_anom'] = climdflist[monthnum]['cloud'] - means_df['cloud']

In [24]:
#Check if everything worked as it should
climdflist[0].head()

Unnamed: 0,lat,long,city,country,temp,prcp,wind,cloud,temp_anom,prcp_pct_of_norm,wind_anom,cloud_anom
0,88.5,1.5,--,--,-23.587214,0.000311,5.030107,0.926669,1.29499,62.518619,-1.580189,0.002492
1,88.5,4.5,--,--,-23.608494,0.000308,5.031904,0.926261,1.257945,61.533532,-1.586646,0.002499
2,88.5,7.5,--,--,-23.625597,0.000304,5.034157,0.927412,1.223374,60.443025,-1.594632,0.004221
3,88.5,10.5,--,--,-23.641381,0.000299,5.035086,0.928508,1.187258,59.168419,-1.605817,0.005836
4,88.5,13.5,--,--,-23.65358,0.000294,5.031785,0.929146,1.155526,57.926958,-1.617965,0.006373


### Output the data files

In [25]:
for monthnum in range(12):
    outfile = 'month' + str(monthnum+1) + '_wx_table.csv'
    climdflist[monthnum].to_csv(outfile, index=False)

This notebook last executed on:

In [26]:
print(dt.datetime.now())

2019-05-08 01:38:34.165963
