# A2E-MMC Data cleaning notebook: PS-06 met station (z21.b0)

This notebook is one in a series of data cleaning notebooks for the Atmosphere to Electron Mesoscale Microscale Coupling project. Information about the project can be found here: https://a2e.energy.gov/. The data cleaning notebooks standardize the format of the data used in the project. The observation data collected for comparison with model results come from various sensors in the northwest United States, as well as Texas, coastal New Jersey and Virginia. Each notebook is specific to a sensor, ingests that sensor's particular data set, and outputs a curated set of variables with consistent naming conventions, units, and fill values. The output data set can then be used in notebooks designed for model analysis. (a link to those notebooks goes here)

All input and output files are in netCDF format.

Specifically, this notebook is for the data collected at the meteorological station at the PS-06 tower location (45.63798, -120.65082). PS-06 has a tower above-ground height of 21 meters, and takes measurements every 15 minutes at 3, 10, and 21 meters using sonic anemometers, licor, and temperature/ relative humidity probes. Note that this notebook converts netCDF data, and the measurements at 10 and 21 are only available in csv format, meaning that this data set contains data only at the 3 meter level. The tower sits at an elevation of 474 meters above sea level. The data can be found here: https://a2e.energy.gov/data/wfip2/met.z21.b0. 

Start by importing the libraries you will need for running this notebook:

In [1]:
import os 
import sys
import numpy as np
import matplotlib.pyplot as plt
from netCDF4 import Dataset as ncdf
from netCDF4 import stringtochar, num2date, date2num
from matplotlib import cm
import pandas as pd
import datetime
from datetime import date, time,timedelta
import netCDF4
sys.path.append('../../mmctools/')
from datawriters import write_to_netCDF

After downloading the data for the date you are interested in, change the inputPathBase to reflect where you are keeping that input file. -Note that this should just be the folder path, and should not contain the actual file name. You may also need to change the year, month, and day if you are using a date other than November 1, 2016. Change the outputPathBase to reflect where you want to store the output file (again, this is a folder path). Also, change my name to your name in who_created_me so that you will be associated with the data set you create. 

In [2]:
#Your working directory (where the data lives...)
inputPathBase = "/Users/decastro/Downloads/"

#Some intrument specifics
station_name = 'Boardman'
sensor_name  = 'met.z21'

instrument_filePrefix = "{sensorName:s}.b0".format(sensorName=sensor_name)
instrument_fileSuffix = ".son03m.biomet.full_output.csv.a2e.nc"

#The date of interest...
year = 2016
month = 11
day = 1
dateString = "{yyyy:4d}{mm:02d}{dd:02d}".format(yyyy=year,mm=month,dd=day)

#The start time of interest...
starthour = 0
startmin = 0
startsec = 0
timeString = "{hour:02d}{minute:02d}{second:02d}".format(hour=starthour,minute=startmin,second=startsec)

#output file specifics
outputPathBase = "/Users/decastro/Downloads/"
output_filePrefix = instrument_filePrefix
output_fileSuffix = ".mmc.a2e.nc"

#Set a value for the output file author attribute
who_created_me = 'Amy DeCastro decastro@ucar.edu'


Run the cell below to assign your input file name and output file name. 

In [3]:
#Setup the inputFile and outputFile names from the information specified above
inputFile = "{pb:s}{fP:s}.{ds:s}.{ts:s}{fS:s}".format(pb=inputPathBase,
                                                      fP=instrument_filePrefix,
                                                      ds=dateString,
                                                      ts=timeString,
                                                      fS=instrument_fileSuffix)
print(inputFile)

outputFile = "{pb:s}{fP:s}.{ds:s}.{ts:s}{fS:s}".format(pb=outputPathBase,
                                                      fP=output_filePrefix,
                                                      ds=dateString,
                                                      ts=timeString,
                                                      fS=output_fileSuffix)
print(outputFile)

/Users/decastro/Downloads/met.z21.b0.20161101.000000.son03m.biomet.full_output.csv.a2e.nc
/Users/decastro/Downloads/met.z21.b0.20161101.000000.mmc.a2e.nc


Next define a function to write the output file. Notice that there is an option in the first line (all_variables) to either keep all of the original variables or pare them down to the standardized format. Keeping the binary as False tells the function to write the curated version. If you wish to output all variables from the input file, switch it to True. 

The curated variables are listed below as the core_variables, and include north-south wind speed (u), east-west wind speed (v), vertical wind speed (w), wind speed (wspd), wind direction (wdir), temperature (T), pressure (p), potential temperature (theta), and relative humidity (RH). Later in the notebook, long names and units will be assigned to each variable. 

Summarizing the input file, we can see what variables and dimensions are included. And by printing each variable of interest, we can see its long name, units, and other attributes. By doing so, we can see that we'll want to change the units of time, temperature, and pressure to meet project standards, and we'll want to derive u and v from wind speed and wind direction. 

In [4]:
f = netCDF4.Dataset(inputFile)
print(f)
print(f.variables['time'])
print(f.variables['temperature'])
print(f.variables['relative_humidity'])
print(f.variables['pressure'])
print(f.variables['wind_speed'])
print(f.variables['wind_direction'])
print(f.variables['wind_u'])
print(f.variables['wind_v'])
print(f.variables['height'])

<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF3_CLASSIC data model, file format NETCDF3):
    Conventions: CF-1.6
    history: 2018-05-18 15:15:39 created by libingest-1.2 using wfip2_met-1.5 (build version: v1.12.0)
    dimensions(sizes): time(96), bounds(2)
    variables(dimensions): float64 [4mtime[0m(time), float64 [4mtime_bounds[0m(time,bounds), float32 [4mheight[0m(), float32 [4msensible_heat_flux[0m(time), int32 [4msensible_heat_flux_qc[0m(time), float32 [4mwT_sensible_heat_flux[0m(time), float32 [4msonic_temperature[0m(time), float32 [4mtemperature[0m(time), float32 [4mpressure[0m(time), float32 [4mrelative_humidity[0m(time), float32 [4mwind_u[0m(time), float32 [4mwind_v[0m(time), float32 [4mwind_w[0m(time), float32 [4mwind_speed[0m(time), float32 [4mwind_direction[0m(time), float32 [4mwind_u_variance[0m(time), float32 [4mwind_v_variance[0m(time), float32 [4mwind_w_variance[0m(time), float32 [4mwind_shear_stress[0m(time), int32 [4mw

Assign the latitude, longitude, and altitude of the station. 

In [5]:
lat, lon, alt = f.variables['latitude'][:], f.variables['longitude'][:], f.variables['altitude'][:]
print(lat, lon, alt)

45.63798 -120.65082 474.0


The time from the input data is formatted in seconds from midnight on November 1, 2016. Run the cell below to change the formatting to epoch time. 

In [6]:
time = f.variables['time'][:]
mytime=np.array(time,dtype='float64')
#print(mytime)
nt = time.size
#print(nt)
#print(type(mytime[0]))
dtTimes = [datetime.datetime(2016,11,1) + timedelta(seconds=i) for i in mytime]
#Times = date2num(dtTimes,units='hours since 0001-01-01 00:00:00.0',calendar='gregorian')
Times = date2num(dtTimes,units='seconds since 1970-01-01 00:00:00.0',calendar='gregorian')
print(Times)

[1.4779593e+09 1.4779602e+09 1.4779611e+09 1.4779620e+09 1.4779629e+09
 1.4779638e+09 1.4779647e+09 1.4779656e+09 1.4779665e+09 1.4779674e+09
 1.4779683e+09 1.4779692e+09 1.4779701e+09 1.4779710e+09 1.4779719e+09
 1.4779728e+09 1.4779737e+09 1.4779746e+09 1.4779755e+09 1.4779764e+09
 1.4779773e+09 1.4779782e+09 1.4779791e+09 1.4779800e+09 1.4779809e+09
 1.4779818e+09 1.4779827e+09 1.4779836e+09 1.4779845e+09 1.4779854e+09
 1.4779863e+09 1.4779872e+09 1.4779881e+09 1.4779890e+09 1.4779899e+09
 1.4779908e+09 1.4779917e+09 1.4779926e+09 1.4779935e+09 1.4779944e+09
 1.4779953e+09 1.4779962e+09 1.4779971e+09 1.4779980e+09 1.4779989e+09
 1.4779998e+09 1.4780007e+09 1.4780016e+09 1.4780025e+09 1.4780034e+09
 1.4780043e+09 1.4780052e+09 1.4780061e+09 1.4780070e+09 1.4780079e+09
 1.4780088e+09 1.4780097e+09 1.4780106e+09 1.4780115e+09 1.4780124e+09
 1.4780133e+09 1.4780142e+09 1.4780151e+09 1.4780160e+09 1.4780169e+09
 1.4780178e+09 1.4780187e+09 1.4780196e+09 1.4780205e+09 1.4780214e+09
 1.478

Next, assign standardized variable names.

In [7]:
#T = f.variables['temperature'][:]
#RH = f.variables['relative_humidity'][:]
#p = f.variables['pressure'][:]
#wspd = f.variables['wind_speed'][:]
#wdir = f.variables['wind_direction'][:]

new_fill = 9999
for varn in ['temperature','relative_humidity','pressure', 'wind_speed', 'wind_direction', 'wind_u_rotated', 'wind_v_rotated', 'wind_w_rotated']:
    var = f.variables[varn]
    old_fill = var._FillValue
    var[var==old_fill] = np.nan
    if varn == 'temperature': T = var[:]
    if varn == 'relative_humidity': RH = var[:]
    if varn == 'wind_w_rotated': w = var[:]
#p = f.variables['pressure'][:]
    if varn == 'pressure': p = var[:]
    
#RH = f.variables['relative_humidity'][:]
wspd = f.variables['wind_speed'][:]
wdir = f.variables['wind_direction'][:]
u = f.variables['wind_u'][:]
v = f.variables['wind_v'][:]

Although there are sensors at 3, 10, and 21 meters above ground, this data set provides only measurements taken at 3 meters, so the output data set will have only one height dimension.  

In [8]:
height = f.variables['height'][:]
nz = height.size
print(height, nz)

3.0 1


Assign the dimension names and sizes.

In [9]:
dim_names = ['time', 'height']
dims      = [    nt,    nz]

Assign names, units, types, and dimensions to the variables. Notice that we've changed the units for pressure from hPa to mbar, there's no mathematical conversion necessary for that change, so we just reassign the units that meet the project standards.

In [10]:
# Assign all of the data you want to arrays as follows:
var_data  = [Times, T, RH, p, u, v, wspd, wdir] # the actual data var[time,height]
var_names = ['Times','T', 'RH', 'p', 'u', 'v', 'wspd', 'wdir'] # a string for the name of the data
var_units = ['seconds since 1970-01-01 00:00:00.0','K', '%',  'mbar', 'm/s', 'm/s', 'm/s', 'degree'] # units of the data 
# The data type is needed to add the variable to a netCDF file:
var_dtype = [np.float64, np.float64, np.float64, np.float64, np.float64, np.float64, np.float64, np.float64, np.float64, np.float64, np.float64]
# The dimensions of EACH variable must be specified as follows:
var_dims  = [('time'),('time','height'),('time','height'),('time','height'),('time','height'),('time','height'),('time','height'),('time','height')]

To keep track of changes between the input and output data sets, we'll note the changes we made in the description. We'll also assign an author to the output data set. So if you've changed anything about the way the output data is generated, change my name to your name here. 

In [11]:
description = 'Converted pressure units from hPa to mbar, and assigned heights as dimensions'

Now use a dictionary to associate all of the data you've created with the function for writing the output file. 

In [12]:
# Assign all of the information to a dictionary so that we can call it into the 
# ... write_to_netcdf function.
vardict = {
   'dimname'    : dim_names,     # the names of the dimensions
   'dims'       : dims,          # the size of the dimensions
   'varn'       : var_names,     # the names of the variables
   'data'       : var_data,      # the data, itself
   'units'      : var_units,     # the units for each variable
   'vardims'    : var_dims,      # the dimensions of each variable
   'vardtype'   : var_dtype,     # the data types
   'time'       : time,          # time
   'station'    : station_name,  # Name of the station
   'sensor'     : sensor_name,   # Name of the sensor
   'latitude'   : lat,           # station latitude
   'longitude'  : lon,           # station longitude
   'altitude'   : alt,           # station altitude
   'description': description,   # description of what the data is
   'author'     : who_created_me,# who created this file
   'fillValue'  : new_fill       # fill value
}

Write your output file.

In [13]:
write_to_netCDF(outputFile,vardict)

Times
1477959300.0
T
[284.14337158]
RH
[76.25296783]
p
[956.10388184]
u
[-1.61671495]
v
[0.86782181]
wspd
[1.83651686]
wdir
[313.2260437]


You can execute the cell below to see it in standardized format and double-check that it as all of the variables, dimensions, and attributes that you need to begin analysis. 

In [14]:
out = netCDF4.Dataset(outputFile)
print(out)

<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4_CLASSIC data model, file format HDF5):
    description: Converted pressure units from hPa to mbar, and assigned heights as dimensions
    station: Boardman
    sensor: met.z21
    latitude: 45.63798
    longitude: -120.65082
    altitude: 474.0
    createdon: 2019-05-21 11:42:26
    createdby: Amy DeCastro decastro@ucar.edu
    dimensions(sizes): time(96), height(1), nchars(19)
    variables(dimensions): float64 [4mTimes[0m(time), float64 [4mT[0m(time,height), float64 [4mRH[0m(time,height), float64 [4mp[0m(time,height), float64 [4mu[0m(time,height), float64 [4mv[0m(time,height), float64 [4mwspd[0m(time,height), float64 [4mwdir[0m(time,height)
    groups: 



In [15]:
print(out.variables['u'])

<class 'netCDF4._netCDF4.Variable'>
float64 u(time, height)
    _FillValue: 9999.0
    units: m/s
unlimited dimensions: 
current shape = (96, 1)
filling on
