# A2E-MMC Data cleaning notebook: Boardman met station (z12.b0)

This notebook is one in a series of data cleaning notebooks for the Atmosphere to Electron Mesoscale Microscale Coupling project. Information about the project can be found here: https://a2e.energy.gov/. The data cleaning notebooks standardize the format of the data used in the project. The observation data collected for comparison with model results come from various sensors in the northwest United States, as well as Texas, coastal New Jersey and Virginia. Each notebook is specific to a sensor, ingests that particular data set, and outputs a curated set of variables with consistent naming conventions, units, and fill values. The output data set can then be used in notebooks designed for model analysis. (a link to those notebooks goes here)

All input and output files are in netCDF format.

Specifically, this notebook is for the data collected at the meteorological station at Boardman, OR. The data can be found here: https://a2e.energy.gov/data/wfip2/met.z12.b0. 

Start by importing the libraries you will need for running this notebook:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from netCDF4 import Dataset as ncdf
from netCDF4 import stringtochar, num2date, date2num
from matplotlib import cm
import pandas as pd
import datetime
from datetime import date, time,timedelta
import netCDF4

After downloading the data for the date you are interested in, change the inputPathBase to reflect where you are keeping that input file. -Note that this should just be the folder path, and should not contain the actual file name. You may also need to change the year, month, and day if you are using a date other than November 1, 2016. Change the outputPathBase to reflect where you want to store the output file (again, this is a folder path). Also, change my name to your name in who_created_me so that you will be associated with the data set you create. 

In [2]:
#Your working directory (where the data lives...)
inputPathBase = "C:/Users/decastro/Downloads/wfip2.met.z12.b0.6665a02aa5781150f554f5d96/"

#Some intrument specifics
station_name = 'Boardman'
sensor_name  = 'met.z12'

instrument_filePrefix = "{sensorName:s}.b0".format(sensorName=sensor_name)
instrument_fileSuffix = ".txt.a2e.nc"

#The date of interest...
year = 2016
month = 11
day = 1
dateString = "{yyyy:4d}{mm:02d}{dd:02d}".format(yyyy=year,mm=month,dd=day)

#The start time of interest...
starthour = 0
startmin = 0
startsec = 0
timeString = "{hour:02d}{minute:02d}{second:02d}".format(hour=starthour,minute=startmin,second=startsec)

#output file specifics
outputPathBase = "C:/Users/decastro/Downloads/wfip2.met.z12.b0.6665a02aa5781150f554f5d96/"
output_filePrefix = instrument_filePrefix
output_fileSuffix = ".mmc.a2e.nc"

#Set a value for the output file author attribute
who_created_me = 'Amy DeCastro decastro@ucar.edu'


Run the cell below to assign your input file name and output file name. 

In [3]:
#Setup the inputFile and outputFile names from the information specified above
inputFile = "{pb:s}{fP:s}.{ds:s}.{ts:s}{fS:s}".format(pb=inputPathBase,
                                                      fP=instrument_filePrefix,
                                                      ds=dateString,
                                                      ts=timeString,
                                                      fS=instrument_fileSuffix)
print(inputFile)

outputFile = "{pb:s}{fP:s}.{ds:s}.{ts:s}{fS:s}".format(pb=outputPathBase,
                                                      fP=output_filePrefix,
                                                      ds=dateString,
                                                      ts=timeString,
                                                      fS=output_fileSuffix)
print(outputFile)

C:/Users/decastro/Downloads/wfip2.met.z12.b0.6665a02aa5781150f554f5d96/met.z12.b0.20161101.000000.txt.a2e.nc
C:/Users/decastro/Downloads/wfip2.met.z12.b0.6665a02aa5781150f554f5d96/met.z12.b0.20161101.000000.mmc.a2e.nc


Next define a function to write the output file. Notice that there is an option in the first line (all_variables) to either keep all of the original variables or pare them down to the standardized format. Keeping the binary as False tells the function to write the curated version. If you wish to output all variables from the input file, switch it to True. 

The curated variables are listed below as the core_variables, and include north-south wind speed (u), east-west wind speed (v), vertical wind speed (w), wind speed (wspd), wind direction (wdir), temperature (T), pressure (p), potential temperature (theta), and relative humidity (RH). Later in the notebook, long names and units will be assigned to each variable. 

In [4]:
def write_to_netCDF(nc_filename=None, data=None, ncformat='NETCDF4_CLASSIC', all_variables = False):
    '''
    This will write a new netCDF file from the data provided in a dictionary.
    Several global variables will be set and care must be taken that the 
    dictionary variables are named the same as what this function expects.
    '''
    core_variables = ['Times','u','v','w','wspd','wdir','T','p','theta','RH']
    ncfile = ncdf(nc_filename,'w',format=ncformat,clobber=True)
    for dd,dim in enumerate(data['dims']):
        ncfile.createDimension(data['dimname'][dd],dim)
    for vv,varname in enumerate(data['varn']):
        if all_variables:
            newvar = ncfile.createVariable(varname,data['vardtype'][vv],data['vardims'][vv])
            newvar[:] = data['data'][vv]
            newvar.units = data['units'][vv]
        else:
            if varname in core_variables:
                newvar = ncfile.createVariable(varname,data['vardtype'][vv],data['vardims'][vv], fill_value=data['fillValue'])
                newvar[:] = data['data'][vv]
                print(varname)
                print(newvar[newvar == np.nan])
                newvar[newvar == np.nan] = data['fillValue']
                newvar.units = data['units'][vv]
            
    ncfile.createDimension('nchars',19)
    #newvar    = ncfile.createVariable('Times','S1',('time','nchars'))
    newvar[:] = data['time']
    ncfile.description = data['description']
    ncfile.station     = data['station']
    ncfile.sensor      = data['sensor']
    ncfile.latitude    = data['latitude']
    ncfile.longitude   = data['longitude']
    ncfile.altitude    = data['altitude']
    ncfile.createdon   = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    ncfile.createdby   = data['author']

Summarizing the input file, we can see what variables and dimensions are included. And by printing each variable of interest, we can see its long name, units, and other attributes. By doing so, we can see that we'll want to change the units of time, temperature, and pressure to meet project standards, and we'll want to derive u and v from wind speed and wind direction. 

In [5]:
f = netCDF4.Dataset(inputFile)
print(f)
print(f.variables['time'])
print(f.variables['temperature'])
print(f.variables['relative_humidity'])
print(f.variables['pressure'])
print(f.variables['wind_speed'])
print(f.variables['wind_direction'])

<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF3_CLASSIC data model, file format NETCDF3):
    Conventions: CF-1.6
    history: 2018-01-24 20:13:38 created by libingest-1.2 using wfip2_met-1.4 (build version: v1.10.0)
    dimensions(sizes): time(1440), bounds(2)
    variables(dimensions): float64 [4mtime[0m(time), float64 [4mtime_bounds[0m(time,bounds), float32 [4minstrument_height[0m(), float32 [4mwind_height[0m(), float32 [4mpressure_height[0m(), float32 [4mtemperature[0m(time), float32 [4mrelative_humidity[0m(time), int32 [4mrelative_humidity_qc[0m(time), float32 [4mpressure[0m(time), float32 [4mwind_speed[0m(time), int32 [4mwind_speed_qc[0m(time), float32 [4mwind_direction[0m(time), float32 [4mwind_direction_std[0m(time), float32 [4mprecipitation[0m(time), float32 [4msolar_irradiance[0m(time), float64 [4mlatitude[0m(), float64 [4mlongitude[0m(), float64 [4maltitude[0m()
    groups: 

<class 'netCDF4._netCDF4.Variable'>
float64 time(time)
  

In [6]:
new_fill = 9999
for varn in ['temperature','relative_humidity','pressure', 'wind_speed', 'wind_direction']:
   var = f.variables[varn]
   old_fill = var._FillValue
   var[var==old_fill] = np.nan
   if varn == 'temperature': temp = var[:]
   if varn == 'relative_humidity': RH = var[:]
    
#p = f.variables['pressure'][:]
   if varn == 'pressure': p = var[:]
    
#RH = f.variables['relative_humidity'][:]
wspd = f.variables['wind_speed'][:]
wdir = f.variables['wind_direction'][:]

In [7]:
print(temp)

[13.88 13.85 13.82 ... 16.23 16.21 16.17]


Assign the latitude, longitude, and altitude of the station. 

In [8]:
lat, lon, alt = f.variables['latitude'][:], f.variables['longitude'][:], f.variables['altitude'][:]
print(lat, lon, alt)

45.8167 -119.8121 110.3376


The time from the input data is formatted in seconds from midnight on November 1, 2016. Run the cell below to change the formatting to epoch time. 

In [9]:
time = f.variables['time'][:]
mytime=np.array(time,dtype='float64')
#print(mytime)
nt = time.size
#print(nt)
#print(type(mytime[0]))
dtTimes = [datetime.datetime(2016,11,1) + timedelta(seconds=i) for i in mytime]
#Times = date2num(dtTimes,units='hours since 0001-01-01 00:00:00.0',calendar='gregorian')
Times = date2num(dtTimes,units='seconds since 1970-01-01 00:00:00.0',calendar='gregorian')
print(Times)

[1.47795840e+09 1.47795846e+09 1.47795852e+09 ... 1.47804462e+09
 1.47804468e+09 1.47804474e+09]


Next convert temperature from Celsius to Kelvin, derive u and v from wind speed and direction, and assign wind speed, wind direction, pressure, and relative humidity. 

In [10]:
#temp = f.variables['temperature'][:]
T = temp + 273.15

u = wspd*np.cos(wdir)
v = wspd*np.sin(wdir)

When we printed the input file name above, we saw that the data includes three different height dimensions. Heights are not associated with variables in the input data, but we want them to be variable dimensions in the output. So, we'll assign them now, and in a few cells, we'll associate them with specific variables. 

In [11]:
inst_z = f.variables['instrument_height'][:]
niz = inst_z.size
print(inst_z, niz)
wind_z = f.variables['wind_height'][:]
nwz = wind_z.size
print(wind_z, nwz)
p_z = f.variables['pressure_height'][:]
npz = p_z.size
print(p_z, npz)

2.0 1
10.0 1
1.524 1


Assign the dimension names and sizes.

In [12]:
dim_names = ['time', 'inst_z', 'wind_z', 'p_z']
dims      = [    nt,    niz,    nwz,    npz]

Assign names, units, types, and dimensions to the variables. Notice that we've changed the units for pressure from hPa to mbar, there's no mathematical conversion necessary for that change, so we just reassign the units that meet the project standards.

In [13]:
# Assign all of the data you want to arrays as follows:
var_data  = [Times, T, RH, p, u, v, wspd, wdir] # the actual data var[time,height]
var_names = ['Times','T', 'RH', 'p', 'u', 'v', 'wspd', 'wdir'] # a string for the name of the data
var_units = ['seconds since 1970-01-01 00:00:00.0','K', '%',  'mbar', 'm/s', 'm/s', 'm/s', 'degree'] # units of the data 
# The data type is needed to add the variable to a netCDF file:
var_dtype = [np.float64, np.float64, np.float64, np.float64, np.float64, np.float64, np.float64, np.float64, np.float64, np.float64, np.float64]
# The dimensions of EACH variable must be specified as follows:
var_dims  = [('time'),('time','inst_z'),('time','inst_z'),('time','p_z'),('time','wind_z'),('time','wind_z'),('time','wind_z'),('time','wind_z')]

To keep track of changes between the input and output data sets, we'll note the changes we made in the description. We'll also assign an author to the output data set. So if you've changed anything about the way the output data is generated, change my name to your name here. 

In [14]:
description = 'Convert temperature units from Celsius to Kelvin, derived u and v components from wind speed and direction, changed the units of pressure to mbar, and assigned heights as dimensions'

Now use a dictionary to associate all of the data you've created with the function for writing the output file. 

In [15]:
# Assign all of the information to a dictionary so that we can call it into the 
# ... write_to_netcdf function.
vardict = {
   'dimname'    : dim_names,     # the names of the dimensions
   'dims'       : dims,          # the size of the dimensions
   'varn'       : var_names,     # the names of the variables
   'data'       : var_data,      # the data, itself
   'units'      : var_units,     # the units for each variable
   'vardims'    : var_dims,      # the dimensions of each variable
   'vardtype'   : var_dtype,     # the data types
   'time'       : time,          # time
   'station'    : station_name,  # Name of the station
   'sensor'     : sensor_name,   # Name of the sensor
   'latitude'   : lat,           # station latitude
   'longitude'  : lon,           # station longitude
   'altitude'   : alt,           # station altitude
   'description': description,   # description of what the data is
   'author'     : who_created_me,# who created this file
   'fillValue'  : new_fill       # fill value
}

Write your output file.

In [16]:
write_to_netCDF(outputFile,vardict)

Times
--
T
[287.02999878]
RH
[64.47000122]
p
[996.21124268]
u
[0.2034831]
v
[-2.63716125]
wspd
[2.64499998]
wdir
[262.3999939]


You can execute the cell below to see it in standardized format and double-check that it as all of the variables, dimensions, and attributes that you need to begin analysis. 

In [17]:
out = netCDF4.Dataset(outputFile)
print(out)

<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4_CLASSIC data model, file format HDF5):
    description: Convert temperature units from Celsius to Kelvin, derived u and v components from wind speed and direction, changed the units of pressure to mbar, and assigned heights as dimensions
    station: Boardman
    sensor: met.z12
    latitude: 45.8167
    longitude: -119.8121
    altitude: 110.3376
    createdon: 2019-04-23 14:46:53
    createdby: Amy DeCastro decastro@ucar.edu
    dimensions(sizes): time(1440), inst_z(1), wind_z(1), p_z(1), nchars(19)
    variables(dimensions): float64 [4mTimes[0m(time), float64 [4mT[0m(time,inst_z), float64 [4mRH[0m(time,inst_z), float64 [4mp[0m(time,p_z), float64 [4mu[0m(time,wind_z), float64 [4mv[0m(time,wind_z), float64 [4mwspd[0m(time,wind_z), float64 [4mwdir[0m(time,wind_z)
    groups: 



In [18]:
print(out.variables['u'])

<class 'netCDF4._netCDF4.Variable'>
float64 u(time, wind_z)
    _FillValue: 9999.0
    units: m/s
unlimited dimensions: 
current shape = (1440, 1)
filling on


In [25]:
T[np.isnan(T)]
T[T == np.nan])

[]
