<h1><center> NASA Airathon - NO2 Track </center></h1>

### <center> Step 1: Extract OMI data for each grid ID (training data) </center>

<div style="text-align: center"> 
    Dr. Sukanta Basu <br/> Associate Professor <br/> Delft University of Technology, The Netherlands <br/> Email: s.basu@tudelft.nl<br/> https://sites.google.com/view/sukantabasu/
</div>

#### Log

Last updated: 4th April, 2022

#### User instructions

1. Prior to running this notebook, please convert OMI data from HDF5 to netcdf format using the following command: 

```
for FILE in *.he5; do ncks $FILE ${FILE%.he5}.nc; done
```
The processed OMI data is stored in **OMI_DIR**. The OMI data are not included in the repo_sukantabasu.

2. Next, run this notebook. For each 5km x 5km grid, it locates the nearest OMI grid point. Subsequently, it extracts OMI time-series data (both tropospheric and total column values) for that specific grid point and save it in a csv file. The missing OMI values are represented as NaNs. The files are named as: 1X116_trainOMI.csv, 1Z2W7_trainOMI.csv, etc. 

#### Load packages

In [1]:
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
from pathlib import Path
import netCDF4
from glob import glob
import time

#### Directories

In [2]:
ROOT_DIR    = '../../'

#The OMI data are not included in the repo_sukantabasu
OMI_DIR     = ROOT_DIR + 'bucket/no2/train/omi/'

#Location of processed datasets
EXTDATA_DIR = ROOT_DIR + 'data/airathon/processed/'

#### Load grid data

In [3]:
df_grd = pd.read_csv(EXTDATA_DIR + 'grid_latlon.csv') #Contains: ID, latitude, longitude

ID     = df_grd['ID']
LAT    = df_grd['latitude']
LON    = df_grd['longitude']

nID    = np.size(ID)

#### User input

In [4]:
trnYR  = np.array([2019,2020]) #training years

#### Compute nearest point

In [5]:
def nearestGridFAST(LAT,LON,LATx,LONx):
    
    dLAT = np.abs(LAT - LATx)
    dLON = np.abs(LON - LONx)
    
    dTOT = dLAT + dLON #taxi-cab distance
    
    #https://stackoverflow.com/questions/3230067/numpy-minimum-in-row-column-format
    r_min, c_min = np.unravel_index(dTOT.argmin(), dTOT.shape)
    
    return r_min, c_min

#### Grid structure of OMI data

In [6]:
LATomi = np.arange(-90, 90, 0.25)
LONomi = np.arange(-180, 180, 0.25)
LONomi2d, LATomi2d = np.meshgrid(LONomi, LATomi)

#### Extract OMI data for each grid ID

In [7]:
for n in range(nID):
    tini = time.time()
    
    if (LON[n] > 120) & (LON[n] < 122):
        strCITY = 'tpe'
    elif (LON[n] > 76) & (LON[n] < 78):
        strCITY = 'dl'
    elif (LON[n] > -119) & (LON[n] < -116):
        strCITY = 'la'
    else:
        strCITY = 'XXX'
    
    print(strCITY)

    cnt = 0
    for y in trnYR:
        
        OMI_SUBDIR = OMI_DIR + str(y) + "/"
        #Use glob to find all the files using strCITY as a wildcard
        for f in glob(OMI_SUBDIR + '*' + '.nc'):

            #Read netcdf files
            dataOMI   = netCDF4.Dataset(f,'r')
            
            NO2       = dataOMI.groups['HDFEOS'].groups['GRIDS'].groups['ColumnAmountNO2'].groups['Data Fields'].variables['ColumnAmountNO2CloudScreened'][:]            
            NO2Tr     = dataOMI.groups['HDFEOS'].groups['GRIDS'].groups['ColumnAmountNO2'].groups['Data Fields'].variables['ColumnAmountNO2TropCloudScreened'][:]  

            NO2       = NO2/1e15
            NO2Tr     = NO2Tr/1e15
            
            r_min, c_min = nearestGridFAST(LATomi2d,LONomi2d,LAT[n],LON[n])

            #Note: maskedArray.data takes care of the mask
            #All the arrays are of size 720x1440
            NO2_i     = NO2.data[r_min][c_min]
            NO2Tr_i   = NO2Tr.data[r_min][c_min]
            
            if NO2_i < 0:
                NO2_i = np.nan
            if NO2Tr_i < 0:
                NO2Tr_i = np.nan
            
            print((r_min,c_min,NO2_i,NO2Tr_i))
            
            #Year, month, day
            yr = int(f[32:36])
            mo = int(f[36:38])
            dy = int(f[38:40])
            
            #Combine variables in an array
            D = np.array((yr,mo,dy,NO2_i,NO2Tr_i))
            if cnt == 0:
                comboD = D
            else:
                comboD = np.vstack((comboD,D))
            
            cnt = cnt + 1
            print((n,y,cnt))
            
    #Create pandas dataframes and save it as csv files   
    df_new = pd.DataFrame(data=comboD)
    df_new.columns = ['Year','Month','Day','NO2_OMI','NO2Tr_OMI'] 
    df_new['datetime'] = pd.to_datetime(df_new[['Year', 'Month', 'Day']])
    
    df_new.to_csv(EXTDATA_DIR + 'train/OMI/' + str(ID[n]) + '_trainOMI.csv', index=False)
    
    et = time.time() - tini
    print(et)

tpe
(460, 1206, 7.1906514, 5.009883)
(0, 2019, 1)
(460, 1206, nan, nan)
(0, 2019, 2)
(460, 1206, nan, nan)
(0, 2019, 3)
(460, 1206, 6.512077, 4.1828103)
(0, 2019, 4)
(460, 1206, 7.786171, 5.3272443)
(0, 2019, 5)
(460, 1206, nan, nan)
(0, 2019, 6)
(460, 1206, 5.77722, 3.6124668)
(0, 2019, 7)
(460, 1206, nan, nan)
(0, 2019, 8)
(460, 1206, nan, nan)
(0, 2019, 9)
(460, 1206, 4.4518385, 1.9058449)
(0, 2019, 10)
(460, 1206, nan, nan)
(0, 2019, 11)
(460, 1206, nan, nan)
(0, 2019, 12)
(460, 1206, nan, nan)
(0, 2019, 13)
(460, 1206, 7.864144, 4.9547825)
(0, 2019, 14)
(460, 1206, nan, nan)
(0, 2019, 15)
(460, 1206, 6.346594, 3.4172168)
(0, 2019, 16)
(460, 1206, nan, nan)
(0, 2019, 17)
(460, 1206, 4.3397627, 1.7177197)
(0, 2019, 18)
(460, 1206, nan, nan)
(0, 2019, 19)
(460, 1206, nan, nan)
(0, 2019, 20)
(460, 1206, nan, nan)
(0, 2019, 21)
(460, 1206, nan, nan)
(0, 2019, 22)
(460, 1206, nan, nan)
(0, 2019, 23)
(460, 1206, nan, nan)
(0, 2019, 24)
(460, 1206, 5.815924, 3.3985002)
(0, 2019, 25)
(460,