In [1]:
import xarray as xr
import numpy as np
import os

# IMERG (Precipitation) Data

In [2]:
os.chdir('/path/to/directory/raw/')

In [3]:
imerg = xr.open_dataset('IMERG.nc')
imerg_cleaned = imerg.copy()
imerg_cleaned

In [4]:
np.where(np.isnan(imerg.precipitation==True))

(array([], dtype=int64), array([], dtype=int64), array([], dtype=int64))

There are currently no NaNs present in the IMERG data, so no steps need to be taken to clean the data. Since the values of precipitation are in mm/hr, given the nature of precipitation data, the data will be heavily skewed towards 0 as more often than not, a grid cell will have a value of 0. Despite this, values can realistically exceed 100 mm/hr in instances of flash flooding. As such, determining outliers through use of the mean and standard deviation does not make sense, as the data is not normally distributed. Additionally, attempting to use an interquartile range to determine outliers does not work, as both the first and third quartile are 0 due to the overwhelming numbers of observations at this value. As such, the data will be treated as clean, as the values reported do not fall outside of a seasonable range.

In [5]:
chunks = {"time": 1000, "lon": 195, "lat": 85}
encoding = {var: {"chunksizes": [chunks[dim] for dim in imerg_cleaned[var].dims]} for var in imerg_cleaned.data_vars}
imerg_cleaned.to_netcdf("/path/to/directory/clean/IMERG_clean.nc", encoding=encoding)

# MERRA2 (Aerosols) Data

In [6]:
os.chdir('/path/to/directory/raw/')

In [7]:
merra = xr.open_dataset('MERRA.nc')
merra

In [8]:
has_nans = {var: merra[var].isnull().any() for var in merra.data_vars}

for var, contains_nan in has_nans.items():
    print(f"{var} contains NaNs: {contains_nan.values}")

BCCMASS contains NaNs: False
BCSMASS contains NaNs: False
DUCMASS contains NaNs: False
DUCMASS25 contains NaNs: False
DUSMASS contains NaNs: False
DUSMASS25 contains NaNs: False
OCCMASS contains NaNs: False
OCSMASS contains NaNs: False
SO2CMASS contains NaNs: False
SO2SMASS contains NaNs: False
SO4CMASS contains NaNs: False
SO4SMASS contains NaNs: False
SSCMASS contains NaNs: False
SSCMASS25 contains NaNs: False
SSSMASS contains NaNs: False
SSSMASS25 contains NaNs: False


There are currently no NaNs present in the MERRA data, so no steps need to be taken to clean the data. Similar to the IMERG dataset, most values of aerosol concentrations are relatively low, but periodic high concentration events will occur. Figuring out the effect of these changes in aerosol concentrations is the point of the study, so it does not make sense to remove values that may seem like outliers.

In [9]:
merra_cleaned = merra.copy()

In [10]:
chunks = {"time": 100, "lon": 94, "lat": 52}
encoding = {var: {"chunksizes": [chunks[dim] for dim in merra[var].dims]} for var in merra.data_vars}
merra_cleaned.to_netcdf("/path/to/directory/clean/MERRA_clean.nc", encoding=encoding)

# ECMWF (CAPE) Data

Data for CAPE is also highly skewed towards 0, as large values of CAPE are only seen in regions that have the potential for large amounts of convection. The highest value of CAPE recorded is around 8000 J/kg, but using the interquartile range returns an upper bound of 73, which would limit the amount of convectively active samples we have. Since this is a project about lightning, these convectively active regions are crucial. As such, all values above 8000 will be set to 8000, but the data will not be manipulated in other ways.

In [11]:
cape = xr.open_dataset('/path/to/directory/raw/CAPE.nc')
cape

In [12]:
np.where(np.isnan(cape.cape==True))

(array([], dtype=int64), array([], dtype=int64), array([], dtype=int64))

In [13]:
##Used to show upper bound is too low, and therefore most of the data is being scrapped when using this approach
cape_cleaned = cape.copy()
for var in cape.data_vars:
    Q1 = cape[var].quantile(0.25)
    Q3 = cape[var].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    cape_cleaned[var] = cape[var].where((cape[var] >= lower_bound) & (cape[var] <= upper_bound), drop=True)
upper_bound

In [14]:
cape_cleaned = cape.copy() 
cape_cleaned = cape_cleaned.where(cape <= 8000, other=8000)

In [15]:
chunks = {"time": 876, "longitude": 235, "latitude": 103}
encoding = {var: {"chunksizes": [chunks[dim] for dim in cape_cleaned[var].dims]} for var in cape_cleaned.data_vars}
cape_cleaned.to_netcdf("/path/to/directory/clean/CAPE_clean.nc", encoding=encoding)

# WWLLN (Lightning) Data

In [16]:
wwlln = xr.open_dataset('/path/to/directory/raw/WWLLN.nc')
wwlln_cleaned = wwlln.copy()
wwlln_cleaned

In [17]:
np.where(np.isnan(wwlln_cleaned.ltg==True))

(array([], dtype=int64), array([], dtype=int64), array([], dtype=int64))

There are currently no NaNs present in the WWLLN data, so no steps need to be taken to clean the data. Similar to the precipitation data, lightning data is also heavily skewed towards 0. However, large values that may seemingly be outliers when looking at standard deviations or the interquartile range. As such, "outliers" will not be removed.

In [None]:
chunks = {"time": 292, "lon": 360, "lat": 135}
encoding = {var: {"chunksizes": [chunks[dim] for dim in wwlln[var].dims]} for var in wwlln.data_vars}
wwlln_cleaned.to_netcdf("/path/to/directory/clean/WWLLN_clean.nc", encoding=encoding)