In [None]:
import xarray as xr
import os

# IMERG (Precipitation) Data

In order to make the process easier, make a directory specifically for IMERG output, as a large amount of netCDF files are created in the text file. Navigate to the proper directory, and proceed with the following commands.

In [None]:
os.chdir('/path/to/directory/')

The .txt file is located in the data/text_files directory. In order to run the command, the .txt file must be in your present directory, so move it to the directory you wish to extract the files in.

It is possible that the .txt file resets after a few days, in which case another must be generated. In order to do so, download a new list from GES DISC (the proper dataset is located at this link: https://disc.gsfc.nasa.gov/datasets/GPM_3IMERGHH_07/summary?keywords=imerg). In order to download data, an account on GES DISC is necessary.

To recreate the domain used in this project, click on "Subset/Get Data" under Data Access, followed by "Get File Subsets using the GES DISC Subsetter" in the "Download Method" section. Change the data range from 2023-01-01 to 2023-12-31 under Refine "Date Range", and enter -125,24.5,-66.5,50 under "Refine Region". Under "Select Variables", select precipitation. Afer this, click "Get Data" at the bottom. Click "Download Links List" from the following pop-up, and follow the Download Instructions.

A .txt file containing the list of links will be created. Place the .txt file in the desired directory before running the command (I am using wget for Linux/OS). Replace the .txt file at the end of the wget command with the proper link.

In [None]:
#Extract the list of netCDFs from the .txt file
!wget --load-cookies ~/.urs_cookies --save-cookies ~/.urs_cookies --keep-session-cookies --user=<username> --ask-password --content-disposition -i subset_GPM_3IMERGHH_07_20241026_141236_.txt

In [None]:
imerg = xr.open_mfdataset('*.nc4')
imerg

Save the netCDF in a directory called "raw". Do not worry about a long download time, the netCDF file is relatively large (~10 GB).

In [None]:
#Save the netCDF. Chunking allows for the large files to be saved more efficiently.
chunks = {"time": 1000, "lon": 195, "lat": 85}
encoding = {var: {"chunksizes": [chunks[dim] for dim in imerg[var].dims]} for var in imerg.data_vars}
imerg.to_netcdf("/path/to/directory/raw/IMERG.nc", encoding=encoding)

# MERRA2 (Aerosols) Data

The process for MERRA2 data is very similar to that for IMERG. Data can be found here: https://disc.gsfc.nasa.gov/datasets/M2T1NXAER_5.12.4/summary. Use the same parameters as for the precipitation data, but under "Select Variables", choose the desired variables. For this study, BCCMASS, BCSMASS, DUCMASS, DUCMASS25, DUSMASS, DUSMASS25, OCCMASS, OCSMASS, SO2CMASS, SO2SMASS, SO4CMASS, SO4SMASS, SSCMASS, SSCMASS25, SSSMASS, and SSSMASS25 are used. "CMASS" variables correspond to aerosol column mass densities, while "SMASS" variables correspond to aerosol surface mass concentrations.

In [None]:
os.chdir('/path/to/directory/MERRA2/')

In [None]:
#Extract the list of netCDFs from the .txt file
!wget --load-cookies ~/.urs_cookies --save-cookies ~/.urs_cookies --keep-session-cookies --user=<username> --ask-password --content-disposition -i subset_M2T1NXAER_5.12.4_20241030_194637_.txt

In [None]:
aer = xr.open_mfdataset('/path/to/directory/MERRA2/*.nc')
aer

In [None]:
#Save the netCDF. Chunking allows for the large files to be saved more efficiently.
chunks = {"time": 100, "lon": 94, "lat": 52}
encoding = {var: {"chunksizes": [chunks[dim] for dim in aer[var].dims]} for var in aer.data_vars}
aer.to_netcdf("/path/to/directory/raw/MERRA2.nc", encoding=encoding)

# ERA5 (CAPE) Data

CAPE data is taken from the ECMWF's ERA5 reanalysis, linked here: https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels?tab=download. In order to use the following code, an account must be made on Copernicus, and the CDSAPI client must be set up. Instructions on how to configure the CDSAPI are here: https://cds.climate.copernicus.eu/how-to-api

In [None]:
os.chdir('/path/to/directory/raw/')

In [None]:
import cdsapi

dataset = "reanalysis-era5-single-levels"
request = {
    "product_type": ["reanalysis"],
    "variable": ["convective_available_potential_energy"],
    "year": ["2023"],
    "month": [
        "01", "02", "03",
        "04", "05", "06",
        "07", "08", "09",
        "10", "11", "12"
    ],
    "day": [
        "01", "02", "03",
        "04", "05", "06",
        "07", "08", "09",
        "10", "11", "12",
        "13", "14", "15",
        "16", "17", "18",
        "19", "20", "21",
        "22", "23", "24",
        "25", "26", "27",
        "28", "29", "30",
        "31"
    ],
    "time": [
        "00:00", "01:00", "02:00",
        "03:00", "04:00", "05:00",
        "06:00", "07:00", "08:00",
        "09:00", "10:00", "11:00",
        "12:00", "13:00", "14:00",
        "15:00", "16:00", "17:00",
        "18:00", "19:00", "20:00",
        "21:00", "22:00", "23:00"
    ],
    "data_format": "grib",
    "download_format": "unarchived",
    "area": [50, -125, 24.5, -66.5]
}

client = cdsapi.Client()
client.retrieve(dataset, request, "CAPE.grib")

The output received is a .grib file. In order to convert this to a netCDF, the command grib_to_netcdf must be used like below.

In [None]:
!grib_to_netcdf CAPE.grib -o CAPE.nc

# WWLLN (Lightning) Data

Lightning data is taken from the World Wide Lightning Location Network (WWLLN), developed at the University of Washington. Data is not publicly available, but is available upon request. Instructions are located here: https://wwlln.net/. Since the data is not publicaly available, I will be subsetting lightning data I have already obtained.

In [None]:
os.chdir('/home/giantstep4/data/WWLLN_gridded/3hourly/0.1degree')

In [None]:
wwlln = xr.open_mfdataset('*2023*')
wwlln

In [None]:
chunks = {"time": 292, "lon": 360, "lat": 135}
encoding = {var: {"chunksizes": [chunks[dim] for dim in wwlln[var].dims]} for var in wwlln.data_vars}
wwlln.to_netcdf("/path/to/directory/raw/WWLLN.nc", encoding=encoding)