# Processes hysplit output

HYSPLIT Processing function takes the 'raw' HYSPLIT data i.e. what you get from the output. It prcoesses them one by one and saves them in a form with also distance calculated and also the rotated latitudes and longitudes as grid cells.

If it needs to be done in parts the function will also process the files that haven't been read. 

Needed are the:

- ***inpath*** where the 'raw' data is located 
- ***outpath_processed_trajectories*** where you want to save the processed HYSPLIT files 
- ***dict_year_to_HYSPLIT_folder*** a dictionary showing where each year's data is

Files are labelled as such: ***t20210201_00_GDAS***

In [1]:
import sys
import numpy as np
import glob 
import pandas as pd
import re

sys.path.append(r'C:\Users\DominicHeslinRees\Documents\ACP_2023\scripts\HYSPLIT-processing')
import HYSPLIT_processing as HYprocess

%load_ext autoreload
%autoreload 2

In [2]:
inpath = "C:\\Users\\DominicHeslinRees\\Documents\\Data\\HYSPLIT\\" 
outpath_processed_trajectories = inpath+"processed\\"

In [3]:
## new trajs
inpath =  "D:\\HYSPLIT_runs_ensemble\\"
outpath_processed_trajectories = "F:\\HYSPLIT\\processed\\"

dict_year_to_HYSPLIT_folder = {}
for year in np.arange(2002, 2024, 1):
    dict_year_to_HYSPLIT_folder[year] = str(year) 
print(dict_year_to_HYSPLIT_folder)

{2002: '2002', 2003: '2003', 2004: '2004', 2005: '2005', 2006: '2006', 2007: '2007', 2008: '2008', 2009: '2009', 2010: '2010', 2011: '2011', 2012: '2012', 2013: '2013', 2014: '2014', 2015: '2015', 2016: '2016', 2017: '2017', 2018: '2018', 2019: '2019', 2020: '2020', 2021: '2021', 2022: '2022', 2023: '2023'}


In [4]:
def find_HYSPLIT_files_per_year(year, inpath, dict_year_to_HYSPLIT_folder, prefix, inputyr_2digits=False):
    print("Year: "+str(year))
    inpath = inpath+str(dict_year_to_HYSPLIT_folder[year])+"\\" 
    print("Path: "+str(inpath))    
    list_of_files = glob.glob(inpath+str(prefix)+str(year)+'*')
    if len(list_of_files) == 0:
        print(year)        
        if inputyr_2digits==False:
            list_of_files = glob.glob(inpath+str(prefix)+str(year)[2:]+'*') #last 2 digits 
        if inputyr_2digits==True:
            list_of_files = glob.glob(inpath+str(prefix)+str(year)+'*') #last 2 digits 
            list_of_files = [x for x in list_of_files if len(x.split(str(year))[-1]) == 6]
    print("Number of HYSPLIT files for "+str(year)+": "+str(len(list_of_files)))
    return list_of_files

In [5]:
list_of_files = find_HYSPLIT_files_per_year(2023, inpath, dict_year_to_HYSPLIT_folder, prefix='*', 
                                            inputyr_2digits=True)

Year: 2023
Path: D:\HYSPLIT_runs_ensemble\2023\
Number of HYSPLIT files for 2023: 7631


In [30]:
HYprocess.create_processed_data(outpath_processed_trajectories=r"F:\\HYSPLIT\\processed\\", 
                                inpath="D:\\HYSPLIT_runs_ensemble\\", dict_year_to_HYSPLIT_folder=dict_year_to_HYSPLIT_folder, 
                                years=np.arange(2023, 2024,1), process_data=True, ZEP_lat=78.906,ZEP_lon=11.888,
                                prefix='*', use_last_file_processed=True, cut_traj=False, save=True)

save processed file to: F:\\HYSPLIT\\processed\\
Years to process: [2023]
year: 2023
Year: 2023
Path: D:\HYSPLIT_runs_ensemble\2023\
2023
number of files: 7631
inputyr_2digits=False i.e. str(year)
added day and month now*
D:\HYSPLIT_runs_ensemble\2023\**2023*
Number of HYSPLIT files for 2023: 7621
F:\\HYSPLIT\\processed\\2023\
last file processed: 20230721_05.pi
index of last processed file: 7360
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1nov0250autumn2023111106
starting index: 34
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20231111_06.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1nov0250autumn2023111107
starting index: 34
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20231111_07.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1nov0250autumn2023111108
starting index: 34
number of trajs per unit time: 27
l

starting index: 34
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20231112_15.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1nov0250autumn2023111216
starting index: 34
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20231112_16.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1nov0250autumn2023111217
starting index: 34
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20231112_17.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1nov0250autumn2023111218
starting index: 34
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20231112_18.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1nov0250autumn2023111219
starting index: 34
number of trajs per unit time

number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20231114_02.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1nov0250autumn2023111403
starting index: 34
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20231114_03.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1nov0250autumn2023111404
starting index: 34
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20231114_04.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1nov0250autumn2023111405
starting index: 34
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20231114_05.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1nov0250autumn2023111406
starting index: 34
number of trajs per unit time: 27
length_num_tra

number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20230714_15.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1jul0250summer2023071416
starting index: 38
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20230714_16.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1jul0250summer2023071417
starting index: 38
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20230714_17.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1jul0250summer2023071418
starting index: 38
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20230714_18.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1jul0250summer2023071419
starting index: 38
number of trajs per unit time: 27
length_num_tra

number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20230716_02.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1jul0250summer2023071603
starting index: 38
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20230716_03.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1jul0250summer2023071604
starting index: 38
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20230716_04.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1jul0250summer2023071605
starting index: 38
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20230716_05.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1jul0250summer2023071606
starting index: 38
number of trajs per unit time: 27
length_num_tra

number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20230717_13.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1jul0250summer2023071714
starting index: 38
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20230717_14.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1jul0250summer2023071715
starting index: 38
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20230717_15.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1jul0250summer2023071716
starting index: 38
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20230717_16.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1jul0250summer2023071717
starting index: 38
number of trajs per unit time: 27
length_num_tra

starting index: 38
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20230719_00.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1jul0250summer2023071901
starting index: 38
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20230719_01.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1jul0250summer2023071902
starting index: 38
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20230719_02.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1jul0250summer2023071903
starting index: 38
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20230719_03.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1jul0250summer2023071904
starting index: 38
number of trajs per unit time

number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20230720_11.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1jul0250summer2023072012
starting index: 38
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20230720_12.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1jul0250summer2023072013
starting index: 38
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20230720_13.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1jul0250summer2023072014
starting index: 38
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\\HYSPLIT\\processed\\\2023
F:\\HYSPLIT\\processed\\2023\20230720_14.pickle
read file: D:\HYSPLIT_runs_ensemble\2023\GDAS1jul0250summer2023072015
starting index: 38
number of trajs per unit time: 27
length_num_tra

# Work out what is left to process: 

In [6]:
def get_digits(string_with_digits):
    digit = re.findall(r'\d+', string_with_digits)
    return digit

def sample_filter_name_to_data(sample_name):
    """get digits in strings"""
    sample_name_unaltered = sample_name
    sample_name = sample_name[-18:]
    digits = get_digits(sample_name)
    date = digits[0]
    hour = digits[1]
    try:
        sample_name = pd.to_datetime(date+hour, format='%Y%m%d%H')
    except:
        print("error:")
        print(sample_name)
        print(sample_name_unaltered)
    return sample_name    

In [7]:
def fraction_done(list_files):
    datetimes = [sample_filter_name_to_data(x) for x in list_files]
    df_datetimes = pd.DataFrame(index=datetimes)
    df_datetimes.index = pd.to_datetime(df_datetimes.index)
    df_datetimes['month'] = df_datetimes.index.month
    months_counted = df_datetimes['month'].value_counts()
    
    df_months = pd.DataFrame(months_counted)   
    rest = list(set(np.arange(1, 13, 1))-set(df_months.index)) #rest of the months not run yet

    for i in rest:
        df_months.loc[str(i)] = np.nan    
    df_months.index = df_months.index.astype(int)    
        
    if year in [2004, 2008, 2012, 2016, 2020]: #leap year different dictionary of days
        print("leap year:")
        dict_months_to_days = {1:31,2:29,3:31,4:30,5:31,6:30,7:31,8:31,9:30,10:31,11:30,12:31}
    if year not in [2004, 2008, 2012, 2016, 2020]:
        print("normal year:")
        dict_months_to_days = {1:31,2:28,3:31,4:30,5:31,6:30,7:31,8:31,9:30,10:31,11:30,12:31}
    
    df_months['potential_days'] = df_months.index.map(dict_months_to_days)
    df_months['potential_data_points'] = df_months['potential_days']*24
    df_months['fraction'] = df_months['month']/df_months['potential_data_points']
    df_months = df_months.sort_index()
    fraction = df_months['month'].sum() / df_months['potential_data_points'].sum()
    df_months = df_months.rename(columns={'month':'hours_in_month_generated'})
    df_months = df_months.rename(columns={'potential_data_points':'potential_hourly_data_points'})
    print("fraction produced: "+str(fraction))
    df_months = df_months.sort_index()
    return df_months

In [8]:
year = 2023

path_HYSPLIT_files = r'F:\\HYSPLIT\\processed\\'
files = glob.glob(path_HYSPLIT_files+str(year)+"\\*")

fraction_done(files)

normal year:
fraction produced: 0.0


Unnamed: 0,hours_in_month_generated,potential_days,potential_hourly_data_points,fraction
1,,31,744,
2,,28,672,
3,,31,744,
4,,30,720,
5,,31,744,
6,,30,720,
7,,31,744,
8,,31,744,
9,,30,720,
10,,31,744,


# Remaining ones: 

In [40]:
def get_datetimes_not_processed(outpath_processed_trajectories = "F:\\HYSPLIT\\processed\\", year=2023):
    list_of_files = glob.glob(outpath_processed_trajectories+str(year)+'\*')
    datetimes = [sample_filter_name_to_data(x) for x in list_of_files]
    df_datetimes = pd.DataFrame(index=datetimes)
    datetime_range = pd.date_range(start=str(year)+'-01-01 00:00:00', end=str(year)+'-12-31 23:59:59', freq='H')
    datetimes_not_processed = list(set(datetime_range) - set(datetimes)) #datetimes not processed
    datetimes_not_processed_digits = [str(x)[:13].replace('-','').replace(' ','') for x in datetimes_not_processed]
    return datetimes_not_processed_digits

def find_unprocessed_files(datetimes_not_processed_digits, list_of_HYSLPLIT_files):
    found_HYSLIT_files = [] 
    for unprocessed in datetimes_not_processed_digits:
        matching = [s for s in list_of_HYSLPLIT_files if unprocessed in s]
        if len(matching) > 0:
            found_HYSLIT_files.append(matching[0])
    return found_HYSLIT_files

In [41]:
datetimes_not_processed_digits = get_datetimes_not_processed(outpath_processed_trajectories = "F:\\HYSPLIT\\processed\\", 
                                                             year=2022)
list_of_HYSLPLIT_files = find_HYSPLIT_files_per_year(2022, inpath, dict_year_to_HYSPLIT_folder, prefix='*', 
                                            inputyr_2digits=True)
found_HYSLIT_files = find_unprocessed_files(datetimes_not_processed_digits, list_of_HYSLPLIT_files)

Year: 2022
Path: D:\HYSPLIT_runs_ensemble\2022\
Number of HYSPLIT files for 2022: 8736


In [42]:
found_HYSLIT_files

['D:\\HYSPLIT_runs_ensemble\\2022\\GDAS1oct0250autumn2022102022',
 'D:\\HYSPLIT_runs_ensemble\\2022\\GDAS1aug0250summer2022082022',
 'D:\\HYSPLIT_runs_ensemble\\2022\\GDAS1dec0250winter2022120221',
 'D:\\HYSPLIT_runs_ensemble\\2022\\GDAS1feb0250winter2022020220',
 'D:\\HYSPLIT_runs_ensemble\\2022\\GDAS1jul0250summer2022072022',
 'D:\\HYSPLIT_runs_ensemble\\2022\\GDAS1feb0250winter2022022022',
 'D:\\HYSPLIT_runs_ensemble\\2022\\GDAS1nov0250autumn2022112022',
 'D:\\HYSPLIT_runs_ensemble\\2022\\GDAS1mar0250spring2022032022',
 'D:\\HYSPLIT_runs_ensemble\\2022\\GDAS1dec0250winter2022122022',
 'D:\\HYSPLIT_runs_ensemble\\2022\\GDAS1apr0250spring2022042022',
 'D:\\HYSPLIT_runs_ensemble\\2022\\GDAS1feb0250winter2022020223',
 'D:\\HYSPLIT_runs_ensemble\\2022\\GDAS1dec0250winter2022120223',
 'D:\\HYSPLIT_runs_ensemble\\2022\\GDAS1dec0250winter2022120220',
 'D:\\HYSPLIT_runs_ensemble\\2022\\GDAS1jan0250winter2022012022',
 'D:\\HYSPLIT_runs_ensemble\\2022\\GDAS1dec0250winter2022120222',
 'D:\\HYSP

In [43]:
ZEP_lat=78.906; ZEP_lon=11.888
South_grid = HYprocess.get_South_grid(ZEP_lat, ZEP_lon)

HYprocess.process_HYSPLIT_and_save(2022, found_HYSLIT_files, South_grid, outpath_processed_trajectories, 
                                   cut_traj=False, save=True,
                                   csv=False, pickle=True, parquet=False)

read file: D:\HYSPLIT_runs_ensemble\2022\GDAS1oct0250autumn2022102022
starting index: 38
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\HYSPLIT\processed\\2022
F:\HYSPLIT\processed\2022\20221020_22.pickle
read file: D:\HYSPLIT_runs_ensemble\2022\GDAS1aug0250summer2022082022
starting index: 38
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\HYSPLIT\processed\\2022
F:\HYSPLIT\processed\2022\20220820_22.pickle
read file: D:\HYSPLIT_runs_ensemble\2022\GDAS1dec0250winter2022120221
starting index: 38
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\HYSPLIT\processed\\2022
F:\HYSPLIT\processed\2022\20221202_21.pickle
read file: D:\HYSPLIT_runs_ensemble\2022\GDAS1feb0250winter2022020220
starting index: 37
number of trajs per unit time: 27
length_num_trajs: 241.0
make folder: F:\HYSPLIT\processed\\2022
F:\HYSPLIT\processed\2022\20220202_20.pickle
read file: D:\HYSPLIT_runs_ensemble\2022\GDAS1jul0250summer2022072022
st