In [None]:
import numpy as np
from datetime import datetime, date, time, timedelta
from dateutil import parser
import csv

import h5py
import os, fnmatch

In [None]:
"""
FIND SPECIFIED PATTERN USING FNMATCH 
"""
def pattern_finder(home_dir, pattern):
    
    names = []
    for root, dirs, files in os.walk(home_dir):
        for name in files:
            if fnmatch.fnmatch(name, pattern):
                name = str(name)
                names.append(name)
    
    return names

The objective of the deep learning system investigated in this project is the forecast of the N-S component of the IMF, Bz, measured in the GSM coordinate system at Lagrange point 1 (L1), in a two day window, 3 - 5 days ahead of the corresponding Solar and Heliophysics Observatory (SOHO) image products.
This challenge is cast as a binary classification problem and compared with a Gaussian Naive Bayes classifier baseline. 
The value computed is not the raw Bz from OMNI but rather we define the new Bz as the min of raw Bz minus the mean of raw Bz in the above mentioned two-day window.
The ground truth labels are obtained from the low-resolution, hourly-averaged values are obtained from the **OMNI** database: [OMNI2](https://omniweb.gsfc.nasa.gov/html/ow_data.html). The time range from this OMNI set that has been made available here is from Jan. 1st 1999 till Dec. 31st 2010. 

Provide the path to the downloaded OMNI data set to the ```np.genfromtxt()``` function in the cell below. 
The following definition for the 'nulls' variable comes from Section 3 of '**_Description of records and words_**' from the 'Fill values' column. There are a total of 57 columns listed but only 55 columns are actually present since 'F9.6' and 'F7.4' (columns 55 and 56, respectively) are omitted in the data set. Descriptions of the data are provided at the above NASA url. This step is performed to homogenize the representations of all values that are missing in the data set. Although this step is not strictly needed for our use-case here, it is useful if the user wished to use a different variable as their ground truth label. 

In [None]:
PATH_TO_OMNI_DATA = '/home/carl/Documents/OMNI_DATA/'
omni2_data_raw = np.genfromtxt(f'{PATH_TO_OMNI_DATA}omni2_1999-01-01_to_2010-12-31.txt')
nulls = np.array([np.nan, np.nan, np.nan, 9999, 99, 99, 999, 999, 999.9, 999.9, 999.9, 999.9,
    999.9, 999.9, 999.9, 999.9, 999.9, 999.9, 999.9, 999.9, 999.9, 999.9,
    9999999., 999.9, 9999., 999.9, 999.9, 9.999, 99.99, 9999999., 999.9, 9999.,
    999.9, 999.9, 9.999, 999.99, 999.99, 999.9, 99, 999, 99999, 9999,
    999999.99, 99999.99, 99999.99, 99999.99, 99999.99, 99999.99, 0, 999, 999.9,
    999.9, 99999, 99999, 99.9
])
print('len(nulls):', len(nulls))
omni2_data_nan_processed = np.where(omni2_data_raw == nulls[None, :], np.nan, omni2_data_raw)
print('shape(omni2_data_nan_processed):', np.shape(omni2_data_nan_processed))

Convert OMNI date-time format of 'Year-Day_of_Year-Time_of_Day' to date-time object.

Choose the Bz GSM column which is the 16th column (counting the pythonic way from 0)

Identify dates for which the ground truth label is missing.

**Note**: If there would be missing ground truth labels that would influence our approach, we would need to account for this by removing corresponding SOHO images from the data cubes. From the procedure used in this project to arrive at the computed Bz value, these missing raw Bz values do not require any modification to the pipeline produced image cubes in this case. In another ground truth label setting or forecasting window, they might.  

In [None]:
omni2_date = [datetime(int(yr),1,1) + timedelta(days = int(day) - 1) + timedelta(hours = hr) for yr,day,hr in np.vstack((omni2_data_nan_processed.T[0],omni2_data_nan_processed.T[1], omni2_data_nan_processed.T[2])).T ]

omni2_Bz = omni2_data_nan_processed.T[16]

missing_Bz_vals_ind = np.where(np.array(omni2_Bz) != np.array(omni2_Bz))[0]
print('Missing ground truth labels from Jan. 1st 1999 to Dec. 31st 2010 in data:\n', [str(omni2_date[ind]) for ind in missing_Bz_vals_ind])
print('Number of missing ground truth labels:', len(missing_Bz_vals_ind))

Convert above transformed OMNI date-time object to a string for efficient matching when comparing with synced string date-times.

In [None]:
omni2_date_copy_str = [str(elem) for elem in omni2_date]

Provide the path to the synced times contained in the **CSV** file for which to obtain the ground truth labels and specify ```dtype = 'str'```. 
As a check, ensure that the output length corresponds to the length of the CSV file which you expect to have.  

In [None]:
PATH_TO_SYNCED_TIMES = '/home/carl/Documents/synced_csvs_1_3_7fams/'
synced_times = np.genfromtxt(f'{PATH_TO_SYNCED_TIMES}1999-01-01_to_2010-12-31_MDI_96m_6_6_128_times_sync_onlyMDI.csv',dtype = 'str')
print('len(synced_times):', len(synced_times))

Convert '19990202160302' string date-time format to a date-time object and compute the value of the date-time object 3 days ahead of the time found after rounding to the nearest hour and convert to string. Rounding is done to enable direct comparison with the OMNI date-time objects.

In [None]:
synced_times_datetimes = [parser.parse(elem) for elem in synced_times]
synced_times_datetimes_rounded = [elem.replace(second = 0, microsecond = 0, minute = 0, hour = elem.hour) + timedelta(hours = elem.minute//30) for elem in synced_times_datetimes]
synced_times_datetimes_rounded_3days_ahead = np.array(synced_times_datetimes_rounded) + timedelta(days = 3)
synced_times_datetimes_rounded_3days_ahead_copy_str = [str(elem) for elem in synced_times_datetimes_rounded_3days_ahead]

Now can compare synced times directly with the times from OMNI data and compute Bmin - Bmean, 3-5 days ahead of the SOHO image product times. The figure of 48 is added in order to arrive at the 5 day mark of the two-day window. Need to use np.nanmin and np.nanmean as a result of the 24 missing raw Bz values. This is the most time consuming step thus far. 

In [None]:
omni2_3days_ahead_ind = [np.where(np.array(omni2_date_copy_str) == elem)[0][0] for elem in synced_times_datetimes_rounded_3days_ahead_copy_str]
omni2_Bzmin_minus_mean_3to5days_ahead = [ np.round(np.nanmin(omni2_Bz[elem: elem + 48]) - np.nanmean(omni2_Bz[elem: elem + 48]),2) for elem in omni2_3days_ahead_ind]

Option to output the computed Bz ground truth labels corresponding to the synced times to a new CSV file in the same directory as the synced times. The user can change the name for the CSV file that is to be output.

In [None]:
with open(f'{PATH_TO_SYNCED_TIMES}3products_Bz_min_minus_mean_3to5days_ahead.csv', 'a') as f:
    writer = csv.writer(f, delimiter='\n')
    writer.writerow(omni2_Bzmin_minus_mean_3to5days_ahead)

Reading in the HDF5 data cubes

In [None]:
home_dir = '/home/carl/Documents/juniper_datacubes_3fams/MDI_19990101_20101231_6_6_128/'
name_list = pattern_finder(home_dir, pattern = '*sync.h5')
print('cube name list:', name_list)

cu_list = []
for cu in name_list:
    data_set = h5py.File(f'{home_dir}{cu}','r')
    print(list(data_set.keys())[0])
    data = data_set[str(list(data_set.keys())[0])][:] 
    print(np.shape(data))
    cu_list.append(data)

print(np.shape(cu_list))
print(len(cu_list))

Constructing the 'rolling data split'. In first year of data: training is Feb. - Aug., validation is Oct. and testing is Nov. Each subsequent year these periods are cyclically moved forward by one month.

In [None]:
synced_times_datetimes_rounded_years = np.array([elem.year for elem in synced_times_datetimes_rounded])
synced_times_datetimes_rounded_months = np.array([elem.month for elem in synced_times_datetimes_rounded])

start_ind_train = []
end_ind_train = []

start_ind_valid = []
end_ind_valid = []

start_ind_test = []
end_ind_test = []


# Hardcoded: Feb [0] - Aug [6], Oct [8], Nov [9] in Python index for start of year in new_month_order sequence
# In our case will have 18 indices because of splitting of indices to obtain proper start and end indices

month_order = np.arange(1,12+1)

counter = 1
for yr in range(synced_times_datetimes_rounded_years[0],synced_times_datetimes_rounded_years[-1] + 1):
    new_month_order = np.roll(month_order,-counter) #since we start from Feb.1999 so roll back year by 1 (i.e., -1) 
    
    if new_month_order[6] < new_month_order[0]: # Aug. is [6] in the first new_month_order for 1999.
        pos_one = np.where(new_month_order == 1)[0] # 1 is the location of Jan.
        ind_start_train_part1 = np.where((synced_times_datetimes_rounded_years == yr) & (synced_times_datetimes_rounded_months == new_month_order[pos_one]))[0][0]
        ind_end_train_part1 = np.where((synced_times_datetimes_rounded_years == yr) & (synced_times_datetimes_rounded_months == new_month_order[pos_one + (6-pos_one)]))[0][-1] #6 is for Aug. position
        
        ind_start_train_part2 = np.where((synced_times_datetimes_rounded_years == yr) & (synced_times_datetimes_rounded_months == new_month_order[0]))[0][0]
        ind_end_train_part2 = np.where((synced_times_datetimes_rounded_years == yr) & (synced_times_datetimes_rounded_months == new_month_order[pos_one-1]))[0][-1]

        start_ind_train.append(ind_start_train_part1) 
        start_ind_train.append(ind_start_train_part2)
        end_ind_train.append(ind_end_train_part1)
        end_ind_train.append(ind_end_train_part2)
        
        
        ind_start_valid = np.where((synced_times_datetimes_rounded_years == yr) & (synced_times_datetimes_rounded_months == new_month_order[8]))[0][0] 
        start_ind_valid.append(ind_start_valid)

        ind_end_valid = np.where((synced_times_datetimes_rounded_years == yr) & (synced_times_datetimes_rounded_months == new_month_order[8]))[0][-1]
        end_ind_valid.append(ind_end_valid)


        ind_start_test = np.where((synced_times_datetimes_rounded_years == yr) & (synced_times_datetimes_rounded_months == new_month_order[9]))[0][0]
        start_ind_test.append(ind_start_test)

        ind_end_test = np.where((synced_times_datetimes_rounded_years == yr) & (synced_times_datetimes_rounded_months == new_month_order[9]))[0][-1]
        end_ind_test.append(ind_end_test) 

        
    else:
    
        ind_start_train = np.where((synced_times_datetimes_rounded_years == yr) & (synced_times_datetimes_rounded_months == new_month_order[0]))[0][0]
        ind_end_train = np.where((synced_times_datetimes_rounded_years == yr) & (synced_times_datetimes_rounded_months == new_month_order[6]))[0][-1]
        
        start_ind_train.append(ind_start_train)
        end_ind_train.append(ind_end_train)
        
    
        ind_start_valid = np.where((synced_times_datetimes_rounded_years == yr) & (synced_times_datetimes_rounded_months == new_month_order[8]))[0][0] 
        start_ind_valid.append(ind_start_valid)

        ind_end_valid = np.where((synced_times_datetimes_rounded_years == yr) & (synced_times_datetimes_rounded_months == new_month_order[8]))[0][-1]
        end_ind_valid.append(ind_end_valid)


        ind_start_test = np.where((synced_times_datetimes_rounded_years == yr) & (synced_times_datetimes_rounded_months == new_month_order[9]))[0][0]
        start_ind_test.append(ind_start_test)

        ind_end_test = np.where((synced_times_datetimes_rounded_years == yr) & (synced_times_datetimes_rounded_months == new_month_order[9]))[0][-1]
        end_ind_test.append(ind_end_test)       
        
    print('yr:', yr)
    counter += 1


diff_train = np.array(end_ind_train) - array(start_ind_train)

diff_valid = np.array(end_ind_valid) - array(start_ind_valid)

diff_test = np.array(end_ind_test) - array(start_ind_test)

diff_train_sum = np.sum(diff_train)
print('training data size:', diff_train_sum)

diff_valid_sum = np.sum(diff_valid)
print('validation data size:', diff_valid_sum)

diff_test_sum = np.sum(diff_test)
print('test data size:', diff_test_sum)

Data statistics

In [None]:
print('total data avilable:', len(synced_times))
print('expected amount of available data used in 7 months:', int(np.round((7./12)*len(synced_times_MDI_C2_195),0))) #7 months of the year for training
data_total = diff_train_sum + diff_valid_sum + diff_test_sum
print('actual total data used:', data_total)
print('perc. data used from total data available:', data_total/len(synced_times_MDI_C2_195))
print('perc. train:', diff_train_sum/data_total)
print('perc. valid:', diff_valid_sum/data_total)
print('perc. test:', diff_test_sum/data_total)

First, combining the data from the rolling indices calculated in the preceding step

In [None]:
train_indices = list(zip(start_ind_train, end_ind_train))
valid_indices = list(zip(start_ind_valid, end_ind_valid))
test_indices = list(zip(start_ind_test, end_ind_test))

SOHO_grand_train_set = []
SOHO_grand_valid_set = []
SOHO_grand_test_set = []

Bz_train_set = []
Bz_valid_set = []
Bz_test_set = []

for l in range(len(cu_list)):

    for i,j in train_indices:
        SOHO_grand_train_set += list(cu_list[l][i:j]) #read in MDI data here too and index it accordingly
        Bz_train_set += list(omni2_Bzmin_minus_mean_3to5days_ahead[i:j])

    for i,j in valid_indices:
        SOHO_grand_valid_set += list(cu_list[l][i:j])
        Bz_valid_set += list(omni2_Bzmin_minus_mean_3to5days_ahead[i:j])

    for i,j in test_indices:
        SOHO_grand_test_set += list(cu_list[l][i:j])
        Bz_test_set += list(omni2_Bzmin_minus_mean_3to5days_ahead[i:j])
    
# Double checking that same length train, validation, and test sets as previously found.   

print('len(SOHO_grand_train_set):', len(SOHO_grand_train_set))
print('len(SOHO_grand_valid_set):', len(SOHO_grand_valid_set))
print('len(SOHO_grand_test_set):', len(SOHO_grand_test_set))


Bz_train_set_len = len(Bz_train_set)
Bz_valid_set_len = len(Bz_valid_set)
Bz_test_set_len = len(Bz_test_set)

print('len(Bz_train_set):', Bz_train_set_len)
print('len(Bz_valid_set):', Bz_valid_set_len)
print('len(Bz_test_set):', Bz_test_set_len)
print('total data size:', Bz_train_set_len + Bz_valid_set_len + Bz_test_set_len)

Calculating threshold Bz values for MDI only family training set. For consistency will apply these Bz values to binarize the validation and test set within MDI only as well as the other two experiments (MDI + Lasco C2 + EIT195, and all 7 SOHO products together).

In [None]:
thresholds = list(np.arange(5,30,5)/100.)
print('thresholds:', thresholds)

Bz_train_set_sorted = np.sort(Bz_train_set) #most negative to least negative values

p_val_thres = np.round(np.arange(len(Bz_train_set_sorted)) / (len(Bz_train_set_sorted) - 1.),4)

ind_p = [np.where(p_val_thres == (elem + np.min(np.abs(p_val_thres - elem))))[0][0] for elem in thresholds]

Bz_train_set_sorted_thresholds = [Bz_train_set_sorted[elem] for elem in ind_p]
print('Bz_train_set_sorted_thresholds:', Bz_train_set_sorted_thresholds)

Bz values are -11.6, -8.91, -7.76, -6.94, and -6.27 nT @ (5,10,15,20,25)% threshold obtained from the MDI only experiment.
These thresholds are fixed for all three experiments. 
Binarize the ground truth labels accordingly

In [None]:
ind_class_1_label_list = []
ind_class_0_label_list = []

Bz_total_set = list(Bz_train_set) + list(Bz_valid_set) + list(Bz_test_set)
print('len(Bz_total_set):', len(Bz_total_set))

Bz_total_set_copy_list = []

# 5 lists of total sets (train + valid + test) corresponding to the 5 thresholds
for val in Bz_train_set_sorted_thresholds: 
    Bz_total_set_copy = copy(Bz_total_set)
    
    print('val:', val)
    ind1 = np.where(np.array(Bz_total_set) <= val)[0]
    ind0 = np.where(np.array(Bz_total_set) > val)[0]
    
    ind_class_1_label_list.append(list(ind1)) #this is for the entire data set
    ind_class_0_label_list.append(list(ind0)) #this is for the entire data set
    
    Bz_total_set_copy[ind1] = 1
    Bz_total_set_copy[ind0] = 0
    
    Bz_total_set_copy_list.append(list(Bz_total_set_copy))

With the data ready and loaded, we are ready to conduct the three ML experiments with a deep CNN. We need to seed the random values first to ensure reproducibility of the run.