In [5]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


# Configure Matplotlib

Choose a matplotlib style sheet to match your browser theme (dark or light)

Increase the figure size to match the notebook width.

Set the graph colour variables.

In [6]:
#style_sheet = 'dark_background' # dark theme
style_sheet = 'default'         # light theme
plt.style.use(style_sheet)

plt.rcParams['figure.figsize'] = [12, 8]
plt.rcParams['figure.dpi'] = 200

dsc_colour = 'green'
wnd_colour = 'magenta'
anl_colour = 'red'

# Time Conversion functions

Let's define some helper functions to convert some time formats into human readable time

In [7]:
def time_from_seconds(seconds):
  seconds = seconds % (24 * 3600)
  hour = seconds // 3600
  seconds %= 3600
  minutes = seconds // 60
  seconds %= 60

  return "%d:%02d:%02d" % (hour, minutes, seconds)

def time_from_doy(doy):
  part, whole = np.modf(doy)
  seconds = part * 86400

  return time_from_seconds(seconds)

In [8]:
print("3600 seconds: %s" %(time_from_seconds(3600)))
print("0.04167 days: %s" %(time_from_doy(0.04167)))

3600 seconds: 1:00:00
0.04167 days: 1:00:00


# DSCOVR data

The DSCOVR  data is provided in the challenge resources. We will use 2022 data for our training and validation sets and 2023 data for a test set.
In machine learning we call the data we have, *X*. and teach (train) a model to learn a mapping of *X* to *Y* where *Y* is what we want the model to output.

Before we move on to *Y* let's have a look at *X* ...

First we need to download the data that we need from the UH/IfA data transfer node [spaceapps repo](https://dtn-itc.ifa.hawaii.edu/spaceapps/DSCOVR.tgz)

In [9]:
!wget https://opensource.gsfc.nasa.gov/spaceappschallenge/dsc_fc_summed_spectra_2022_v01.zip
!wget https://opensource.gsfc.nasa.gov/spaceappschallenge/dsc_fc_summed_spectra_2023_v01.zip
!mkdir -p ./data/dscovr
!(cd ./data/dscovr;unzip ../../dsc_fc_summed_spectra_2022_v01.zip; unzip ../../dsc_fc_summed_spectra_2023_v01.zip)


--2023-10-08 20:21:39--  https://opensource.gsfc.nasa.gov/spaceappschallenge/dsc_fc_summed_spectra_2022_v01.zip
Resolving opensource.gsfc.nasa.gov (opensource.gsfc.nasa.gov)... 129.164.181.182, 2001:4d0:2310:153::6
Connecting to opensource.gsfc.nasa.gov (opensource.gsfc.nasa.gov)|129.164.181.182|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 57868570 (55M) [application/zip]
Saving to: ‘dsc_fc_summed_spectra_2022_v01.zip’


2023-10-08 20:22:06 (2.04 MB/s) - ‘dsc_fc_summed_spectra_2022_v01.zip’ saved [57868570/57868570]

--2023-10-08 20:22:06--  https://opensource.gsfc.nasa.gov/spaceappschallenge/dsc_fc_summed_spectra_2023_v01.zip
Resolving opensource.gsfc.nasa.gov (opensource.gsfc.nasa.gov)... 129.164.181.182, 2001:4d0:2310:153::6
Connecting to opensource.gsfc.nasa.gov (opensource.gsfc.nasa.gov)|129.164.181.182|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18068625 (17M) [application/zip]
Saving to: ‘dsc_fc_summed_spectra_2023_v01.z

In [10]:
!ls ./data/dscovr

dsc_fc_summed_spectra_2022_v01.csv  dsc_fc_summed_spectra_2023_v01.csv


# DSCOVR data
The DSCOVR data we need for training our model is now stored locally in ./data.
Let's take a look at the data we just downloaded.

In [None]:
dscovr_df = pd.read_csv("./data/dscovr/dsc_fc_summed_spectra_2022_v01.csv", delimiter = ',', parse_dates=[0], infer_datetime_format=True, na_values='0', header = None)

In [None]:
dscovr_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,53
0,2022-01-01 00:00:00,-6.1717,1.12483,-4.90228,38.0314,0.231726,46.0427,44.9743,55.9143,43.7069,...,,,,,,,,,,
1,2022-01-01 00:01:00,-6.28883,1.23313,-4.79001,38.3868,0.231726,45.5257,46.2587,55.1428,43.2768,...,,,,,,,,,,
2,2022-01-01 00:02:00,-6.11811,0.871923,-5.1283,37.5636,0.231726,45.1955,46.8222,55.7484,42.7894,...,,,,,,,,,,
3,2022-01-01 00:03:00,-6.28704,1.24987,-4.7664,38.1094,0.242084,46.7083,47.1713,53.538,42.1558,...,,,,,,,,,,
4,2022-01-01 00:04:00,-6.42125,1.17156,-4.5323,37.5893,0.231726,47.4888,45.3234,54.5404,44.2773,...,,,,,,,,,,


In [None]:
dscovr_df.describe()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,44,45,46,47,48,49,50,51,52,53
count,524450.0,524450.0,524450.0,511816.0,511811.0,511809.0,511809.0,511815.0,511815.0,511814.0,...,20115.0,11191.0,8416.0,3971.0,3859.0,213.0,139.0,44.0,40.0,35.0
mean,0.098939,-0.219225,0.060599,58.218162,9.514369,63.246343,56.563212,74.062121,65.428809,94.261894,...,384.711976,332.966637,388.589832,326.915073,294.150454,390.130451,384.123885,376.326977,403.068593,367.789749
std,3.943064,4.535931,3.667523,38.506281,17.031107,44.696797,45.047853,60.528255,82.509457,114.361904,...,40.774888,48.215582,38.280578,57.801942,85.771695,52.754546,51.58345,69.89348,124.156716,143.007012
min,-16.7123,-26.3765,-23.06,0.231726,0.231726,0.231726,0.231726,0.231726,0.231726,0.231726,...,214.319,188.544,211.826,205.87,108.95,202.247,112.29,187.778,0.231726,89.9962
25%,-3.00014,-3.305092,-1.967988,28.843,0.231726,32.3525,27.916,37.30155,25.6404,36.973425,...,365.23,302.5325,367.20375,274.956,211.853,359.223,352.9465,354.073,366.69175,241.622
50%,0.212469,-0.288369,0.034746,51.9893,0.386813,55.5706,46.9776,61.3553,45.4702,68.83805,...,383.495,326.713,393.8075,323.581,309.98,389.128,395.347,383.1295,427.067,403.663
75%,3.164427,3.017987,2.039108,84.7379,15.8128,90.0339,80.5927,100.403,84.1425,114.472,...,404.069,363.7825,409.08075,377.611,367.0345,416.967,410.484,411.97125,471.866,449.1185
max,19.7253,22.8347,27.9059,415.389,385.676,493.748,746.807,1136.67,1562.55,1804.56,...,662.013,735.132,772.122,521.745,541.136,637.731,568.053,528.47,655.892,646.111


Take a look at the data statistics. There are not always the same number of rows for each column so we need to watch for missing data. The index is the timestamp for each row. The values in columns 1-3 are the magnetic field vector. The remaining columns contain the solar wind values.

# Daily Space Weather Kp indices

Next let's download the rest of our training data. This will be the *Y* that we want to predict. It is an integer value between 0 and 9 that describes the strength of the predicted geomagnetic field and relates to the likelihood of a geomagnetic storm.

We have staged the data for 2022 and 2023 that we obtained from the [NOAA web site](ftp:/ftp.ngdc.noaa.gov/STP/swpc_products/daily_reports/space_weather_indices) on our UH/IfA data transfer node [spaceapps data repo](http://dtn-itc.ifa.hawaii.edu/spaceapps/SWID.tgz)

In [11]:
!wget http://dtn-itc.ifa.hawaii.edu/spaceapps/SWID.tgz
!tar xzvf SWID.tgz


--2023-10-08 20:22:28--  http://dtn-itc.ifa.hawaii.edu/spaceapps/SWID.tgz
Resolving dtn-itc.ifa.hawaii.edu (dtn-itc.ifa.hawaii.edu)... 128.171.123.136
Connecting to dtn-itc.ifa.hawaii.edu (dtn-itc.ifa.hawaii.edu)|128.171.123.136|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 127530 (125K) [application/x-gzip]
Saving to: ‘SWID.tgz’


2023-10-08 20:22:30 (104 KB/s) - ‘SWID.tgz’ saved [127530/127530]

./data/swpc/
./data/swpc/2022/
./data/swpc/2022/04/
./data/swpc/2022/04/20220401dayind.txt
./data/swpc/2022/04/20220402dayind.txt
./data/swpc/2022/04/20220403dayind.txt
./data/swpc/2022/04/20220404dayind.txt
./data/swpc/2022/04/20220405dayind.txt
./data/swpc/2022/04/20220406dayind.txt
./data/swpc/2022/04/20220407dayind.txt
./data/swpc/2022/04/20220408dayind.txt
./data/swpc/2022/04/20220409dayind.txt
./data/swpc/2022/04/20220410dayind.txt
./data/swpc/2022/04/20220411dayind.txt
./data/swpc/2022/04/20220412dayind.txt
./data/swpc/2022/04/20220413dayind.txt
./data/swpc/2

In [12]:
!tail -9 ./data/swpc/2023/10/20231007dayind.txt

# ----- Fredericksburg -----           --------- Boulder ---------
# A        K-indices                   A        K-indices
#     03-06-09-12-15-18-21-24              03-06-09-12-15-18-21-24
 -1    1  2 -1 -1 -1 -1 -1 -1          4    1  2  1  0  1  1  1  2 
#        High Latitude                                  Estimated
# --------- College ---------      -------------------- Planetary --------------------
# A       K-indices                A                    K-indices
#     03-06-09-12-15-18-21-24            03  - 06  - 09  - 12  - 15  - 18  - 21  - 24
  2    1  1  1  0  0  0  1  0      5    1.33  2.00  1.33  0.67  0.33  0.33  1.67  1.67 


We will need to parse each of these files, grabbing the last line and the last 8 floating point values which will be the daily planetary K-indices that we want to predict.

In [None]:
!tail -1 ./data/swpc/2023/10/20231007dayind.txt | awk '{print $11, $12, $13, $14, $15, $16, $17, $18}'

1.33 2.00 1.33 0.67 0.33 0.33 1.67 1.67


We provide a useful shell script to do this for every file for 2022 and save the output in a single csv file. One file for 2022 and one file for 2023.

In [None]:
!wget http://dtn-itc.ifa.hawaii.edu/spaceapps/parse_kp_indices.sh
!bash ./parse_kp_indices.sh

--2023-10-08 20:11:28--  http://dtn-itc.ifa.hawaii.edu/spaceapps/parse_kp_indices.sh
Resolving dtn-itc.ifa.hawaii.edu (dtn-itc.ifa.hawaii.edu)... 128.171.123.136
Connecting to dtn-itc.ifa.hawaii.edu (dtn-itc.ifa.hawaii.edu)|128.171.123.136|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 462 [text/x-sh]
Saving to: ‘parse_kp_indices.sh.1’


2023-10-08 20:11:28 (56.9 MB/s) - ‘parse_kp_indices.sh.1’ saved [462/462]



In [None]:
kp_df = pd.read_csv("./data/swpc/kpindices-2022.csv", delimiter = ',', parse_dates=[0], infer_datetime_format=True, na_values='0', header = None)

In [None]:
kp_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,date,3.0,6.0,9.0,12.0,15.0,18.0,21.0,24.0
1,20221122,1.0,1.0,1.0,1.0,1.0,1.0,1.0,
2,20221129,4.0,3.0,4.0,4.0,4.0,5.0,4.0,3.0
3,20221107,1.0,1.0,2.0,3.0,4.0,5.0,4.0,3.0
4,20221105,3.0,3.0,2.0,3.0,2.0,2.0,1.0,1.0


In [None]:
kp_df.describe()

Unnamed: 0,1,2,3,4,5,6,7,8
count,348.0,336.0,343.0,355.0,352.0,342.0,341.0,348.0
mean,2.468391,2.389881,2.215743,2.219718,2.196023,2.25731,2.255132,2.390805
std,1.172023,1.201896,1.12914,1.140995,1.28931,1.41769,1.532782,1.65187
min,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
50%,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
75%,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0
max,6.0,7.0,9.0,12.0,15.0,18.0,21.0,24.0


As you can see, the counts don't agree for every column so we have some data cleanup yet to do.

However, we have the training data in place now. We are going to train a model to learn to predict the daily Kp indices using the DSCOVR data from the previous few days. Take a minute to think about what model architectures might be useful for this then open the dscovrmatic-dataprep notebook to continue.