# Predicting the DST current index from other variables

## Downloading data

We will first download low-resolution data from [OMNI](https://omniweb.gsfc.nasa.gov/), and split it into a training and a test set.


In [1]:
from datetime import datetime
from aidapy import load_data
from sklearn.model_selection import train_test_split

# Set the start and end date as year, month, day
t0 = datetime(2000, 1, 1)
t1 = datetime(2015, 12, 31)

# Download the data
omnixr = load_data(mission='omni', start_time=t0, end_time=t1)

# Store data in pandas format
pd_data = omnixr['all1'].to_pandas()

# Split into training and test data
dtrain, dtest = train_test_split(pd_data, shuffle=False)

pd_data.describe()

Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2000.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2000.dat


100.0% 2883584 / 2881152




Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2001.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2001.dat


100.0% 2875392 / 2873280




Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2002.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2002.dat


100.0% 2875392 / 2873280




Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2003.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2003.dat


100.0% 2875392 / 2873280




Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2004.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2004.dat


100.0% 2883584 / 2881152






products,Bartels Rotation Number,ID IMF Spacecraft,ID SW Plasma Spacecraft,points(IMF Average),points(Plasma Average),|B|,Magnitude of Avg Field Vector,Lat. Angle of Aver. Field Vector,Long. Angle of Aver. Field Vector,"Bx GSE, GSM",...,Proton Flux > 10MeV,Proton Flux > 30MeV,Proton Flux > 60MeV,flag,ap index,f10.7 index,PC(N) index,AL index (Kyoto),AU index (Kyoto),Magnetosonic Mach No.
count,140233.0,140233.0,140105.0,140233.0,140105.0,140233.0,140233.0,140233.0,140233.0,140233.0,...,113159.0,113134.0,113120.0,140233.0,140233.0,140113.0,140035.0,140233.0,140233.0,135984.0
mean,2379.926836,55.852353,56.569801,48.675333,30.04967,5.827601,5.174818,0.130479,201.635718,-0.004155,...,7.994024,2.036895,0.759653,-0.807028,10.408599,116.238255,0.960367,-110.092596,68.02637,5.641059
std,62.47306,8.573346,8.16648,19.356001,9.397953,3.219993,3.069832,29.213007,100.877033,3.473102,...,152.075039,48.944587,18.04282,0.394632,16.728629,43.382109,1.345067,143.618415,70.739359,1.158862
min,2272.0,51.0,45.0,1.0,1.0,0.4,0.1,-89.2,0.0,-40.8,...,0.01,0.01,0.01,-1.0,0.0,65.1,-21.5,-2452.0,-260.0,0.6
25%,2326.0,51.0,52.0,51.0,17.0,3.8,3.2,-18.5,122.3,-2.5,...,0.15,0.08,0.06,-1.0,3.0,78.5,0.1,-148.0,20.0,4.9
50%,2380.0,51.0,52.0,59.0,35.0,5.1,4.5,-0.1,181.1,0.0,...,0.2,0.11,0.08,-1.0,6.0,107.1,0.6,-49.0,42.0,5.7
75%,2434.0,51.0,52.0,60.0,37.0,7.0,6.2,18.6,302.0,2.4,...,0.25,0.15,0.11,-1.0,12.0,141.2,1.5,-20.0,91.0,6.4
max,2488.0,71.0,71.0,94.0,74.0,62.0,60.7,89.8,360.0,34.8,...,9650.0,3220.0,1220.0,0.0,400.0,325.1,28.0,22.0,1226.0,12.7


## Let's have a look at the data

Below, the DST index is shown. Note that there are quiet periods, and periods with more magnetic storms, indicated by large negative values.

In [2]:
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters
%matplotlib notebook
register_matplotlib_converters()

plt.figure()
plt.title("DST index (training/test data)")
dtrain['DST Index'].plot(label='train')
dtest['DST Index'].plot(label='test')
plt.gcf().autofmt_xdate()
plt.legend();

<IPython.core.display.Javascript object>

## The correlation between the DST Index and other variables

Let's have a look at the linear correlation between the DST index and other variables. Correlations range from -1 (anti-correlation) to 1 (perfect correlation), and a value around zero indicates the variables are linearly independent. (However, there might be a complex non-linear correlation)

In [3]:
# Generate a correlation matrix
corr_matrix = dtrain.corr()

# Get the correlation with the DST index
dst_corr = corr_matrix['DST Index']

# Display correlations sorted by absolute value
ix = dst_corr.abs().sort_values(ascending=False).index
sorted_corr = dst_corr.reindex(ix)

# Print the variables with the strongest correlations
print(sorted_corr[:10])

products
DST Index                        1.000000
ap index                        -0.629716
Kp                              -0.573182
AL index (Kyoto)                 0.548138
AE Index                        -0.542807
PC(N) index                     -0.514771
Plasma Flow Speed               -0.456372
AU index (Kyoto)                -0.420654
|B|                             -0.396887
Magnitude of Avg Field Vector   -0.386568
Name: DST Index, dtype: float64


## Selecting features

We will now select *features* from which we will predict the DST Index (our target). There are a number of other indices that the DST correlates to, but to make this example somewhat challenging we will exclude them!

In [4]:
# Get names of the variables
all_names = list(sorted_corr.index)

# Remove index variables
my_names = [name for name in all_names if 'ndex' not in name]

# Remove other variables
my_names.remove('Kp')

# Select variables with the strongest (absolute) correlations as features
n_features = 5
features = my_names[:n_features]
targets = ['DST Index']
all_vars = features + targets
print("The features are: ", features)
print("The targets are: ", targets)

# Remove rows with missing values. We do this after selecting features,
# otherwise we remove too many rows.
dtrain_dropna = dtrain[all_vars].dropna()
dtest_dropna = dtest[all_vars].dropna()

# Get the features and targets
X_train = dtrain_dropna[features].values
y_train = dtrain_dropna[['DST Index']].values

X_test = dtest_dropna[features].values
y_test = dtest_dropna[['DST Index']].values

t_test = dtest_dropna.index
t_train = dtrain_dropna.index

The features are:  ['Plasma Flow Speed', '|B|', 'Magnitude of Avg Field Vector', 'Na/Np', 'Bz GSM']
The targets are:  ['DST Index']


## Regression models

In this example, we consider a linear regression model and an artificial neural network (ANN) regression model. These models are used in a pipeline, which also performs scaling of the input data.

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from aidapy.ml import mlp
from skorch import NeuralNetRegressor
import torch

# A list of models to use
models = []

# Append a dictionary with the model name and pipeline
models.append({
    'name': 'Linear Regression',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', LinearRegression())])
})

# RegressorMlp is a simple, fully-connected neural network, 
# for which the layer sizes are defined below. The default
# activation function is ReLU.
mlp_model = NeuralNetRegressor(
    mlp.RegressorMlp,
    max_epochs=25,
    lr=0.001,
    batch_size=128,
    optimizer=torch.optim.Adam,
    module__layer_sizes=[X_train.shape[1], 64, 64, 64, 1]
)

models.append({
    'name': 'Multilayer perceptron',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', mlp_model)])
})

## Training the models

In [6]:
from sklearn.metrics import r2_score

for model in models:
    model['pipe'].fit(X_train, y_train)
    model['test_predict'] = model['pipe'].predict(X_test)
    model['train_predict'] = model['pipe'].predict(X_train)
    
for model in models:
    print("{:30} R2 score on test / train set:  {:8.3f} {:8.3f}".format(
        model['name'], r2_score(y_test, model['test_predict']),
        r2_score(y_train, model['train_predict'])))

  epoch    train_loss    valid_loss     dur
-------  ------------  ------------  ------
      1      [36m265.4195[0m      [32m510.7300[0m  4.9675
      2      [36m199.7389[0m      514.0299  6.8559
      3      [36m196.6792[0m      513.1639  5.4448
      4      [36m194.9946[0m      511.3364  8.0728
      5      [36m193.8425[0m      [32m508.8832[0m  3.3963
      6      [36m192.9440[0m      [32m506.7368[0m  6.7031
      7      [36m192.1152[0m      507.0628  4.3043
      8      [36m191.5201[0m      [32m504.7900[0m  4.5337
      9      [36m190.9533[0m      [32m504.0500[0m  5.6420
     10      [36m190.5970[0m      [32m503.2600[0m  3.0030
     11      [36m190.2063[0m      [32m502.6676[0m  3.0651
     12      [36m189.9117[0m      [32m501.8307[0m  7.2454
     13      [36m189.5939[0m      [32m501.4661[0m  7.0313
     14      [36m189.3153[0m      [32m501.0051[0m  4.6045
     15      [36m189.0875[0m      [32m500.4397[0m  8.1879
     16      [36m

## Visualizing the model predictions

In [7]:
plt.figure()
plt.title('Performance on test/train data')
plt.plot(t_test, y_test, label='Data test')
plt.plot(t_train, y_train, label='Data train')
for model in models:
    p = plt.plot(t_test, model['test_predict'], label=model['name'])
    # Plot another line with the same color
    plt.plot(t_train, model['train_predict'], color=p[0].get_color())
plt.legend()
plt.gcf().autofmt_xdate()
plt.grid(True)

<IPython.core.display.Javascript object>

## Working with a history to improve predictions

Before, we only used the 'instantaneous' values of other variables to predict the DST Index. We can improve predictions by keeping track of a history of values, for which we use the AIDApy preprocessing module.

In [8]:
from aidapy.ml import preprocess

histsize = 24                   # Number of past hours
forecast_time = 0               # Hours into the future

Xtr = dtrain[features].values
ytr = dtrain[targets].values
Xte = dtest[features].values
yte = dtest[targets].values

# Use the AIDApy preprocessing method for time series
X_train, y_train, mask_train = preprocess.time_series(
    Xtr, ytr, histsize, forecast_time)
X_test, y_test, mask_test = preprocess.time_series(
    Xte, yte, histsize, forecast_time)

t_train = dtrain.index[mask_train]
t_test = dtest.index[mask_test]

models_v2 = []

models_v2.append({
    'name': 'Linear Regression',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', LinearRegression())])
})

mlp_model = NeuralNetRegressor(
    mlp.RegressorMlp,
    max_epochs=20,
    lr=0.001,
    batch_size=128,
    optimizer=torch.optim.Adam,
    module__layer_sizes=[X_train.shape[1], 64, 64, 64, 1]
)

models_v2.append({
    'name': 'Multilayer perceptron',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', mlp_model)])
})

## Training and evaluating the new models

In [9]:
for model in models_v2:
    model['pipe'].fit(X_train, y_train)
    model['test_predict'] = model['pipe'].predict(X_test)
    model['train_predict'] = model['pipe'].predict(X_train)
    
for model in models_v2:
    print("{:30} R2 score on test / train set:  {:8.3f} {:8.3f}".format(
        model['name'], r2_score(y_test, model['test_predict']),
        r2_score(y_train, model['train_predict'])))

  epoch    train_loss    valid_loss     dur
-------  ------------  ------------  ------
      1      [36m211.1822[0m      [32m320.4362[0m  3.7218
      2      [36m120.9668[0m      [32m272.9826[0m  4.0578
      3      [36m116.8801[0m      [32m265.6732[0m  5.9941
      4      [36m114.9084[0m      [32m249.5811[0m  3.8256
      5      [36m112.3243[0m      [32m244.7855[0m  3.6040
      6      [36m110.7551[0m      [32m240.7139[0m  3.4175
      7      [36m109.4082[0m      [32m237.4817[0m  3.4066
      8      [36m108.2381[0m      [32m234.2301[0m  3.3303
      9      [36m107.1030[0m      [32m231.2991[0m  2.4646
     10      [36m105.9993[0m      [32m228.7583[0m  5.1124
     11      [36m105.0800[0m      [32m226.6691[0m  3.4773
     12      [36m104.1000[0m      [32m224.7690[0m  3.3862
     13      [36m103.2819[0m      [32m222.9194[0m  3.0130
     14      [36m102.5243[0m      [32m221.2111[0m  5.1869
     15      [36m101.7451[0m      [32m219

## Looking at the results

The performance of both models seems to have significantly improved by using a history of variables.

In [10]:
plt.figure()
plt.title('Performance on test/train data')
plt.plot(t_test, y_test, label='Data test')
plt.plot(t_train, y_train, label='Data train')
for model in models_v2:
    p = plt.plot(t_test, model['test_predict'], label=model['name'])
    # Plot another line with the same color
    plt.plot(t_train, model['train_predict'], color=p[0].get_color())
plt.legend()
plt.gcf().autofmt_xdate()
plt.grid(True)

<IPython.core.display.Javascript object>

## Possible extensions

Above, features were selected based on their linear correlation to the DST index. Such features will work well for a linear model, but a neural network could potentially also use features that have a more complex correlation with the DST index. We could simply use all variables as the input for the neural network, but then we have to take care that there are no missing values (NaN's).