# DST forecasting from OMNI data

This example shows how to use AIDApy to perform the following tasks:

* Download time series data from OMNI
* Preprocess the data so that it can be used for machine learning
* Train several models and evaluate their performance


## Downloading data

We will first download low-resolution data from [OMNI](https://omniweb.gsfc.nasa.gov/).


In [1]:
from datetime import datetime
from aidapy import load_data

# Set the start and end date as year, month, day
t0 = datetime(2005, 1, 1)
t1 = datetime(2015, 12, 31)

# Download the data
omnixr = load_data(mission='omni', start_time=t0, end_time=t1)

# Store data in pandas format
pd_data = omnixr['all1'].to_pandas()

pd_data.describe()

products,Bartels Rotation Number,ID IMF Spacecraft,ID SW Plasma Spacecraft,points(IMF Average),points(Plasma Average),|B|,Magnitude of Avg Field Vector,Lat. Angle of Aver. Field Vector,Long. Angle of Aver. Field Vector,"Bx GSE, GSM",...,Proton Flux > 10MeV,Proton Flux > 30MeV,Proton Flux > 60MeV,flag,ap index,f10.7 index,PC(N) index,AL index (Kyoto),AU index (Kyoto),Magnetosonic Mach No.
count,96385.0,96385.0,96354.0,96385.0,96354.0,96385.0,96385.0,96385.0,96385.0,96385.0,...,91174.0,91171.0,91169.0,96385.0,96385.0,96265.0,96212.0,96385.0,96385.0,95501.0
mean,2413.760481,51.46916,52.41442,58.407595,34.32062,5.216666,4.620777,0.442976,200.817234,-0.003756,...,4.467667,0.895241,0.283256,-0.945936,8.207802,98.542824,0.876475,-93.396431,56.163034,5.769463
std,42.939382,3.027076,2.95985,7.630976,5.37908,2.785629,2.659993,29.518247,101.021925,3.079144,...,74.289989,15.688849,5.969648,0.226146,12.072234,29.679255,1.171071,128.300576,61.312826,1.119411
min,2339.0,51.0,51.0,1.0,1.0,0.4,0.1,-89.2,0.0,-40.8,...,0.05,0.04,0.03,-1.0,0.0,65.1,-6.9,-2452.0,-225.0,0.6
25%,2377.0,51.0,52.0,58.0,34.0,3.4,2.9,-18.6,121.1,-2.2,...,0.14,0.08,0.06,-1.0,3.0,73.5,0.1,-120.0,17.0,5.1
50%,2414.0,51.0,52.0,60.0,36.0,4.6,4.0,0.1,182.6,0.0,...,0.18,0.1,0.07,-1.0,5.0,88.4,0.6,-38.0,34.0,5.8
75%,2451.0,51.0,52.0,61.0,37.0,6.2,5.6,19.2,300.6,2.2,...,0.22,0.12,0.09,-1.0,9.0,117.8,1.3,-17.0,72.0,6.5
max,2488.0,71.0,71.0,91.0,52.0,55.4,54.5,89.8,360.0,26.4,...,4559.95,1210.0,956.0,0.0,300.0,255.0,18.4,15.0,873.0,10.9


Split the data into different sets
--

For evaluating the performance of a machine learning approach, the input data has to be split in different sets:

* A training set, on which the model is trained
* Optionally, a validation set, to tune so-called hyperparameters of the model
* A test set, which is only used at the end to evaluate the performance of the model

Such a separation is important to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting). Since there is usually a correlation between consecutive observations in a time series, it is important not to shuffle the data before splitting it in different sets. Otherwise, the test set would not be truly independent from the training set.


In [2]:
from sklearn.model_selection import train_test_split

# Split into training and test data
dtrain, dtest = train_test_split(pd_data, shuffle=False)

## Let's have a look at the data

Below, the DST index is shown. Note that there are quiet periods, and periods with more magnetic storms, indicated by large negative values.

In [3]:
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters
%matplotlib notebook
register_matplotlib_converters()

plt.figure()
plt.title("DST index (training/test data)")
dtrain['DST Index'].plot(label='train')
dtest['DST Index'].plot(label='test')
plt.gcf().autofmt_xdate()
plt.legend();

<IPython.core.display.Javascript object>

## Selecting features

The features are here selected by hand (and they are not necessarily the best ones).

In [4]:
features = ['|B|', 
            'Magnitude of Avg Field Vector',
            'Proton Density',
            'Proton Temperature',
            'Alfven Mach Number',
            'Bz GSM',
            'Na/Np',
            'Plasma Flow Speed',
            'Plasma Beta',
            'Electric Field',
            'DST Index']
targets = ['DST Index']

## Preprocessing data

The training and test data are preprocessed so that we can make predictions for the targets based on a history of the features. If the original features and targets are $X_i$ and $y_i$, then we define new features $X_i' = [X_i, X_{i-1}, \dots, X_{i-n-1}]$, where $n$ is the history size. We use these features to make predictions at some future time, so that the new targets are $y_i' = y_{i+k}$, where $k$ is the forecast time.

Another question is what to do with missing values. This is a common issue in working with space craft data -- in particular if data from multiple instruments is combined. We have opted for a simple approach: remove all samples for which some data is missing, either in the features or the targets.

In [5]:
from aidapy.ml import preprocess

histsize = 8                    # Number of past hours
forecast_time = 1               # Hours into the future

# Use the AIDApy preprocessing method for time series
X_train, y_train, mask_train = preprocess.time_series(
    dtrain[features].values, dtrain[targets].values, histsize, forecast_time)
X_test, y_test, mask_test = preprocess.time_series(
    dtest[features].values, dtest[targets].values, histsize, forecast_time)

t_train = dtrain.index[mask_train]
t_test = dtest.index[mask_test]

## Regression models

For the first use case, the following regression models are used:
* A fully-connected neural network implemented in PyTorch. Users can customize the number of layers, the number of neurons per layer, and the activation functions of the layers.
* A linear regression model imported from scikit-learn

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from aidapy.ml import mlp
from skorch import NeuralNetRegressor
import torch

models = []

models.append({
    'name': 'Linear Regression',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', LinearRegression())])
})

# RegressorMlp is a simple, fully-connected neural network, 
# for which the layer sizes are defined below. The default
# activation function is ReLU.
mlp_model = NeuralNetRegressor(
    mlp.RegressorMlp,
    max_epochs=25,
    lr=0.001,
    batch_size=128,
    optimizer=torch.optim.Adam,
    module__layer_sizes=[X_train.shape[1], 64, 64, 64, 1]
)

models.append({
    'name': 'Multilayer perceptron',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', mlp_model)])
})

In [7]:
from sklearn.metrics import r2_score

for model in models:
    model['pipe'].fit(X_train, y_train)
    model['test_predict'] = model['pipe'].predict(X_test)
    model['train_predict'] = model['pipe'].predict(X_train)
    
for model in models:
    print("{:30} R2 score on test / train set:  {:8.3f} {:8.3f}".format(
        model['name'], r2_score(y_test, model['test_predict']),
        r2_score(y_train, model['train_predict'])))

  epoch    train_loss    valid_loss     dur
-------  ------------  ------------  ------
      1       [36m60.7240[0m       [32m32.4020[0m  2.9466
      2       [36m17.2051[0m       [32m27.2859[0m  1.8680
      3       [36m13.9108[0m       27.3186  4.9162
      4       [36m11.7659[0m       [32m25.7481[0m  3.9948
      5       [36m11.0440[0m       29.0479  2.6648
      6       [36m10.3904[0m       26.3400  2.7699
      7       [36m10.1087[0m       27.2759  1.9154
      8        [36m9.7082[0m       [32m24.8305[0m  2.6972
      9        [36m9.3186[0m       [32m24.3196[0m  2.6852
     10        [36m9.0785[0m       [32m23.8498[0m  3.3641
     11        [36m8.9514[0m       [32m23.7893[0m  2.6362
     12        [36m8.8422[0m       [32m23.3065[0m  4.0155
     13        [36m8.6888[0m       [32m22.9633[0m  2.9111
     14        [36m8.4773[0m       [32m21.6068[0m  2.7943
     15        [36m8.4333[0m       22.4387  2.9364
     16        [36m8.2843

Let's compare the predictions with the actual data

In [8]:
plt.figure()
plt.title('Predicting the change in DST Index')
plt.plot(t_test, y_test, label='Data test')
plt.plot(t_train, y_train, label='Data train')
for model in models:
    p = plt.plot(t_test, model['test_predict'], label=model['name'])
    # Plot another line with the same color
    plt.plot(t_train, model['train_predict'], color=p[0].get_color())
plt.legend()
plt.gcf().autofmt_xdate()
plt.grid(True)

<IPython.core.display.Javascript object>

## This looks pretty good, but...

Most variables we consider in a time series have strong auto-correlation. This means that future values strongly resemble past values; examples are the temperature at a location on earth or the price of a stock on a stock market. When making predictions for a variable with strong auto-correlation, one has to be careful in assessing the predictive power of a model. Simply predicting the past state as the future state can look convincing (also numerically in most error norms), but gives little information. 

Below, we therefore take the following approach: predict the change in the variable.

In [9]:
from aidapy.ml import preprocess

histsize = 24                   # Number of past hours
forecast_time = 1               # Hours into the future

# Use the AIDApy preprocessing method for time series
# With the predict_change=True flag, the change in the DST index 
# is the target variable
X_train, y_train, mask_train = preprocess.time_series(
    dtrain[features].values, dtrain[targets].values,
    histsize, forecast_time, predict_change=True)
X_test, y_test, mask_test = preprocess.time_series(
    dtest[features].values, dtest[targets].values, 
    histsize, forecast_time, predict_change=True)

t_train = dtrain.index[mask_train]
t_test = dtest.index[mask_test]

models = []

models.append({
    'name': 'Linear Regression',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', LinearRegression())])
})

# RegressorMlp is a simple, fully-connected neural network, 
# for which the layer sizes are defined below. The default
# activation function is ReLU.
mlp_model = NeuralNetRegressor(
    mlp.RegressorMlp,
    max_epochs=25,
    lr=0.001,
    batch_size=128,
    optimizer=torch.optim.Adam,
    module__layer_sizes=[X_train.shape[1], 64, 64, 64, 1]
)

models.append({
    'name': 'ANN',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', mlp_model)])
})

In [10]:
from sklearn.metrics import r2_score

for model in models:
    model['pipe'].fit(X_train, y_train)
    model['test_predict'] = model['pipe'].predict(X_test)
    model['train_predict'] = model['pipe'].predict(X_train)
    
for model in models:
    print("{:30} R2 score on test / train set:  {:8.3f} {:8.3f}".format(
        model['name'], r2_score(y_test, model['test_predict']),
        r2_score(y_train, model['train_predict'])))

  epoch    train_loss    valid_loss     dur
-------  ------------  ------------  ------
      1        [36m9.3040[0m       [32m13.4525[0m  2.5473
      2        [36m8.0748[0m       [32m13.2288[0m  2.4043
      3        [36m7.6921[0m       [32m12.3340[0m  2.4899
      4        [36m7.5276[0m       [32m12.2525[0m  3.9910
      5        [36m7.3974[0m       [32m12.1197[0m  3.8056
      6        [36m7.2771[0m       [32m11.9705[0m  3.0214
      7        [36m7.1804[0m       [32m11.9094[0m  2.0155
      8        [36m7.0822[0m       11.9217  2.8191
      9        [36m7.0014[0m       [32m11.8192[0m  2.6564
     10        [36m6.8758[0m       11.8942  2.6622
     11        [36m6.7928[0m       11.8929  1.9705
     12        [36m6.6663[0m       11.8819  1.9809
     13        [36m6.6267[0m       12.3010  2.6679
     14        [36m6.5041[0m       12.2031  1.8921
     15        [36m6.4059[0m       12.2843  2.4354
     16        [36m6.3420[0m       12.9039 

In [11]:
plt.figure()
plt.title('Predicting the change in DST Index')
plt.plot(t_test, y_test, label='Data test')
plt.plot(t_train, y_train, label='Data train')
for model in models:
    p = plt.plot(t_test, model['test_predict'], label=model['name'])
    # Plot another line with the same color
    plt.plot(t_train, model['train_predict'], color=p[0].get_color())
plt.legend()
plt.gcf().autofmt_xdate()
plt.grid(True)

<IPython.core.display.Javascript object>