## Predicting El Niño - LSTM Model

After analyzing our dataset, we proceed to create a regression model to predict the Oceanic Nino Index (ONI), and we will follow these three steps:

1. Read and clean the data
2. Train Model using Long-Short Term Memory (LSTM) Neural Network
3. Predict and evaluate the model

### 1. Read and clean the data

Firstly we will import the required libraries to process the dataset.

In [2]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.models import load_model
from math import sqrt

warnings.filterwarnings('ignore')

df = pd.read_csv('Data.csv')

# Create a function to convert parse the date variable
def date_parser(x):
    if str(x).endswith('10') or str(x).endswith('11') or str(x).endswith('12'):
        return pd.datetime.strptime(str(x),'%Y%m')
    else:
        return pd.datetime.strptime(str(x),'%Y0%m')

df['date'] = df['date'].apply(date_parser)

# And now convert date to index
df = df.set_index('date')

# We are going to use data from 1979, and standardize the value
start = 336
df = df.iloc[start:]
df = (df - df.mean())/df.std()
df.head()

We are going to use a supervised learning algorithm to predict SOI. For time series, we must avoid autocorrelation by using a technique called sliding (or lag) window. For this project, we will use LSTM neural network.

In [6]:
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[1]
    df = pd.DataFrame(data)
    cols, names = list(), list()
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i).iloc[:,-1])
        if i == 0:
            names += ['VAR(t)']
        else:
            names += ['VAR(t+%d)' % i]
    # put it all together
    agg = pd.concat(cols, axis=1)
    agg.columns = names
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg

In [7]:
# specify the size of our sliding window and number of features
enso = df.values.astype('float32')
lag = 12
ahead = 3
n_features = 1
reframed = series_to_supervised(enso, lag, ahead)
reframed.head()

Unnamed: 0,var1(t-12),var2(t-12),var3(t-12),var4(t-12),var5(t-12),var1(t-11),var2(t-11),var3(t-11),var4(t-11),var5(t-11),...,var4(t-2),var5(t-2),var1(t-1),var2(t-1),var3(t-1),var4(t-1),var5(t-1),VAR(t),VAR(t+1),VAR(t+2)
348,-0.4,0.0,-0.09,-1.54,5.15,1.0,0.1,-0.2,-2.73,0.73,...,0.51,0.23,-0.7,0.6,0.58,-0.68,-0.11,-0.27,-2.9,0.98
349,1.0,0.1,-0.2,-2.73,0.73,0.1,0.2,-0.08,0.08,-1.91,...,-0.68,-0.11,0.4,0.6,0.42,-1.01,-0.27,-2.9,0.98,-1.29
350,0.1,0.2,-0.08,0.08,-1.91,-0.1,0.3,0.24,-0.15,-2.72,...,-1.01,-0.27,0.3,0.5,0.02,2.03,-2.9,0.98,-1.29,0.45
351,-0.1,0.3,0.24,-0.15,-2.72,0.5,0.2,0.08,1.27,-0.97,...,2.03,-2.9,-0.4,0.3,-0.13,-0.64,0.98,-1.29,0.45,1.61
352,0.5,0.2,0.08,1.27,-0.97,0.6,0.0,0.3,-1.64,-0.24,...,-0.64,0.98,-0.6,0.4,0.04,1.66,-1.29,0.45,1.61,-0.44
