<a href="https://colab.research.google.com/github/danielegrattarola/ml-18-19/blob/master/06_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Machine Learning

Prof. Cesare Alippi

Daniele Grattarola (daniele.grattarola@usi.ch )

Daniele Zambon (daniele.zambon@usi.ch )

In [0]:
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import requests
import pandas as pd
import pandas_datareader.data as web
from datetime import datetime

# Toy problem

Let's try to predict $$y(t) = sin(\omega t) + \epsilon$$

In [0]:
omega = 10  # Angular frequency of signal
x = np.linspace(0, 2 * np.pi, 2000)
data = np.sin(omega * x) + np.random.normal(0, 0.0, size=x.shape)
plt.plot(x, data)

We create an helper function that given a time series 
$$[x_{t_0}, x_{t_1}, x_{t_2}, x_{t_3} \dots]$$
builds a dataset of observations that we can use to train our AR model of order $p$: 

$$
X = \left[ 
\begin{array}{c}
x_{t_0}, \dots, x_{t_0 + p - 1} \\
x_{t_1}, \dots, x_{t_1 + p - 1} \\
x_{t_2}, \dots, x_{t_2 + p - 1} \\
\vdots \\
\end{array}
\right]\textrm{ }
Y = \left[ 
\begin{array}{c}
x_{t_0 + p} \\
x_{t_1 + p} \\
x_{t_2 + p} \\
\vdots \\
\end{array}
\right] 
$$




In [0]:
def get_sequences_and_targets(data, k):
    '''
    Function to create pairs of sequences and targets
    '''
    time = np.arange(data.shape[0])
    inputs = []
    targets = []
    for t in time[:-k]:
        inputs.append(data[t:t+k])
        targets.append(data[t+k])
    inputs = np.array(inputs)
    targets = np.array(targets)

    return inputs, targets

The data pre-processing is as usual: we create a training and a test set. Note that in this case the test set cannot be sampled randomly, but must be taken "at the end" of our data, in order to see whether the model is able to generalize. 

In [0]:
# Prep the data for training model
k = 2
inputs, targets = get_sequences_and_targets(data, k=k)

# Split training and test sets
test_split = int(len(targets) * 0.9)
x_train, x_test = inputs[:test_split], inputs[test_split:]
y_train, y_test = targets[:test_split], targets[test_split:]

Let's fit a linear autoregressive model: 
$$
y(t) = \sum\limits_{i=1}^{p} x_{t - i} \phi_i = x_{t-1}\phi_1 + \dots + x_{t-p}\phi_p
$$

using the standard linear regression techniques that we have seen.



In [0]:
# Linear autoregression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train, y_train)
lr_pred = lr.predict(x_test)

plt.title('Autoregressive')
plt.plot(y_test, label='True')
plt.plot(lr_pred, label='Pred')
plt.legend()

In this simple case, we know that $y=\sin(\omega x)$ can be rewritten as 

$$
y(t) = \sin(\omega x_t) = 2 \cos(\omega) x_{t - 1} - x_{t - 2}
$$

Was the model able to find these exact values?

In [0]:
print('Analytical coefficients: {:.2f}, {:.2f}'.format(-1, 2 * np.cos(omega)))
print('Learned coefficients: {:.2f}, {:.2f}'.format(*lr.coef_))

# Predicting the stock market

First, we create a couple of helper functions to download the historical price data from the internet. 

In [0]:
COLUMNS = ['open','high','low','close','volume']

def get_crypto(name, start, end):
    '''
    Get price history of given cryptocurrency from coinmarketcap.com
    '''
    start_str = start.strftime('%Y%m%d') 
    stop_str = end.strftime('%Y%m%d') 
    url = 'https://coinmarketcap.com/currencies/{}/historical-data/?start={}&end={}'.format(name, start_str, stop_str)

    # Get data
    content = requests.get(url).content
    soup = BeautifulSoup(content,'html.parser')
    table = soup.find('table', {'class': 'table'})

    # Process raw html
    data = [[td.text.strip() for td in tr.findChildren('td')] 
            for tr in table.findChildren('tr')]

    # Convert to CSV
    df = pd.DataFrame(data)
    df.drop(df.index[0], inplace=True) # first row is empty
    df[0] =  pd.to_datetime(df[0]) # date
    for i in range(1,7):
        df[i] = pd.to_numeric(df[i].str.replace(",","").str.replace("-","")) # some vol is missing and has -
    df.columns = ['date','open','high','low','close','volume','market_cap']
    df.set_index('date',inplace=True)
    df.sort_index(inplace=True)

    return df[COLUMNS]


def get_stock(name, start, end):
    '''
    Get price history of stock from iextrading.com
    '''
    df = web.DataReader('AAPL', 'iex', start, end)
    df.sort_index(inplace=True)

    return df[COLUMNS]

The workflow is the same as above, except that this time we take the logarithm of the data in order to rescale it.

In [0]:
# Get the data
start = datetime(2014, 5, 26)
end   = datetime(2019, 5, 26)
aapl_stock = get_stock('AAPL', start, end)
bitcoin = get_crypto('bitcoin', start, end)

data = aapl_stock['close'].values
data = np.log(data)  # Take log to simplify things

plt.plot(data)

Let's try the same linear autoregressive model that we used for the sine. 

In [0]:
# Create sequences and targets
k = 5
inputs, targets = get_sequences_and_targets(data, k=k)
test_split = int(len(targets) * 0.9)
x_train, x_test = inputs[:test_split], inputs[test_split:]
y_train, y_test = targets[:test_split], targets[test_split:]

# Linear autoregressive model
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train, y_train)
lr_pred = lr.predict(x_test)

plt.figure()
plt.title('Autoregressive')
plt.plot(y_test, label='True')
plt.plot(lr_pred, label='Pred')
plt.legend()

Now let's try something more complicated...

In [0]:
# Nonlinear autoregressive model
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping, ModelCheckpoint

nn = Sequential([
        Dense(32, activation='relu', input_shape=(k, )),
        Dense(32, activation='relu'),
        Dense(1)
    ])
nn.compile('adam', 'mse')
callbacks = [
    EarlyStopping(monitor='val_loss', patience=100),
    ModelCheckpoint('best_model.h5', save_weights_only=True, save_best_only=True)
]
nn.fit(x_train, y_train, 
       epochs=10000, validation_split=0.1,
       callbacks=callbacks)
nn.load_weights('best_model.h5')
nn_pred = nn.predict(x_test)

plt.figure()
plt.title('Non-linear Autoregressive')
plt.plot(y_test, label='True')
plt.plot(nn_pred, label='Pred')
plt.legend()