# Investment and Trading Capstone Project

## Build a Stock Price Indicator

### project overview

Investment firms, hedge funds and even individuals have been using financial models to better understand market behavior and make profitable investments and trades. A wealth of information is available in the form of historical stock prices and company performance data, suitable for machine learning algorithms to process. 

This project uses this historical stock prices from finnhub.io to make predictions on the development of these stocks. The result of this process will implememnted in a website giving the user the possibility to choose a certain timeframe or stock to analyze.

### problem statement

The problem to be tackled in this project is to predict future adjsuted stock closing prices for certain stocks. To do so we will make use of several regression and deep learning models to achieve a maximum of accuracy for our predictions. 
The user interaction of this project will be implemented in a website/dashboard. There it will be possible to choose the stock of interest and a certain timeframe to predict data for the fututre.

## Exploratory Data Analysis

In this part we will have a closer look at the underlying data to decide how exactly we will deal with it in order to achieve the above mentioned results. Let's read in some libraries and the data first.

In [308]:
import pandas as pd
import numpy as np
import requests
from datetime import datetime
import plotly.graph_objects as go
from statsmodels.tsa.stattools import adfuller,acf, pacf


### data

We obtain our data from an API at www.finnhub.io. We're especially interested in the candlestick data for stocks going back 25 years for US stocks. As an example in this notebook we choose to work the the Google stock (GOOG).

In [309]:
def convert_timestamp_to_unix(timestamp):
    """Converts a pandas timestamp to a unix integer."""
    return (timestamp - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')

def convert_unix_to_timestamp(unix):
    """Converts a unix integer into a pandas datetime object."""
    return pd.to_datetime(unix, unit='s')

In [310]:

# timedelta='5y'

# # get start & endtime for ohlc data
# end_time = datetime.now()
# start_time = end_time - pd.Timedelta(timedelta)
# # convert times in unix integers for API request
# to_time = convert_timestamp_to_unix(end_time)
# from_time = convert_timestamp_to_unix(start_time)

# # get OHLC data for each symbol

# r = requests.get(
#     'https://finnhub.io/api/v1/stock/candle?symbol=GOOG&resolution=D&from={}&to={}&adjusted=true'.format(
#     symbol, from_time, to_time)).json()


# r
#pd.DatetimeIndex([convert_unix_to_timestamp(x) for x in r.json()['t']]).to_period('D'),

In [313]:
def get_ohlc_data(symbols, timedelta='5y'):
    """
    Queries list of stock symbols for their OHLC data.
    Arguments:
    symbols - list of strings containing the stock symbol of the desired stock
    Returns:
    ohlc_data - dict of dataframes containing ohlc data for each symbol over certain timeframe
    """
    ohlc_data = dict()
    # get start & endtime for ohlc data
    end_time = datetime.now()
    start_time = end_time - pd.Timedelta(timedelta)
    # convert times in unix integers for API request
    to_time = convert_timestamp_to_unix(end_time)
    from_time = convert_timestamp_to_unix(start_time)

    # get OHLC data for each symbol
    for symbol in symbols:
        r = requests.get(
            'https://finnhub.io/api/v1/stock/candle?symbol={}&resolution=D&from={}&to={}&adjusted=true'.format(
            symbol, from_time, to_time))

        data = pd.DataFrame(
            index= [convert_unix_to_timestamp(x).date() for x in r.json()['t']],
            data = {'open': r.json()['o'],
                    'high': r.json()['h'],
                    'low': r.json()['l'],
                    'adj_close': r.json()['c'],
                    'volume': r.json()['v']})
        ohlc_data.update({symbol: data})
    
    return ohlc_data

In [314]:
symbols = ['GOOG', 'AAPL']
ohlc_data = get_ohlc_data(symbols)

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [317]:
ohlc_data['GOOG'].info()

Unnamed: 0,open,high,low,adj_close,volume
2015-08-20,655.4600,662.990,642.900,646.83,2855300
2015-08-21,639.7800,640.050,612.330,612.48,4265200
2015-08-24,573.0000,599.330,565.050,589.61,5770300
2015-08-25,614.9100,617.450,581.110,582.06,3538000
2015-08-26,610.3500,631.710,599.050,628.62,4235900
...,...,...,...,...,...
2020-08-12,1485.5800,1512.386,1485.250,1506.62,1437700
2020-08-13,1510.3400,1537.250,1508.005,1518.45,1454700
2020-08-14,1515.6600,1521.900,1502.880,1507.73,1354800
2020-08-17,1514.6700,1525.610,1507.970,1517.98,1378300


In [316]:
ohlc_data['AAPL'].info()

<class 'pandas.core.frame.DataFrame'>
Index: 1258 entries, 2015-08-20 to 2020-08-18
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   open       1258 non-null   float64
 1   high       1258 non-null   float64
 2   low        1258 non-null   float64
 3   adj_close  1258 non-null   float64
 4   volume     1258 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 59.0+ KB


We can see that for both datapoints we got back a complete dataset without any null values. There is no further cleaning required.

In [28]:
def plot_ohlc(symbols):
    """
    Plots the ohlc data for a given symbol and timeframe.
    Arguments:
    symbols - list of strings representing the name of a stock symbol that is contained in the ohlc dictionary.
    Returns:
    None
    """
    for symbol in symbols:
        
        df = ohlc_data[symbol]

        fig = go.Figure(data=[
            go.Candlestick(
                name=symbol,
                x=df.index,
                open=df.open,
                high=df.high,
                low=df.low,
                close=df.adj_close)])

        fig.update_layout(
            xaxis_rangeslider_visible=False,
            title='OHLC Stock Chart for: {}'.format(symbol))
        fig.show()

In [29]:
plot_ohlc(symbols)

Above we mentioned that there are no missing values in our datasets. However, zooming in we can observe gaps in both of the graphs above. This is due to the fact that the stock markets are closed on weekends and certain holidays. This shouldn't be a problem for our prediction as we want to prodict for predefined steps in the future like 7 or 30 days. The exact day then doesn't really matter.

## Algorithm Evaluation

In this chapter we will have a look on different models to predict stock prices and evaluate their performance to decide which algorithm to implement in the final stock price predictor.

### ARIMA

### TODO: Explain model and prep steps

Most of the models dealing with TimeSeries data work on the assumption that the data is stationary. Stationary means that it's statistical properties such as mean or standard deviation remain constant over time. We can assume the series to be stationary if it has constant statistical properties over time., i.e. the following:
- constant mean
- constant variance
- an autocovariance that doesn't depend on time

In the above graphs we could clearly see an overall increasing trend. We would not expect any seasonal patterns in this data but will also check for this later. To confirm our visual inspection we check for stationarity by using:
- **Plots of the rolling statistics**: We can plot the moving average and moving variance of the adjsuted closing price and see if it changes over time. Here we will use a timeframe of 7 days to catch possible weekly changes.
- **Dickey-Fuller test**: This is a statistical test to check stationarity. The *Null Hypothesis* is that the TimeSeries is not stationary. The results of this test are a *Test Statistic* and some *Critical Values* for different confidence levels. If the Test Statistic is smaller than the Critical Value we can reject the Null Hypothesis and say the series is stationary.

In [30]:
def check_stationarity(timeseries, interval=7):
    """
    Calculates rolling average and std. deviation for a series of adjusted closing prices.
    The function plots the results and prints the results of the Dickey-Fuller test.
    
    Arguments:
    symbol - string containing the stock symbol of the desired stock
    interval - timeframe to calculate rolling statistics
    Returns:
    df - dataframe with added rolling statistics
    """
    
    # add rolling statistics to dataframe  
    roll_avg = timeseries.rolling(interval).mean()
    roll_std = timeseries.rolling(interval).std()
    
    # plot adjusted closing price vs. rolling statistics
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=df.index, y= timeseries, name='Original'))
    fig.add_trace(go.Scatter(x=df.index, y= roll_avg, name='Rolling Average'))
    fig.add_trace(go.Scatter(x=df.index, y= roll_std, name='Rolling Std Deviation'))
    fig.update_layout(title='Rolling Mean & Standard Deviation')
    fig.show()
    
    # Dickey-Fuller test:
    result = adfuller(timeseries)
    print('ADF Statistic: {}'.format(result[0]))
    print('p-value: {}'.format(result[1]))
    print('Critical Values:')
    for key, value in result[4].items():
        print('\t{}: {}'.format(key, value))

From the graph we can tell that our rolling statistics change over time meaning they are not stationary. 
The results from the Dickey-Fuller test confirm our observation. The ADF statistic is way higher than any of our Critical Values here. It's safe to say our series is not stationary and there's some work to do to make it stationary.

In [255]:
df = ohlc_data['GOOG']
timeseries = np.log(df.adj_close)

In [258]:
moving_avg = timeseries.rolling(30).mean()

fig = go.Figure()
fig.add_trace(go.Scatter(x=df.index, y=timeseries,
                    name='Log(adjusted closing price)'))
fig.add_trace(go.Scatter(x=df.index, y=moving_avg,
                    name='Moving Avg'))

ts_log_moving_avg_diff = timeseries - moving_avg
ts_log_moving_avg_diff.dropna(inplace=True)

In [259]:
check_stationarity(ts_log_moving_avg_diff)

ADF Statistic: -7.209160174244871
p-value: 2.2585219238226482e-10
Critical Values:
	1%: -3.4357658900670085
	5%: -2.8639315921664568
	10%: -2.5680433235434736


In [260]:
exp_weighted_avg = timeseries.ewm(halflife=100).mean()

fig = go.Figure()
fig.add_trace(go.Scatter(x=df.index, y=timeseries,
                    name='Log(adjusted closing price)'))
fig.add_trace(go.Scatter(x=df.index, y=exp_weighted_avg,
                    name='Exp Moving Avg'))

ts_log_exp_avg_diff = timeseries - exp_weighted_avg
ts_log_exp_avg_diff.dropna(inplace=True)

In [261]:
check_stationarity(ts_log_exp_avg_diff)

ADF Statistic: -4.472458448116631
p-value: 0.0002204433946711244
Critical Values:
	1%: -3.435638861796935
	5%: -2.863875547501718
	10%: -2.5680134763122906


In [264]:
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt

decomposition = seasonal_decompose(x=timeseries, period=10)

trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

fig = go.Figure()
fig.add_trace(go.Scatter(x=df.index, y=timeseries, name='Original'))
fig.add_trace(go.Scatter(x=df.index, y=trend, name='Trend'))
fig.add_trace(go.Scatter(x=df.index, y=seasonal, name='Seasonality'))
fig.add_trace(go.Scatter(x=df.index, y=residual, name='Residuals'))

In [265]:
check_stationarity(residual.dropna())

ADF Statistic: -12.009956181755612
p-value: 3.1980269676059216e-22
Critical Values:
	1%: -3.4356950607889254
	5%: -2.863900342696613
	10%: -2.568026681232353


### ARIMA model

#### Auto correlation function (ACF):
This metric expresses the correlation between the observations at the current point in time and the observations at all previous points in time. We use ACF to determine the **number of optimal Moving Average (MA) terms**. 

#### Partial  Auto correlation function (PACF):
PACG expresses the correlation between observations made at two points in time while accounting for any influence from other data points. We sue PACF to determine the **optimal number of Autoregressive terms** in our model. 

In [266]:
def check_auto_correlation(timeseries, n_lags):
    """
    Calculates and plots Autocorrelation function (ACF) & Partial Autocorrelation function (PACF).
    Plots the results and the related confidence interval.
    """
        
    lag_acf, conf_array = acf(timeseries, nlags=n_lags, fft=True, alpha=0.05)
    upper = conf_array[:, 1] - lag_acf
    lower = conf_array[:, 0] - lag_acf
    lag_pacf = pacf(timeseries, nlags=n_lags, method='ols')

    # Plot ACF & PACF
    x = list(range(1,n_lags))
    x_rev = x[::-1]
    lower = lower[::-1]

    fig = go.Figure()
    fig.add_trace(go.Scatter(x=x, y=lag_acf, name='ACF'))
    fig.add_trace(go.Scatter(x=x, y=lag_pacf, name='PACF'))
    fig.add_trace(go.Scatter(x=x, y=upper, name='upper conf-int',
                            mode='lines', fill='tozeroy'))
    fig.add_trace(go.Scatter(x=x, y=lower[::-1], name='lower conf-int',
                            mode='lines', fill='tozeroy'))
    fig.show()

In [267]:
check_auto_correlation(residual.dropna(), 30)

### Choosing the inputs for the ARIMA model

With the above chart we look for the values p, d & q as inputs for our model which are described as followed: 
p: number of autoregressive terms (AR order)
d: number of nonseasonal differences (differencing order)
q: number of moving-average terms (MA order)

Therfore we choose the values as follows:

In [269]:
p = 2 #3
d = 0
q = 2 #17

In [320]:
residuals = residual.dropna()
residuals.index = pd.DatetimeIndex(residual.dropna().index).to_period('D')

residuals

2015-08-26    0.019698
2015-08-27    0.038274
2015-08-28    0.030691
2015-08-31    0.010953
2015-09-01   -0.026729
                ...   
2020-08-04   -0.018785
2020-08-05   -0.011106
2020-08-06    0.007424
2020-08-07    0.001468
2020-08-10    0.001208
Freq: D, Name: resid, Length: 1248, dtype: float64

In [324]:
from statsmodels.tsa.arima_model import ARIMA

# AR model
model = ARIMA(residuals, order=(p, d, q))  
results_AR = model.fit()  

x = residual.dropna().index
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=residuals, name='Residuals'))
fig.add_trace(go.Scatter(x=x, y=results_AR.fittedvalues, name='fitted results'))
fig.update_layout(title='SSE: %.4f'% sum((results_AR.fittedvalues-residuals)**2))