# Investment and Trading Capstone Project

## Build a Stock Price Indicator

### project overview

Investment firms, hedge funds and even individuals have been using financial models to better understand market behavior and make profitable investments and trades. A wealth of information is available in the form of historical stock prices and company performance data, suitable for machine learning algorithms to process. 

This project uses this historical stock prices from finnhub.io to make predictions on the development of these stocks. The result of this process will implememnted in a website giving the user the possibility to choose a certain timeframe or stock to analyze.

### problem statement

The problem to be tackled in this project is to predict future adjsuted stock closing prices for certain stocks. To do so we will make use of several regression and deep learning models to achieve a maximum of accuracy for our predictions. 
The user interaction of this project will be implemented in a website/dashboard. There it will be possible to choose the stock of interest and a certain timeframe to predict data for the fututre.

## Exploratory Data Analysis

In this part we will have a closer look at the underlying data to decide how exactly we will deal with it in order to achieve the above mentioned results. Let's read in some libraries and the data first.

In [180]:
import pandas as pd
import numpy as np
import requests
from datetime import datetime
import plotly.graph_objects as go
from statsmodels.tsa.stattools import adfuller

### data

We obtain our data from an API at www.finnhub.io. We're especially interested in the candlestick data for stocks going back 25 years for US stocks. As an example in this notebook we choose to work the the Google stock (GOOG).

In [156]:
def convert_timestamp_to_unix(timestamp):
    """Converts a pandas timestamp to a unix integer."""
    return (timestamp - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')

def convert_unix_to_timestamp(unix):
    """Converts a unix integer into a pandas datetime object."""
    return pd.to_datetime(unix, unit='s')

In [157]:
def get_ohlc_data(symbols, timedelta='5y'):
    """
    Queries list of stock symbols for their OHLC data.
    Arguments:
    symbols - list of strings containing the stock symbol of the desired stock
    Returns:
    ohlc_data - dict of dataframes containing ohlc data for each symbol over certain timeframe
    """
    ohlc_data = dict()
    # get start & endtime for ohlc data
    end_time = datetime.now()
    start_time = end_time - pd.Timedelta(timedelta)
    # convert times in unix integers for API request
    to_time = convert_timestamp_to_unix(end_time)
    from_time = convert_timestamp_to_unix(start_time)

    # get OHLC data for each symbol
    for symbol in symbols:
        r = requests.get(
            'https://finnhub.io/api/v1/stock/candle?symbol={}&resolution=D&from={}&to={}&adjusted=true'.format(
            symbol, from_time, to_time))

        data = pd.DataFrame(
            index= [convert_unix_to_timestamp(x).date() for x in r.json()['t']],
            data = {'open': r.json()['o'],
                    'high': r.json()['h'],
                    'low': r.json()['l'],
                    'adj_close': r.json()['c'],
                    'volume': r.json()['v']})
        ohlc_data.update({symbol: data})
    
    return ohlc_data

In [158]:
symbols = ['GOOG', 'AAPL']
ohlc_data = get_ohlc_data(symbols)

In [159]:
ohlc_data['GOOG'].info()

<class 'pandas.core.frame.DataFrame'>
Index: 1259 entries, 2015-08-17 to 2020-08-14
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   open       1259 non-null   float64
 1   high       1259 non-null   float64
 2   low        1259 non-null   float64
 3   adj_close  1259 non-null   float64
 4   volume     1259 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 59.0+ KB


In [160]:
ohlc_data['AAPL'].info()

<class 'pandas.core.frame.DataFrame'>
Index: 1259 entries, 2015-08-17 to 2020-08-14
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   open       1259 non-null   float64
 1   high       1259 non-null   float64
 2   low        1259 non-null   float64
 3   adj_close  1259 non-null   float64
 4   volume     1259 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 59.0+ KB


We can see that for both datapoints we got back a complete dataset without any null values. There is no further cleaning required.

In [161]:
def plot_ohlc(symbols):
    """
    Plots the ohlc data for a given symbol and timeframe.
    Arguments:
    symbols - list of strings representing the name of a stock symbol that is contained in the ohlc dictionary.
    Returns:
    None
    """
    for symbol in symbols:
        
        df = ohlc_data[symbol]

        fig = go.Figure(data=[
            go.Candlestick(
                name=symbol,
                x=df.index,
                open=df.open,
                high=df.high,
                low=df.low,
                close=df.adj_close)])

        fig.update_layout(
            xaxis_rangeslider_visible=False,
            title='OHLC Stock Chart for: {}'.format(symbol))
        fig.show()

In [162]:
plot_ohlc(symbols)

Above we mentioned that there are no missing values in our datasets. However we can observe gaps in both of the graphs above. This is due to the fact that the stock markets are closed on weekends and certain holidays. This shouldn't be a problem for our prediction as we want to prodict for predefined steps in the future like 7 or 30 days. The exact day then doesn't really matter.

## Algorithm Evaluation

In this chapter we will have a look on different models to predict stock prices and evaluate their performance to decide which algorithm to implement in the final stock price predictor.

### ARIMA

### TODO: Explain model and prep steps

As we're dealing with time-series data we want to assure the data is stationary. In our case we're especially interested in the adjusted closing price as this is the variable we try to predict.

In [163]:
df = ohlc_data['GOOG'].adj_close
df.head()

2015-08-17    660.87
2015-08-18    656.13
2015-08-19    660.90
2015-08-20    646.83
2015-08-21    612.48
Name: adj_close, dtype: float64

In [185]:
def check_stationarity(symbol, interval=7):
    """
    Calculates rolling average and std. deviation for a series of adjusted closing prices.
    The function plots the results and prints the results of the Dickey-Fuller test.
    
    Arguments:
    symbol - string containing the stock symbol of the desired stock
    interval - timeframe to calculate rolling statistics
    Returns:
    df - dataframe with added rolling statistics
    """
    df = ohlc_data[symbol]
    # add rolling statistics to dataframe  
    df['roll_avg'] = df.adj_close.rolling(interval).mean()
    df['roll_std'] = df.adj_close.rolling(interval).std()
    
    # plot adjusted closing price vs. rolling statistics
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=df.index, y=df.adj_close,
                        name='adjusted closing price'))
    fig.add_trace(go.Scatter(x=df.index, y=df.roll_avg,
                        name='rolling 7 day average'))
    fig.add_trace(go.Scatter(x=df.index, y=df.roll_std,
                        name='rolling 7 day std deviation'))
    fig.update_layout(title='{}: Rolling Mean & Standard Deviation'.format(symbol))
    fig.show()
    
    # Dickey-Fuller test:
    result = adfuller(df.adj_close)
    print('ADF Statistic: {}'.format(result[0]))
    print('p-value: {}'.format(result[1]))
    print('Critical Values:')
    for key, value in result[4].items():
        print('\t{}: {}'.format(key, value))
    
    return df

In [186]:
df = check_stationarity('GOOG')

ADF Statistic: -0.7635289002526755
p-value: 0.8296011915134187
Critical Values:
	1%: -3.435651725648415
	5%: -2.863881223119536
	10%: -2.568016498910778


From the graph we can tell that our rolling statistics change over time meaning they are not stationary. We can confirm this with the statistics from the Dickey-Fuller test.
The ADF statistic is far from the critical values and the p-value is way higher than the treshold of 0.05.