FN
# Stock movement prediction with machine learning

<br>
<br>
Description... Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

### Important
GitHub does not render plotly plots. Please open this notebook in nbviewer: https://nbviewer.jupyter.org/github/afizing/matplot2plotly/blob/master/matplot2plotly.ipynb

All required dependencies are listed in requirements.txt
<br>
<br>
<br>

------------------------------------------

## Contents
 1. Requirements
 2. Dataset
 3. Stock universe creation
 * Feature engineering
 * Target labels
 * Data exploration
 * Feature selection
 * ARIMA
 * Deep learning 
   * LSTM
   * GRU
 * Reinforced learning
 * Model comparison, forcasting
 * Conclusions and further research
 * Credits
 
 ------------------------------------------

## 1. Requirements

pip install -r requirements.txt

In [7]:
import numpy as np
import pandas as pd
import os
import plotly.graph_objects as go

In [8]:
# TODO: plotly configuration parameters to external file
plotly_config = {'showLink': False, 'displayModeBar': False, 'showAxisRangeEntryBoxes': True}
color_scheme = {
    'index': '#B6B2CF',
    'etf': '#2D3ECF',
    'tracking_error': '#6F91DE',
    'df_header': 'silver',
    'df_value': 'white',
    'df_line': 'silver',
    'heatmap_colorscale': [(0, '#6F91DE'), (0.5, 'grey'), (1, 'red')],
    'background_label': '#9dbdd5',
    'low_value': '#B6B2CF',
    'high_value': '#2D3ECF',
    'y_axis_2_text_color': 'grey',
    'shadow': 'rgba(0, 0, 0, 0.75)',
    'major_line': '#2D3ECF',
    'minor_line': '#B6B2CF',
    'main_line': 'black'}

## 2. Dataset
 
This data includes 'end of day' stock prices in OHLCV records (open-high-low-close-volume) from xxxx to 24-02-2020. <br>
All prices are adjusted for splits and dividends to reflect accurate stock value. 

In this notebook I will analize and predict **Intel Corporation**. To test other assets I would recommend these financial data providers:
- [Quandl](https://www.quandl.com/) 
- [Quantopian](https://www.quantopian.com/) 
- [Exemplary Kaggle dataset](https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs)
- [Bossa.pl: polish stocks](https://info.bossa.pl/notowania/pliki/eod/bossafx/)

For additional feature engineering I used [S&P500 companies sector data from datahub.io](https://datahub.io/core/s-and-p-500-companies/)

------------------------------------------
<br>
<br>


In [9]:
# Exemplary csv
data = pd.read_csv('data\INTC.csv', parse_dates=['date'])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4566 entries, 0 to 4565
Data columns (total 6 columns):
date      4566 non-null datetime64[ns]
open      4566 non-null float64
high      4566 non-null float64
low       4566 non-null float64
close     4566 non-null float64
volume    4566 non-null float64
dtypes: datetime64[ns](1), float64(5)
memory usage: 214.2 KB


In [10]:
monthly_close = data[["close"]].set_index(data.date).resample('M').last()

In [11]:
traces = [go.Scatter(name='Daily Close', 
                     x=data['date'], 
                     y=data['close'], 
                     line={'color': color_scheme['minor_line']}),
          go.Scatter(name='Monthly Close', 
                     x=monthly_close.index, 
                     y=monthly_close['close'], 
                     line={'color': color_scheme['major_line']})]

fig = go.Figure(data=traces)

fig.update_layout(
    title_text='Intel Corporation (INTC). Daily and monthly close',
    xaxis_title='Date',
    yaxis_title='Adjusted price USD'
    )

fig.show(config=plotly_config)

fig.update_layout(
    xaxis_range=['2017-02-23','2020-02-23'],
    title_text='Enclosed period: last 3 years',
    xaxis_title='Date',
    yaxis_title='Adjusted price USD'
    )

fig.show(config=plotly_config)

## 3. Stock universe
<br>
First, I need to combine and preprocess multiple csv files to create the stock universe dataframe.

<br>
<br>

---------------------------------------------------------------------------------------
Directories and time range:

In [12]:
db_dir = r'F:\coding\trading\datasets\daily'
sector_dir = r'F:\coding\trading\datasets\sector'

start_date = '2010-01-04'
end_date = '2020-02-21'

### Sector
<br>
Information about the sector of stocks will help us analyze them as a group and create new features based on their relations and similarities.

Source: 

I included missing tickers and updated few symbols manually. 

In [13]:
sector_df = pd.read_csv(os.path.join(sector_dir, 'sector.csv'), index_col="Symbol")
sector_df.head()

Unnamed: 0_level_0,Name,Sector
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
MMM,3M Company,Industrials
AOS,A.O. Smith Corp,Industrials
ABT,Abbott Laboratories,Health Care
ABBV,AbbVie Inc.,Health Care
ACN,Accenture plc,Information Technology


In [14]:
sector_codes = pd.DataFrame(columns=['name'])
sector_codes.name = sector_df.Sector.unique()
sector_codes = sector_codes.reset_index().set_index('name').to_dict()['index']
sector_codes

{'Industrials': 0,
 'Health Care': 1,
 'Information Technology': 2,
 'Consumer Discretionary': 3,
 'Utilities': 4,
 'Financials': 5,
 'Materials': 6,
 'Real Estate': 7,
 'Consumer Staples': 8,
 'Energy': 9,
 'Telecommunication Services': 10}

In [15]:
sector_df.replace({'Sector': sector_codes}, inplace=True)

In [16]:
sector_df.head()

Unnamed: 0_level_0,Name,Sector
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
MMM,3M Company,0
AOS,A.O. Smith Corp,0
ABT,Abbott Laboratories,1
ABBV,AbbVie Inc.,1
ACN,Accenture plc,2


### Combining csv ticker data

In [17]:
%%time

main_df = pd.DataFrame(columns=['open', 'close', 'volume'])

for file in os.listdir(db_dir):
    
    part_df = pd.read_csv(os.path.join(db_dir, file),
                          index_col = 'date',
                          parse_dates=['date'],
                          usecols = ['date', 'open', 'close', 'volume'],
                          memory_map=True,
                          )
    
    part_df = part_df.truncate(before=pd.Timestamp(start_date), after=pd.Timestamp(end_date))
    
    # Filter only continuous stocks from given period
    if part_df[part_df.index.isin([start_date, end_date])].isna().values.any():
        part_df = []
        continue
    
    # Ticker name and sector
    ticker_name = file[:-4]
    part_df['name'] = ticker_name
    part_df['sector'] = sector_df.loc[ticker_name].Sector
    
    main_df = pd.concat([main_df, part_df], axis=0, sort=False)

main_df.reset_index(inplace=True)
main_df.rename(columns={'index':'date'}, inplace=True)

Wall time: 23.6 s


### Missing data
We have few NaN values. In this case we will fill them by linear interpolation.

In [18]:
main_df[main_df.isna().any(axis=1)]

Unnamed: 0,date,open,close,volume,name,sector
142805,2019-12-09,,,0.0,BBT,5.0
182640,2016-01-15,,,0.0,CB,5.0
185737,2018-03-19,,,0.0,CBRE,7.0
188721,2019-12-05,,,0.0,CBS,3.0
385098,2019-09-25,,,0.0,EXC,4.0
409312,2014-08-01,,,0.0,FIS,2.0
414141,2013-07-02,,,0.0,FITB,5.0
459925,2012-12-18,,,0.0,GT,3.0
476963,2019-11-05,,,0.0,HCP,7.0
504935,2019-07-01,,,0.0,HRS,2.0


In [19]:
# Linear interpolation for missing NaNs
main_fillna = main_df.pivot_table(index=main_df['date'], values=['open', 'close', 'volume'], columns=["name"]).interpolate()
main_df = main_df.merge(main_fillna.stack().reset_index())
print(f'NaN ammount after preprocessing: {main_df[main_df.isna().any(axis=1)].size}')

NaN ammount after preprocessing: 0


In [20]:
main_df.head()

Unnamed: 0,date,open,close,volume,name,sector
0,2010-01-04,20.719,20.713,2546797.0,A,1.0
1,2010-01-05,20.646,20.481,2875349.0,A,1.0
2,2010-01-06,20.408,20.402,2067453.0,A,1.0
3,2010-01-07,20.362,20.375,2127216.0,A,1.0
4,2010-01-08,20.269,20.395,2464546.0,A,1.0


<br>
<br>

## 4. Feature engineering
TODO: Opis wskaźników analizy fundamentalnej 
<br>

#### 4.1. Market movement features

In [21]:
def rank(universe, factor):
    rank = universe[['date', 'name', factor]]\
        .groupby(['date'])[factor]\
        .rank(method='first', ascending=False)
    
    return rank


def zscore(universe, factor):
    score = universe[['date', 'name', factor]]\
        .groupby(['date'])[factor]\
        .transform(lambda x: (x - x.mean()) / x.std())
    
    return -score


def smoothing(universe, factor, window):
    moving_av = universe[['date', 'name', factor]]\
                        .pivot_table(index='date' ,columns='name', dropna=False)\
                        .rolling(window)\
                        .mean()\
                        .rename(columns={factor: f"{factor}_smooth_{window}_days"})\
                        .stack(dropna=False)\
                        .reset_index()
    return moving_av[['date', 'name', f"{factor}_smooth_{window}_days"]]


def momentum(universe, window):
    """
    Parameters
    ----------
    window: int, lookback window length (numer of rows)
    universe: pd.DataFrame, stock dataframe with daily data
    
    Returns
    -------
    momentum: pd.DataFrame, transformed universe with new feature column
                            of sector-neutral momentum
    
    """

    # Returns
    main = universe.copy()
    temp = universe.copy().pivot_table(index='date', columns=['name'])
    returns_windowed = (temp.close - temp.open.shift(window))/temp.open.shift(window)
    temp = main.merge(returns_windowed.stack(dropna=False)\
                                                    .rename(f'returns_{window}days')\
                                                    .reset_index())
   
    # Demean
    demean = temp.groupby(["date", "sector"])\
                   .mean()[f"returns_{window}days"]\
                   .reset_index()\
                   .rename(columns={f"returns_{window}days": "mean_returns"})
    temp = temp.merge(demean)
    temp['demean_returns'] = temp[f"returns_{window}days"] - temp["mean_returns"]
    
    # Rank
    temp['demean_rank'] = temp[['date', 'name', 'demean_returns']].groupby(['date'])['demean_returns']\
                                                              .rank(method='first', ascending=False)
    # Zscore
    temp['momentum_annual'] = zscore(temp, 'demean_rank')
    return temp[['date','name','momentum_annual']]


def smooth_momentum_reversion(universe, window):
    
    temp = universe.copy()
    temp["momentum_reversion"] = -momentum(temp, window)['momentum_annual']

    # Moving average
    moving_avg = temp[['date', 'name', 'momentum_reversion']]\
                        .pivot_table(index='date' ,columns='name', dropna=False)\
                        .rolling(window)\
                        .mean()\
                        .rename(columns={"momentum_reversion": f"mma_{window}days"})\
                        .stack(dropna=False)\
                        .reset_index()
    
    universe = universe.merge(moving_avg)
    
    # Rank
    universe['mma_rank'] = universe[['date', 'name', f"mma_{window}days"]].groupby(['date'])[f"mma_{window}days"]\
                                                              .rank(method='first', ascending=False)
    # Zscore    
    temp[f'momentum_reversion_{window}_days'] = zscore(universe, 'mma_rank')
    return temp[['date','name',f'momentum_reversion_{window}_days']]


def smooth_overnight_sentiment(universe, window):
    
    # Overnight returns
    temp = universe.copy()
    op = temp[["date", "open", "name"]].pivot_table(index='date', columns='name', dropna=False)\
        .rename(columns={"open": "overnight_returns"})
    cl = temp[["date", "close", "name"]].pivot_table(index='date', columns='name', dropna=False).shift(1)\
        .rename(columns={"close": "overnight_returns"})
    
    temp = universe.merge(((op - cl)/cl).stack(dropna=False).reset_index())
    
    # Rolling mean * window
    # The overnight return during week is the average daily return for that week, multiplied by 5
    temp["overnight_returns"] = (temp[["date", "name", "overnight_returns"]]
                                        .pivot_table(index='date', columns='name', dropna=False)\
                                        .rolling(window)\
                                        .mean()*window)\
                                        .stack(dropna=False)\
                                        .reset_index()\
                                        .overnight_returns
    
    # Overnight rank
    temp['overnight_rank'] = rank(temp, 'overnight_returns')
    
    # Overnight zscore
    temp['overnight_rank'] = zscore(temp, 'overnight_rank')

    # Smoothing: Moving average    
    temp = temp.merge(smoothing(temp, 'overnight_rank', window))
    
    # Rank
    temp['overnight_rank'] = rank(temp, f'overnight_rank_smooth_{window}_days')
    
    # Zscore
    temp['overnight_rank'] = zscore(temp, 'overnight_rank')
    
    return temp[['date','name','overnight_rank']]


def annualized_volatility(universe, window):
    
    temp = universe.copy()
    temp['returns'] = universe.close - universe.open
    temp['av'] = (temp[["date", "name", "returns"]]
                    .pivot_table(index='date', columns='name', dropna=False)\
                    .rolling(window)\
                    .std()* (252 ** .5))\
                    .stack(dropna=False)\
                    .reset_index()\
                    .returns
    
    temp['av'] = rank(temp, 'av')
    temp[f'annualized_volatility_{window}_days'] = zscore(temp, 'av')
    
    return temp[['date','name',f'annualized_volatility_{window}_days']]


def average_dollar_volume(universe, window):
    
#     avg_$_vol = sum(close*volume)/window
    
    temp = universe.copy()
    temp['usd_volume'] = universe.close * universe.volume
    temp['avg_usd_volume'] = temp[["date", "name", "usd_volume"]]\
                                .pivot_table(index='date', columns=['name'], dropna=False)\
                                .rolling(window)\
                                .mean()\
                                .stack(dropna=False)\
                                .reset_index()\
                                .usd_volume
    
    temp['avg_usd_volume'] = rank(temp, 'avg_usd_volume')
    temp['avg_usd_volume'] = zscore(temp, 'avg_usd_volume')
    
    return temp[['date','name','avg_usd_volume']]


def market_dispersion(universe):

    temp = universe.copy()
    temp['daily_returns'] = (temp.close - temp.open)/(temp.open)
    
    temp['market_mean']=temp[["date", "name", "daily_returns"]]\
                            .groupby('date')\
                            .transform('mean')\
                            .daily_returns
    temp['market_dispersion'] = np.sqrt((temp['daily_returns'] - temp['market_mean'])**2)
    
    return temp[['date','name','market_dispersion']]  


def market_volatility(universe, window=1):

    temp = universe.copy()
    temp['daily_returns'] = (universe.close - universe.open)/(universe.open)
    
    temp['market_mean'] = temp[["date", "daily_returns"]].groupby('date')\
                                                         .transform('mean')\
                                                         .rolling(window).mean()

    temp['daily_demean_sq'] = (temp['daily_returns'] - temp['market_mean'])**2   
    temp[f'market_volatility_{window}_days'] = np.sqrt((temp[["date", "daily_demean_sq"]]\
                                                        .groupby('date')\
                                                        .transform('mean'))*252)
    
    return temp[['date','name',f'market_volatility_{window}_days']]

In [22]:
features = main_df[['date', 'name', 'sector']]
features = features.astype({'sector':'int32'})

In [23]:
%%time
features = features.merge(momentum(main_df, 252))
features = features.merge(smooth_momentum_reversion(main_df, 20))
features = features.merge(smooth_overnight_sentiment(main_df, 5))
features = features.merge(annualized_volatility(main_df, 20))
features = features.merge(annualized_volatility(main_df, 120))
features = features.merge(average_dollar_volume(main_df, 20))
features = features.merge(market_dispersion(main_df))
features = features.merge(smoothing(features, "market_dispersion", 20))
features = features.merge(smoothing(features, "market_dispersion", 120))
features = features.merge(market_volatility(main_df, 20))
features = features.merge(market_volatility(main_df, 120))

Wall time: 38.4 s


#### 4.2. Stock exchange calendar features

In [24]:
features['january'] = (main_df.date.dt.month == 1).astype(int)
features['december'] = (main_df.date.dt.month == 12).astype(int)
features['weekday'] = main_df.date.dt.weekday
features['quarter'] = main_df.date.dt.quarter
features['year'] = main_df.date.dt.year

last_day_m = pd.date_range(start=start_date, end=end_date, freq='BM')
first_day_m = pd.date_range(start=start_date, end=end_date, freq='BMS')
last_day_q = pd.date_range(start=start_date, end=end_date, freq='BQ')
first_day_q = pd.date_range(start=start_date, end=end_date, freq='BQS')

features['month_end'] = (main_df.date.isin(last_day_m)).astype(int)
features['month_start'] = (main_df.date.isin(first_day_m)).astype(int)
features['quarter_end'] = (main_df.date.isin(last_day_q)).astype(int)
features['quarter_start'] = (main_df.date.isin(first_day_q)).astype(int)

#### 4.3. Processing. Sector one-hot encoding

In [25]:
from sklearn.preprocessing import OneHotEncoder

features.astype({'sector':'category'}, inplace=True)
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore',)
ohe.fit(features[['sector']])

OneHotEncoder(categorical_features=None, categories=None, drop=None,
              dtype=<class 'numpy.float64'>, handle_unknown='ignore',
              n_values=None, sparse=False)

In [26]:
sector_one_hot = pd.DataFrame(ohe.transform(features[['sector']]), 
                              columns=ohe.get_feature_names(input_features=['sector'*len(ohe.categories_)]))
features.drop('sector', axis=1, inplace=True)
features = pd.concat([features, sector_one_hot], axis=1)

In [27]:
features.tail()

Unnamed: 0,date,name,momentum_annual,momentum_reversion_20_days,overnight_rank,annualized_volatility_20_days,annualized_volatility_120_days,avg_usd_volume,market_dispersion,market_dispersion_smooth_20_days,...,sector_1,sector_2,sector_3,sector_4,sector_5,sector_6,sector_7,sector_8,sector_9,sector_10
1094349,2020-02-14,ZION,-1.242077,0.687095,-0.951721,1.514784,1.377141,-0.207115,0.002664,0.007499,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1094350,2020-02-18,ZION,-1.266273,1.117718,-0.975918,0.832522,0.203024,-1.059943,0.018892,0.008249,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1094351,2020-02-19,ZION,-1.274339,1.213882,-1.419516,1.336096,1.395793,-0.702567,0.010927,0.008255,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1094352,2020-02-20,ZION,-1.274339,0.594554,-1.516302,1.555395,1.56498,-0.962477,0.010951,0.008401,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1094353,2020-02-21,ZION,-1.274339,1.015697,-1.032376,1.668957,1.563747,0.976951,0.004655,0.008389,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


<br>
<br>

## 5. Target labels
We are predicting movement for the next week. Podział na kwantyle - neutralne w stosunku do rynku

In [28]:
def targets(universe, quantiles):
    # labels from 0 to quantiles-1
    # NaNs labeled with -1
    
    # Weekly returns
    temp = universe.copy()
    temp['open_shift'] = temp[["date", "name", "open"]]\
                            .pivot_table(index='date', columns=['name'], dropna=False)\
                            .shift(5)\
                            .stack(dropna=False)\
                            .reset_index()\
                            .open
    
    temp['returns_weekly'] = (temp['close'] - temp['open_shift'])/(temp['open_shift'])
    
    # Quantiles
    
    mask = temp['returns_weekly'].isna()
    temp['label'] = temp[["date", "returns_weekly"]].groupby('date')['returns_weekly']\
                        .transform(lambda x: pd.qcut(x.array, quantiles, labels=False, duplicates='drop'))
    
    temp['label'] = temp['label'].where(~mask, -1)
    
    return temp[['date', 'name', 'label']]

In [29]:
features = features.merge(targets(main_df, 5))

# TODO___