# Stock Value Prediction

In this Notebook, we will create the actual prediction system, by testing various approaches and accuracy against multiple time-horizons (target_days variable).

First we will load all libraries:

In [2]:
import pandas as pd
import numpy as np
import sys, os
from datetime import datetime
sys.path.insert(1, '..')
import recommender as rcmd
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

Next, we create the input data pipelines for stock and statement data. Therefore we will have to split data into training and test sets. There are two options for doing that:

* Splitting the list of symbols
* Splitting the results list of training stock datapoints

We will use the first option in order ensure a clear split (since the generate data has overlapping time frames, the second options would generate data that might have been seen by the system beforehand).

In [3]:
# create cache object
cache = rcmd.stocks.Cache()

# load list of all available stocks and sample sub-list
stocks = cache.list_data('stock')
sample = np.random.choice(list(stocks.keys()), 2000)
# split the stock data
sample_train = sample[:1500]
sample_test = sample[1500:]

# generate sample data
df_train = rcmd.learning.preprocess.create_dataset(sample_train, stocks, cache, 14, 66, (-.5, .5))
df_test = rcmd.learning.preprocess.create_dataset(sample_test, stocks, cache, 14, 66, (-.5, .5))
df_train.head()

[-0.5  -0.25  0.    0.25  0.5 ]


  df['expenses_research_netcash'] = np.divide(df['expenses_research'], df['cash_net'])
  df['expenses_research_netcash'] = np.divide(df['expenses_research'], df['cash_net'])
  df['pe_ratio'] = np.divide(df[col_price], df['eps_diluted'])
  df['cash_share'] = np.divide(df['cash_net'], np.divide(df['shareholder_equity'], df[col_price]))


[-0.5  -0.25  0.    0.25  0.5 ]


  df['expenses_research_netcash'] = np.divide(df['expenses_research'], df['cash_net'])
  df['expenses_research_netcash'] = np.divide(df['expenses_research'], df['cash_net'])
  df['pe_ratio'] = np.divide(df[col_price], df['eps_diluted'])


Unnamed: 0_level_0,day_1,day_2,day_3,day_4,day_5,day_6,day_7,day_8,day_9,day_10,...,dividend_share_growth_5y,dividend_share_growth_10y,revenue_share_growth_3y,revenue_share_growth_5y,revenue_share_growth_10y,pe_ratio,cash_share,target,target_cat,symbol
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2009-05-29,-0.06503,-0.06503,-0.038056,-0.038056,-0.026796,-0.026796,-0.033958,-0.033958,-0.007198,-0.007198,...,-0.0282,0.0,-0.0107,-0.0415,-0.0125,55.027452,-8.359794,0.013145,3,AYI
2009-05-29,0.07536,0.141144,0.162709,0.08151,0.132347,0.141845,0.120981,0.150019,0.128377,0.14239,...,0.4137,0.2213,-0.0182,0.069,0.1124,-107.041669,-6.389111,0.246215,3,CMC
2009-05-29,-0.047218,0.0,-0.033727,-0.038786,-0.045531,-0.048904,-0.045531,-0.047218,-0.038786,0.011804,...,0.0,0.0,-0.1277,-0.0803,-0.1035,-15.605263,-2.448446,-0.015177,2,FC
2009-05-29,-0.111023,-0.011102,0.019826,0.004758,-0.012688,-0.011895,0.086439,0.079302,0.053925,0.077716,...,0.0,0.0,-0.0107,0.0612,0.0983,97.000004,-0.168783,0.568236,5,KMX
2009-05-29,0.182652,0.220847,0.210011,0.133774,0.160467,0.148683,0.118967,0.161543,0.195102,0.217953,...,0.3797,0.0,0.1863,0.196,-0.0186,118.290904,6.649403,0.118513,3,MOS


Before we create the actual prediction systems, we will have to define metrics, how we want to measure the success of the systems.
As we have two approaches (classification and regression) we will use two types metrics:

* Precision, Recall & Accuracy
* RMSE

In [None]:
# TODO: metrics

## Baseline Classification

The first step is to create a baseline for both approaches (classification and regression). In case of regression our target value will be `target` and for classification it will be `target_cat` (which we might convert into a one-hot vector along the way).

Lets start with the simpler form of classification: