# Project: Financial time series forecasting

The following project is could be done individually or in pairs but you not allowed to share your solution with anyone else. Read below carefully!

- The aim of the project is that you learn how to set up an analytics project end-to-end. A secondary aim is that you understand how to work with a time series data set and forecast based on such data. Third aim is that you gain an insight into how to interpret data and results.

- The solution must address each grade in the written order. Therefore, to complete grade five you must have completed the other grades first.

- Unintentionally, there may be information missing in the description, please go through the description early in advance so that you have time to ask for assistance.


<a id='toc'></a>
# TOC

[Grade 1](#g1)

[Grade 2](#g2)

[Grade 3](#g3)

[Grade 4](#g4)

[Grade 5](#g5)


In [2]:
# First upgrade the environment.
# https://pypi.org/project/yfinance
import pip
from subprocess import run
# add what you will need
modules =[
#     'pandas_datareader',
     'yfinance',
    'pandas_market_calendars',
    'plotly', 
    'numpy',
    'sklearn',
    'pandas'
]
proc = run(f'pip install {" ".join(modules)} --upgrade --no-input', 
       shell=True, 
       text=True, 
       capture_output=True, 
       timeout=40)
print(proc.stderr)




In [3]:
 #Run this if you need to check your modules
 import pip
 from pip._internal.utils.misc import get_installed_distributions
 pkgs = ''.join(str(get_installed_distributions(local_only=True)))

 with open("modules.txt", "a") as file_object:
     for p in (get_installed_distributions(local_only=True)):
         file_object.write(str(p)+'\n')
         print(str(p))
 file_object.close()

zope.interface 5.1.2
zope.event 4.5.0
zipp 3.4.0
zict 2.0.0
yfinance 0.1.59
yapf 0.30.0
xmltodict 0.12.0
xlwt 1.3.0
xlwings 0.20.8
XlsxWriter 1.3.7
xlrd 1.2.0
wrapt 1.11.2
wincertstore 0.2
win-unicode-console 0.5
win-inet-pton 1.1.0
widgetsnbextension 3.5.1
wheel 0.35.1
Werkzeug 1.0.1
webencodings 0.5.1
wcwidth 0.2.5
watchdog 0.10.3
urllib3 1.25.11
unicodecsv 0.14.1
ujson 4.0.1
typing-extensions 3.7.4.3
traitlets 5.0.5
trading-calendars 2.1.1
tqdm 4.50.2
tornado 6.0.4
toolz 0.11.1
toml 0.10.1
tifffile 2020.10.1
threadpoolctl 2.1.0
testpath 0.4.4
terminado 0.9.1
tblib 1.7.0
tables 3.6.1
sympy 1.6.2
statsmodels 0.12.0
SQLAlchemy 1.3.20
spyder 4.1.5
spyder-kernels 1.9.4
sphinxcontrib-websupport 1.2.4
sphinxcontrib-serializinghtml 1.1.4
sphinxcontrib-qthelp 1.0.3
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-htmlhelp 1.0.3
sphinxcontrib-devhelp 1.0.2
sphinxcontrib-applehelp 1.0.2
Sphinx 3.2.1
soupsieve 2.0.1
sortedcontainers 2.2.2
sortedcollections 1.2.1
snowballstemmer 2.0.0
sklearn 0.0
six 1.

In [9]:
import pandas as pd
from pathlib import Path
import numpy as np
import datetime as dt
import yfinance as yf
from random import randrange

import matplotlib
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter, FuncFormatter, StrMethodFormatter
%matplotlib inline

import plotly as ply
import plotly.graph_objects as go

import sklearn
from sklearn.preprocessing import StandardScaler, Normalizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import TimeSeriesSplit

from functools import reduce
from operator import mul
from pprint import PrettyPrinter
pprint = PrettyPrinter().pprint

<a id='g1'></a>
# Grade 1
## Implement a complete process for forecasting a single stock.

You should do the following steps:
- Use the [EURUSD data set](https://people.arcada.fi/~parland/hjd5_8amp_Gt3/EURUSD1m.zip) (52Mb)
- Calculate feature [Larry William’s %R](https://www.investopedia.com/terms/w/williamsr.asp) from the paper [Predicting the Direction of Stock Market Index Movement Using an Optimized Artificial Neural Network Model](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4873195) (implement in code, insert values in complementary columns). Note that you need to implement your own calculation of each feature and be able to explain the code.
- [Normalize or standardize](https://scikit-learn.org/stable/modules/preprocessing.html)
- Create a label for your forecast, by shifting the Close value 1 step. You will predict one day ahead.
- Drop other data than the Close and the features for inference
- Split data into 80/20 (train/test). Be carefull, you are splitting a time serie
- Set up a [linear model](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares) 
- Fit/train the linear model to the training data
- Forecast 1 day ahead based on test data
- Calculate the [R² error](https://en.wikipedia.org/wiki/Coefficient_of_determination) on both the training data set and the test. Please format numbers to four [significant digits](https://en.wikipedia.org/wiki/Significant_figures).
- Compare the errors and explain the outcome


### **Larry William’s %R**

$ (H_n − C_t)/(H_n − L_n)\times100 $

### Coefficient of determination ($R^2$)

$$R^2 = 1 - \frac {SSResid}{SSTot}$$

#### Residual Sum of Squares: $SSResid = \sum_{i} (y_i - \hat{y_i})^2$

#### Total Sum of Squares: $SSTot = \sum_{i} (y_i - \bar{y})^2$

#### A baseline model, which always predicts $\bar {y}$, will have $R^2 = 0$

In [7]:
# Import data and create dataframe
DATAFILE_NAME = "EurUsd1m.pickle"
if Path(DATAFILE_NAME).is_file(): # Check if we got the datafile already
    df = pd.read_pickle(DATAFILE_NAME) # Read it
else:
    df = pd.read_csv('https://people.arcada.fi/~parland/hjd5_9amp_Gt3/EURUSD1m.zip',\
                       compression='zip')
    df.to_pickle(DATAFILE_NAME) # Save Localy

symbol = 'MU'
end = dt.datetime.now() 
start = end - dt.timedelta(days=10*365)
data = yf.Ticker(symbol).history(start=start, end=end)
#data.resample('1D').sum()
data

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2011-04-18,10.550000,10.610000,10.260000,10.420000,23112400,0,0
2011-04-19,10.440000,10.580000,10.320000,10.520000,20529600,0,0
2011-04-20,10.900000,11.410000,10.890000,11.390000,52613700,0,0
2011-04-21,11.470000,11.720000,11.240000,11.520000,43506100,0,0
2011-04-25,11.460000,11.490000,11.250000,11.330000,20447100,0,0
...,...,...,...,...,...,...,...
2021-04-08,95.000000,96.389999,93.779999,95.290001,17815600,0,0
2021-04-09,94.339996,95.379997,93.309998,95.300003,14066700,0,0
2021-04-12,95.190002,96.959999,94.750000,95.589996,18818300,0,0
2021-04-13,96.290001,96.820000,91.400002,92.150002,27102100,0,0


In [11]:
#grade 1 shit
columns=['close', 'label']
close = np.array([randrange(10,20,1) for _ in range(5)])
label, close = np.roll(close, -1).reshape(-1,1), close.reshape(-1,1)
d = pd.DataFrame(np.concatenate([close, label], axis=1), columns=columns)

train = pd.DataFrame(Normalizer().fit_transform(d[:3]), columns=columns)
train.index.name = 'train'
test = d[3:]; test.index.name = 'test'
display(d, train, test)
highestHigh = highest(15)
lowestLow = lowest(15)
#def calcWilldell(df)
    #df['highestHigh'] / df['close'].rolling



# here is an example of Disparity in 5 days = Ct/MA5 × 100
def Disparity_5(df):
    return 100 * df['close'] / df['close'].rolling(window = 5).mean()


Unnamed: 0,close,label
0,19,18
1,18,14
2,14,18
3,18,19
4,19,19


Unnamed: 0_level_0,close,label
train,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.725953,0.687745
1,0.789352,0.613941
2,0.613941,0.789352


Unnamed: 0_level_0,close,label
test,Unnamed: 1_level_1,Unnamed: 2_level_1
3,18,19
4,19,19


NameError: name 'highest' is not defined

<a id='g2'></a>
# Grade 2
## Illustrate data using plotly (or other) library

- Calculate additional feature [Stochastic slow %D](https://tradingsim.com/blog/slow-stochastics)
- Create a figure based on OHLC candles covering the test period
- Second add a line chart(s) that illustrates the *label* (actual data) and the *forecast* in the same figure over OHLC. The lines should have different colors and include names of series.
- Add subplot(s) with features so we can se them time-aligned
- What patterns can you observe from the line figure?

### Stochastic %K	
<br>
<span style='font-size:20px'>
$\frac{(C_t − L_n)}{(H_n − L_n)}\times100$
</span>
    
### Stochastic %D
<br>
<span style='font-size:20px'>
$\sum\nolimits_{i=0}^{n-1}\frac{\%K_{t-i}}n$
</span>
    
### Stochastic slow %D
<br>
<span style='font-size:25px'>
$\frac{\sum_{i=0}^{n-1}\%D_{t-i}}n$
</span>

<a id='g3'></a>
# Grade 3

- Calculate additional feature [RSI (relative strength index)](https://www.investopedia.com/terms/r/rsi.asp)
- Add the feature as a subplot to the illustration from in the previos step
- Set up an [ElasticNet](https://scikit-learn.org/stable/modules/linear_model.html#elastic-net) model
- Fit/train the ElasticNet to the training data
- Forecast and calculate the R² error on both the training data set and the test
- Combine line chart(s) that illustrates the *label* (actual data) and the *forecast* from both models in the previos figure.
- Compare the errors and explain the outcome


### RSI
<br>
<span style='font-size:25px'>
$100-\frac{100}{\left(1+\frac{\frac{\sum_{i=0}^{n-1}Up_{t-i}}{\text{n}}}{\frac{\sum_{i=0}^{n-1}Dw_{t-i}}{\text{n}}}\right)} $
    </span>

<a id='g4'></a>
# Grade 4

- Calculate additional feature [On Balance Volume](https://www.investopedia.com/terms/o/onbalancevolume.asp)
- Create sliding windows for the input data, e.g. the window length of 10, 5, and 2 samples. You will extract data for the window length n (rows), and turn the data from a matrix (2D) form into a vector form of the size $R\times C$ (i.e. number of rows * number of columns in the window) [NumPy Array Reshaping](https://www.w3schools.com/python/numpy_array_reshape.asp). You will probably need to create a function that returns a vector (array, tuple, list, Series). Other solutions are also possible. Here is an example of two days window for two features attached to the original data:

|     | a | b   | a-1 | b-1 | a-2 | b-2 |
|:---:|:-:|:---:|:---:|:---:|:---:|:---:| 
| t₁  | 0 |	1   | nan | nan | nan | nan |	
| t₂  | 2 |	3   | 0   | 1   | nan | nan |
| t₃  | 4 |	5   | 2   | 3   | 0	  | 1   | 
| t₄  | 6 |	7   | 4   | 5   | 2	  | 3   | 
| t₅  | 8 |	9   | 6   | 7   | 4	  | 5   |

- Set up a [Polynomial regression](https://scikit-learn.org/stable/modules/linear_model.html#polynomial-regression-extending-linear-models-with-basis-functions)
- Fit and run all three different models (on all features) for all three different window lengths
- Summarize and compare their R² error measures. Is anyone better than the [LinearRegression model](https://en.wikipedia.org/wiki/Linear_regression) without window information attached?

| M\W  | no |  2 |  5 | 10 |
|:----:|:--:|:--:|:--:|:--:|
|  M₁  | R² | R² | R² | R² |
|  M₂  | R² | R² | R² | R² |
|  M₃  | R² | R² | R² | R² |


### **On Balance Volume**

$ OBV_t = OBV_{t − 1} + \theta \times V_t $

where $V_t$ is the volume of trade at time $t$, and 

(*classic definition*) $ \theta = 
\begin{cases}
  +1, & \textit{if} \ C_t > C_{t−1} \\
  0, & \textit{if} \ C_t = C_{t−1} \\
  –1, & \textit{otherwise}
\end{cases}
$  <br /><br /> or  (*definition from the paper*) $ \theta = 
\begin{cases}
  +1, & \textit{if} \ C_t \geq C_{t−1} \\
  –1, & \textit{otherwise}
\end{cases}
$


<a id='g5'></a>
# Grade 5
Implement an investment decision to either buy or sell based on some signals which you choose to detect. 
See the paper for how this can be done, the easiest solution is to hand-craft this decision to either buy or sell.

- Compare the regression forecast with the known Close price.
- Once the the forecast go above Close price you can define a buy opportunity
- You can decide to keep and hold if forecasted difference small or based on other signals.
- Calculate the hit ratio of your investment decision for each of the windows
- Forecast one week ahead and compare the hit ratio with one day ahead forecast
- Which setup was the best, and why was that?


# Happy coding! 👩‍💻

