In [1]:
from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')

# Strategy Learner

## Goal

The main purpose of this project was to:
* Produce a learner that can take raw numerical data for certain stock tickers and 
* Output a strategy of buy, short or hold at each day
* Maximize the 21 day portfolio return over these tickers

## Input

The input we get is a ticker for each CSV, which contains the:
* DATE
* OPEN
* HIGH
* LOW
* CLOSE
* VOL
* ADJ CLOSE

Below is an example of the input for AAPL(Apple):

In [11]:
import pandas as pd

AAPL = pd.read_csv('data/AAPL.csv')
print AAPL

            Date    Open    High     Low   Close    Volume  Adj Close
0     2012-09-12  666.85  669.90  656.00  669.79  25410600     669.79
1     2012-09-11  665.11  670.10  656.50  660.59  17987400     660.59
2     2012-09-10  680.45  683.29  662.10  662.74  17428500     662.74
3     2012-09-07  678.05  682.48  675.77  680.44  11773800     680.44
4     2012-09-06  673.17  678.29  670.80  676.27  13971300     676.27
5     2012-09-05  675.57  676.35  669.60  670.23  12013400     670.23
6     2012-09-04  665.76  675.14  664.50  674.97  13139000     674.97
7     2012-08-31  667.25  668.60  657.25  665.24  12082900     665.24
8     2012-08-30  670.64  671.55  662.85  663.87  10810700     663.87
9     2012-08-29  675.25  677.67  672.60  673.47   7243100     673.47
10    2012-08-28  674.98  676.10  670.67  674.80   9550600     674.80
11    2012-08-27  679.99  680.87  673.54  675.68  15250300     675.68
12    2012-08-24  659.51  669.48  655.55  663.22  15619300     663.22
13    2012-08-23  66

## Design Choices

I chose to use the Q-Learner that I wrote previously to be the learner for the problem. In doing this I needed to choose:

* States
* Actions
* Rewards

The design choices for each of these is detailed below:

### Technical Indicators

#### Bollinger Bands

In [8]:
%%latex
\begin{align}
BB(t) = \frac{price(t) - SMA_{n}(t)}{stdev(t)}
\end{align}

<IPython.core.display.Latex object>

#### Momentum

In [3]:
%%latex
\begin{align}
momentum_{n}(t) = \frac{price(t)}{price(t-n)} - 1
\end{align}

<IPython.core.display.Latex object>

#### Volatility

In [4]:
%%latex
\begin{align}
volatility_{n}(t) = stdev_{n}(dr(t))
\end{align}

<IPython.core.display.Latex object>

#### Daily Return

In [6]:
%%latex
\begin{align}
dr(t) = \frac{price(t)}{price(t-1)} - 1
\end{align}

<IPython.core.display.Latex object>

For each the time interval for BB, momentum and the rolling stdev was
chosen to be 3. Which was an arbitrary choice that was chosen after
trying various values. The value of 3 seemed to work well with the data
as it provided a small window for the technical indicators which seemed
to work better with the reward of 21 day returns. Had we been looking at
returns over longer periods of time it may have been beneficial to
likewise increase the time interval.

Here is a sample of what these indicators look like on the AAPL ticker over a year’s worth of data:

![title](figure_1.png)

After the indicators were collected they were normalized using the standard normalization, i.e. for each indicator:

In [10]:
%%latex
\begin{align}
indicator(t) = \frac{indicator(t) - indicator_{\mu}}{indicator_{\sigma}}
\end{align}

<IPython.core.display.Latex object>

Lastly the reward was filled in where it was nan, with zeros each time which mostly done in an attempt to indicate to the learner than no matter what trading is done here we wont be able to really change the return since the metric is not yet available.


## Discretization, States and Actions

From observation it seemed that the normalized indicators that were computed somewhat followed a standard normal distribution. In order to discretize this a bucket size of .5 was selected and the lower and upper cutoff was designated at 2.6.

The explicit formulation for changing the normalized indicator into a state value was:

This value was then converted into an int and for the values I saw this was between 0-9 for most of the indicators values I had, but just in case the values above 9 were capped at 9 and the values below 0 were set to 0. I felt comfortable doing this as the probability that we would get a value above 2.4 or below -2.4 with the assumption of a standard normal is low:

In [12]:
%%latex
\begin{align}
\int_{2.4}^{\inf} \mathcal{N}(0,\,1) dx = 0.00819754
\end{align}

<IPython.core.display.Latex object>


Because we were using three indicators whose values can vary from 0-9, the states are computed by concatenating the digits together which creates an in- teger between 0 - 999, making a 1000 state space.

For the points at which the indicators were nan I converted this to the average value of the indicators, which was 0 with the normal assumption, this led to the state:

In [13]:
%%latex
\begin{align}
\frac{0 + 2.4}{0.5} = int(4.8) = 4
\end{align}

<IPython.core.display.Latex object>

This was primarily done to attempt to indicate to the algorithm that we didn’t really have much information here.
Lastly the actions were encoded as 3 number 0 for sell/short 1 for hold or 2 for buy. The holdings were assigned appropriately based on these actions.

## Learning

Learning was done by feeding each new state to the Qlearner and computing the reward based on the action the learner returned, the daily return and the current holding. More specifically the reward was computed as:

In [18]:
%%latex
\begin{align}
r = holding * dr(t), \ holding = \{-200, 0, 200\}
\end{align}

<IPython.core.display.Latex object>

This was done instead of the more complex portfolio computation for each day because the only thing we really cared about was the trades, the cash we had was dependent on the trades we made. One can see that the multiplication takes care of the various cases, if you short and the price goes down then you get positive reward, likewise for long and an increase in price you get a positive reward. On the other-hand you get a negative or no reward for a trade when the price went the opposite direction of your holding or if you just held onto it.

There is one more distinction between my calculation and the portfolio com- putation, the cost of making a trade wasn’t taken into account in computing the reward. Taking this into account would have likely refined the learner a little bit, but since the holdings were done in 200 increments, a $9.95 transaction fee would likely not have affected the reward that much. This was observed empirically as the learner was still able to hit all of the benchmarks.

Lastly the learner ensured that the rules of the trades were met in that when the Qlearner attempted to put itself in a position that was illegal the holdings did not change and the reward only changed by whatever the daily return was. In other words it got the same rewards as if it had done nothing.

The same indicators and states were computed as during learning and all of these states were looped through. The actions returned by the Qlearner were entered into the trades data-frame as long as they did not put us into an illegal position.

## Hyper-Parameters Experiments

### Normalization

The first experiment I conducted was with normalizing the data. Initially I did not normalize the data and after attempting to change almost every hyper- parameter in the Qlearner I finally tried normalizing the indicators. Below
3 is a table the describes how many of the test cases past with respect to the normalization:

![title](figure_2.png)

The reason that normalization mattered so much is due to the fact that without it the states for training and testing were widely different. In addition, using time series statistics for normalization with a window size equal to the window size of the indicators did not perform that well since the SMA or rolling Stdev couldnt properly capture these values for the whole period. Instead by using the global statistics the information encoded in the training state space could be transferred to the testing state space.

## Epochs, rar, alpha

These were some of the hyper-parameters that I thought would affect the learner in a significantly different way than it did. Varying number of Epochs along with rar and α had almost no major change to the number of tests or % over benchmark score. For each table the other two parameters were held constant at either their default value of 1500 for epochs. The results are summarized below:

![title](figure_3.png)