# Capstone Project Proposal

## Domain Background
This project aims to construct a predictive financial model, capable of estimating the future price movement of traded financial instruments based primarily on historical price data.

This is a challenging project, since there are a range of views held as to whether this is feasible. On one hand, the Efficient Market Hypothesis (EMH) [1] argues that for efficient markets, it should not be possible to predict future prices, based on historic price information alone. On the other hand, investing strategies which rely on exactly this approach, such as momentum and trend-following approaches [2] do so with empirical success under certain conditions, suggesting that there may indeed be structure in historic price information which can be used to predict future price movements.

There are many factors that influence the price of a financial product traded on any particular market at a given time. These range from the sentiment and demand for the product by the market participants/investors to the dynamics of the systems used to operate the marketplace itself.

The hypothesis proposed here is that some of these factors are not completely independent of recent changes in the price of the product, and that by analysing enough data, some structure may be found in historic price movements that would be a predictor of future price movements.

It is appreciated that any such structure discoverable is likely to be a very weak underlying signal amid a lot of noise. Therefore it's entirely feasible, indeed likely, that any predictive signal obtained, would not be strong enough to trade profitably on it's own. Trading any systematic trading strategy incurs execution costs, typically due to price spreads (the difference between the cost to buy and that to sell a given product) and fees or commissions charged by the various brokers or marketplaces involved.

Nevertheless, it is proposed that even the identification of a weak but measurable signal would be a valuable contribution, since it could be incorporated as an additional input into existing systematic investment strategies to increase their profitability or diversity. 

## Problem Statement

The task proposed is to construct a predictive financial model, capable of predicting the direction of short-term future price movements, based on historical price data.

It will take as it's main input a dataset consisting of recent historical prices of a financial product and output a signal indicating the direction any expected future price movement within a short time period.

The model will be implemented as a classifier, returning a categorical output, predicting whether the price will rise or fall.
Part of the project will be to determine the relative effectiveness of the number of output categories. For example the model could output simply 2 categories, one for a rising and one for a falling price prediction. Alternatively, more categories could be used, to give a coarse indication of both magnitude and direction of the price movement.

Another trade-off to be investigated is that between the confidence achievable in very short term prediction with that in a longer term prediction, which may be of lower confidence but of greater price movement.

Some investigation will focus on how a model trained on a single instrument's historic data compares with one trained on data from a range of instruments.

Finally, the output of the project would be a trained model along with test results, analysis and measurements of it's performance on a hold-out test data set unseen during its training.

Care will be taken to avoid "data leakage" [3] (whereby implicit information from a test or evaluation scenario is inadvertently used in the training process), by ensuring that the test data sets used cover time ranges that are later in time and distinct from and data used for training.


## Dataset and inputs

To have confidence in any possible discovery of a weak signal a lot of data is required. Hence the direction of the project is driven to some extent by the availability of such data. Whilst there is a wealth of financial data potentially available from market data vendors such as Bloomberg or Reuters, most of this comes at significant cost.

However, a source of freely available Foreign Exchange (FX) and Precious Metals (PM) price data was found at [4], which provides historical open/high/low/close prices sampled at 1 minute intervals across many currency pairs. In most cases, this data extends back to 2009 , giving around 2.5 million samples per current pair.

This led to the proposal to attempt the prediction of short term price movements of FX currency pairs, based on recent price history.

The primary input to the model for any given time point would be a snapshot of recent price information up to that time, looking back over a given time window _W_. The impact of different values of _W_ will be assessed.

Rather than use the raw prices themselves, it is envisaged features based on a series of "returns", meaning the relative price change from the previous sample, would be used instead. As a relative measure they tend to be more comparable between different financial products, whose actual prices may be very different. 

The effects of including additional derived features based on price history may also be considered. For example, signals derived from different speed moving averages, which are often used in trend following investment strategies [5]. 

The effect of other inputs, such as temporal indicators, such as hour of day, and day of week may be also explored. Certainly FX markets are known to exhibit different volatility at different times of day, as market participants in different time zones become active. Whilst not necessarily contributing to a direct prediction of _direction_ of future price moves, it is considered likely that they may have impact on prediction of the distribution of _sizes_ of any price moves.

Some pre-processing of the training data sets will be performed, for example to scale and normalise them to appropriate ranges, and to de-mean the input data to remove the effect of long-term trends. A model such as this which is trying to predict short term direction of future returns is likely to capture significant bias if trained on data where there was a long term price trend, resulting it seeing different numbers of positive versus negative examples. Subtracting the mean return over the training set from each of the training samples is one potential way work around this issue, although other approaches (to be documented in the final project report) may be considered.


## Proposed Solution

It is proposed to use a deep neural network trained on historical price information to produce a model providing predictions on future pricing movements.

Deep neural networks were chosen as they are known to be cabable of capturing arbitrarily complex patterns in high dimensional data, and they may be trained using Stochastic Gradient Descent (and variants thereof) which is a scalable approach to dealing with large volumes of data.

Note that although the latest price sample at any point in time has only a few dimensions (e.g.  open/high/low/close), the recent history up until that point is a sequence of, say, _W_ elements, where _W_ is a parameter describing the length of the price history to be considered.

There are several approaches to processing sequential information for machine learning models. One approach is to use a Recurrent Neural Network (RNN), which explicitly models the input as a sequence of low-dimensional samples. Alternatively, the "sliding window" approach [6] uses one or more input features for each element of the recent history, i.e. having at least _W_ features or dimensions. This approach allows the use regular fully-connected feed forward networks (Multi-Layer Perceptrons).

Although both approaches are potentially interesting and worthy of comparison, in order to limit the scope, this project will primarily consider the use of Multi-Layer-Perceptron networks trained using the sliding window method. Some focus will be given on assessing the impact of different aspects of this model architecture, such as numbers of network layers and layer sizes.


## Benchmark Models

A number of baseline models are proposed for comparison and benchmarking as predictors of future price movements.

1. last price change predictor - a model which simply predicts the future price change to be the same as the most recent price change. Such a model may be expected to have some success at modelling short term trends.

2. moving average price change predictor - a model which predicts the future price change based on the gradient of the moving average price over a recent window. This model can itself be fine-tuned on the training dataset by selecting the optimal window size, and moving average type (e.g. simple moving average versus exponential moving average).


## Evaluation Metrics

For a binary categorical model, which attempts to predict price rise versus price fall it is proposed to use F1-score metric to evaluate the models. 

Although accuracy would seem intuitively to be a reasonable metric, it is best suited for assessing performance on perfectly balanced classes. In this case, our test data will be real price series which are expected to have slightly imbalanced number of positive and negative return examples. Although only a slight imbalance, this may be significant given that we expect to be measuring performance of a relatively weak signal.

Hence F1-score which measures the balance between precision and recall and is intended for measuring performance on examples with imbalanced classes, is proposed as a more robust metric.

For models with more than one output category, the mean f1-score across categories will be used.

The evaluation of a binary classification model versus a multi-class model, will depend somewhat on the performance characteristics discovered for the multi-class model. For example, it may turn out that performance at predicting large price is significantly better than small ones, and better than simple binary up/down prediction. These performance characteristics will be considered when making a final selection between model types.

Depending on the nature of any performance characteristics, further metrics, more directly related to use as a trading signal may be developed.


## Project Design

The project is expected to follow the following approximate workflow:

1. Download datasets for range of currency pairs and dates. 
1. Prepare training, validation and test data sets, for a single FX pair.
1. Compute descriptive statistics and visualisations of the dataset.
1. Build data pre-processing pipeline with an initial feature set (based on return series)
1. Visually investigate existence of structure in patterns of future price movement in the data, for example using T-SNE visualisations.
1. Construct baseline models, and fine tune baseline model 2 for optimal performance on training dataset.
1. Evaluate baseline models on validation data set.
1. Construct multi-layer-perceptron model
1. Train the model on the training set and evaluate the performance on the validation set.
1. Repeat with different model architectural hyperparameters (number of layers, layer width), 
1. Assess performance relative to baseline models. 
1. Investigate impact of additional features and length of lookback window
1. Investigate performance at longer term versus shorter term prediction, and select a target timescale.
1. Investigate performance when training model on dataset based on prices from multiple FX pairs.
1. Further refine hyperparameters of best performing model/dataset combination, and select this as the final model.
1. Report performance of final model on test datasets for different FX pairs, and compare with baseline model performance on the same.
1. Summarize areas for further research.




## References:
[1] https://en.wikipedia.org/wiki/Efficient-market_hypothesis  
[2] https://en.wikipedia.org/wiki/Momentum_(finance)  
[3] https://www.kaggle.com/wiki/Leakage  
[4] http://www.histdata.com/  
[5] https://en.wikipedia.org/wiki/Trend_following  
[6] Machine Learning for Sequential Data: A Review - Thomas G. Dietterich - http://web.engr.oregonstate.edu/~tgd/publications/mlsd-ssspr.pdf.  
