# Machine Learning Engineer Nanodegree
## Capstone Project
Andy Wilson  
March 25, 2017

## I. Definition

### Project Overview
This project aims to construct a predictive financial model, capable of estimating the future price movement of traded financial instruments based primarily on historical price data.

This is a challenging project, since there are a range of views held as to whether this is feasible. On one hand, the Efficient Market Hypothesis (EMH) [1] argues that for efficient markets, it should not be possible to predict future prices, based on historic price information alone. On the other hand, investing strategies which rely on exactly this approach, such as momentum and trend-following approaches [2] do so with empirical success under certain conditions, suggesting that there may indeed be structure in historic price information which can be used to predict future price movements.

There are many factors that influence the price of a financial product traded on any particular market at a given time. These range from the sentiment and demand for the product by the market participants/investors to the dynamics of the systems used to operate the marketplace itself.

The hypothesis proposed here is that some of these factors are not completely independent of recent changes in the price of the product, and that by analysing enough data, some structure may be found in historic price movements that would be a predictor of future price movements.

It is appreciated that any such structure discoverable is likely to be a very weak underlying signal amid a lot of noise. Therefore it's entirely feasible, indeed likely, that any predictive signal obtained, would not be strong enough to trade profitably on it's own. Trading any systematic trading strategy incurs execution costs, typically due to price spreads (the difference between the cost to buy and that to sell a given product) and fees or commissions charged by the various brokers or marketplaces involved.

Nevertheless, it is proposed that even the identification of a weak but measurable signal would be a valuable contribution, since it could be incorporated as an additional input into existing systematic investment strategies to increase their profitability or diversity. 

#### Dataset and inputs

To have confidence in any possible discovery of a weak signal a lot of data is required. Hence the direction of the project was driven to some extent by the availability of such data. Whilst there is a wealth of financial data potentially available from market data vendors such as Bloomberg or Reuters, most of this comes at significant cost.

However, a source of freely available Foreign Exchange (FX) and Precious Metals (PM) price data was found at [4], which provides historical open/high/low/close prices sampled at 1 minute intervals across many currency pairs. In most cases, this data extends back to 2009, giving around 2.5 million samples per current pair.

This led to the proposal to attempt the prediction of short term price movements of FX currency pairs, based on recent price history.

An example extract illustrating the nature of the data available is shown below, in this case for the USDJPY FX currency pair. This shows examples of 1 minute samples of "Open/High/Low/Close" ("OHLC") price samples from one day in 2009. 

<p style="text-align: center;">USDJPY</p>

|     timestamp       |   open |   high |   low |   close |
|:--------------------|-------:|-------:|------:|--------:|
| 2009-03-16 00:00:00 |  98.25 |  98.25 | 98.23 |   98.24 |
| 2009-03-16 00:01:00 |  98.23 |  98.24 | 98.23 |   98.24 |
| 2009-03-16 00:03:00 |  98.25 |  98.25 | 98.24 |   98.24 |
| 2009-03-16 00:04:00 |  98.25 |  98.25 | 98.23 |   98.24 |
| 2009-03-16 00:05:00 |  98.23 |  98.23 | 98.22 |   98.23 |

![](reportresources/usdjpy2009-closingprice-20090316.png)

The primary input to the model for any given time point will be a snapshot of recent price information up to that time, looking back over a given time window of _W_ samples.

Rather than use the raw prices themselves, which may be arbitrarily large, and differ greatly between different FX currency pairs, the data is transformed and normalised. One important transformation is to base the features on the "returns" of the price series, meaning the relative price change from the previous time interval, which as a relative measure tend to be more comparable between different FX pairs. The "Preprocessing" section below gives full details of the transformations applied to the data, to make it more suitable for input to the models developed.



### Problem Statement

The task proposed is to construct a predictive financial model, capable of predicting the direction of short-term future price movements, based on historical price data.

It takes as it's main input a dataset consisting of recent historical prices of a financial product and outputs a signal indicating the direction any expected future price movement within a short time period.

The model is implemented as a classifier, returning a categorical output, predicting whether the price will rise or fall. One trade-off which is investigated is that between the confidence achievable in very short term prediction with that in a longer term prediction, which may be of lower confidence but of greater price movement.

Some focus is also given to how a model trained on a single instrument's historic data compares with one trained on data from a range of instruments.

Finally, the output of the project is a trained model along with test results, analysis and measurements of it's performance on a hold-out test data set unseen during its training.

Care is be taken to avoid "data leakage" [3] (whereby implicit information from a test or evaluation scenario is inadvertently used in the training process), by ensuring that the test data sets used cover time ranges that are later in time and distinct from and data used for training.

#### Proposed Solution

The approach taken is to use a deep neural network trained on historical price information to produce a model providing predictions on future pricing movements.

Deep neural networks were chosen as they are known to be cabable of capturing arbitrarily complex patterns in high dimensional data, and they may be trained using Stochastic Gradient Descent (and variants thereof) which is a scalable approach to dealing with large volumes of data.

Note that although the latest price sample at any point in time has only a few dimensions (e.g.  open/high/low/close), the recent _history_ up until that point is a sequence of, say, _W_ elements, where _W_ is a parameter describing the length of the price history to be considered.

There are several approaches to processing sequential information for machine learning models. One approach is to use a Recurrent Neural Network (RNN), which explicitly models the input as a sequence of low-dimensional samples. Alternatively, the "sliding window" approach [6] uses one or more input features for each element of the recent history, i.e. having at least _W_ features or dimensions. This approach allows the use of regular fully-connected feed forward networks (Multi-Layer Perceptron networks).

Although both approaches are potentially interesting and worthy of comparison, in order to limit the scope, this project focusses on the use of Multi-Layer-Perceptron networks trained using the sliding window method. Some focus is given to assessing the impact of different aspects of this model architecture, such as numbers of network layers and layer sizes.


#### Benchmark Models

A number of baseline models are proposed for comparison and benchmarking as predictors of future price movements.

1. last price change predictor - a model which simply predicts the future price change to be the same as the most recent price change. Such a model may be expected to have some success at modelling short term trends.

2. moving average price change predictor - a model which predicts the future price change based on the gradient of the moving average price over a recent window. This model can itself be fine-tuned on the training dataset by selecting the optimal window size, and moving average type (e.g. simple moving average versus exponential moving average).


#### Project Design

The project followed the following workflow:

1. Download datasets for range of currency pairs and dates. 
1. Prepare training, validation and test data sets, for a single FX pair.
1. Compute descriptive statistics and visualisations of the dataset.
1. Build data pre-processing pipeline with an initial feature set (based on return series)
1. Visually investigate existence of structure in patterns of future price movement in the data, for example using T-SNE visualisations.
1. Construct baseline models, and fine tune baseline model 2 for optimal performance on training dataset.
1. Evaluate baseline models on validation data set.
1. Construct multi-layer-perceptron model
1. Train the model on the training set and evaluate the performance on the validation set.
1. Repeat with different model architectural hyperparameters (number of layers, layer width), 
1. Assess performance relative to baseline models. 
1. Investigate impact of additional features and length of lookback window
1. Investigate performance at longer term versus shorter term prediction, and select a target timescale.
1. Investigate performance when training model on dataset based on prices from multiple FX pairs.
1. Further refine hyperparameters of best performing model/dataset combination, and select this as the final model.
1. Report performance of final model on test datasets for different FX pairs, and compare with baseline model performance on the same.
1. Summarize areas for further research.



### Metrics

#### F1-Score
For the binary categorical model, which attempts to predict price rise versus price fall the F1-score metric is used for evaluation. 

Although accuracy would seem intuitively to be a reasonable metric, it is best suited for assessing performance on perfectly balanced classes. In this case, our test data are real price series which have slightly imbalanced number of positive and negative return examples. Although only a slight imbalance, this can be significant given that we are measuring performance of a relatively weak signal.

Hence F1-score which measures the balance between precision and recall and is intended for measuring performance on examples with imbalanced classes, is proposed as a more robust metric.

#### Mean Future Return

The "Mean Future Return" is also offered used as an additional metric, which we devised to provide an indicator of the model's performance from a financial perspective. It is conceived as the mean return which would be achieved over each future time period, if a "cost-free" investment could be made to take advantage of the price change predicted by the model. By this we mean an investment which provides exposure to the price changes of the underlying asset, but which avoids the costs of entering and exiting positions and all trading costs. It is therefore an unrealistic measure, however it is one which introduces a financial perspective to the evaluation of the model and one which places a theoretical upper-bound on the model's performance.

This is computed as the price return over the next period, multiplied by the sign of the model's predicted direction signal (+1/-1) i.e.

```MeanFutureReturn = future_return * predicted_price_movement_direction
```




## II. Analysis
_(approx. 2-4 pages)_

### Data Exploration
In this section, you will be expected to analyze the data you are using for the problem. This data can either be in the form of a dataset (or datasets), input data (or input files), or even an environment. The type of data should be thoroughly described and, if possible, have basic statistics and information presented (such as discussion of input features or defining characteristics about the input or environment). Any abnormalities or interesting qualities about the data that may need to be addressed have been identified (such as features that need to be transformed or the possibility of outliers). Questions to ask yourself when writing this section:
- _If a dataset is present for this problem, have you thoroughly discussed certain features about the dataset? Has a data sample been provided to the reader?_
- _If a dataset is present for this problem, are statistics about the dataset calculated and reported? Have any relevant results from this calculation been discussed?_
- _If a dataset is **not** present for this problem, has discussion been made about the input space or input data for your problem?_
- _Are there any abnormalities or characteristics about the input space or dataset that need to be addressed? (categorical variables, missing values, outliers, etc.)_

### Exploratory Visualization
In this section, you will need to provide some form of visualization that summarizes or extracts a relevant characteristic or feature about the data. The visualization should adequately support the data being used. Discuss why this visualization was chosen and how it is relevant. Questions to ask yourself when writing this section:
- _Have you visualized a relevant characteristic or feature about the dataset or input data?_
- _Is the visualization thoroughly analyzed and discussed?_
- _If a plot is provided, are the axes, title, and datum clearly defined?_

### Algorithms and Techniques
In this section, you will need to discuss the algorithms and techniques you intend to use for solving the problem. You should justify the use of each one based on the characteristics of the problem and the problem domain. Questions to ask yourself when writing this section:
- _Are the algorithms you will use, including any default variables/parameters in the project clearly defined?_
- _Are the techniques to be used thoroughly discussed and justified?_
- _Is it made clear how the input data or datasets will be handled by the algorithms and techniques chosen?_

### Benchmark
In this section, you will need to provide a clearly defined benchmark result or threshold for comparing across performances obtained by your solution. The reasoning behind the benchmark (in the case where it is not an established result) should be discussed. Questions to ask yourself when writing this section:
- _Has some result or value been provided that acts as a benchmark for measuring performance?_
- _Is it clear how this result or value was obtained (whether by data or by hypothesis)?_


## III. Methodology
_(approx. 3-5 pages)_

### Data Preprocessing
In this section, all of your preprocessing steps will need to be clearly documented, if any were necessary. From the previous section, any of the abnormalities or characteristics that you identified about the dataset will be addressed and corrected here. Questions to ask yourself when writing this section:
- _If the algorithms chosen require preprocessing steps like feature selection or feature transformations, have they been properly documented?_
- _Based on the **Data Exploration** section, if there were abnormalities or characteristics that needed to be addressed, have they been properly corrected?_
- _If no preprocessing is needed, has it been made clear why?_

### Implementation
In this section, the process for which metrics, algorithms, and techniques that you implemented for the given data will need to be clearly documented. It should be abundantly clear how the implementation was carried out, and discussion should be made regarding any complications that occurred during this process. Questions to ask yourself when writing this section:
- _Is it made clear how the algorithms and techniques were implemented with the given datasets or input data?_
- _Were there any complications with the original metrics or techniques that required changing prior to acquiring a solution?_
- _Was there any part of the coding process (e.g., writing complicated functions) that should be documented?_

### Refinement
In this section, you will need to discuss the process of improvement you made upon the algorithms and techniques you used in your implementation. For example, adjusting parameters for certain models to acquire improved solutions would fall under the refinement category. Your initial and final solutions should be reported, as well as any significant intermediate results as necessary. Questions to ask yourself when writing this section:
- _Has an initial solution been found and clearly reported?_
- _Is the process of improvement clearly documented, such as what techniques were used?_
- _Are intermediate and final solutions clearly reported as the process is improved?_


## IV. Results
_(approx. 2-3 pages)_

### Model Evaluation and Validation
In this section, the final model and any supporting qualities should be evaluated in detail. It should be clear how the final model was derived and why this model was chosen. In addition, some type of analysis should be used to validate the robustness of this model and its solution, such as manipulating the input data or environment to see how the model’s solution is affected (this is called sensitivity analysis). Questions to ask yourself when writing this section:
- _Is the final model reasonable and aligning with solution expectations? Are the final parameters of the model appropriate?_
- _Has the final model been tested with various inputs to evaluate whether the model generalizes well to unseen data?_
- _Is the model robust enough for the problem? Do small perturbations (changes) in training data or the input space greatly affect the results?_
- _Can results found from the model be trusted?_

### Justification
In this section, your model’s final solution and its results should be compared to the benchmark you established earlier in the project using some type of statistical analysis. You should also justify whether these results and the solution are significant enough to have solved the problem posed in the project. Questions to ask yourself when writing this section:
- _Are the final results found stronger than the benchmark result reported earlier?_
- _Have you thoroughly analyzed and discussed the final solution?_
- _Is the final solution significant enough to have solved the problem?_


## V. Conclusion
_(approx. 1-2 pages)_

### Free-Form Visualization
In this section, you will need to provide some form of visualization that emphasizes an important quality about the project. It is much more free-form, but should reasonably support a significant result or characteristic about the problem that you want to discuss. Questions to ask yourself when writing this section:
- _Have you visualized a relevant or important quality about the problem, dataset, input data, or results?_
- _Is the visualization thoroughly analyzed and discussed?_
- _If a plot is provided, are the axes, title, and datum clearly defined?_

### Reflection
In this section, you will summarize the entire end-to-end problem solution and discuss one or two particular aspects of the project you found interesting or difficult. You are expected to reflect on the project as a whole to show that you have a firm understanding of the entire process employed in your work. Questions to ask yourself when writing this section:
- _Have you thoroughly summarized the entire process you used for this project?_
- _Were there any interesting aspects of the project?_
- _Were there any difficult aspects of the project?_
- _Does the final model and solution fit your expectations for the problem, and should it be used in a general setting to solve these types of problems?_

### Improvement
In this section, you will need to provide discussion as to how one aspect of the implementation you designed could be improved. As an example, consider ways your implementation can be made more general, and what would need to be modified. You do not need to make this improvement, but the potential solutions resulting from these changes are considered and compared/contrasted to your current solution. Questions to ask yourself when writing this section:
- _Are there further improvements that could be made on the algorithms or techniques you used in this project?_
- _Were there algorithms or techniques you researched that you did not know how to implement, but would consider using if you knew how?_
- _If you used your final solution as the new benchmark, do you think an even better solution exists?_

-----------

**Before submitting, ask yourself. . .**

- Does the project report you’ve written follow a well-organized structure similar to that of the project template?
- Is each section (particularly **Analysis** and **Methodology**) written in a clear, concise and specific fashion? Are there any ambiguous terms or phrases that need clarification?
- Would the intended audience of your project be able to understand your analysis, methods, and results?
- Have you properly proof-read your project report to assure there are minimal grammatical and spelling mistakes?
- Are all the resources used for this project correctly cited and referenced?
- Is the code that implements your solution easily readable and properly commented?
- Does the code execute without error and produce results similar to those reported?


---------------------------------------------------

refinements:

The effects of including additional derived features based on price history may also be considered. For example, signals derived from different speed moving averages, which are often used in trend following investment strategies [5]. 

The effect of other inputs, such as temporal indicators, such as hour of day, and day of week may be also explored. Certainly FX markets are known to exhibit different volatility at different times of day, as market participants in different time zones become active. Whilst not necessarily contributing to a direct prediction of _direction_ of future price moves, it is considered likely that they may have impact on prediction of the distribution of _sizes_ of any price moves.

Some pre-processing of the training data sets will be performed, for example to scale and normalise them to appropriate ranges, and to de-mean the input data to remove the effect of long-term trends. A model such as this which is trying to predict short term direction of future returns is likely to capture significant bias if trained on data where there was a long term price trend, resulting it seeing different numbers of positive versus negative examples. Subtracting the mean return over the training set from each of the training samples is one potential way work around this issue, although other approaches (to be documented in the final project report) may be considered.

## References:
[1] https://en.wikipedia.org/wiki/Efficient-market_hypothesis  
[2] https://en.wikipedia.org/wiki/Momentum_(finance)  
[3] https://www.kaggle.com/wiki/Leakage  
[4] http://www.histdata.com/  
[5] https://en.wikipedia.org/wiki/Trend_following  
[6] Machine Learning for Sequential Data: A Review - Thomas G. Dietterich - http://web.engr.oregonstate.edu/~tgd/publications/mlsd-ssspr.pdf.  
