

# Capstone Proposal
```
Dang Le Dang Khoa
April 17th, 2018
```

# Domain Background

- Investment firms, hedge funds, and even individuals have been using financial models to better understand market behavior and make profitable investments and trades. A wealth of information is available in the form of historical stock prices and company performance data, suitable for machine learning algorithms to process.

- For this project, the task is to build a stock price predictor that takes daily trading data over a certain date range as input, and outputs projected estimates for given query dates. Note that the inputs will contain multiple metrics, such as opening price (Open), highest price the stock traded at (High), how many stocks were traded (Volume) and closing price adjusted for stock splits and dividends (Adjusted Close); the system only needs to predict the Adjusted Close price.


# Problem Statement
- Predict(forecast) the Adjusted Close day by day
- Suggest 2 different approaches:
    + Traditional supervised machine learning methods(Regressions)
    + A specific machine learning method for time series forecasting(ARIMA model)
- Challenges of Time series data:
    + Contains a lot of noises
    + Is different from tabular data that each data point related to each other
    + Contains time-dependent structures:
        - Level: the average value in the series
        - Trend: global increasing or decreasing
        - Seasonalities: repeating pattern of the series 

# Datasets and Inputs
- Dataset downloaded from: [http://finance.yahoo.com](http://finance.yahoo.com)
- Example: GOOG dataset

| __Dates__ | __Open__ | __High__  | __Low__  | __Close__ | __Adj Close__  | __Volume__ |
|:---------:|:--------:|:---------:|:--------:|:---------:|:--------------:|:----------:|
| 2009-02-17|172.135422|172.423553 |168.747467|170.222870 | 170.222870     |   11434600 |
| 2009-02-18|172.498062|175.548233 |169.159775|175.414093 | 175.414093     |   12127300 |
| 2009-02-19|177.580017|178.737488 |169.601898|170.212936 | 170.212936     |   10042200 |
| 2009-02-20|167.932755|173.332642 |166.417618|172.105621 | 172.105621     |   12515000 |
| 2009-02-21|172.378845|173.769791 |163.710220|163.963577 | 163.963577     |   10510000 |


- Inputs: Date and Adjusted Close
- Example: Time series input

    
| __Dates__ | __Adj Close__  |
|:---------:|:--------------:|
| 2009-02-17| 170.222870     |
| 2009-02-18| 175.414093     |
| 2009-02-19| 170.212936     |
| 2009-02-20| 172.105621     |
| 2009-02-21| 163.963577     |

![image alt <>](.\figures\1.jpg "Time-Series")
    
- Output: Prediction Price at the current day
- Optional output: Suggest Buy/Sell/Hold

# Solution Statement
## Regression Approach
- Choose Linear Regression for traditional machine learning model approach
- For time series Regression, create lagged values as new features
    $$lag(n) = f(t-n) $$
    _Example_: lagged values with n = 7

| __Dates__ | __t__ |__t-1__    |__t-2__    |__t-3__    |__t-4__    |__t-5__    |__t-6__   |__t-7__|
|:---------:|:-----:|:---------:|:---------:|:---------:|:---------:|:---------:|:--------:|:-----:|
|2011-01-08 |1.020005|  1.020005|   1.015140|   1.007810|   0.996310|   1.000000|   1.00000| 1.00000|
|2011-01-09 |1.020005|  1.020005|   1.020005|   1.015140|   1.007810|   0.996310|   1.00000| 1.00000|
|2011-01-10 |1.016315|  1.020005|   1.020005|   1.020005|   1.015140|   1.007810|   0.99631| 1.00000|
|2011-01-11 |1.019293|  1.016315|   1.020005|   1.020005|   1.020005|   1.015140|   1.00781| 0.99631|
|2011-01-12 |1.020716|  1.019293|   1.016315|   1.020005|   1.020005|   1.020005|   1.01514| 1.00781|

- Perform grid search to find the optimal lag order based on Root Mean Squared Error(RMSE)

## ARIMA Approach
### Definition
- ARIMA stands for Autoregressive Integrated Moving Average (Alternative name: Box-Jenkins Model)
- ARIMA is a forecasting technique that projects the future values of a series based on its own inertia
- Its main application is short-term forecasting requiring at least 40 historical data points
- ARIMA works best when data: 
    + Exhibits a stable or consistent pattern
    + Have a minimum amount of outliers

### Models parameter
- ARIMA attempts to describe the movement in a stationary time series as a function of "autoregressive and moving avg"
    + AR(autoregressive)
    + MA(moving avg)
- Autoregressive Models: 
    $$X(t) = A(1)*X(t-1) + A(2)*X(t-2) +... + A(n)*X(t-n) + E(t)$$
    + X(t): the time-series
    + X(t-n):  time series lagged n
    + A(n): autoregressive parameters
    + E(t): the error term of the model
- Moving Average Models:
    $$X(t) = -B(1) * E(t-1) + E(t)$$
    + B(1): MA of order 1
    + E(t):  current error term
    + E(t-1): error in the previous period

### Approach
- Mixed ARIMA model is built on 3 parameters (p,d,q)
    + p: lag order
    + d: degree of differencing
    + q: size of moving average window(order of moving average)
- Perform grid search to find the optimal orders (p,d,q) based on Root mean squared error(RMSE)

# Evaluation Metrics
- RMSE(Root Mean Squared Error)
    $$RMSE = \sqrt{\frac{\sum_{t \in N}(y_t - \hat{y}_t)^2}{N}}$$
- Reasons to choose RMSE:
    + Squaring error to have positive values
    + Putting more weight on large errors
- Cons of choosing RMSE:
    + Our data may have many outliers that affect the perfomance evaluation 

# Project Design
## Data Exploratory
- Perform statical analysis
    + Mean
    + Standard Deviation
    + Median
    + Sum
- Visualize time series data
    + Visualize time-series data
    ![image alt <>](./figures/2.jpg "Time-Series")
    
    + Bollinger Bands - Rolling stats
    ![image alt <>](./figures/3.jpg "Bollinger Bands")
       
## Data Preprocessing
- Normalize data
$$f(t) = \frac{f(t)}{f(0)}$$
![image alt <>](./figures/4.jpg "Normalizations")

- Remove Trend if nesscessary

## Model Prediction
### Linear Regression
- Feature engineering
    - Create lagged value
    - Examine correlation between lagged datapoints
    ![image alt <](./figures/5.jpg "Correlation")
    ![image alt >](./figures/6.jpg "Correlation")
    
- Split data into trainning/validation/test set
![image alt <>](./figures/7.jpg "Splitting")

- Perform grid search to find optimal parameters
```python
    for lag in lag_values:
        model = fit(trainning_set(X, y, lag))
        y_hat = model.predict(validation_set(X, lag))
        error = RMSE(y - y_hat)
        best_params = params with minimum error
    return lag
```
- Evaluate based on RMSE and visualization
![image alt <>](./figures/8.jpg "Prediction")

### ARIMA
- Split data into trainning/validation/test set
![image alt <>](./figures/7.jpg "Splitting")

- Perform grid search to find optimal parameters
```python
    for each (p,q,d) in order_values:
        model = fit(trainning_set(X, y, p, q, d))
        y_hat = model.predict(validation_set(X, p, q, d))
        error = RMSE(y - y_hat)
        best_params = params with minimum error
    return best_params(p,q,d)
```
- Evaluate based on RMSE and visualization

## Model Evaluation
- Compare RMSE of 2 different approaches
- Explain the results based on visualizations
![image alt <>](./figures/9.jpg "Prediction")

# Reference
[https://www.quora.com/What-is-ARIMA](https://www.quora.com/What-is-ARIMA)

