# Guided Project 17. Predicting Stock Market

In this project, we'll work with data from the S&P500 Index. The S&P500 is a stock market index. We'll be using historical data on the price of the S&P500 Index to make predictions about future prices. 

We'll be working with a csv file containing index prices. Each row in the file contains a daily record of the price of the S&P500 Index from 1950 to 2015. The dataset is stored in sphist.csv.

The columns of the dataset are:

- Date -- The date of the record.
- Open -- The opening price of the day (when trading starts).
- High -- The highest trade price during the day.
- Low -- The lowest trade price during the day.
- Close -- The closing price for the day (when trading is finished).
- Volume -- The number of shares traded.
- Adj Close -- The daily closing price, adjusted retroactively to include any corporate actions. Read more [here](http://www.investopedia.com/terms/a/adjusted_closing_price.asp).

We'll be using this dataset to develop a predictive model. We'll train the model with data from 1950-2012 and try to make predictions from 2013-2015.

### Reading data

Lets read in the dataframe and sort it on the Date column. It's currently in descending order, but we'll want it to be in ascending order for some of the next steps. 

In [11]:
import pandas as pd
from datetime import datetime
sp = pd.read_csv('sphist.csv', parse_dates=['Date'])
sp = sp.sort_values('Date').reset_index().drop(columns=['index'])

sp.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66
1,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85
2,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93
3,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98
4,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08


### Generating indicators 

Here are some indicators that are interesting to generate for each row:

- The average price from the past 5 days.
- The average price for the past 30 days.
- The average price for the past 365 days.
- The ratio between the average price for the past 5 days, and the average price for the past 365 days.

- The standard deviation of the price over the past 5 days.
- The standard deviation of the price over the past 365 days.
- The ratio between the standard deviation for the past 5 days, and the standard deviation for the past 365 days.

When calculating these, we have to be careful not to use the current row in the values we average. We want to teach the model how to predict the current price from historical prices. If we include the current price in the prices we average, it will be equivalent to handing the answers to the model upfront, and will make it impossible to use in the "real world", where we don't know the price upfront.

We will the Pandas `rolling` function, which does the kind of the computation we need here. We'll the window equal to the number of trading days in the past we want to use to compute the indicators. This adds in NaN values for any row where there aren't enough historical trading days to do the computation.

Note: There is a giant caveat here, which is that the rolling mean will use the current day's price. We'll reindex the resulting series to shift all the values "forward" one day with the `shift` method on dataframes to do this.

In [22]:
sp['Avg 5'] = sp['Close'].rolling(5).mean().shift(1)
sp['Avg 30'] = sp['Close'].rolling(30).mean().shift(1)
sp['Avg 365'] = sp['Close'].rolling(365).mean().shift(1)
sp['Avg Ratio'] = avg_5/avg_365

sp['Std 5'] = sp['Close'].rolling(5).std().shift(1)
sp['Std 30'] = sp['Close'].rolling(30).std().shift(1)
sp['Std 365'] = sp['Close'].rolling(365).std().shift(1)
sp['Std Ratio'] = std_5/std_365

sp['Std Ratio'].loc[363:367]

363         NaN
364         NaN
365    0.143121
366    0.119409
367    0.051758
Name: Std Ratio, dtype: float64

### Splitting the data
Let's remove the row where indicator could not have been calculated, and split the data into training and testing datasets

In [23]:
print(sp.shape)
sp.head()

(16590, 15)


Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,Avg 5,Avg 30,Avg 365,Avg Ratio,Std 5,Std 30,Std 365,Std Ratio
0,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66,,,,,,,,
1,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85,,,,,,,,
2,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93,,,,,,,,
3,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98,,,,,,,,
4,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08,,,,,,,,


In [32]:
sp.dropna(axis=0, inplace=True)
print(sp.shape)
sp.head().iloc[:,0:8]

(16225, 15)


Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,Avg 5
365,1951-06-19,22.02,22.02,22.02,22.02,1100000.0,22.02,21.8
366,1951-06-20,21.91,21.91,21.91,21.91,1120000.0,21.91,21.9
367,1951-06-21,21.780001,21.780001,21.780001,21.780001,1100000.0,21.780001,21.972
368,1951-06-22,21.549999,21.549999,21.549999,21.549999,1340000.0,21.549999,21.96
369,1951-06-25,21.290001,21.290001,21.290001,21.290001,2440000.0,21.290001,21.862


In [25]:
train = sp[sp['Date'] < datetime(year=2013, month=1, day=1)]
test = sp[sp['Date'] >= datetime(year=2013, month=1, day=1)]

In [28]:
print(train.shape)
print(test.shape)
test.head()[['Date']]

(15486, 15)
(739, 15)


Unnamed: 0,Date
15851,2013-01-02
15852,2013-01-03
15853,2013-01-04
15854,2013-01-07
15855,2013-01-08


### Making predictions

In [42]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.linear_model import LinearRegression
from math import sqrt

features = sp.columns[7:]
target = 'Close'
lr = LinearRegression()
lr.fit(train[features], train[target])
predictions = lr.predict(test[features])

mae = mean_absolute_error(predictions, test['Close'])
mse = mean_squared_error(predictions, test['Close'])
rmse = sqrt(mse)

print('Features: {}'.format(list(features)))
print('MAE: {}, RMSE: {}'.format(mae,rmse))

Features: ['Avg 5', 'Avg 30', 'Avg 365', 'Avg Ratio', 'Std 5', 'Std 30', 'Std 365', 'Std Ratio']
MAE: 16.216208263589817, RMSE: 22.20742791793927


### Improving error

Congratulations! We can now predict the S&P500 (with some error). We can improve the error of this model significantly, though. Let's think about some indicators that might be helpful to compute.

Here are some ideas that might be helpful:

- The average volume over the past five days.
- The average volume over the past year.
- The ratio between the average volume for the past five days, and the average volume for the past year.
- The standard deviation of the average volume over the past five days.
- The standard deviation of the average volume over the past year.
- The ratio between the standard deviation of the average volume for the past five days, and the standard deviation of the average volume for the past year.
- The year component of the date.
- The ratio between the lowest price in the past year and the current price.
- The ratio between the highest price in the past year and the current price.
- The month component of the date.
- The day of week.
- The day component of the date.
- The number of holidays in the prior month.

Let's add 2 additional indicators to our dataframe and see if the error is reduced. We'll need to insert these indicators at the same point where we insert the others, before we clean out rows with NaN values and split the dataframe into `train` and `test`.

### Next steps

There's a lot of improvement to be made on the indicator side and we urge you to think of better indicators that you could use for prediction. We can also make significant structural improvements to the algorithm and pull in data from other sources.

- Accuracy would improve greatly by making predictions only one day ahead. For example, train a model using data from 1951-01-03 to 2013-01-02, make predictions for 2013-01-03, and then train another model using data from 1951-01-03 to 2013-01-03, make predictions for 2013-01-04, and so on. This more closely simulates what you'd do if you were trading using the algorithm.

- You can also improve the algorithm used significantly. Try other techniques, like a random forest, and see if they perform better.

- You can also incorporate outside data, such as the weather in New York City (where most trading happens) the day before and the amount of Twitter activity around certain stocks.

- You can also make the system real-time by writing an automated script to download the latest data when the market closes and make predictions for the next day.

- Finally, you can make the system "higher-resolution". You're currently making daily predictions, but you could make hourly, minute-by-minute, or second-by-second predictions. This requires obtaining more data, though. You could also make predictions for individual stocks instead of the S&P500.