# Predicting the Stock Market

This project will utilise historic data from the [S&P500 Index](https://en.wikipedia.org/wiki/S%26P_500_Index). Indexes aggregate the stock prices of multiple companies, often grouped by geographic location or sector. 

The dataset has the following headers:
- `Date` - Date of the record
- `Open` - Opening price for the day
- `High` - Highest trade price during the day
- `Low` - Lowest trade price during the day
- `Close` - Closing price for the day
- `Volume` - Number of shares traded
- `Adj Close` - Daily closing price, adjusted retroactively to include any corporate actions

We will use this to train a model on data from 1950 to 2012, and make predections on prices between 2013 and 2015.

## Reading in the Data

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

df = pd.read_csv('https://raw.githubusercontent.com/nasimjafari7/PythonProjects/master/Guided%20Project_%20Predicting%20the%20stock%20market/sphist.csv')
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2015-12-07,2090.419922,2090.419922,2066.780029,2077.070068,4043820000.0,2077.070068
1,2015-12-04,2051.23999,2093.840088,2051.23999,2091.689941,4214910000.0,2091.689941
2,2015-12-03,2080.709961,2085.0,2042.349976,2049.620117,4306490000.0,2049.620117
3,2015-12-02,2101.709961,2104.27002,2077.110107,2079.51001,3950640000.0,2079.51001
4,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3712120000.0,2102.629883


In [2]:
# Converting the 'Date' column to datetime and sorting from oldest to newest
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values('Date')

## Generating Indicators

Stock market data is sequential and not all independent. Therefore an appropriate way to build a model is by generating indicators of a defined window of historic data and use this to predict future price. These indicators can use a variety of calculations such as averages, ratios and/or standard deviations. 

The model in this instance will be built using the following indicators:
- Average price from the past 10 days
- Average price from the past 30 days
- Average price from the past 365 days

In [3]:
def add_indicator_col(df, no_days, indicator, col, function):
    # Create a series with dates as indexes
    s = pd.Series(np.array(df[col]), index=np.array(df['Date']))
    
    # Calculate mean price over previous range, shift indices to exclude current day
    means = s.rolling(window=no_days).apply(function)
    means = means.shift()
    
    means = means.reset_index()
    means = means.rename(columns={'index':'Date', 0:indicator})
    df_new = df.merge(means, left_on='Date', right_on='Date')
    return df_new

df = add_indicator_col(df, 10, 'means_past_10', 'Close', np.mean)
df = add_indicator_col(df, 30, 'means_past_30', 'Close', np.mean)
df = add_indicator_col(df, 365, 'means_past_365', 'Close', np.mean)

print(df.head(15), '\n', df.tail(10))

         Date       Open       High        Low      Close     Volume  \
0  1950-01-03  16.660000  16.660000  16.660000  16.660000  1260000.0   
1  1950-01-04  16.850000  16.850000  16.850000  16.850000  1890000.0   
2  1950-01-05  16.930000  16.930000  16.930000  16.930000  2550000.0   
3  1950-01-06  16.980000  16.980000  16.980000  16.980000  2010000.0   
4  1950-01-09  17.080000  17.080000  17.080000  17.080000  2520000.0   
5  1950-01-10  17.030001  17.030001  17.030001  17.030001  2160000.0   
6  1950-01-11  17.090000  17.090000  17.090000  17.090000  2630000.0   
7  1950-01-12  16.760000  16.760000  16.760000  16.760000  2970000.0   
8  1950-01-13  16.670000  16.670000  16.670000  16.670000  3330000.0   
9  1950-01-16  16.719999  16.719999  16.719999  16.719999  1460000.0   
10 1950-01-17  16.860001  16.860001  16.860001  16.860001  1790000.0   
11 1950-01-18  16.850000  16.850000  16.850000  16.850000  1570000.0   
12 1950-01-19  16.870001  16.870001  16.870001  16.870001  11700

## Splitting up the Data

In [4]:
# Removing rows with data from 3rd Jan 1951 or earlier, and dropping rows will null values
df_updated = df[df['Date'] >= datetime(1951,1,3)]
df_clean = df_updated.dropna(axis=0)

# Splitting the data into train and test by date
train = df_clean[df_clean['Date'] < datetime(2013,1,1)]
test = df_clean[df_clean['Date'] >= datetime(2013,1,1)]
print(df_clean.isnull().sum(), df_clean.shape)

Date              0
Open              0
High              0
Low               0
Close             0
Volume            0
Adj Close         0
means_past_10     0
means_past_30     0
means_past_365    0
dtype: int64 (16225, 10)


## Making Predictions

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lr = LinearRegression()
features = ['means_past_10', 'means_past_30', 'means_past_365']
lr.fit(train[features], train['Close'])
train_predictions = lr.predict(train[features])
test_predictions = lr.predict(test[features])

train_rmse = np.sqrt(mean_squared_error(train['Close'], train_predictions))
test_rmse = np.sqrt(mean_squared_error(test['Close'], test_predictions))

print(train_rmse, test_rmse)

12.866827318853254 27.538908144131845


## Improving Error

In an attempt to minimise error, I will add two more indicators to use as features:
- Average `volume` over the past five days
- Average `volume` over the past year

In [6]:
df = add_indicator_col(df, 10, 'volume_past_5', 'Volume', np.mean)
df = add_indicator_col(df, 365, 'volume_past_365', 'Volume', np.mean)

In [7]:
df_updated = df[df['Date'] >= datetime(1951,1,3)]
df_clean = df_updated.dropna(axis=0)

train = df_clean[df_clean['Date'] < datetime(2013,1,1)]
test = df_clean[df_clean['Date'] >= datetime(2013,1,1)]
print(df_clean.isnull().sum(), df_clean.shape)

Date               0
Open               0
High               0
Low                0
Close              0
Volume             0
Adj Close          0
means_past_10      0
means_past_30      0
means_past_365     0
volume_past_5      0
volume_past_365    0
dtype: int64 (16225, 12)


In [8]:
lr = LinearRegression()
features = ['means_past_10', 'means_past_30', 'means_past_365', 'volume_past_5', 'volume_past_365']
lr.fit(train[features], train['Close'])
train_predictions_v2 = lr.predict(train[features])
test_predictions_v2 = lr.predict(test[features])

train_rmse_v2 = np.sqrt(mean_squared_error(train['Close'], train_predictions))
test_rmse_v2 = np.sqrt(mean_squared_error(test['Close'], test_predictions))

print(train_rmse_v2,'\n',test_rmse_v2)

12.866827318853254 
 27.538908144131845


Adding these two indicators did not improve the error rate, so let's add a couple more:
- Ratio between the 10 day average price and the 365 day average price
- Ratio between the 10 day standard deviation and the 365 day standard deviation

In [9]:
df['mean_price_ratio'] = df['means_past_10'] / df['means_past_365']

df = add_indicator_col(df, 10, 'std_past_10', 'Close', np.std)
df = add_indicator_col(df, 365, 'std_past_365', 'Close', np.std)

df['std_price_ratio'] = df['std_past_10'] / df['std_past_365']

In [10]:
df_updated = df[df['Date'] >= datetime(1951,1,3)]
df_clean = df_updated.dropna(axis=0)

train = df_clean[df_clean['Date'] < datetime(2013,1,1)]
test = df_clean[df_clean['Date'] >= datetime(2013,1,1)]
print(df_clean.isnull().sum(), df_clean.shape)

Date                0
Open                0
High                0
Low                 0
Close               0
Volume              0
Adj Close           0
means_past_10       0
means_past_30       0
means_past_365      0
volume_past_5       0
volume_past_365     0
mean_price_ratio    0
std_past_10         0
std_past_365        0
std_price_ratio     0
dtype: int64 (16225, 16)


In [11]:
lr = LinearRegression()
features = ['means_past_10', 'means_past_30', 'means_past_365', 'mean_price_ratio', 'std_price_ratio']
lr.fit(train[features], train['Close'])
train_predictions_v3 = lr.predict(train[features])
test_predictions_v3 = lr.predict(test[features])

train_rmse_v3 = np.sqrt(mean_squared_error(train['Close'], train_predictions))
test_rmse_v3 = np.sqrt(mean_squared_error(test['Close'], test_predictions))

print(train_rmse_v3,'\n',test_rmse_v3)

12.866827318853254 
 27.538908144131845


Still no change! Let's split out the date into 3 columns - day, month and week. From there I can look for correlation between columns and the close price, and use this to inform what columns to use to add more features.

In [15]:
df['year'] = df['Date'].dt.strftime('%Y').astype(float)
df['month'] = df['Date'].dt.strftime('%m').astype(float)
df['day'] = df['Date'].dt.strftime('%d').astype(float)
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,means_past_10,means_past_30,means_past_365,volume_past_5,volume_past_365,mean_price_ratio,std_past_10,std_past_365,std_price_ratio,year,month,day
0,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66,,,,,,,,,,1950.0,1.0,3.0
1,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85,,,,,,,,,,1950.0,1.0,4.0
2,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93,,,,,,,,,,1950.0,1.0,5.0
3,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98,,,,,,,,,,1950.0,1.0,6.0
4,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08,,,,,,,,,,1950.0,1.0,9.0


In [18]:
df_updated = df[df['Date'] >= datetime(1951,1,3)]
df_clean = df_updated.dropna(axis=0)
train = df_clean[df_clean['Date'] < datetime(2013,1,1)]
test = df_clean[df_clean['Date'] >= datetime(2013,1,1)]
df_clean.corr()['Close']

Open                0.999900
High                0.999953
Low                 0.999956
Close               1.000000
Volume              0.772817
Adj Close           1.000000
means_past_10       0.999673
means_past_30       0.999189
means_past_365      0.988870
volume_past_5       0.783639
volume_past_365     0.784878
mean_price_ratio    0.047883
std_past_10         0.755110
std_past_365        0.816103
std_price_ratio     0.047118
year                0.872100
month               0.005684
day                -0.001525
Name: Close, dtype: float64

In [19]:
lr = LinearRegression()
features = ['means_past_10', 'means_past_30', 'means_past_365', 'year', 'std_past_365']
lr.fit(train[features], train['Close'])
train_predictions_v4 = lr.predict(train[features])
test_predictions_v4 = lr.predict(test[features])

train_rmse_v4 = np.sqrt(mean_squared_error(train['Close'], train_predictions))
test_rmse_v4 = np.sqrt(mean_squared_error(test['Close'], test_predictions))

print(train_rmse_v3,'\n',test_rmse_v3)

12.866827318853254 
 27.538908144131845
