## Dataset and Problem Introduction

In this analysis, we explore the [S&P500 Index](https://en.wikipedia.org/wiki/S%26P_500_Index) by using historical data (from 1950 to 2015) on the price of the Index to make predictions about future prices. We develop a linear regression model that is trained with data from 1950-2012, and makes predictions on 2013-2015.

Data Source: https://finance.yahoo.com/quote/%5EGSPC/history/
<br>Reference: https://dataquest.io/

## Data
The columns of the dataset are:
- **Date** - The date of the record.
- **Open** - The opening price of the day (when trading starts).
- **High** - The highest trade price during the day.
- **Low** - The lowest trade price during the day.
- **Close** - The closing price for the day (when trading is finished).
- **Volume** - The number of shares traded.
- **Adj Close** - The daily closing price, adjusted retroactively to include any corporate actions.

The target column for the prediction is the **Close** column.

In [1]:
import numpy as np
import pandas as pd
from datetime import datetime
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

df = pd.read_csv('datasets/sphist.csv')
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values('Date', ascending = True)
df.head(5)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08


## Data Manipulation
We add three indicators to the dataset:
- The average price for the past **5** days.
- The average price for the past **30** days.
- The average price for the past **365** days.

In [2]:
def add_mean_indicator_col(df, num_days, indicator_name, col, function):
    #Make a series of Close price with the dates as indexes
    s = pd.Series(np.array(df[col]), index=np.array(df["Date"]))
    #calculate the mean price of past days
    means = s.rolling(window = num_days).apply(function)
    
    #Shift indices to exclude the price of each day from the mean value
    means = means.shift()
    
    #convert indices to Date Column
    means = means.reset_index()
    means = means.rename(columns={'index':'Date', 0:indicator_name})
    
    df_new = df.merge(means, left_on='Date', right_on='Date')
    return df_new

df = add_mean_indicator_col(df, 5, 'mean_close_5', 'Close', np.mean)
df = add_mean_indicator_col(df, 30, 'mean_close_30', 'Close', np.mean)
df = add_mean_indicator_col(df, 365, 'mean_close_365', 'Close', np.mean)
print('head:\n', df.head(7))
print('tail:\n', df.tail(5))

head:
         Date       Open       High        Low      Close     Volume  \
0 1950-01-03  16.660000  16.660000  16.660000  16.660000  1260000.0   
1 1950-01-04  16.850000  16.850000  16.850000  16.850000  1890000.0   
2 1950-01-05  16.930000  16.930000  16.930000  16.930000  2550000.0   
3 1950-01-06  16.980000  16.980000  16.980000  16.980000  2010000.0   
4 1950-01-09  17.080000  17.080000  17.080000  17.080000  2520000.0   
5 1950-01-10  17.030001  17.030001  17.030001  17.030001  2160000.0   
6 1950-01-11  17.090000  17.090000  17.090000  17.090000  2630000.0   

   Adj Close  mean_close_5  mean_close_30  mean_close_365  
0  16.660000           NaN            NaN             NaN  
1  16.850000           NaN            NaN             NaN  
2  16.930000           NaN            NaN             NaN  
3  16.980000           NaN            NaN             NaN  
4  17.080000           NaN            NaN             NaN  
5  17.030001        16.900            NaN             NaN  
6  1

Since we are computing indicators that use historical data, there are some rows where there isn't enough historical data to generate them. There is an indicator that uses 365 days of historical data, and the dataset starts on 1950-01-03 - so we remove the data from before 1951-01-03.

In [3]:
#Remove the data before 1951-01-03
df_updated = df[df["Date"] > datetime(year=1951, month=1, day=2)]

# Drop all rows containing null values
df_clean = df_updated.dropna(axis = 0)
df_clean.isnull().sum()

Date              0
Open              0
High              0
Low               0
Close             0
Volume            0
Adj Close         0
mean_close_5      0
mean_close_30     0
mean_close_365    0
dtype: int64

## Train and Test 
We generate two new data frames to use in the algorithm.
- **train** contains any rows in the data with a date less than 2013-01-01. 
- **test** contains any rows with a date greater than or equal to 2013-01-01

A **linear regression model** is then used to train the train dataset and predict the test dataset. The **root of mean squared error (RMSE)** is calculated to represent the forecast error.

In [4]:
def train_test(df, features):
    train  = df[df["Date"] < datetime(year=2013, month=1, day=1)]
    test = df[df["Date"] >= datetime(year=2013, month=1, day=1)]
    #initialize model
    lr = LinearRegression()
    target = 'Close'

    #Train
    lr.fit(train[features], train[target])

    #Test
    predictions = lr.predict(test[features])

    #Calculate error
    mse = mean_squared_error(test[target], predictions)
    rmse = np.sqrt(mse)
    return rmse

In [5]:
features = ['mean_close_5', 'mean_close_30', 'mean_close_365']
rmse = train_test(df_clean, features)
rmse

22.220065324219927

We now add two more indicators to see if it helps to improve the predictions and reduce the error.
- The average **volume** over the **past five days**.
- The average **volume** over the **past year**.

In [6]:
df = add_mean_indicator_col(df, 5, 'mean_volume_5', 'Volume', np.mean)
df = add_mean_indicator_col(df, 365, 'mean_volume_365', 'Volume', np.mean)
df_clean = df.dropna(axis = 0)
features = ['mean_volume_5', 'mean_volume_365']
rmse = train_test(df_clean, features)
print('RMSE for features = [mean_volume_5, mean_volume_365]:', rmse)

features = ['mean_close_5', 'mean_close_30', 'mean_close_365', 'mean_volume_5', 'mean_volume_365']
rmse = train_test(df_clean, features)
print('RMSE for features = [mean_close_5, mean_close_30, mean_close_365, mean_volume_5, mean_volume_365]:', rmse)

RMSE for features = [mean_volume_5, mean_volume_365]: 732.2270657604481
RMSE for features = [mean_close_5, mean_close_30, mean_close_365, mean_volume_5, mean_volume_365]: 22.234505187470507


Using the average values of the past days of the Volume column does not show improvement in prediction.

Now we add the following indicators and see the results:
- The **ratio** between the average price for the past 5 days, and the average price for the past 365 days.
- The **ratio** between the standard deviation for the past 5 days, and the standard deviation for the past 365 days.

In [7]:
df['ratio_mean_close'] = df['mean_close_5']/df['mean_close_365']

df = add_mean_indicator_col(df, 5, 'std_close_5', 'Close', np.std)
df = add_mean_indicator_col(df, 365, 'std_close_365', 'Close', np.std)
df['ratio_std_close'] = df['std_close_5']/df['std_close_365']

#Remove nulls and model
df_clean = df.dropna(axis = 0)

df_clean.corr()['Close']

Open                0.999900
High                0.999953
Low                 0.999956
Close               1.000000
Volume              0.772817
Adj Close           1.000000
mean_close_5        0.999793
mean_close_30       0.999189
mean_close_365      0.988870
mean_volume_5       0.780896
mean_volume_365     0.784878
ratio_mean_close    0.047782
std_close_5         0.722414
std_close_365       0.816103
ratio_std_close     0.087018
Name: Close, dtype: float64

The above correlation coefficients show that ratios are not correlated to the price.

In [8]:
#features = ['ratio_mean_close', 'ratio_std_close', 'mean_close_5', 'mean_close_365', 'mean_volume_5', 'mean_volume_365']
features = ['mean_close_5', 'mean_close_365']
rmse = train_test(df_clean, features)
print('RMSE for mean of Close without ratio: ', rmse)

features = ['mean_close_5', 'mean_close_365', 'ratio_mean_close']
rmse = train_test(df_clean, features)
print('RMSE for mean of Close with ratio: ', rmse)

features = ['std_close_5', 'std_close_365']
rmse = train_test(df_clean, features)
print('RMSE for std of Close without ratio: ', rmse)

features = ['std_close_5', 'std_close_365', 'ratio_std_close']
rmse = train_test(df_clean, features)
print('RMSE for std of Close with ratio: ', rmse)

features = ['ratio_mean_close', 'ratio_std_close', 'mean_close_5', 'mean_close_365', 'std_close_5', 'std_close_365']
rmse = train_test(df_clean, features)
print('All: ', rmse)

RMSE for mean of Close without ratio:  22.178420498912217
RMSE for mean of Close with ratio:  22.178149148967584
RMSE for std of Close without ratio:  802.681605426773
RMSE for std of Close with ratio:  802.6811942101356
All:  22.15180399006527


The ratios do not show a significant effect in reducing error.

We also check the **Date** column by creating the following indicators:
- The **year** component of the date.
- The **month** component of the date.
- The **day** component of the date.

In [9]:
df['year'] = df['Date'].dt.strftime('%Y').astype(float)
df['month'] = df['Date'].dt.strftime('%m').astype(float)
df['day'] = df['Date'].dt.strftime('%d').astype(float)
df_clean = df.dropna(axis = 0)

df_clean.corr()['Close']

Open                0.999900
High                0.999953
Low                 0.999956
Close               1.000000
Volume              0.772817
Adj Close           1.000000
mean_close_5        0.999793
mean_close_30       0.999189
mean_close_365      0.988870
mean_volume_5       0.780896
mean_volume_365     0.784878
ratio_mean_close    0.047782
std_close_5         0.722414
std_close_365       0.816103
ratio_std_close     0.087018
year                0.872100
month               0.005684
day                -0.001525
Name: Close, dtype: float64

In [10]:
features = ['year', 'month', 'day']
rmse = train_test(df_clean, features)
print("RMSE for features = ['year', 'month', 'day']: ", rmse)

features = ['mean_close_5', 'mean_close_365','year']
rmse = train_test(df_clean, features)
print("RMSE for features = ['mean_close_5', 'mean_close_365','year']: ", rmse)

features = ['mean_close_5', 'mean_close_365','year', 'month', 'day']
rmse = train_test(df_clean, features)
print("RMSE for features = ['mean_close_5', 'mean_close_365','year', 'month', 'day']: ", rmse)

RMSE for features = ['year', 'month', 'day']:  719.8237863775871
RMSE for features = ['mean_close_5', 'mean_close_365','year']:  22.193454834355535
RMSE for features = ['mean_close_5', 'mean_close_365','year', 'month', 'day']:  22.18586410431193


These indicators do not have a significant effect as well.

## Making predictions one day ahead
We now see if the accuracy improves greatly by making predictions only one day ahead. For example, train the model using data from 1951-01-03 to 2013-01-02, then make predictions for 2013-01-03, and then train another model using data from 1951-01-03 to 2013-01-03, and make predictions for 2013-01-04, and so on. This more closely simulates what we'd do if we were trading using the algorithm.

In [11]:
def train_test(df, features, row):
    train  = df[df["Date"] < row['Date']]
    test = df[df["Date"] == row['Date']]
    if len(train) == 0:
        return np.nan
    else:        
        #initialize model
        lr = LinearRegression()
        target = 'Close'

        #Train
        lr.fit(train[features], train[target])

        #Test
        predictions = lr.predict(test[features])

        #Calculate error
        mse = mean_squared_error(test[target], predictions)
        rmse = np.sqrt(mse)
        return rmse

In [12]:
features = ['ratio_mean_close', 'ratio_std_close', 'mean_close_5','mean_close_30', 'mean_close_365', 'mean_volume_5', 'mean_volume_365','std_close_5', 'std_close_365','year', 'month', 'day']
rmses = df_clean.apply(lambda row: train_test(df_clean, features, row), axis = 1 )

rmse = np.mean(rmses)
rmse

5.482714047042083

This result shows a significantly lower error than the previous ones. Therefore, we can say accuracy would improve greatly by making predictions only one day ahead.

## Summary

In this project, we used S&P500 Index data from 1950 to 2015 to make predictions of the close day price of the index. We practiced using the linear regression model with different indicators to make better predictions. We also tried making predictions only one day ahead to improve the accuracy of the predictions and it significantly reduced the error.