# Predicting The Stock Market

In this project, you'll be working with data from the S&P500 Index. The S&P500 is a stock market index. By analyzing the time series data, we will predict the closing price of a stock. 

Our way forward would be to explore the data and then prepare the dataset in such a way with useful columns and values ; along with determining some 'indicators' like putting the average of certain columns of last five observations etc or standard deviations -- there could be a lot of indicators, so that the model can understand the data better and trains itself from that data to predict the stock price. 

Lets first read the data and 'setup' the dataset. 

In [1]:
import numpy as np
import pandas as pd
from datetime import datetime

In [2]:
stock = pd.read_csv('sphist.csv')
stock.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2015-12-07,2090.419922,2090.419922,2066.780029,2077.070068,4043820000.0,2077.070068
1,2015-12-04,2051.23999,2093.840088,2051.23999,2091.689941,4214910000.0,2091.689941
2,2015-12-03,2080.709961,2085.0,2042.349976,2049.620117,4306490000.0,2049.620117
3,2015-12-02,2101.709961,2104.27002,2077.110107,2079.51001,3950640000.0,2079.51001
4,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3712120000.0,2102.629883


As stock price data is time series data, date is an important column and it is very important that we have the date in right format. So lets see whether the date is in right format or not.

In [3]:
stock.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16590 entries, 0 to 16589
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       16590 non-null  object 
 1   Open       16590 non-null  float64
 2   High       16590 non-null  float64
 3   Low        16590 non-null  float64
 4   Close      16590 non-null  float64
 5   Volume     16590 non-null  float64
 6   Adj Close  16590 non-null  float64
dtypes: float64(6), object(1)
memory usage: 907.4+ KB


Here we can see that the date column is an object column. We have to get this column in the correct 'datetime' object.

In [4]:
stock['Date'] = pd.to_datetime(stock['Date'])


In [5]:
# check again

stock.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16590 entries, 0 to 16589
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       16590 non-null  datetime64[ns]
 1   Open       16590 non-null  float64       
 2   High       16590 non-null  float64       
 3   Low        16590 non-null  float64       
 4   Close      16590 non-null  float64       
 5   Volume     16590 non-null  float64       
 6   Adj Close  16590 non-null  float64       
dtypes: datetime64[ns](1), float64(6)
memory usage: 907.4 KB


Now we have to sort the dataset according to the date. 


In [6]:
sorted_stock = stock.sort_values(by=['Date'])
sorted_stock.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08


## Generating Indicators 

As mentioned earlier, the dataset is best trained if we can include some indicators in granular level, so that the model understand the data better. For this, we can include some indicators which shows the average of last five values for each row, standard deviations etc. 

There are some pandas package which can handle row by row calculations for these kinds of indicators. We will use them. 

This is also very important that we dont include the information about the future in the train dataset OR in other  words, we dont want to inlude columns that 'leaks' the information about the future target column.

We will include an indicator that shows the precious 5 days' average.

In [7]:
sorted_stock['day_5'] = sorted_stock.Close.rolling(5, win_type='triang').mean()
sorted_stock.head(10)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,day_5
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66,
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85,
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93,
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98,
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08,16.91
16584,1950-01-10,17.030001,17.030001,17.030001,17.030001,2160000.0,17.030001,16.982222
16583,1950-01-11,17.09,17.09,17.09,17.09,2630000.0,17.09,17.031111
16582,1950-01-12,16.76,16.76,16.76,16.76,2970000.0,16.76,17.018889
16581,1950-01-13,16.67,16.67,16.67,16.67,3330000.0,16.67,16.955556
16580,1950-01-16,16.719999,16.719999,16.719999,16.719999,1460000.0,16.719999,16.838889



Since you're computing indicators that use historical data, there are some rows where there isn't enough historical data to generate them. Some of the indicators use 365 days of historical data, and the dataset starts on 1950-01-03. Thus, any rows that fall before 1951-01-03 don't have enough historical data to compute all the indicators. You'll need to remove these rows before you split the data.

In [8]:
clean_stock = sorted_stock[sorted_stock["Date"] > datetime(year=1951, month=1, day=2)]
clean_stock.head()


Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,day_5
16339,1951-01-03,20.690001,20.690001,20.690001,20.690001,3370000.0,20.690001,20.508889
16338,1951-01-04,20.870001,20.870001,20.870001,20.870001,3390000.0,20.870001,20.644445
16337,1951-01-05,20.870001,20.870001,20.870001,20.870001,3390000.0,20.870001,20.73889
16336,1951-01-08,21.0,21.0,21.0,21.0,2780000.0,21.0,20.833334
16335,1951-01-09,21.120001,21.120001,21.120001,21.120001,3800000.0,21.120001,20.906667


In [10]:
# lets check the null values

clean_stock.isnull().sum()

Date         0
Open         0
High         0
Low          0
Close        0
Volume       0
Adj Close    0
day_5        0
dtype: int64

## Splitting Up The Data 

Lets split the train and test data on the point of the year 2013.

In [11]:
train_stock = clean_stock[clean_stock['Date'] < datetime(year=2013, month=1, day=1)]
train_stock.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,day_5
16339,1951-01-03,20.690001,20.690001,20.690001,20.690001,3370000.0,20.690001,20.508889
16338,1951-01-04,20.870001,20.870001,20.870001,20.870001,3390000.0,20.870001,20.644445
16337,1951-01-05,20.870001,20.870001,20.870001,20.870001,3390000.0,20.870001,20.73889
16336,1951-01-08,21.0,21.0,21.0,21.0,2780000.0,21.0,20.833334
16335,1951-01-09,21.120001,21.120001,21.120001,21.120001,3800000.0,21.120001,20.906667


In [12]:
test_stock = clean_stock[clean_stock['Date'] >= datetime(year=2013, month=1, day=1)]
test_stock.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,day_5
738,2013-01-02,1426.189941,1462.430054,1426.189941,1462.420044,4202600000.0,1462.420044,1419.791111
737,2013-01-03,1462.420044,1465.469971,1455.530029,1459.369995,3829730000.0,1459.369995,1431.748888
736,2013-01-04,1459.369995,1467.939941,1458.98999,1466.469971,3424290000.0,1466.469971,1447.475559
735,2013-01-07,1466.469971,1466.469971,1456.619995,1461.890015,3304970000.0,1461.890015,1458.218886
734,2013-01-08,1461.890015,1461.890015,1451.640015,1457.150024,3601600000.0,1457.150024,1462.388889


## Making Predictions 

We will use the Linear Regression class for training the data. 

In [13]:
from sklearn.linear_model import LinearRegression
features = ['Open', 'High', 'Low', 'Volume', 'Adj Close', 'day_5']
target = ['Close']

model = LinearRegression()
model.fit(train_stock[features], train_stock[target])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [14]:
## PREDICT 

predicted_target = model.predict(test_stock[features])


We will use MSE as the error metric. 

In [15]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(test_stock[target], predicted_target)
mse

3.030323845679848e-20

In [17]:
rmse = np.sqrt(mse)
rmse

1.7407825383085182e-10

There are a number of ways by which the rmse can be lowered more. FOr example  : introducing some more indicators ...

The average volume over the past five days.
The average volume over the past year.
The ratio between the average volume for the past five days, and the average volume for the past year.
The standard deviation of the average volume over the past five days.
The standard deviation of the average volume over the past year.
The ratio between the standard deviation of the average volume for the past five days, and the standard deviation of the average volume for the past year.
The year component of the date.
The ratio between the lowest price in the past year and the current price.
The ratio between the highest price in the past year and the current price.
The year component of the date.
The month component of the date.
The day of week.
The day component of the date.
The number of holidays in the prior month.


**Moreover, we can improve our model by predicting day by day on different models** . This will improve the models to make more accurate predictions.. 

And other models like random forests can be used too. The models can be tweaked a lot more for more accurate predictions.


