USEFUL LINKS FOR DATA CLEANING:

https://www.analyticsvidhya.com/blog/2018/10/predicting-stock-price-machine-learningnd-deep-learning-techniques-python/

https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-python/?utm_source=blog&utm_medium=stockmarketpredictionarticle

https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/

https://www.relataly.com/stock-market-prediction-with-multivariate-time-series-in-python/1815/

In [92]:
import pandas as pd
import calendar as cd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow import keras
import os

DIRTY_FILE_NAME = 'ES=F.csv'
CLEAN_FILE_NAME = '(Clean)dowjones_stocks.csv'

## DATA EXTRACTION AND CLEANING AND FORMATTING
### Data Extraction
Putting the dow jones csv data into a dataframe

In [93]:
if os.path.exists(DIRTY_FILE_NAME):
    dowjones_stocks = pd.read_csv(DIRTY_FILE_NAME)
    print(dowjones_stocks.head())
else:
    print("Error: Input file not found")

         Date     Open     High      Low   Close  Adj Close    Volume
0  2000-09-18  1485.25  1489.75  1462.25  1467.5     1467.5  104794.0
1  2000-09-19  1467.00  1482.75  1466.75  1478.5     1478.5  103371.0
2  2000-09-20  1478.75  1480.50  1450.25  1469.5     1469.5  109667.0
3  2000-09-21  1470.25  1474.00  1455.50  1469.5     1469.5   98528.0
4  2000-09-22  1454.75  1471.00  1436.75  1468.5     1468.5   97416.0


### Data Cleaning

The Dow Jones stock price dataset contains a number of null records typically during holidays or weekends, which are days when the stock market is closed. To clean this dataset, we will simply remove all null records. Due to the missing dates, it should be noted that the stock price forecasting model will not predict stock prices for each subsequent days. Rather, we should assume that the forecasting model will predict stock prices for the next subsequent day in which the stock price would typically be observed and recorded.

In [94]:
dowjones_stocks_cleaned = dowjones_stocks.loc[dowjones_stocks["Open"].isnull() == False]
dowjones_stocks_cleaned

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2000-09-18,1485.25,1489.75,1462.25,1467.50,1467.50,104794.0
1,2000-09-19,1467.00,1482.75,1466.75,1478.50,1478.50,103371.0
2,2000-09-20,1478.75,1480.50,1450.25,1469.50,1469.50,109667.0
3,2000-09-21,1470.25,1474.00,1455.50,1469.50,1469.50,98528.0
4,2000-09-22,1454.75,1471.00,1436.75,1468.50,1468.50,97416.0
...,...,...,...,...,...,...,...
6189,2020-11-16,3587.00,3637.00,3586.50,3623.00,3623.00,1303941.0
6190,2020-11-17,3625.50,3630.00,3584.25,3606.75,3606.75,1268206.0
6191,2020-11-18,3604.50,3623.25,3556.50,3565.00,3565.00,1325309.0
6192,2020-11-19,3562.00,3582.75,3542.25,3580.00,3580.00,1291117.0


Check if there are any null values left in the dataset. There are none left.

In [95]:
dowjones_stocks_cleaned.isnull().sum()

Date         0
Open         0
High         0
Low          0
Close        0
Adj Close    0
Volume       0
dtype: int64

In [96]:
#save cleaned data without the index column.
dowjones_stocks_cleaned.to_csv(CLEAN_FILE_NAME, index=0)

## BREAK POINT 1: Data Cleaned and saved till here. Can begin from here if saved file is available

In [97]:
if os.path.exists(CLEAN_FILE_NAME):
    dowjones_stocks_cleaned = pd.read_csv(CLEAN_FILE_NAME,index_col=['Date'])
    print(dowjones_stocks_cleaned.head())
else:
    print("Error: Clean File not found. Restart from the beginning")

               Open     High      Low   Close  Adj Close    Volume
Date                                                              
2000-09-18  1485.25  1489.75  1462.25  1467.5     1467.5  104794.0
2000-09-19  1467.00  1482.75  1466.75  1478.5     1478.5  103371.0
2000-09-20  1478.75  1480.50  1450.25  1469.5     1469.5  109667.0
2000-09-21  1470.25  1474.00  1455.50  1469.5     1469.5   98528.0
2000-09-22  1454.75  1471.00  1436.75  1468.5     1468.5   97416.0


# Splitting the Data

We have 10 years worth of data. Will use 9 years of data to predict the last year.
Will break the data down into weekly data and then use that to predict the daily high for each daya for a week. We will also remove the "Adjusted Close" column as it is the same as the "Close" column.

In [98]:
dowjones_stocks_cleaned_copy = dowjones_stocks_cleaned.copy()
del dowjones_stocks_cleaned_copy["Adj Close"]
dowjones_stocks_cleaned_copy

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-09-18,1485.25,1489.75,1462.25,1467.50,104794.0
2000-09-19,1467.00,1482.75,1466.75,1478.50,103371.0
2000-09-20,1478.75,1480.50,1450.25,1469.50,109667.0
2000-09-21,1470.25,1474.00,1455.50,1469.50,98528.0
2000-09-22,1454.75,1471.00,1436.75,1468.50,97416.0
...,...,...,...,...,...
2020-11-16,3587.00,3637.00,3586.50,3623.00,1303941.0
2020-11-17,3625.50,3630.00,3584.25,3606.75,1268206.0
2020-11-18,3604.50,3623.25,3556.50,3565.00,1325309.0
2020-11-19,3562.00,3582.75,3542.25,3580.00,1291117.0


In [102]:
def split(dataset):
    dataset = dataset.values
    length = len(dataset)
    print("Length=",length)
    total_weeks = int(length/5)
    print("Total Weeks=",total_weeks)
    
    #use last 1 year as test data set, everything else as train dataset
    WEEKS_IN_ONE_YEAR = 52
    ONE_YEAR_WORK_DAYS = 5*52
    
    train = dataset[0:-ONE_YEAR_WORK_DAYS]
    test = dataset[-ONE_YEAR_WORK_DAYS:]
    
    #convert to array
    train = np.array(train)
    test = np.array(test)
    
    print("-----TRAIN DATA-----")
    print("Length=",len(train))
    print("-----TEST DATA-----")
    print("Length=",len(test))
    print(train[0])
    print(test[0])
    return train,test
    
train_data,test_data = split(dowjones_stocks_cleaned_copy)

Length= 5131
Total Weeks= 1026
-----TRAIN DATA-----
Length= 4871
-----TEST DATA-----
Length= 260
[  1485.25   1489.75   1462.25   1467.5  104794.  ]
[   3237.      3261.75    3234.25    3259.   1416241.  ]


### What to predict?
Use the available data: Open Price, High, Low, Close, Adj Close, Volume to predict the High of next 1 week

In [88]:
COLUMN_TO_PREDICT = 3 #Closing cost for the day
NUMBER_OF_DAYS_DATA_TO_USE = 15
NUMBER_OF_COLUMNS = train_data.shape[1]
NUMBER_OF_DAYS_DATA_TO_PREDICT = 5

#split the given data into inputs and outputs. We can use last 7 days data to predict the next day 
# or we can use monthly data to predict. It depends onm how we want to model the data
# and will experiment with various models
#if column_with_result is None, then trying to make data for validation
def convert_data_into_io(input_data, steps, column_with_result, num_of_days_to_predict):
    print(column_with_result)
    x,y = list(), list()
    
    for i in range(len(input_data)):
        end = i + steps
        if end+1 > len(input_data) or end+num_of_days_to_predict > len(input_data):
            break
        
        _x = input_data[i:end,]
        _y = input_data[end:end+num_of_days_to_predict,column_with_result]
        
        y.append(_y)
        x.append(_x)
        
    return np.array(x), np.array(y)

In [89]:
train_x,train_y = convert_data_into_io(train_data,NUMBER_OF_DAYS_DATA_TO_USE,COLUMN_TO_PREDICT,NUMBER_OF_DAYS_DATA_TO_PREDICT)
print("------------------------")
print("Total Train Data:", len(train_x))
print("Sample Train Input Data:")
print(train_x[0])
print("Sample Train Output Data:")
print(train_y[0])
print("------------------------")

test_x,test_y = convert_data_into_io(test_data,NUMBER_OF_DAYS_DATA_TO_USE,COLUMN_TO_PREDICT,NUMBER_OF_DAYS_DATA_TO_PREDICT)
print("------------------------")
print("Total Test Data:", len(test_x))
print("Sample Test Input Data:")
print(test_x[0])
print("Sample Test Output Data:")
print(test_y[0])
print("-------------------------")

3
------------------------
Total Train Data: 4852
Sample Train Input Data:
[[  1485.25   1489.75   1462.25   1467.5  104794.  ]
 [  1467.     1482.75   1466.75   1478.5  103371.  ]
 [  1478.75   1480.5    1450.25   1469.5  109667.  ]
 [  1470.25   1474.     1455.5    1469.5   98528.  ]
 [  1454.75   1471.     1436.75   1468.5   97416.  ]
 [  1469.5    1477.75   1455.5    1461.    85491.  ]
 [  1461.     1467.     1442.5    1443.    99803.  ]
 [  1444.     1456.     1438.25   1446.75 101996.  ]
 [  1447.75   1481.     1445.     1476.    84280.  ]
 [  1473.     1473.25   1454.     1454.    78277.  ]
 [  1453.75   1464.25   1447.5    1456.25  84100.  ]
 [  1457.25   1474.     1438.75   1441.5   89440.  ]
 [  1442.     1457.25   1432.5    1450.25 101607.  ]
 [  1449.5    1462.     1447.25   1456.    92232.  ]
 [  1456.     1460.5    1411.5    1426.25  95257.  ]]
Sample Train Output Data:
[1416.5  1391.   1378.5  1344.   1386.25]
------------------------
3
------------------------
Total Tes

In [91]:
test_x

array([[[   3237.  ,    3261.75,    3234.25,    3259.  , 1416241.  ],
        [   3261.  ,    3263.5 ,    3206.75,    3235.5 , 1755057.  ],
        [   3220.25,    3223.75,    3216.25,    3222.  , 1002909.  ],
        ...,
        [   3294.25,    3318.  ,    3294.  ,    3316.5 , 1335246.  ],
        [   3316.75,    3330.25,    3316.  ,    3325.  , 1312253.  ],
        [   3325.  ,    3327.75,    3323.25,    3327.75,  340135.  ]],

       [[   3261.  ,    3263.5 ,    3206.75,    3235.5 , 1755057.  ],
        [   3220.25,    3223.75,    3216.25,    3222.  , 1002909.  ],
        [   3220.25,    3249.5 ,    3208.75,    3243.5 , 1502748.  ],
        ...,
        [   3316.75,    3330.25,    3316.  ,    3325.  , 1312253.  ],
        [   3325.  ,    3327.75,    3323.25,    3327.75,  340135.  ],
        [   3325.  ,    3329.75,    3307.25,    3319.5 , 1573187.  ]],

       [[   3220.25,    3223.75,    3216.25,    3222.  , 1002909.  ],
        [   3220.25,    3249.5 ,    3208.75,    3243.5 , 150

In [90]:
test_y

array([[3319.5 , 3319.75, 3326.  , 3293.5 , 3258.5 ],
       [3319.75, 3326.  , 3293.5 , 3258.5 , 3239.5 ],
       [3326.  , 3293.5 , 3258.5 , 3239.5 , 3278.25],
       ...,
       [3532.5 , 3582.  , 3623.  , 3606.75, 3565.  ],
       [3582.  , 3623.  , 3606.75, 3565.  , 3580.  ],
       [3623.  , 3606.75, 3565.  , 3580.  , 3554.25]])