# Jupyter Notebook for the Stock Prediction Data Analysis ( SPDA )

## Data Source : https://www.kaggle.com/qks1lver/amex-nyse-nasdaq-stock-histories

# Development stages

### 1. [Dataset formatting and freature extraction](#dataset_formatting_and_extraction)
### 2. [AutoML development](#automl_development)
### 3. [Manual model development](#manual_model_development)
### 4. [Model Training and Testing](#model_training_and_testing)

---
---

# Dataset formatting and feature extraction<a id='dataset_formatting_and_extraction'></a>

## a. Setup the links to the data file to be used for model training and testing

In [3]:
# System library for path management
from os import path

# Set the patha for the training and test datafiles
user_home_dir = str(path.expanduser('~'))
print('Home directory for the current user : ', user_home_dir)

Home directory for the current user :  C:\Users\Alpha


In [6]:
# Add path to the sample data file for training and testing models
sample_file_path = path.join(user_home_dir,
                                  'Desktop\MLH-2018\\amex-nyse-nasdaq-stock-histories\subset_data\AAL.csv')
print('Path to the data file currently being used : ', sample_file_path)

Path to the data file currently being used :  C:\Users\Alpha\Desktop\MLH-2018\amex-nyse-nasdaq-stock-histories\subset_data\AAL.csv


## b. Read the data file into a dataframe to be pre-processed

In [9]:
# Import the system built-in modules needed for feature extraction
import os
import time
from datetime import datetime

In [7]:
# Import the essential data processing libraries
import pandas as pd
import numpy as np

In [10]:
# Import visualization libraries for plotting and visualizing the dataset 
# vectors
import matplotlib.pyplot as plt

In [11]:
% matplotlib inline

In [12]:
# Read in the dataset from the csv file, and convert to a Pandas dataframe
sample_dataframe = pd.read_csv(sample_file_path, engine='python', encoding='utf-8 sig')
display(sample_dataframe)

Unnamed: 0,date,volume,open,close,high,low,adjclose
0,2018-11-16,9832851,37.400002,36.750000,37.529999,36.500000,36.750000
1,2018-11-15,8296700,37.869999,37.820000,38.160000,36.310001,37.820000
2,2018-11-14,7288400,38.000000,38.110001,38.590000,37.450001,38.110001
3,2018-11-13,9694100,37.150002,37.779999,38.419998,37.099998,37.779999
4,2018-11-12,9360800,36.310001,36.860001,37.299999,35.779999,36.860001
5,2018-11-09,6794400,36.700001,36.220001,37.259998,36.029999,36.220001
6,2018-11-08,6884800,36.770000,36.860001,37.049999,35.970001,36.860001
7,2018-11-07,10904000,35.570000,36.970001,37.389999,35.480000,36.970001
8,2018-11-06,11376700,35.599998,35.169998,35.959999,34.840000,35.169998
9,2018-11-05,11305300,36.349998,35.720001,36.520000,35.130001,35.720001


In [16]:
# Convert the date format in the dataframe into POSIX Timestamps

default_timestamps = sample_dataframe['date'].values
show_values = 5

# Initialize the list for storing POSIX timestamps
posix_timestamps = []

# Transform the datetime into POSIX datetime
for i in range(default_timestamps.shape[0]):
    
    # Collect the logged time value
    timestamp_logged = default_timestamps[i]    
    
    # Convert the logged default timestamp to POSIX and add to the list
    posix_timestamps.append(datetime.strptime(timestamp_logged, '%Y-%m-%d'))
    posix_timestamps[i] = time.mktime(posix_timestamps[i].timetuple())

# Add the list to the dataframe
sample_dataframe['Timestamp'] = posix_timestamps

# Set the POSIX timestamp column to be the index of the dataframe
sample_dataframe.set_index('Timestamp', inplace=True)

# Sort the POSIX timestamp values in the dataframe
sample_dataframe.sort_values(by=['Timestamp'], inplace=True)

# Give a preview of the re-index dataframe
print('Showing the first %d values from the dataframe.' %(show_values))
sample_dataframe.head(show_values)

Showing the first 5 values from the dataframe.


Unnamed: 0_level_0,date,volume,open,close,high,low,adjclose
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1127797000.0,2005-09-27,961200,21.049999,19.299999,21.4,19.1,18.489122
1127884000.0,2005-09-28,5747900,19.299999,20.5,20.530001,19.200001,19.638702
1127970000.0,2005-09-29,1078200,20.4,20.209999,20.58,20.1,19.360884
1128056000.0,2005-09-30,3123300,20.26,21.01,21.049999,20.18,20.127277
1128316000.0,2005-10-03,1057900,20.9,21.5,21.75,20.9,20.596684


## Now that the dataframe are constructed, we can start the feature extraction and normalization

In [None]:
# Approach 1 : Using the data from previous 7 days to predict the next day


# AutoML model development<a id='automl_development'></a>

## Model Development using AutoKeras

In [None]:
# Import the AutoML libraries
import autokeras as ak