
# Data Preparation

This notebook covers the initial stages of the data pipeline for our machine learning project on stock price prediction using LSTM and Transformer models. It includes the setup for data extraction, transformation and loading.


In [12]:
import pandas as pd
import numpy as np
import yfinance as yf
import pickle as pkl


## ETL Class - Extract, transform, and load 


The `ETL` class is a critical component in our machine learning pipeline, designed to manage the data lifecycle for stock price prediction models. It automates the steps from data extraction through to preparation for modeling, ensuring data is ready for analysis and prediction. Below, we outline the key functionalities of the class:

### Key Functionalities

- **Data Extraction**: Retrieves historical stock price data from the Yahoo Finance API, focusing on the closing prices for the specified stock ticker.

- **Data Transformation**: Processes the data to fit the specific requirements of our machine learning models. This includes reshaping the data into sequences that represent consistent time intervals, ensuring that the models can recognize patterns over time.

- **Data Loading**: Splits the data into training and testing datasets. The training set is used to teach our models how to predict future prices, while the test set is used to evaluate the accuracy of these predictions.

- **Supervised Learning Conversion**: Transforms the time series data into a format suitable for supervised learning, where the model can learn from past data to predict future outcomes.

### Implementation

The class is implemented with methods that handle each stage of the data handling process:

- **Initialization**: Sets up the class with parameters like the stock ticker, the size of the test dataset, and the details of how data should be segmented.

- **Data Splitting**: Divides the data into training and testing sets based on a specified proportion.

- **Reshaping and Windowing**: Adjusts the structure of the data arrays to meet the input requirements of predictive models, organizing the data into meaningful segments.

- **Complete ETL Execution**: Runs all steps from data extraction to loading in sequence, delivering data that's ready for modeling.

### Benefits

This class streamlines the preparation of financial time series data, enabling efficient and effective model training.


In [16]:
class ETL:
    """
    ticker: str
    period: string
    test_size: float betwee 0 and 1
    n_input: int
    timestep: int
    Extracts data for stock with ticker `ticker` from yf api,
    splits the data into train and test sets by date,
    reshapes the data into np.array of shape [#weeks, 5, 1],
    converts our problem into supervised learning problem.
    """
    def __init__(self, ticker, test_size=0.2, period='max', n_input=5, timestep=5) -> None:
        self.ticker = ticker
        self.period = period
        self.test_size = test_size
        self.n_input = n_input
        self.df = self.extract_historic_data()
        self.timestep = timestep
        self.train, self.test = self.etl()
        self.X_train, self.y_train = self.to_supervised(self.train)
        self.X_test, self.y_test = self.to_supervised(self.test)

    def extract_historic_data(self) -> pd.Series:
        """
        Gets historical data from yf api.
        """
        t = yf.Ticker(self.ticker)
        history = t.history(period=self.period)
        return history.Close

    def split_data(self) -> tuple:
        """
        Splits our pd.Series into train and test series with
        test series representing test_size * 100 % of data.
        """
        data = self.extract_historic_data()
        if len(data) != 0:
            train_idx = round(len(data) * (1-self.test_size))
            train = data[:train_idx]
            test = data[train_idx:]
            train = np.array(train)
            test = np.array(test)
            return train[:, np.newaxis], test[:, np.newaxis]
        else:
            raise Exception('Data set is empty, cannot split.')

    def window_and_reshape(self, data) -> np.array:
        """
        Reformats data into shape our model needs,
        namely, [# samples, timestep, # feautures]
        samples
        """
        NUM_FEATURES = 1
        samples = int(data.shape[0] / self.timestep)
        result = np.array(np.array_split(data, samples))
        return result.reshape((samples, self.timestep, NUM_FEATURES))

    def transform(self, train, test) -> np.array:
        train_remainder = train.shape[0] % self.timestep
        test_remainder = test.shape[0] % self.timestep
        if train_remainder != 0 and test_remainder != 0:
            train = train[train_remainder:]
            test = test[test_remainder:]
        elif train_remainder != 0:
            train = train[train_remainder:]
        elif test_remainder != 0:
            test = test[test_remainder:]
        return self.window_and_reshape(train), self.window_and_reshape(test)

    def etl(self) -> 'tuple[np.array, np.array]':
        """
        Runs complete ETL
        """
        train, test = self.split_data()
        return self.transform(train, test)
    

    def to_supervised(self, train, n_out=5) -> tuple:
        """
        Converts our time series prediction problem to a
        supervised learning problem.
        """
        # flatted the data
        data = train.reshape((train.shape[0]*train.shape[1], train.shape[2]))
        X, y = [], []
        in_start = 0
        # step over the entire history one time step at a time
        for _ in range(len(data)):
            # define the end of the input sequence
            in_end = in_start + self.n_input
            out_end = in_end + n_out
            # ensure we have enough data for this instance
            if out_end <= len(data):
                x_input = data[in_start:in_end, 0]
                x_input = x_input.reshape((len(x_input), 1))
                X.append(x_input)
                y.append(data[in_end:out_end, 0])
                # move along one time step
                in_start += 1
        return np.array(X), np.array(y)
    
        
    def save_data(self, folder="datasets"):
        """
        Saves the data to pickle files.
        """
        with open(folder + "/df.pkl", "wb") as f:
            pkl.dump(self.df, f)
            
        with open(folder + "/train.pkl", "wb") as f:
            pkl.dump(self.train, f)
            
        with open(folder + "/test.pkl", "wb") as f:
            pkl.dump(self.test, f)
    
        with open(folder + "/X_train.pkl", "wb") as f:
            pkl.dump(self.X_train, f)
            
        with open(folder + "/y_train.pkl", "wb") as f:
            pkl.dump(self.y_train, f)
            
        with open(folder + "/X_test.pkl", "wb") as f:
            pkl.dump(self.X_test, f)
            
        with open(folder + "/y_test.pkl", "wb") as f:
            pkl.dump(self.y_test, f)
            
            
        


## Data Loading and Preprocessing

Here we load and preprocess our data using the `ETL` class defined above.


In [14]:
data = ETL('AAPL')

# store data
data.save_data('datasets')