<a href="https://colab.research.google.com/github/FrancLis/Multivariate-Time-Series-Forecasting/blob/main/2_Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install fast_ml
!pip install talos
!pip install kats
!pip install scipy

In [2]:
# Seed value
# Apparently you may use different seed values at each stage
seed_value = 0

# 1. Set the `PYTHONHASHSEED` environment variable at a fixed value
import os
os.environ['PYTHONHASHSEED'] = str(seed_value)

import numpy as np
import tensorflow as tf
import random as python_random

# The below is necessary for starting Numpy generated random numbers
# in a well-defined initial state.
np.random.seed(123)

# The below is necessary for starting core Python generated random numbers
# in a well-defined state.
python_random.seed(123)

# The below set_seed() will make random number generation
# in the TensorFlow backend have a well-defined initial state.
# For further details, see:
# https://www.tensorflow.org/api_docs/python/tf/random/set_seed
tf.random.set_seed(1234)

import seaborn as sns
import pandas as pd
import talos as ta
from matplotlib import pyplot as plt
from fast_ml.model_development import train_valid_test_split
from tensorflow import keras
from sklearn.preprocessing import MinMaxScaler, PowerTransformer, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, max_error, mean_absolute_error
from tensorflow.keras import Sequential, layers, callbacks
from tensorflow.keras.layers import Input, Dense, LSTM, Dropout, GRU, Bidirectional, SimpleRNN, Conv1D, MaxPooling1D, Flatten


## 1. Data Acquisition

In [3]:
# Read Csv
file = r"/content/PG.csv"
df = pd.read_csv(file, parse_dates=['Date'], index_col='Date')
plt.style.use('seaborn')

## 2. Data Visualization...

## 3. Data prepocessing

#### 3.1 Data Cleaning

In [4]:
# Check missing values
df.isnull().sum()

Open         0
High         0
Low          0
Close        0
Adj Close    0
Volume       0
dtype: int64

There aren't missing value

In [5]:
# Replace missing values by interpolation
def replace_missing(attribute):
    return attribute.interpolate(inplace=True)

# replace_missing(df['Open'])
# ....

In [6]:
# Detect and remove outliers with IQR
def detect_remove_outliers(df, column):
    # IQR
    Q1 = np.percentile(df[f'{column}'], 25, interpolation='midpoint')
    Q3 = np.percentile(df[f'{column}'], 75, interpolation='midpoint')
    IQR = Q3 - Q1

    # Above Upper bound
    upper = df[f'{column}'] >= (Q3 + 1.5 * IQR)
    # print("Upper bound:", upper)
    print("Upper bound outliers:", f'{column}', np.where(upper))

    # Below Lower bound
    lower = df[f'{column}'] <= (Q1 - 1.5 * IQR)
    # print("Lower bound:", lower)
    print("Lower bound:", f'{column}', np.where(lower))

    # Removing the Outliers 
    # df.drop(upper, inplace = True)
    # df.drop(lower, inplace = True)
    
    # print("New Shape: ", df.shape)
    return

# There may be potential outliers in the Volume column, but they won't be considered outliers because
# a large volume of transactions is related to a change in the closing price
# For the other columns it was previously verified graphically with the boxplot that there aren't outliers.
# Also mathematically, with the IQR method, the same result is gotten.

titles = ["Open", "High", "Low", "Close", "Adj Close", "Volume"]

for i in titles:
       detect_remove_outliers(df, f'{i}')

Upper bound outliers: Open (array([13100, 13101, 13102, 13103, 13104, 13105, 13106, 13107, 13108,
       13109, 13110, 13111, 13112, 13113, 13114, 13115, 13116, 13117,
       13118, 13119, 13120, 13121, 13122, 13123, 13124, 13125, 13126,
       13127, 13128, 13129, 13130, 13131, 13132, 13133, 13134, 13135,
       13136, 13137, 13138, 13139, 13140, 13141, 13142, 13143, 13144]),)
Lower bound: Open (array([], dtype=int64),)
Upper bound outliers: High (array([13101, 13102, 13103, 13104, 13105, 13106, 13107, 13108, 13109,
       13110, 13111, 13112, 13113, 13114, 13115, 13116, 13117, 13118,
       13119, 13120, 13121, 13122, 13123, 13124, 13125, 13126, 13127,
       13128, 13129, 13130, 13131, 13132, 13133, 13134, 13135, 13136,
       13137, 13138, 13139, 13140, 13141, 13142, 13143, 13144]),)
Lower bound: High (array([], dtype=int64),)
Upper bound outliers: Low (array([13100, 13102, 13103, 13104, 13105, 13106, 13107, 13108, 13109,
       13110, 13111, 13112, 13113, 13114, 13115, 13116, 1311

*   It seems that these are not outliers. They are a part of the trends of the timeseries



**Remove** **Volume** **feature** 


It has been decided to remove the "Volume" column from the dataframe. In particular, it was observed that this feature is not very related to the target. Furthermore, the purpose of the algorithm is to predict the trend of the actions in the future (out of sample) and when this is done, given that the analysis is multivariate, the data of the other features will also be provided. Trading volume is a measure of how much a given financial asset has traded in a period of time. For stocks, volume is measured in the number of shares traded.
Volume measures the number of shares traded in a stock or contracts traded in futures or options.

Volume can indicate market strength, as rising markets on increasing 

*   Volume measures the number of shares traded in a stock or contracts traded in futures or options.
*   Volume can indicate market strength, as rising markets on increasing volume are typically viewed as strong and healthy.
*   When prices fall on increasing volume, the trend is gathering strength to the downside.

It is therefore more difficult to predict and could damage the model.

In [7]:
df = df.drop(['Volume'], axis=1)

#### 2.2 Data Splitting


splitting given that the goal of the algorithm is to predict the trend of the share price in the future and
therefore given that we are in a multivariate analysis we are going to predict not only the closing price but also
the other features in the future as a consequence of the splitting have been considered both in the training phase
nd in the test phase all the features except the volume this to train the algorithm to predict also
the values and the features in addition to that of the close price was also considered as training since it
can be useful to use the previous day's value to predict the next one As happens in a multivariate analysis

In [13]:
# Let's say we want to split the data in 80:10:10 for train:valid:test dataset it was decided to use a manual

train_size = 0.8
valid_size = 0.1

train_index = int(len(df) * train_size)

# First we need to sort the dataset by the desired column
df.sort_values(by='Date', ascending=True, inplace=True)

df_train = df[0:train_index]
df_rem = df[train_index:]

valid_index = int(len(df) * valid_size)

df_valid = df[train_index:train_index + valid_index]
df_test = df[train_index + valid_index:]
test_index = df_test.shape[0]

X_train, y_train = df_train[['Open', 'High', 'Low', 'Close', 'Adj Close']], \
                   df_train[['Open', 'High', 'Low', 'Close', 'Adj Close']]

X_valid, y_valid = df_valid[['Open', 'High', 'Low', 'Close', 'Adj Close']], \
                   df_valid[['Open', 'High', 'Low', 'Close', 'Adj Close']]

X_test, y_test = df_test[['Open', 'High', 'Low', 'Close', 'Adj Close']], \
                 df_test[['Open', 'High', 'Low', 'Close', 'Adj Close']]

print('X_train.shape:', X_train.shape, 'y_train.shape:', y_train.shape)
print('X_valid.shape:', X_valid.shape, 'y_valid.shape:', y_valid.shape)
print('X_test.shape:', X_test.shape, 'y_test.shape:', y_test.shape)

X_train.shape: (10516, 5) y_train.shape: (10516, 5)
X_valid.shape: (1314, 5) y_valid.shape: (1314, 5)
X_test.shape: (1315, 5) y_test.shape: (1315, 5)


#### 2.3 Data Transformation

In [15]:
# Normalization
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)
X_test = scaler.transform(X_test)

'''
# Other transformers
# StandardScaler
st_scaler = StandardScaler()
X_train = st_scaler.fit_transform(X_train)
X_test = st_scaler.transform(X_test)

# PowerTransformer
pt = PowerTransformer()
X_train = pt.fit_transform(X_train)
X_test = pt.transform(X_test)
'''
# Convert y sets to numpy array
y_train = y_train.to_numpy()
y_valid = y_valid.to_numpy()
y_test = y_test.to_numpy()


#### 2.4 Set "***Window size***"

In [16]:
# Create a 3D input
def create_dataset(X, y, time_steps=1):
    Xs, ys = [], []
    for i in range(len(X) - time_steps):
        v = X[i:i + time_steps, :]
        Xs.append(v)
        ys.append(y[i + time_steps])
    return np.array(Xs), np.array(ys)


TIME_STEPS = 20
X_train, y_train = create_dataset(X_train, y_train, TIME_STEPS)
X_test, y_test = create_dataset(X_test, y_test, TIME_STEPS)
X_valid, y_valid = create_dataset(X_valid, y_valid, TIME_STEPS)

print('All shapes are: (batch, time, features)')
print('X_train.shape:', X_train.shape, 'y_train.shape:', y_train.shape)
print('X_valid.shape:', X_valid.shape, 'y_valid.shape:', y_valid.shape)
print('X_test.shape:', X_test.shape, 'y_test.shape:', y_test.shape)

All shapes are: (batch, time, features)
X_train.shape: (10496, 20, 5) y_train.shape: (10496, 5)
X_valid.shape: (1294, 20, 5) y_valid.shape: (1294, 5)
X_test.shape: (1295, 20, 5) y_test.shape: (1295, 5)


##### 2.4.1 Save preprocessed dataset 

In [17]:
with open('Preprocessed_data.npy', 'wb') as f:
    np.save(f, X_test )
    np.save(f, y_test )
    np.save(f, X_train )
    np.save(f, y_train )
    np.save(f, X_valid )
    np.save(f, y_valid )

you can find the numpy file of the preprocessed data in the folder Data of this repository. 

Here is the link:
https://github.com/FrancLis/Multivariate-Time-Series-Forecasting/tree/main/Data