# Auto ARIMA
Auto ARIMA is an algorithm in Python's pmdarima library for automatically finding the best parameters for an ARIMA (Auto Regressive Integrated Moving Average) model. ARIMA models are used to analyze and forecast time series data by capturing patterns in the data such as trends, seasonality, and autocorrelation.

The Auto ARIMA algorithm uses a stepwise approach to search for the best parameters for the ARIMA model. It starts by trying a set of parameters that are commonly used for time series analysis and then iteratively refines the parameters based on a statistical criterion (such as AIC or BIC) until the best combination of parameters is found.

Here is an example of how to use the pmdarima library to fit an ARIMA model to a time series data and make a forecast.



In [2]:
# To install uncomment the below line:
# !pip install pmdarima

import pmdarima as pm

# load sample dataset
data = pm.datasets.load_wineind()

# find best parameters using auto arima
model = pm.auto_arima(data, seasonal=True, m=12)

# print model summary
print(model.summary())

# make forecast for next 12 months
forecast = model.predict(n_periods=12)

# print forecast values
print(forecast)

                                      SARIMAX Results                                       
Dep. Variable:                                    y   No. Observations:                  176
Model:             SARIMAX(0, 1, 2)x(0, 1, [1], 12)   Log Likelihood               -1528.766
Date:                              Thu, 02 May 2024   AIC                           3065.533
Time:                                      16:20:25   BIC                           3077.908
Sample:                                           0   HQIC                          3070.557
                                              - 176                                         
Covariance Type:                                opg                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ma.L1         -0.5756      0.041    -13.952      0.000      -0.656      -0.495
ma.L2         -0.10

In this example, we first load the wineind dataset from the pmdarima library using the load_wineind() function. This dataset contains the monthly wine sales in Australia from January 1980 to October 1995.

We then use the auto_arima function to automatically find the best parameters for the ARIMA model. We set seasonal=True and m=12 to indicate that the data has a yearly seasonal pattern with a period of 12 months.

We then print the summary of the ARIMA model using the summary() method of the ARIMA object.

Finally, we make a forecast for the next 12 months using the predict() method of the ARIMA object and print the forecast values.

Note that since the wineind dataset is a seasonal time series with a yearly pattern, we set seasonal=True and m=12 to indicate this. If the dataset has a different seasonal pattern, the value of m should be set accordingly.

In [3]:
#import packages
import pandas as pd
import numpy as np

#to plot within notebook
import matplotlib.pyplot as plt
%matplotlib inline

#setting figure size
from matplotlib.pylab import rcParams

#for normalizing data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))

#read the file
df = pd.read_csv('NSE-TATAGLOBAL11.csv')

#print the head
df.head()

Unnamed: 0,Date,Open,High,Low,Last,Close,Total Trade Quantity,Turnover (Lacs)
0,2018-10-08,208.0,222.25,206.85,216.0,215.15,4642146.0,10062.83
1,2018-10-05,217.0,218.6,205.9,210.25,209.2,3519515.0,7407.06
2,2018-10-04,223.5,227.8,216.15,217.25,218.2,1728786.0,3815.79
3,2018-10-03,230.0,237.5,225.75,226.45,227.6,1708590.0,3960.27
4,2018-10-01,234.55,234.6,221.05,230.3,230.9,1534749.0,3486.05


In [4]:
# To install plotly
# !pip install plotly

import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# load data
df = pd.read_csv('NSE-TATAGLOBAL11.csv')

# convert Date column to datetime format and set as index
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)

# create dataframe with Date and Close price columns
data = df[['Close']].copy()

# split data into train and validation sets
train = data.iloc[:int(0.8*len(data)), :]
valid = data.iloc[int(0.8*len(data)):, :]

# extract features from Date column
data['day'] = data.index.day
data['month'] = data.index.month
data['year'] = data.index.year
data['weekday'] = data.index.weekday

# create separate dataset for linear regression
lr_data = data.copy()

# sort dataset by date
lr_data.sort_index(inplace=True)

# create linear regression model
lr_model = LinearRegression()

# fit model on train data
lr_model.fit(lr_data.iloc[:int(0.8*len(lr_data)), 1:], lr_data.iloc[:int(0.8*len(lr_data)), 0])

# make predictions on validation data
lr_pred = lr_model.predict(lr_data.iloc[int(0.8*len(lr_data)):, 1:])

# calculate RMSE
rmse = mean_squared_error(valid['Close'], lr_pred, squared=False)
print(f"Root Mean Squared Error: {rmse}")

# create subplots for actual and predicted data
fig1 = make_subplots(rows=2, cols=1, shared_xaxes=True, vertical_spacing=0.03)

# add actual and predicted data to first subplot
fig1.add_trace(go.Scatter(x=valid.index, y=valid['Close'], name='Actual'), row=1, col=1)
fig1.add_trace(go.Scatter(x=valid.index, y=lr_pred, name='Predicted'), row=1, col=1)

# set figure layout
fig1.update_layout(title='Tata Global Beverages Stock Price - Linear Regression',
                   xaxis_title='Date', height=400, width = 600)

# create subplot for Close price and date
fig2 = go.Figure()
fig2.add_trace(go.Scatter(x=data.index, y=data['Close'], name='Close Price'))

# set figure layout
fig2.update_layout(title='Tata Global Beverages Stock Price',
                   xaxis_title='Date', yaxis_title='Close Price',
                   height=300, width=500)

# show figures
fig2.show()
fig1.show()



Root Mean Squared Error: 13.889281057443311


In [9]:
import pandas as pd
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from pmdarima.arima import auto_arima

# Load the data
df = pd.read_csv('NSE-TATAGLOBAL11.csv')

# Sort the data by date
data = df.sort_index(ascending=True, axis=0)

# Split the data into training and validation sets
# split data into train and validation sets
train = data.iloc[:int(0.8*len(data)), :]
valid = data.iloc[int(0.8*len(data)):, :]

# Extract the 'Close' column from the training and validation sets
training = train['Close']
validation = valid['Close']

# Fit the ARIMA model using the training data
model = auto_arima(training, start_p=1, start_q=1, max_p=3, max_q=3, m=12,
                   start_P=0, seasonal=True, d=1, D=1, trace=True,
                   error_action='ignore', suppress_warnings=True)
model.fit(training)

# Make predictions on the validation data using the ARIMA model
forecast = model.predict(n_periods=248)
forecast = pd.DataFrame(forecast, index=valid.index, columns=['Prediction'])

# Create subplots for the actual and predicted data
fig = make_subplots(rows=1, cols=1, shared_xaxes=True, vertical_spacing=0.03)

# Add actual and predicted data to the first subplot
fig.add_trace(go.Scatter(x=train.index, y=train['Close'], name='Training Data'), row=1, col=1)
fig.add_trace(go.Scatter(x=valid.index, y=valid['Close'], name='Validation Data'), row=1, col=1)
fig.add_trace(go.Scatter(x=valid.index, y=forecast['Prediction'], name='Predicted'), row=1, col=1)

# Add the predicted data to the second subplot
# fig.add_trace(go.Scatter(x=valid.index, y=forecast['Prediction'], name='Predicted'), row=2, col=1)

# Set the figure layout
fig.update_layout(title='Tata Global Beverages Stock Price - ARIMA',
                  xaxis_title='Date', height=500, width = 800)

# Show the figure
fig.show()

Performing stepwise search to minimize aic
 ARIMA(1,1,1)(0,1,1)[12]             : AIC=inf, Time=2.22 sec
 ARIMA(0,1,0)(0,1,0)[12]             : AIC=6102.428, Time=0.05 sec
 ARIMA(1,1,0)(1,1,0)[12]             : AIC=5777.957, Time=0.23 sec
 ARIMA(0,1,1)(0,1,1)[12]             : AIC=inf, Time=1.48 sec
 ARIMA(1,1,0)(0,1,0)[12]             : AIC=6103.655, Time=0.08 sec
 ARIMA(1,1,0)(2,1,0)[12]             : AIC=5688.377, Time=0.66 sec
 ARIMA(1,1,0)(2,1,1)[12]             : AIC=inf, Time=5.96 sec
 ARIMA(1,1,0)(1,1,1)[12]             : AIC=inf, Time=2.02 sec
 ARIMA(0,1,0)(2,1,0)[12]             : AIC=5686.531, Time=0.47 sec
 ARIMA(0,1,0)(1,1,0)[12]             : AIC=5776.162, Time=0.19 sec
 ARIMA(0,1,0)(2,1,1)[12]             : AIC=inf, Time=4.19 sec
 ARIMA(0,1,0)(1,1,1)[12]             : AIC=inf, Time=1.34 sec
 ARIMA(0,1,1)(2,1,0)[12]             : AIC=5688.375, Time=0.52 sec
 ARIMA(1,1,1)(2,1,0)[12]             : AIC=5690.368, Time=1.38 sec
 ARIMA(0,1,0)(2,1,0)[12] intercept   : AIC=5688.5

This code loads stock price data from a CSV file, splits it into training and validation sets, and uses the pmdarima.auto_arima function to automatically fit an ARIMA model to the training data. The model is then used to make predictions on the validation data, and the actual and predicted data are plotted using Plotly. The figure is split into two subplots, with the first showing the actual and predicted data for the validation period, and the second showing only the predicted data.

# LSTM

LSTM stands for **Long Short-Term Memory**, which is a type of recurrent neural network (RNN) that is commonly used in time series forecasting. The main advantage of LSTM over traditional RNNs is that it can learn long-term dependencies in the input sequence, which is crucial for time series forecasting.

In a time series forecasting problem, the goal is to predict future values of a variable based on its past values. The input to an LSTM model is a sequence of past values of the variable, and the output is a sequence of predicted future values. The model learns to make predictions by processing the input sequence one element at a time and updating its internal state based on the current input and the previous state.

The key component of an LSTM model is the cell state, which is used to keep track of long-term dependencies in the input sequence. The cell state is updated at each time step using three gates: the input gate, the forget gate, and the output gate. The input gate determines which values from the current input to update the cell state with. The forget gate determines which values from the previous cell state to discard. The output gate determines which values from the updated cell state to output as the prediction.

The LSTM model also includes a hidden state, which is updated at each time step and is used to carry information between time steps. The hidden state is used to compute the output at each time step.

To train an LSTM model for time series forecasting, the input sequence is divided into training and testing sets, and the model is trained on the training set to minimize the difference between the predicted and actual values. The trained model is then used to make predictions on the testing set.

In summary, LSTM is a powerful tool for time series forecasting because it can learn long-term dependencies in the input sequence. The key components of an LSTM model are the cell state, the hidden state, and the gates, which are used to update the cell state and compute the output at each time step. The LSTM model is trained on a training set of past values and used to make predictions on a testing set of future values.

In [12]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib as mpl
import plotly.graph_objects as go
from sklearn.preprocessing import MinMaxScaler


# Load the Walmart quarterly revenue dataset
df = pd.read_csv('WMT_Earnings.csv', index_col='Date')

# Select only the first 50 rows and format the dates
df = df.iloc[:50]
df.index = [df.index[i].split()[0] + " " + df.index[i].split()[2] for i in range(len(df.index))]
df.index = pd.to_datetime(df.index)
df = df.iloc[::-1]

# Select data up to the end of 2019 and convert values to floats
df = df[:"2019"]
df.Value = [float(df.Value[i][:-1]) for i in range(len(df.Value))]


def train_test(df, test_periods):
    """
    Split the dataset into training and testing sets
    """
    train = df[:-test_periods].values
    test = df[-test_periods:].values
    return train, test


test_periods = 8
train, test = train_test(df, test_periods)

# Scale the training data
scaler = MinMaxScaler()
scaler.fit(train)
train_scaled = scaler.transform(train)
train_scaled = torch.FloatTensor(train_scaled)

# Reshape the training data to the correct dimensions
train_scaled = train_scaled.view(-1)

# Define a function to create the input/output pairs for the LSTM model
def get_x_y_pairs(train_scaled, train_periods, prediction_periods):
    """
    train_scaled - training sequence
    train_periods - how many data points to use as inputs
    prediction_periods - how many periods to output as predictions
    """
    x_train = [train_scaled[i:i+train_periods] for i in range(len(train_scaled)-train_periods-prediction_periods)]
    y_train = [train_scaled[i+train_periods:i+train_periods+prediction_periods] for i in range(len(train_scaled)-train_periods-prediction_periods)]
    
    # Use the stack function to convert the list of 1D tensors
    # into a 2D tensor where each element of the list is now a row
    x_train = torch.stack(x_train)
    y_train = torch.stack(y_train)
    
    return x_train, y_train


train_periods = 16  # Number of quarters for input
prediction_periods = test_periods
x_train, y_train = get_x_y_pairs(train_scaled, train_periods, prediction_periods)

class LSTM(nn.Module):
    """
    input_size - will be 1 in this example since we have only 1 predictor (a sequence of previous values)
    hidden_size - can be chosen to dictate how much hidden "long term memory" the network will have
    output_size - this will be equal to the prediction_periods input to get_x_y_pairs
    """
    def __init__(self, input_size, hidden_size, output_size):
        super(LSTM, self).__init__()
        self.hidden_size = hidden_size
        self.lstm = nn.LSTM(input_size, hidden_size)
        self.linear = nn.Linear(hidden_size, output_size)
        
    def forward(self, x, hidden=None):
        if hidden == None:
            self.hidden = (torch.zeros(1, 1, self.hidden_size),
                           torch.zeros(1, 1, self.hidden_size))
        else:
            self.hidden = hidden
            
        lstm_out, self.hidden = self.lstm(x.view(len(x), 1, -1), self.hidden)
        predictions = self.linear(lstm_out.view(len(x), -1))
        return predictions[-1], self.hidden


# Define the loss function and optimizer
model = LSTM(input_size=1, hidden_size=50, output_size=test_periods)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the LSTM model
epochs = 600
model.train()

for epoch in range(epochs + 1):
    for x, y in zip(x_train, y_train):
        y_hat, _ = model(x, None)
        optimizer.zero_grad()
        loss = criterion(y_hat, y)
        loss.backward()
        optimizer.step()
        
    if epoch % 100 == 0:
        print(f'epoch: {epoch:4} loss:{loss.item():10.8f}')

# Use the trained model to make predictions on the training set
model.eval()
with torch.no_grad():
    predictions, _ = model(train_scaled[-train_periods:], None)

# Apply inverse transform to undo scaling
predictions = scaler.inverse_transform(np.array(predictions.reshape(-1, 1)))

# Create a Plotly graph of the predicted vs actual Walmart quarterly revenue
x = [dt.datetime.date(d) for d in df.index]
fig = go.Figure()

fig.add_trace(go.Scatter(x=x[:-len(predictions)], y=df.Value[:-len(predictions)], mode='lines', name='True Values'))
fig.add_trace(go.Scatter(x=x[-len(predictions):], y=df.Value[-len(predictions):], mode='lines', name='True Values', line=dict(dash='dash')))
fig.add_trace(go.Scatter(x=x[-len(predictions):], y=predictions.ravel(), mode='lines', name='Predicted Values'))

fig.update_layout(title='Walmart Quarterly Revenue', xaxis_title='Date', yaxis_title='Revenue (Billions)', width = 800)
fig.show()



Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.



# Multivariate time series forecasting
Multivariate time series forecasting involves predicting the future values of a dependent variable (or multiple dependent variables) based on the values of one or more independent variables at previous time steps. In other words, we are not only trying to predict the future values of a single time series but also using additional time series as input features to improve our forecasts.

For example, consider a company that wants to forecast its sales volume for the next quarter. The sales volume depends not only on past sales data but also on other factors such as marketing spending, seasonality, and economic indicators. In this case, we have a multivariate time series forecasting problem where we need to use the past values of multiple time series (e.g., sales, marketing spending, seasonality, economic indicators) to predict the future values of the dependent variable (sales volume).

One popular approach to multivariate time series forecasting is to use vector autoregression (VAR) models. VAR models assume that each variable in the system is a linear function of its own past values as well as the past values of the other variables in the system. VAR models can be extended to handle non-linear dependencies between variables, incorporate external variables, and account for seasonality and trends.


# TASK 
1. Try changing the parameters of above algorithm and compare the results
2. Apply the ARIMA on WMT_Earning and compare with LSTM
3. Apply the LSTM on NSE-TATAGLOBAL11 and compare with ARIMA
4. Apply VAR on above dataset