<a href="https://colab.research.google.com/github/alerodriguessf/predicting-apple-stock-price/blob/main/Portfolio_Predicting_Apple_Stock_Price_SARIMAX_20250114.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Predicting Apple Stock Price using SARIMAX

### Step 1: Importing Necessary Libraries
### The following libraries are required for data manipulation, visualization, and modeling.

In [None]:
!pip install scipy
!pip install pmdarima

In [None]:
from pmdarima.arima import auto_arima
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tools.eval_measures import mse


### Step 2: Data Acquisition

In [None]:
# Uploading the dataset through the Google Colab file upload functionality.

from google.colab import files

uploaded = files.upload()


### Step 3: Loading the Dataset


In [None]:
# We load the dataset into a Pandas DataFrame for analysis. This dataset contains stock prices for Apple.

df = pd.read_excel('price_apple.xlsx')

In [None]:
# Displaying the first 10 rows of the dataset to understand its structure.
df.head(10)

### Step 4: Feature Engineering


In [None]:
# We create a new feature 'mean' that calculates the average of the 'Low' and 'High' prices for each day.

df['mean'] = (df['Low'] + df['High'])/2

In [None]:
# Displaying the updated DataFrame with the new 'mean' column.

df.head()

### Step 5: Shifting the Target Variable for Prediction


In [None]:
# We shift the 'mean' column by -1 to create a column 'Actual' representing the target variable for prediction.
# We do this to model the future price (next day's average price).

steps = -1
df_pred = df.copy()
df_pred['Actual'] = df_pred['mean'].shift(steps)
df_pred.head()

### Step 6: Cleaning the Data


In [None]:
# Dropping any rows with missing values due to the shift operation, ensuring data consistency for model training.

df_pred = df_pred.dropna()

### Step 7: Converting Date Column to Datetime and Setting as Index


In [None]:
# Converting 'Date' to a datetime object and setting it as the index of the DataFrame for easier time-series manipulation.

df_pred['Date'] = pd.to_datetime(df_pred['Date'])
df_pred.index = df_pred['Date']

### Step 8: Visualizing the 'mean' Column


In [None]:
# Plotting the 'mean' column to get an initial understanding of the data's trend and seasonality.

df_pred[
    'mean'
    ].plot(figsize = (15, 2))

### Step 9: Seasonal Decomposition of the Time Series


In [None]:
# Decomposing the 'mean' series using an additive model to understand its seasonal, trend, and residual components.

sd = sm.tsa.seasonal_decompose(df_pred['mean'], model = 'additive', period = 365)
sd.plot()

### Step 10: Feature Scaling


In [None]:
# We use MinMaxScaler to normalize the features to a range between 0 and 1 for both input features (X) and target variable (Y).

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
scaler_input = scaler.fit_transform(df_pred[['Low', 'High', 'Close', 'Adj Close', 'Volume','mean']])
scaler_input = pd.DataFrame(scaler_input)
x = scaler_input # Assigning the scaled values to the input features (X)


In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
scaler_output = scaler.fit_transform(df_pred[['Actual']]) # Scaling the target variable (Actual)
scaler_ouput = pd.DataFrame(scaler_output)
y = scaler_output # Assigning the scaled values to the target variable (Y)

In [None]:
# Renaming columns for clarity
x.rename(columns = {0: 'Low', 1: 'High', 2: 'Close', 3: 'Adj Close', 4: 'Volume', 5: 'mean'}, inplace = True)
x.index = df_pred.index
x.head()

In [None]:
y = pd.DataFrame(scaler_output)
y.rename(columns = {0:'stock_price'}, inplace = True) # Renaming the target variable 'Preço_açao' (Stock Price)
y.index = df_pred.index
y.head()

### Step 11: Splitting the Data into Training and Test Sets


In [None]:
# We split the data into training and test sets (70% for training, 30% for testing) to evaluate the model's performance.

train_size = int(len(x) * 0.70)
test_size = int(len(df_pred)) - train_size
train_size, test_size

In [None]:
# Creating the training and testing sets for both input features (X) and target variable (Y)

train_x, train_y = x[:train_size], y[:train_size]
test_x, test_y = x[train_size:].dropna(), y[train_size:]

### Step 12: Automatic ARIMA Model Selection


In [None]:
# We use the auto_arima function from pmdarima to find the best parameters for the ARIMA model.

step_wise = auto_arima(train_y, exogenous = train_x,
                       trace = True,
                       start_p=1,
                       start_q=1,
                       max_p=7,
                       max_q=7,
                       d=1,
                       max_d= 7,
                       stepwise= True
                                              )

In [None]:
# Displaying the summary of the best model found by auto_arima

step_wise.summary()


**Log-Likelihood**: 7569.047 indicates a good model fit.
AIC: -15132.095, BIC: -15115.391, and HQIC: -15125.952 all suggest the model is appropriate, with low values favoring good fit and penalizing overfitting.

**Coefficient Estimates:**

Intercept: 0.0002, though not statistically significant (p-value 0.157).
MA Term (ma.L1): 0.1723, highly significant (p-value < 0.0001), indicating that past errors significantly influence future stock prices.
Variance of Residuals (sigma2): 2.339e-05, with a z-score of 75.974, suggesting a good model fit with small residual variance.

**Model Diagnostics:**

Ljung-Box Test: p-value 0.97 shows no significant autocorrelation, indicating white noise residuals.
Jarque-Bera Test: p-value 0.00, rejecting normality and suggesting non-normal residuals.
Heteroskedasticity: p-value 0.00 indicates variance instability, implying heteroskedasticity in the model.
Skew: -0.37 suggests slight left-skewness in residuals.
Kurtosis: 13.66 indicates heavy-tailed residuals.

**Conclusion:** The SARIMAX model shows strong performance, with significant model parameters and good fit. However, residuals display non-normality and heteroskedasticity, which may require further adjustments to improve predictive accuracy and model robustness.

In [None]:
# Converting input features (X) and target variable (Y) to numpy arrays for SARIMAX compatibility

train_x = np.array(train_x)
train_y = np.array(train_y)

In [None]:
# Building the SARIMAX model with the best found parameters

model = SARIMAX(
    train_y,
    exog = train_x,
    order = (0, 1, 1),
    enforce_invertibility= False,
    enforce_stationarity= False
)

In [None]:
# Fitting the SARIMAX model to the training data
results = model.fit()

### Step 13: Making Predictions


In [None]:
## We predict future values using the fitted model on the test set.

pred = results.predict(start=train_size, end=train_size + test_size + (steps), exog=test_x)

### Step 14: Comparing Predictions with Actual Values


In [None]:
# We create a DataFrame for the actual stock prices and predictions to compare and evaluate the model.

act = pd.DataFrame(scaler_output[train_size:, 0])
act.index = test_x.index
act.rename(columns={0: 'Preço_açao'}, inplace=True)



In [None]:
# Organizing predictions for easy comparison

pred =pd.DataFrame(pred)
pred.reset_index(drop=True, inplace=True)
pred.index=test_x.index
pred['Actual'] = act['Preço_açao']
pred.rename(columns={0:'Predicted'}, inplace=True)
pred.head()

### Step 17: Visualizing Actual vs. Predicted Values


In [None]:
# We plot both the actual and predicted stock prices for visual comparison.

pred['Actual'].plot(figsize=(20,8), legend = True, color = 'blue')
pred['Predicted'].plot(legend= True, color='red', figsize=(20,8))

### Step 18: Model Evaluation


In [None]:
# Displaying the MSE to assess the model's performance

error = mse(pred['Actual'], pred['Predicted'])
error

MSE value of 0.000150831116141754 reflects that the SARIMAX model has a relatively small prediction error.

In [None]:
# Displaying the MRSE to assess the model's performance

mrse = np.sqrt(error)
mrse

 MRSE of 0.01228 confirms that the SARIMAX model has relatively small prediction errors.