# Oblig2:

- Student Number: S374918
- Student Name: Alex McCorkle

(Lecturer said it was OK to do it alone)

### Use Case: Tesla Stock

### Machine Learning Algorithm: Random Forest Regression

For this assignment I've decided to try out implementing a Random Forest Regression model to be able to accurately predict the stock price on a specific date. After reading around and watching a few videos on the differences between a few algorithms, I found that Random Forest seems to be a popular ML algorithm for predicting complex data, such as the Stock market. It is also a model that is can be quite accurate and there are findings to back this up. Khan et al. (2023) found when comparing 9 different machine learning models that Random Forest outperformed the other models when forecasting the stock market.



Khan, A. H., Shah, A., Ali, A., Shahid, R., Zahid, Z. U., Sharif, M. U., Jan, T., & Zafar, M. H. (2023). A performance comparison of machine learning models for stock market prediction with novel investment strategy. PloS one, 18(9), e0286362. https://doi.org/10.1371/journal.pone.0286362

In [22]:
# Imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score 
# Apparently mean_squared_error is deprecated in scikit-learn version 1.4 and onwards... 
# but I'm using 1.3.2 and can't seem to update to a more recent version. So will have to keep using it here.

# Load data:
df = pd.read_csv("TSLA.csv")

df.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2010-06-29,3.8,5.0,3.508,4.778,4.778,93831500
1,2010-06-30,5.158,6.084,4.66,4.766,4.766,85935500
2,2010-07-01,5.0,5.184,4.054,4.392,4.392,41094000
3,2010-07-02,4.6,4.62,3.742,3.84,3.84,25699000
4,2010-07-06,4.0,4.0,3.166,3.222,3.222,34334500


In [23]:
df.tail()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
2840,2021-10-08,796.210022,796.380005,780.909973,785.48999,785.48999,16711100
2841,2021-10-11,787.650024,801.23999,785.5,791.940002,791.940002,14200300
2842,2021-10-12,800.929993,812.320007,796.570007,805.719971,805.719971,22020000
2843,2021-10-13,810.469971,815.409973,805.780029,811.080017,811.080017,14120100
2844,2021-10-14,815.48999,820.25,813.349976,818.320007,818.320007,12203200


In [24]:
df.info() # No non-null, nice! 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2845 entries, 0 to 2844
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       2845 non-null   object 
 1   Open       2845 non-null   float64
 2   High       2845 non-null   float64
 3   Low        2845 non-null   float64
 4   Close      2845 non-null   float64
 5   Adj Close  2845 non-null   float64
 6   Volume     2845 non-null   int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 155.7+ KB


In [25]:
# Should convert Date to datetime so we can use the dates for comparison etc. since they are currently in String format.
df['Date'] = pd.to_datetime(df['Date'])

df.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2010-06-29,3.8,5.0,3.508,4.778,4.778,93831500
1,2010-06-30,5.158,6.084,4.66,4.766,4.766,85935500
2,2010-07-01,5.0,5.184,4.054,4.392,4.392,41094000
3,2010-07-02,4.6,4.62,3.742,3.84,3.84,25699000
4,2010-07-06,4.0,4.0,3.166,3.222,3.222,34334500


In [26]:
# Creating and selecting features:

# Since we want to be able to predict on a given date, it might be worth adding features such as Year, Month and Day as well

df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

# # Use date as index instead
df.set_index('Date', inplace=True)

# # I think it was already sorted by date but just to be sure
df.sort_index(inplace=True)

# Select features and target:
features = ['Year', 'Month', 'Day', 'Open', 'High', 'Low', 'Volume']
target = 'Close' # Here we can test it out with Adj Close as well to see if this is more accurate for prediction.
# Close = Stock price when market closes

X = df[features]
y = df[target]

In [27]:
# Now time to split the data into test data and training data:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 20% test, 80% training


In [28]:
# Creating the model:
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
# Can adjust n_estimators, more is usually better but requires more computational 'cost'
# Training the model:
rf_model.fit(X_train, y_train)

In [29]:
y_pred = rf_model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}") # Average squared difference between actual and predicted, lower = better.
print(f"R-Squared Score: {r2}") # 0 = bad, 1 = good

Mean Squared Error: 13.43565980714506
R-Squared Score: 0.9996369063639715
