# Project-3: Stock Prediction
This project aims to predict future percentage changes using historical percentage change data.   
There are two types of prediction problems: Regression and Classification.  
- Regression: The output is a continuous variable (e.g., percentage change, price, cost, salary).
- Classification: The output is a categorical variable (e.g., gender, dress type, species).

## Single Stock
- We will use the Close data of a single stock.

In [1]:
import yfinance as yf
import pandas as pd
import numpy as np

### Data Preparation

In [2]:
df = yf.Ticker('AAPL').history(start='2015-1-1', end='2018-12-31')
df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2015-01-02 00:00:00-05:00,24.895674,24.90685,23.992734,24.435265,212818400,0.0,0.0
2015-01-05 00:00:00-05:00,24.202833,24.283294,23.559154,23.746893,257142000,0.0,0.0
2015-01-06 00:00:00-05:00,23.811706,24.010621,23.38482,23.749126,263188400,0.0,0.0
2015-01-07 00:00:00-05:00,23.959216,24.182716,23.847466,24.082142,160423600,0.0,0.0
2015-01-08 00:00:00-05:00,24.41292,25.06554,24.294463,25.007429,237458000,0.0,0.0


In [3]:
# Reset the index and set the previous index as the Date column.
df.reset_index(inplace=True)
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits
0,2015-01-02 00:00:00-05:00,24.895674,24.90685,23.992734,24.435265,212818400,0.0,0.0
1,2015-01-05 00:00:00-05:00,24.202833,24.283294,23.559154,23.746893,257142000,0.0,0.0
2,2015-01-06 00:00:00-05:00,23.811706,24.010621,23.38482,23.749126,263188400,0.0,0.0
3,2015-01-07 00:00:00-05:00,23.959216,24.182716,23.847466,24.082142,160423600,0.0,0.0
4,2015-01-08 00:00:00-05:00,24.41292,25.06554,24.294463,25.007429,237458000,0.0,0.0


In [4]:
# Use dt to access only the date part and assign it as the new values for the 'Date' column.
df['Date'] = df.Date.dt.date
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits
0,2015-01-02,24.895674,24.90685,23.992734,24.435265,212818400,0.0,0.0
1,2015-01-05,24.202833,24.283294,23.559154,23.746893,257142000,0.0,0.0
2,2015-01-06,23.811706,24.010621,23.38482,23.749126,263188400,0.0,0.0
3,2015-01-07,23.959216,24.182716,23.847466,24.082142,160423600,0.0,0.0
4,2015-01-08,24.41292,25.06554,24.294463,25.007429,237458000,0.0,0.0


In [5]:
# Set the 'Date' column as the index again, retaining only the date part.
df.set_index('Date', inplace=True)
df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2015-01-02,24.895674,24.90685,23.992734,24.435265,212818400,0.0,0.0
2015-01-05,24.202833,24.283294,23.559154,23.746893,257142000,0.0,0.0
2015-01-06,23.811706,24.010621,23.38482,23.749126,263188400,0.0,0.0
2015-01-07,23.959216,24.182716,23.847466,24.082142,160423600,0.0,0.0
2015-01-08,24.41292,25.06554,24.294463,25.007429,237458000,0.0,0.0


In [6]:
# only Close column
dfC = pd.DataFrame(df.Close)
dfC.head()

Unnamed: 0_level_0,Close
Date,Unnamed: 1_level_1
2015-01-02,24.435265
2015-01-05,23.746893
2015-01-06,23.749126
2015-01-07,24.082142
2015-01-08,25.007429


### Percentage Change

In [7]:
# percentage change
dfCP = dfC.pct_change()
dfCP.head()

Unnamed: 0_level_0,Close
Date,Unnamed: 1_level_1
2015-01-02,
2015-01-05,-0.028171
2015-01-06,9.4e-05
2015-01-07,0.014022
2015-01-08,0.038422


In [8]:
# drop missing values
dfCP.dropna(inplace=True)
dfCP.head()

Unnamed: 0_level_0,Close
Date,Unnamed: 1_level_1
2015-01-05,-0.028171
2015-01-06,9.4e-05
2015-01-07,0.014022
2015-01-08,0.038422
2015-01-09,0.001073


### Lag Data
The lagged data represents past data, while the close values will be the future data that we aim to predict.
- lag_1 is the value of the day before the Close day.
- lag_2 is the value of two days before the Close day.
- One important question is how many lagged data points will be used as input.

In [9]:
# lag data
dfCPL = dfCP.copy()
dfCPL['lag_1'] = dfCPL.Close.shift(1)
dfCPL['lag_2'] = dfCPL.Close.shift(2)
dfCPL['lag_3'] = dfCPL.Close.shift(3)
dfCPL['lag_4'] = dfCPL.Close.shift(4)
dfCPL['lag_5'] = dfCPL.Close.shift(5)
dfCPL.head()

Unnamed: 0_level_0,Close,lag_1,lag_2,lag_3,lag_4,lag_5
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-01-05,-0.028171,,,,,
2015-01-06,9.4e-05,-0.028171,,,,
2015-01-07,0.014022,9.4e-05,-0.028171,,,
2015-01-08,0.038422,0.014022,9.4e-05,-0.028171,,
2015-01-09,0.001073,0.038422,0.014022,9.4e-05,-0.028171,


In [10]:
# drop missing values
dfCPL.dropna(inplace=True)
dfCPL.head()

Unnamed: 0_level_0,Close,lag_1,lag_2,lag_3,lag_4,lag_5
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-01-12,-0.024641,0.001073,0.038422,0.014022,9.4e-05,-0.028171
2015-01-13,0.008879,-0.024641,0.001073,0.038422,0.014022,9.4e-05
2015-01-14,-0.00381,0.008879,-0.024641,0.001073,0.038422,0.014022
2015-01-15,-0.02714,-0.00381,0.008879,-0.024641,0.001073,0.038422
2015-01-16,-0.00777,-0.02714,-0.00381,0.008879,-0.024641,0.001073


### Split Data
The entire dataset will be divided into three sets:
1. Training Set (80%): Used to build the model.
2. Validation Set (10%): Used to select the model with the best parameters.
3. Test Set (10%): Used to evaluate the performance of the best model.

- $X$ denotes the input data, representing the lagged (past) data.
- $y$ denotes the output data, representing the close (future) data
- $yC$ represents the output data, indicating whether it is increasing (+1) or decreasing/flat (0), reflecting the close (future) data.

In [11]:
X = dfCPL.iloc[:,1:]
y = dfCPL['Close'].values

yC = np.where(y > 0, 1, 0)

In [12]:
# first 5 values of y
y[:5]

array([-0.02464069,  0.00887863, -0.00381041, -0.02714035, -0.00777039])

In [13]:
# first 5 values of yC
yC[:5]

array([0, 1, 0, 0, 0])

In [14]:
# 80% of the data
N = int(len(dfCPL)*0.8)
N

799

In [15]:
# 90% of the data
M = int(len(dfCPL)*0.9)
M

899

In [16]:
# splitting
X_train, y_train, yC_train = X[:N] , y[:N] , yC[:N]
X_valid, y_valid, yC_valid = X[N:M], y[N:M], yC[N:M]
X_test , y_test , yC_test  = X[M:] , y[M:] , yC[M:]

In [17]:
# training set shape
X_train.shape, y_train.shape, yC_train.shape

((799, 5), (799,), (799,))

In [18]:
# validation set shape
X_valid.shape, y_valid.shape, yC_valid.shape

((100, 5), (100,), (100,))

In [19]:
# test set shape
X_test.shape, y_test.shape, yC_test.shape

((100, 5), (100,), (100,))

### Decision Tree 

In [20]:
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier

In [21]:
dtC_2 = DecisionTreeClassifier(max_depth=2)
dtC_4 = DecisionTreeClassifier(max_depth=4)
dtC_6 = DecisionTreeClassifier(max_depth=6)

In [22]:
dtC_2.fit(X_train, yC_train)
dtC_2.score(X_valid, yC_valid)

0.54

In [23]:
dtC_4.fit(X_train, yC_train)
dtC_4.score(X_valid, yC_valid)

0.54

In [24]:
dtC_6.fit(X_train, yC_train)
dtC_6.score(X_valid, yC_valid)

0.47

In [25]:
dtC_2.score(X_test, yC_test)

0.49

### Random Forest

In [26]:
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [27]:
rfC_2 = RandomForestClassifier(max_depth=2)
rfC_4 = RandomForestClassifier(max_depth=4)
rfC_6 = RandomForestClassifier(max_depth=6)

In [28]:
rfC_2.fit(X_train, yC_train)
rfC_2.score(X_valid, yC_valid)

0.53

In [29]:
rfC_4.fit(X_train, yC_train)
rfC_4.score(X_valid, yC_valid)

0.52

In [30]:
rfC_6.fit(X_train, yC_train)
rfC_6.score(X_valid, yC_valid)

0.48

In [31]:
rfC_2.score(X_test, yC_test)

0.54

## Multiple Stock