# Predicting direction of stock price from interest rate and inflation rate


_We utilized logistic regression to analyze the stock price data and provided a predictive model._

by Allan Lee, Jianhao Zhang, Yi Yan and Chengyu Tao (DSCI 522 Group 3 Milestone 3)

2023/12/02

In [1]:
import yfinance as yf
import pandas as pd
import altair as alt
from myst_nb import glue
#import pickle
#from sklearn import set_config
alt.data_transformers.enable("vegafusion")

DataTransformerRegistry.enable('vegafusion')

In [2]:
import yfinance as yf
import pandas as pd
import altair as alt
from myst_nb import glue
#import pickle
#from sklearn import set_config
alt.data_transformers.enable("vegafusion")

DataTransformerRegistry.enable('vegafusion')

In [3]:
result_df=pd.read_csv("../results/tables/mdl_result.csv")

In [12]:
dummy_score = result_df["dummy_model_score"][0] * 100
glue("dummy_mean_test_score",dummy_score, display=False)
regression_score = result_df["logistic_regression_score"][0] * 100
glue("regression_mean_test_score",regression_score, display=False)


In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import data_read
from train_test_split_class import train_test_split_class

ModuleNotFoundError: No module named 'data_read'

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt

## Summary

## Introduction

During the COVID-19 pandemic, central banks around the world lowered interest rates to ease economical challenges posed by the pandemic. As the pandemic ease, the lowered interest rate leads to excess consumer spending which increased the inflation rate to unacceptable levels. In order to control the inflation and have it return to pre-pandemic levels, the central bank raised the interest rate sharply to the highest level in 15 years. Nowadays, inflation and interest rate often takes the headline of financial news and with more than 50% of American households owning stocks {cite}`Caporal2023StockOwnership`, our team is curious to find out how inflation and interest rate affect stock returns. We ask the question: given inflation rate and interest rate data, can we predict whether we will profit if we invest in a stock market index and hold for 1 year.

## Methods & Results
### Data
#### Raw
1. We decided to use the Standard & Poors 500 Index (S&P500) as stock market proxy. The index tracks stocks of 500 largest companies in USA. The price of S&P500 is obtained from Yahoo Finance {cite}`YahooFinanceSP500`.
2. Inflation data is obtained from calculating the change of consumer price index (CPI) {cite}`YahooFinanceSP500`. We obtained United States CPI from the Federal Reserve Economic Data website and computed yearly inflation rate.
3. We can use the Federal funds rate as proxy for interest rate. It is the target interest rate set by the Federal reserve for commercial banks to lend and borrow overnight. We obtained the Federal funds rate from the Federal Reserve Economic Data website.
#### Derived
We derived the change in inflation rate and change in interest rate from the data we have as additional feature. We often hear on the news that inflaiton and interest rate are increasing or decreasing thus we thought these 2 features might provide additional predicting power for our model

#### Analysis

The Python programming language {cite}`Python` and the following Python packages were used to perform the analysis: Numpy {cite}`Harris2020NumPy`, Pandas {cite}`mckinney-proc-scipy-2010`, Altair {cite}`altair`, Scikit-learn {cite}`scikit-learn`. The code used to perform the EDA and create the report which can be found here: https://github.com/UBC-MDS/dsci_522_group_3/tree/main.

#### Remarks
##### Resampling
S&P500 index, CPI, and Interest rate data we obtained have different sampling frequencies. CPI data has the lowest frequency and it is sampled every first day of the month. We decided to resample all data to last day of every month thus it is easy to calculate and interpret year-year and month-month change. Interest rate data was sampled daily and it is noisy. We decided to filter the data by taking monthly median during resampling. The following table summarizes how data preprocessing was done.


| Data | Original sampling period | Preprocess procedure |
| -------- | -------- | -------- |
| S&P 500 Index | daily | Take the value from last day of month. If we do not have data for last day of month, use the data from the closest previous date |
| CPI | first day of every month | Offset the date by 1 to last day of last month. We thought the value difference for 1 day is neglegible |
| Interest Rate | daily | Resample to the last day of month by taking the median price of for every day of the month to filter for noise |


 


### Read Data From Web

### Read Data From Web

Here we read all data.

#### columns
| column name            | description                                                                                |
|------------------------|--------------------------------------------------------------------------------------------|
| gspc                   | price of S&P 500 stock index (will be ignored for model)                                   |
| inflation_rate_pct     | 1 year inflation rate (12 months ago to now) (will be a feature for model)                 |
| interest_rate_pct      | interest rate (will be a feature for model)                                                |                                           
| inflation_rate_pct_chg | change of inflation between now and 12 months ago (will be a feature for model)            |       
| interest_rate_pct_chg  | change of interest rate between now and 12 months ago (will be a feature for model)        |   
| gspc_prev_year_chg_pct | change of gspc between now and 12 months ago (will be a feature for model)                 |            
| gspc_next_year_pct_chg | change of gspc between now and 12 months later (will be used to get target)                | 
| target                 | whether gspc increased 12 months later compared to now (will be target for classification) |                      
 

In [None]:
data_df: pd.DataFrame = data_read.get_all_data()

In [None]:
data_df.head()

In [None]:
data_df.tail()

### EDA
#### Time series plot of all variables

In [None]:
(alt
 .Chart(data_df)
 .mark_line()
 .encode(x=alt.X('date', type='temporal'),
         y=alt.Y(alt.repeat('row'), type='quantitative'))
 .properties(width=500, height=200)
 .repeat(row=['gspc', 'inflation_rate_pct', 'interest_rate_pct', 'inflation_rate_pct_chg',
              'interest_rate_pct_chg', 'gspc_prev_year_pct_chg', 'gspc_next_year_pct_chg',
              'target']))

```{figure} ../results/figures/time.png
---
width: 600px
name: time series
---
Time series plot of all variables

In [None]:
df = data_df

In [None]:
# Column data types 
df.info()

- No NA or missing data in df
- One row record the observation for one month.
- The time series data is from 1955-07 to 2022-10, which contains 808 observations.
- The target is True when the stock price went up, and is False when the stock price went down.
- We will use four columns 'inflation_rate_pct', 'interest_rate_pct',
       'inflation_rate_pct_chg', 'interest_rate_pct_chg',
       'gspc_prev_year_pct_chg' as features, and the target as response in this binary classification problem. 

## Split data and EDA

In [None]:
# split data into training and test
train_df, test_df = train_test_split(df, test_size=0.2, random_state=123)
train_df

In [None]:
# statistical summary for dataframe
train_df.describe()

In [None]:
train_df.columns

In [None]:
features = ['inflation_rate_pct', 'interest_rate_pct',
       'inflation_rate_pct_chg', 'interest_rate_pct_chg',
       'gspc_prev_year_pct_chg']

In [None]:
import eda
plot_1=eda.plot_histograms(train_df, "target", features)

```{figure} ../results/figures/hist.png
---
width: 600px
name: histogram
---
Histograms of each variable

In [3]:
# finding potential correlation between numeric columns
num_col = train_df.select_dtypes(include=['float64']).columns.tolist()

train_df[num_col].corr('spearman').style.background_gradient()

NameError: name 'train_df' is not defined

- Spearmean's rank correlation test revealed some potential correlation between columns: 

**interest_rate_pct vs inflation_rate_pct** 

**interest_rate_pct_chg vs inflation_rate_pct_chg**

**inflation_rate_pct_chg vs inflation_rate_pct**

**interest_rate_pct_chg vs interest_rate_pct**

In [None]:
eda.scatter_plot(train_df, 'interest_rate_pct', 'inflation_rate_pct', color='blue')
eda.scatter_plot(train_df, 'interest_rate_pct_chg', 'inflation_rate_pct_chg', color='green')
eda.scatter_plot(train_df, 'inflation_rate_pct_chg', 'inflation_rate_pct', color='red')
eda.scatter_plot(train_df, 'interest_rate_pct_chg', 'interest_rate_pct', color='blue')

```{figure} ../results/figures/scat.png
---
width: 600px
name: scatter plot
---
Scatter plots of a few pairs of variables

- Examine the data type for every column.
- Illustrate the distribution of all numeric columns and investigate possible correlations between them.
- Divide the dataframe into training and testing datasets with an 80:20 ratio.
- Based on the histograms of all columns, the five numerical columns 'inflation_rate_pct', 'interest_rate_pct', 'inflation_rate_pct_chg', 'interest_rate_pct_chg', and 'gspc_prev_year_pct_chg' are helpful in separating the target.

### Model

In [None]:
#Separate target value form train and test set 

X_train, y_train, X_test, y_test = train_test_split_class(df, "target", 0.2, random_state = 123)

### Data
The dataset comprises records for 808 months, with each row featuring a crucial predictor for the corresponding month. Additionally, it indicates whether there was an increase or decrease in the S&P 500 index, denoted by the values True or False.


### Preprocessing Data

#### Numeric features:
- 'inflation_rate_pct'
- 'interest_rate_pct'
- 'inflation_rate_pct_chg'
- 'interest_rate_pct_chg'
- 'gspc_prev_year_pct_chg'

Since there is no missing values, imputation is not necessary. And we apply a StandardScaler.

In [None]:
numerical_features = ['inflation_rate_pct', 'interest_rate_pct',
       'inflation_rate_pct_chg', 'interest_rate_pct_chg',
       'gspc_prev_year_pct_chg']

#Create Column Transformer 
preprocessor = make_column_transformer(    
    (StandardScaler(), numerical_features),  
)

### Model Selection
#### Logistic Regression
Our focus is on identifying whether there is an increase in the S&P 500 index, making it a classification problem. To tackle this, we utilize Logistic Regression.

In [None]:
pipe = make_pipeline(preprocessor, LogisticRegression())

In [None]:
pipe.fit(X_train, y_train)

In [None]:
pipe.score(X_test, y_test)

The test data of logistic regression yields an accuracy of {glue:text}`regression_mean_test_score`%. Accuracy is a metric calculated as the ratio of correct predictions to all predictions. Nevertheless, caution is necessary when interpreting this metric, particularly in the context of class imbalance.

##### Dummy Regression

In [None]:
from sklearn.dummy import DummyClassifier

dc = DummyClassifier()
dc.fit(X_train, y_train)

In [None]:
dc.score(X_test, y_test)

### Discussion
The model training is designed such that it follows the Golden Rule. The test data of dummy regression yields an accuracy of {glue:text}`dummy_mean_test_score`%. The accuracy of dummy regression is better then the accuracy of logistic regression. In the preceding examination, we utilize `Logistic Regression` and `Dummy Regression`. Consequently, the result does not show advantage of `Logistic Regression` over `Dummy Regression`. Further data preprocessing is needed to enhance the overall effectiveness of the model. This is aligned with expectation, because we do not have enough meaningful features to extract information that is helpful for a more accurate categorization. Also, we have imbalanced classes, which can also prevent the model from determining whether the index will grow. 

The model's performance may be sensitive to hyperparameter settings, therefore it can be helpful to improve the model performance to experiment with different configurations through hyperparameter tuning. In the future, to improve the performance of the model, we may want to consult with specialists of finance field to get more key information about the topic. This will allow more features to be added to the data and positively impact the model training. 



### Reference

```{bibliography}
```

In [None]:
#pip install docutils==0.17.1