# Predicting direction of stock price from interest rate and inflation rate


_We utilized logistic regression to analyze the stock price data and provided a predictive model._

Data source: Yahoo Finance

by Allan Lee, Jianhao Zhang, Yi Yan and Chengyu Tao (DSCI 522 Group 3 Milestone 1)

2023/11/17

In [None]:
import yfinance as yf
import pandas as pd
import altair as alt
alt.data_transformers.enable("vegafusion")

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt

## Summary

## Introduction

During the COVID-19 pandemic, central banks around the world lowered interest rates to ease economical challenges posed by the pandemic. As the pandemic ease, the lowered interest rate leads to excess consumer spending which increased the inflation rate to unacceptable levels. In order to control the inflation and have it return to pre-pandemic levels, the central bank raised the interest rate sharply to the highest level in 15 years. Nowadays, inflation and interest rate often takes the headline of financial news and with more than 50% of American households owning stocks, our team is curious to find out how inflation and interest rate affect stock returns. We ask the question: given inflation rate and interest rate data, can we predict whether we will profit if we invest in a stock market index and hold for 1 year.

## Methods
### Data
#### Raw
1. We decided to use the Standard & Poors 500 Index (S&P500) as stock market proxy. The index tracks stocks of 500 largest companies in USA. The price of S&P500 is obtained from Yahoo Finance.
2. Inflation data is obtained from calculating the change of consumer price index (CPI). We obtained United States CPI from the Federal Reserve Economic Data website and computed yearly inflation rate.
3. We can use the Federal funds rate as proxy for interest rate. It is the target interest rate set by the Federal reserve for commercial banks to lend and borrow overnight. We obtained the Federal funds rate from the Federal Reserve Economic Data website.
#### Derived
We derived the change in inflation rate and change in interest rate from the data we have as additional feature. We often hear on the news that inflaiton and interest rate are increasing or decreasing thus we thought these 2 features might provide additional predicting power for our model
#### Remarks
##### Resampling
S&P500 index, CPI, and Interest rate data we obtained have different sampling frequencies. CPI data has the lowest frequency and it is sampled every first day of the month. We decided to resample all data to last day of every month thus it is easy to calculate and interpret year-year and month-month change. Interest rate data was sampled daily and it is noisy. We decided to filter the data by taking monthly median during resampling. The following table summarizes how data preprocessing was done.

| Data | Original sampling period         | Preprocess procedure                                                                                                       |
|---------------|----------------------------------------------------------------------------------------------------------------------------|
| S&P 500 Index | daily | Take the value from last day of month. If we do not have data for last day of month, use the data from the closest previous date  |
| CPI           | first day of every month                        | Offset the date by 1 to last day of last month. We thought the value difference for 1 day is neglegible |
| Interest Rate | daily | Resample to the last day of month by taking the median price of for every day of the month to filter for noise |
 


## Results & Discussion

### Read Data From Web

#### S&P 500
##### Read Raw

In [None]:
gspc_raw_s: pd.Series = (yf
                         .Ticker('^GSPC')
                         .history(start='1950-01-01', end='2023-11-01')
                         .loc[:, 'Close'])
gspc_raw_s.name = 'gspc'
gspc_raw_s.index = pd.DatetimeIndex(gspc_raw_s.index.date)
gspc_raw_s.index.name = 'date'
gspc_raw_s.head()

In [None]:
gspc_raw_s.tail()

##### Resample to last date of month.

In [None]:
gspc_m_s: pd.Series = gspc_raw_s.resample('M').last()
gspc_m_s.head()

In [None]:
gspc_m_s.tail()

##### check no missing dates

In [None]:
assert ((gspc_m_s.index 
        == pd.date_range(start=gspc_m_s.index[0],
                         end=gspc_m_s.index[-1],
                         freq='M')).all())

##### check no missing value

In [None]:
assert not gspc_m_s.isna().any()

##### next year change percentage

In [None]:
gspc_next_year_pct_chg: pd.Series = (gspc_m_s.shift(-12) - gspc_m_s) / gspc_m_s * 100
gspc_next_year_pct_chg.name = 'gspc_next_year_pct_chg'
gspc_next_year_pct_chg.head()

In [None]:
gspc_next_year_pct_chg.tail()

In [None]:
gspc_prev_year_pct_chg: pd.Series = (gspc_m_s - gspc_m_s.shift(12)) / gspc_m_s.shift(12) * 100
gspc_prev_year_pct_chg.name = 'gspc_prev_year_pct_chg'
gspc_prev_year_pct_chg.head()

In [None]:
gspc_prev_year_pct_chg.tail()

#### CPI
##### read raw

In [None]:
cpi_raw_s: pd.Series = (pd.read_csv('https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1318&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=CPIAUCNS&scale=left&cosd=1913-01-01&coed=2023-09-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2023-11-11&revision_date=2023-11-11&nd=1913-01-01', parse_dates=['DATE']).set_index('DATE').squeeze())
cpi_raw_s.index.name = 'date'
cpi_raw_s.name = 'cpi'
cpi_raw_s.head(10)

##### subtract 1 day to get last day of month

In [None]:
cpi_m_s: pd.Series = cpi_raw_s.copy()
cpi_m_s.index = cpi_m_s.index - pd.Timedelta(days=1)
cpi_m_s.head()

##### check no missing dates

In [None]:
assert ((cpi_m_s.index
         == pd.date_range(start=cpi_m_s.index[0],
                          end=cpi_m_s.index[-1],
                          freq='M')).all())

##### check no missing value

In [None]:
assert not cpi_m_s.isna().any()

##### calculate yearly Inflation Rate

In [None]:
inflation_rate_m_s: pd.Series = (cpi_m_s - cpi_m_s.shift(12)) / cpi_m_s.shift(12) * 100
inflation_rate_m_s.name = 'inflation_rate_pct'
inflation_rate_m_s.head()

In [None]:
inflation_rate_m_s.tail()

##### previous year change for inflation

In [None]:
inflation_rate_chg_m_s: pd.Series = (inflation_rate_m_s 
                                     - inflation_rate_m_s.shift(12))
inflation_rate_chg_m_s.name = 'inflation_rate_pct_chg'
inflation_rate_chg_m_s.head()

In [None]:
inflation_rate_chg_m_s.tail()

#### Interest Rate
##### read raw

In [None]:
interest_rate_raw_s: pd.Series = (pd.read_csv('https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1318&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=DFF&scale=left&cosd=1954-07-01&coed=2023-11-08&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Daily%2C%207-Day&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2023-11-11&revision_date=2023-11-11&nd=1954-07-01', parse_dates=['DATE'])
                                  .set_index('DATE')
                                  .squeeze())
interest_rate_raw_s.index.name = 'date'
interest_rate_raw_s.name = 'interest_rate_pct'
interest_rate_raw_s.head()

##### resample to last day of month and take median of month

In [None]:
interest_rate_m_s: pd.Series = interest_rate_raw_s.resample('M').median()
interest_rate_m_s.head()

##### check no missing dates

In [None]:
assert ((interest_rate_m_s.index
         == pd.date_range(start=interest_rate_m_s.index[0],
                          end=interest_rate_m_s.index[-1],
                          freq='M')).all())

##### check no missing value

In [None]:
assert not interest_rate_m_s.isna().any()

##### change in interest for the past 12 months

In [None]:
interest_rate_chg_m_s: pd.Series = interest_rate_m_s - interest_rate_m_s.shift(12)
interest_rate_chg_m_s.name = 'interest_rate_pct_chg'
interest_rate_chg_m_s.head()

In [None]:
interest_rate_chg_m_s.tail()

### Data merging

Here we combine all data sourced from web into a single data frame.

#### columns
| column name            | description                                                                                |
|------------------------|--------------------------------------------------------------------------------------------|
| gspc                   | price of S&P 500 stock index (will be ignored for model)                                   |
| inflation_rate_pct     | 1 year inflation rate (12 months ago to now) (will be a feature for model)                 |
| interest_rate_pct      | interest rate (will be a feature for model)                                                |                                           
| inflation_rate_pct_chg | change of inflation between now and 12 months ago (will be a feature for model)            |       
| interest_rate_pct_chg  | change of interest rate between now and 12 months ago (will be a feature for model)        |   
| gspc_prev_year_chg_pct | change of gspc between now and 12 months ago (will be a feature for model)                 |            
| gspc_next_year_pct_chg | change of gspc between now and 12 months later (will be used to get target)                | 
| target                 | whether gspc increased 12 months later compared to now (will be target for classification) |                      
 

In [None]:
data_df: pd.DataFrame = pd.concat([gspc_m_s,
                                   inflation_rate_m_s,
                                   interest_rate_m_s,
                                   inflation_rate_chg_m_s,
                                   interest_rate_chg_m_s,
                                   gspc_prev_year_pct_chg,
                                   gspc_next_year_pct_chg],
                                  axis=1,
                                  join='inner')
data_df.dropna(axis=0, inplace=True)
data_df['target'] = data_df['gspc_next_year_pct_chg'] > 0
data_df.index.name = 'date'
data_df.head()

In [None]:
(alt
 .Chart(data_df)
 .mark_line()
 .encode(x=alt.X('date', type='temporal'),
         y=alt.Y(alt.repeat('row'), type='quantitative'))
 .properties(width=1000, height=250)
 .repeat(row=['gspc', 'inflation_rate_pct', 'interest_rate_pct', 'inflation_rate_pct_chg',
              'interest_rate_pct_chg', 'gspc_prev_year_pct_chg', 'gspc_next_year_pct_chg',
              'target']))

# EDA

In [None]:
df = pd.read_csv("../Data/Processed/data.csv", index_col=0)

In [None]:
# Column data types 
df.info()

- No NA or missing data in df
- One row record the observation for one month.
- The time series data is from 1955-07 to 2022-10, which contains 808 observations.
- The target is True when the stock price went up, and is False when the stock price went down.
- We will use four columns 'inflation_rate_pct', 'interest_rate_pct',
       'inflation_rate_pct_chg', 'interest_rate_pct_chg',
       'gspc_prev_year_pct_chg' as features, and the target as response in this binary classification problem. 

# Split data and EDA

In [None]:
# split data into training and test
train_df, test_df = train_test_split(df, test_size=0.2, random_state=123)
train_df

In [None]:
# statistical summary for dataframe
train_df.describe()

In [None]:
train_df.columns

In [None]:

features = ['inflation_rate_pct', 'interest_rate_pct',
       'inflation_rate_pct_chg', 'interest_rate_pct_chg',
       'gspc_prev_year_pct_chg']

for feat in features:
    train_df.groupby("target")[feat].plot.hist(bins=50, alpha=0.5, legend=True, density=True, title="Histogram of " + feat)
    plt.xlabel(feat)
    plt.show()

In [None]:
# finding potential correlation between numeric columns
num_col = train_df.select_dtypes(include=['float64']).columns.tolist()

train_df[num_col].corr('spearman').style.background_gradient()

- Spearmean's rank correlation test revealed some potential correlation between columns: 

**interest_rate_pct vs inflation_rate_pct** 

**interest_rate_pct_chg vs inflation_rate_pct_chg**

**inflation_rate_pct_chg vs inflation_rate_pct**

**interest_rate_pct_chg vs interest_rate_pct**

In [None]:
plt.figure(figsize=(8, 8))

plt.scatter(train_df['interest_rate_pct'], train_df['inflation_rate_pct'], s=20, c='blue', alpha=0.7)

plt.title('Scatter Plot of Interest Rate vs Inflation Rate')
plt.xlabel('Interest Rate (%)')
plt.ylabel('Inflation Rate (%)')

plt.show()

In [None]:
plt.figure(figsize=(8, 8))

plt.scatter(train_df['interest_rate_pct_chg'], train_df['inflation_rate_pct_chg'], s=20, c='green', alpha=0.7)

plt.title('Scatter Plot of Interest Rate Change vs Inflation Rate Change')
plt.xlabel('Interest Rate Change (%)')
plt.ylabel('Inflation Rate Change (%)')

plt.show()

In [None]:
plt.figure(figsize=(8, 8))

plt.scatter(train_df['inflation_rate_pct_chg'], train_df['inflation_rate_pct'], s=20, c='red', alpha=0.7)

plt.title('Scatter Plot of Inflation Rate Change vs Inflation Rate')
plt.xlabel('Inflation Rate Change (%)')
plt.ylabel('Inflation Rate (%)')

plt.show()

In [None]:
plt.figure(figsize=(8, 8))

plt.scatter(train_df['interest_rate_pct_chg'], train_df['interest_rate_pct'], s=20, c='blue', alpha=0.7)

plt.title('Scatter Plot of Interest Rate Change vs Interest Rate')
plt.xlabel('Interest Rate Change (%)')
plt.ylabel('Interest Rate (%)')

plt.show()

- Examine the data type for every column.
- Illustrate the distribution of all numeric columns and investigate possible correlations between them.
- Divide the dataframe into training and testing datasets with an 80:20 ratio.
- Based on the histograms of all columns, the five numerical columns 'inflation_rate_pct', 'interest_rate_pct', 'inflation_rate_pct_chg', 'interest_rate_pct_chg', and 'gspc_prev_year_pct_chg' are helpful in separating the target.

# Model

In [None]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=123)
train_df

In [None]:
#Separate target value form train and test set 
X_train = train_df[['inflation_rate_pct', 'interest_rate_pct',
       'inflation_rate_pct_chg', 'interest_rate_pct_chg',
       'gspc_prev_year_pct_chg']]
y_train = train_df["target"]

X_test = test_df[['inflation_rate_pct', 'interest_rate_pct',
       'inflation_rate_pct_chg', 'interest_rate_pct_chg',
       'gspc_prev_year_pct_chg']]
y_test = test_df["target"]

#### Data
The dataset comprises records for 808 months, with each row featuring a crucial predictor for the corresponding month. Additionally, it indicates whether there was an increase or decrease in the S&P 500 index, denoted by the values True or False.


#### Preprocessing Data

##### Numeric features:
- 'inflation_rate_pct'
- 'interest_rate_pct'
- 'inflation_rate_pct_chg'
- 'interest_rate_pct_chg'
- 'gspc_prev_year_pct_chg'

Since there is no missing values, imputation is not necessary. And we apply a StandardScaler.

In [None]:
numerical_features = ['inflation_rate_pct', 'interest_rate_pct',
       'inflation_rate_pct_chg', 'interest_rate_pct_chg',
       'gspc_prev_year_pct_chg']

#Create Column Transformer 
preprocessor = make_column_transformer(    
    (StandardScaler(), numerical_features),  
)

#### Model Selection
##### Logistic Regression
Our focus is on identifying whether there is an increase in the S&P 500 index, making it a classification problem. To tackle this, we utilize Logistic Regression.

In [None]:
pipe = make_pipeline(preprocessor, LogisticRegression())

In [None]:
pipe.fit(X_train, y_train)

In [None]:
pipe.score(X_test, y_test)

The test data of logistic regression yields an accuracy of 75.3%. Accuracy is a metric calculated as the ratio of correct predictions to all predictions. Nevertheless, caution is necessary when interpreting this metric, particularly in the context of class imbalance.

##### Dummy Regression

In [None]:
from sklearn.dummy import DummyClassifier

dc = DummyClassifier()
dc.fit(X_train, y_train)

In [None]:
dc.score(X_test, y_test)

The test data of dummy regression yields an accuracy of 75.9%. The accuracy of dummy regression is better then the accuracy of logistic regression. 

#### Conclusion
In the preceding examination, we utilize `Logistic Regression` and `Dummy Regression`. Consequently, the result is not promising. Further data preprocessing is needed to enhance the overall effectiveness of the model.