## Predicting direction of stock price from interest rate and inflation rate


_We utilized logistic regression to analyze the stock price data and provided a predictive model._

by Allan Lee, Andy Zhang, Yi Yan and Chengyu Tao (DSCI 522 Group 3 Milestone 4)

2023/12/09

In [1]:
import pandas as pd
from myst_nb import glue

In [2]:
result_df=pd.read_csv("../results/tables/mdl_result.csv")[['dummy_model_precision_score (%)', 'logistic_regression_precision_score (%)']]
glue("result_df", result_df, display=False)

data_head=pd.read_csv("../Data/Processed/processed_data.csv").head(5)
glue("data_head", data_head, display=False)

In [3]:
dummy_mean_test_score = result_df[['dummy_model_precision_score (%)']]
glue("dummy_mean_test_score",dummy_mean_test_score, display=False)
regression_mean_test_score = result_df[['logistic_regression_precision_score (%)']]
glue("regression_mean_test_score",regression_mean_test_score, display=False)

## Summary


In our exploration of the impact of inflation and interest rates on stock returns, we employed logistic regression to build a predictive model. The model, trained on data sourced from the S&P500, CPI, and Federal funds rate, exhibited reasonable performance. In the context, we want to make sure every positive prediction is accurate and choose the metric precision which focuses specifically on maximizing the true positive predictions among all the instances predicted as positive. 

Utilizing Python and packages such as Numpy, Pandas, Altair, and Scikit-learn, our analysis incorporated data preprocessing to address variations in sampling frequencies among S&P500 index, CPI, and interest rate data. Resampling to the last day of every month and filtering noise through monthly median calculations were crucial steps in preparing the data for effective analysis. 

## Introduction

In the context of the current financial landscape shaped by the COVID-19 pandemic, central banks globally responded by lowering interest rates to mitigate economic challenges. However, this led to heightened inflation due to increased consumer spending. In an effort to regain control and revert to pre-pandemic levels, central banks sharply raised interest rates to the highest point in 15 years. With over 50% of American households owning stocks {cite}`Caporal2023StockOwnership`, the team is keen to explore the impact of inflation and interest rates on stock returns. The pivotal question arises: Can we predict profitable outcomes in stock market index investments over a one-year period based on inflation and interest rate data? Recognizing the potential financial and emotional impact of losing investments and related mental health issues, the team prioritizes precision to minimize setbacks and maximize returns.



## Methods 
### Data
1. We decided to use the Standard & Poors 500 Index (S&P500) as stock market proxy. The index tracks stocks of 500 largest companies in USA. The price of S&P500 is obtained from Yahoo Finance {cite}`YahooFinanceSP500`.
2. Inflation data is obtained from calculating the change of consumer price index (CPI) {cite}`YahooFinanceSP500`. We obtained United States CPI from the Federal Reserve Economic Data website and computed yearly inflation rate.
3. We can use the Federal funds rate as proxy for interest rate. It is the target interest rate set by the Federal reserve for commercial banks to lend and borrow overnight. We obtained the Federal funds rate from the Federal Reserve Economic Data website.

We derived the change in inflation rate and change in interest rate from the data we have as additional feature. We often hear on the news that inflation and interest rate are increasing or decreasing thus we thought these 2 features might provide additional predicting power for our model.

### Analysis

The Python programming language {cite}`van1995python` and the following Python packages were used to perform the analysis:  Numpy {cite}`harris2020array`, Pandas {cite}`mckinney-proc-scipy-2010`, Altair {cite}`vanderplas2018altair`, Scikit-learn {cite}`pedregosa2011scikit`. The code used to perform the EDA and create the report which can be found here: https://github.com/UBC-MDS/dsci_522_group_3/tree/main.

### Remarks
S&P500 index, CPI, and Interest rate data we obtained have different sampling frequencies. CPI data has the lowest frequency and it is sampled every first day of the month. We decided to resample all data to last day of every month thus it is easy to calculate and interpret year-year and month-month change. Interest rate data was sampled daily and it is noisy. We decided to filter the data by taking monthly median during resampling. The following table summarizes how data preprocessing was done.


| Data | Original sampling period | Preprocess procedure |
| -------- | -------- | -------- |
| S&P 500 Index | daily | Take the value from last day of month. If we do not have data for last day of month, use the data from the closest previous date |
| CPI | first day of every month | Offset the date by 1 to last day of last month. We thought the value difference for 1 day is neglegible |
| Interest Rate | daily | Resample to the last day of month by taking the median price of for every day of the month to filter for noise |


### Read Data From Web

The desciption of our data is shown below.

#### columns
| column name            | description                                                                                |
|------------------------|--------------------------------------------------------------------------------------------|
| gspc                   | price of S&P 500 stock index (will be ignored for model)                                   |
| inflation_rate_pct     | 1 year inflation rate (12 months ago to now) (will be a feature for model)                 |
| interest_rate_pct      | interest rate (will be a feature for model)                                                |                                           
| inflation_rate_pct_chg | change of inflation between now and 12 months ago (will be a feature for model)            |       
| interest_rate_pct_chg  | change of interest rate between now and 12 months ago (will be a feature for model)        |   
| gspc_prev_year_chg_pct | change of gspc between now and 12 months ago (will be a feature for model)                 |            
| gspc_next_year_pct_chg | change of gspc between now and 12 months later (will be used to get target)                | 
| target                 | whether gspc increased 12 months later compared to now (will be target for classification) |                      
 

```{glue:figure} data_head
---
width: 200px
name: "data_head"
---
Present the head of the processed data.

## Exploratory Data Analysis

In our preliminary analysis, we find out:
- No NA or missing data in dataset
- One row record the observation for one month.
- The time series data is from 1955-07 to 2022-10, which contains 808 observations.
- The target is `True` when the stock price went up, and is `False` when the stock price went down.
- The four columns `inflation_rate_pct`, `interest_rate_pct`,
       `inflation_rate_pct_chg`, `interest_rate_pct_chg`,
       `gspc_prev_year_pct_chg` are treated as features, and the target is treated as response in this binary classification problem. 

### Time Series Plots of Features

The visualization of time series may reveal the pattern of data, we plot them as follows.

```{figure} ../results/figures/time.png
---
width: 600px
name: time series
---
Time Series Plots of Features

### Histograms of Features
We plot the histograms to see the range and mode of interesting features.

```{figure} ../results/figures/hist.png
---
width: 600px
name: histogram
---
Histograms of Features

### Scatterplots of Feature Pairs

Spearmean's rank correlation test may reveal some potential correlation, we draw the scatterplots of following feature pairs:

`interest_rate_pct vs inflation_rate_pct`

`interest_rate_pct_chg vs inflation_rate_pct_chg`

`inflation_rate_pct_chg vs inflation_rate_pct`

`interest_rate_pct_chg vs interest_rate_pct`

```{figure} ../results/figures/scat.png
---
width: 600px
name: scatter plot
---
Scatter plots of a few pairs of variables

- Examine the data type for every column.
- Illustrate the distribution of all numeric columns and investigate possible correlations between them.
- Divide the dataframe into training and testing datasets with an 80:20 ratio.
- Based on the histograms of all columns, the five numerical columns 'inflation_rate_pct', 'interest_rate_pct', 'inflation_rate_pct_chg', 'interest_rate_pct_chg', and 'gspc_prev_year_pct_chg' are helpful in separating the target.

## Model

### Preprocessing Data
The dataset comprises records for 808 months, with each row featuring a crucial predictor for the corresponding month. Additionally, it indicates whether there was an increase or decrease in the S&P 500 index, denoted by the values `True` or `False`.

We find numeric features are: `inflation_rate_pct`, `interest_rate_pct`, `inflation_rate_pct_chg`, `interest_rate_pct_chg`, `gspc_prev_year_pct_chg`. Since there is no missing values, imputation is not necessary. As the characteristics were exclusively numerical, we utilized a StandardScaler to guarantee the normalization of feature matrix.


### Logistic Regression

Predicting exact stock prices is challenging, so we simplify the task by transforming it into a binary classification problemâ€”predicting whether the stock price will increase or decrease. Our focus is on identifying whether there is an increase in the S&P 500 index, making it a binary classification problem. To tackle this, we utilize Logistic Regression.

```{glue:figure} regression_mean_test_score
---
width: 400px
name: "regression_mean_test_score"
---
Present results of logistic regression model.

The test data of logistic regression yields an precision of {glue:text}`regression_mean_test_score`%. Precision is calculated as the number of true positives divided by the sum of true positives and false positives. Nevertheless, caution is necessary when interpreting this metric.

### Dummy Regression

```{glue:figure} dummy_mean_test_score
---
width: 400px
name: "dummy_mean_test_score"
---
Present results of dummy model.

## Discussion

In concluding this report, we express gratitude for the valuable input received from four reviewers and aim to address significant queries in the discussion section.

### Possible Reasons for Worse Performance of the Logistic Regression Model over Dummy Model
The quality of the data, including missing values, outliers, or noise, can significantly impact the performance of any predictive model. If the data is noisy or contains irrelevant information, the model's accuracy may suffer. Stock price movements are influenced by a wide range of factors, and if the model does not have access to relevant information, it may struggle to make accurate predictions.

### Motivation of Picking the Logistic Regression Model
Logistic regression is specifically designed for binary classification problems where the outcome variable is dichotomous, meaning it has two possible classes (e.g., 0 or 1, true or false, positive or negative). The coefficients in logistic regression represent the log-odds of the target variable, making it easy to interpret the impact of each feature on the likelihood of a particular outcome. This interpretability is valuable for understanding the relationships between variables.


### Overall
The model training is designed such that it follows the Golden Rule, which refers to "despite our utmost concern for test error, it is crucial to emphasize that the training phase must remain completely unaffected by the test data." The test data of dummy regression yields a precision of {glue:text}`dummy_mean_test_score`%. The precision of dummy regression is better then the precision of logistic regression. In the preceding examination, we utilize `Logistic Regression` and `Dummy Regression`. Consequently, the result does not show advantage of `Logistic Regression` over `Dummy Regression`. (see {glue:text}`result_df`%) Further data preprocessing is needed to enhance the overall effectiveness of the model. This is aligned with expectation, because we do not have enough meaningful features to extract information that is helpful for a more accurate categorization. Also, we have imbalanced classes, which can also prevent the model from determining whether the index will grow. 

The model's performance may be sensitive to hyperparameter settings, therefore it can be helpful to improve the model performance to experiment with different configurations through hyperparameter tuning. In the future, to improve the performance of the model, we may want to consult with specialists of finance field to get more key information about the topic. This will allow more features to be added to the data and positively impact the model training. 


```{glue:figure} result_df
---
width: 400px
name: "result_df"
---
Comparison of Dummy model and Logistic Regression Score on Test Data

## Reference

```{bibliography}
```