# An Application of Logistic Regression in Finanical Data 
_We utilized logistic regression to analyze the stock price data and provided a predictive model._

Data source: Yahoo Finance

by Allan Lee, Jianhao Zhang, Yi Yan and Chengyu Tao (DSCI 522 Group 3 Milestone 1)

2023/11/17

In [1]:
#Imports
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt

#### Load and split the data

In [2]:
#import data and split into train and test
df = pd.read_csv("Data/Processed/data.csv", index_col=0)

train_df, test_df = train_test_split(df, test_size=0.2, random_state=123)
train_df

Unnamed: 0_level_0,gspc,inflation_rate_pct,interest_rate_pct,inflation_rate_pct_chg,interest_rate_pct_chg,gspc_prev_year_pct_chg,gspc_next_year_pct_chg,target
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2013-04-30,1597.569946,1.361965,0.150,-0.342289,0.000,14.282744,17.925976,True
1982-12-31,140.639999,3.711559,8.830,-4.679246,-3.470,14.761319,17.271042,True
1968-01-31,92.239998,3.951368,4.630,1.138868,-0.370,6.500401,11.676067,True
2005-04-30,1156.849976,2.802750,2.765,-0.249021,1.765,4.474842,13.291266,True
1986-02-28,226.919998,2.255639,7.830,-1.448065,-0.695,25.245616,25.242383,True
...,...,...,...,...,...,...,...,...
1963-09-30,71.699997,1.315789,3.500,-0.017544,0.500,27.421355,17.405863,True
1982-05-31,111.879997,7.064018,14.590,-2.488582,-3.900,-15.619579,45.146589,True
1987-05-31,290.100006,3.652968,6.750,1.887169,-0.090,17.283202,-9.631162,False
1985-12-31,211.279999,3.886256,7.990,0.353381,-0.520,26.333408,14.620409,True


In [3]:
#Separate target value form train and test set 
X_train = train_df[['inflation_rate_pct', 'interest_rate_pct',
       'inflation_rate_pct_chg', 'interest_rate_pct_chg',
       'gspc_prev_year_pct_chg']]
y_train = train_df["target"]

X_test = test_df[['inflation_rate_pct', 'interest_rate_pct',
       'inflation_rate_pct_chg', 'interest_rate_pct_chg',
       'gspc_prev_year_pct_chg']]
y_test = test_df["target"]

In [4]:
X_train.head(5)

Unnamed: 0_level_0,inflation_rate_pct,interest_rate_pct,inflation_rate_pct_chg,interest_rate_pct_chg,gspc_prev_year_pct_chg
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-04-30,1.361965,0.15,-0.342289,0.0,14.282744
1982-12-31,3.711559,8.83,-4.679246,-3.47,14.761319
1968-01-31,3.951368,4.63,1.138868,-0.37,6.500401
2005-04-30,2.80275,2.765,-0.249021,1.765,4.474842
1986-02-28,2.255639,7.83,-1.448065,-0.695,25.245616


In [5]:
y_train.head(5)

date
2013-04-30    True
1982-12-31    True
1968-01-31    True
2005-04-30    True
1986-02-28    True
Name: target, dtype: bool

#### Data
The dataset comprises records for 808 months, with each row featuring a crucial predictor for the corresponding month. Additionally, it indicates whether there was an increase or decrease in the S&P 500 index, denoted by the values True or False.


#### Preprocessing Data

##### Numeric features:
- 'inflation_rate_pct'
- 'interest_rate_pct'
- 'inflation_rate_pct_chg'
- 'interest_rate_pct_chg'
- 'gspc_prev_year_pct_chg'

Since there is no missing values, imputation is not necessary. And we apply a StandardScaler.

In [6]:
numerical_features = ['inflation_rate_pct', 'interest_rate_pct',
       'inflation_rate_pct_chg', 'interest_rate_pct_chg',
       'gspc_prev_year_pct_chg']

#Create Column Transformer 
preprocessor = make_column_transformer(    
    (StandardScaler(), numerical_features),  
)

In [7]:
preprocessor

#### Model Selection
##### Logistic Regression
Our focus is on identifying whether there is an increase in the S&P 500 index, making it a classification problem. To tackle this, we utilize Logistic Regression.

In [8]:
pipe = make_pipeline(preprocessor, LogisticRegression())

In [9]:
pipe

##### Applying on test data

In [10]:
pipe.fit(X_train, y_train)

In [11]:
pipe.score(X_test, y_test)

0.7530864197530864

The test data of logistic regression yields an accuracy of 75.3%. Accuracy is a metric calculated as the ratio of correct predictions to all predictions. Nevertheless, caution is necessary when interpreting this metric, particularly in the context of class imbalance.

##### Dummy Regression

In [12]:
from sklearn.dummy import DummyClassifier

dc = DummyClassifier()
dc.fit(X_train, y_train)

In [13]:
dc.score(X_test, y_test)

0.7592592592592593

The test data of dummy regression yields an accuracy of 75.9%. The accuracy of dummy regression is better then the accuracy of logistic regression. 

#### Conclusion
In the preceding examination, we utilize `Logistic Regression` and `Dummy Regression`. Consequently, the result is not promising. Further data preprocessing is needed to enhance the overall effectiveness of the model.