# __[Logistic Regression](https://www.youtube.com/watch?v=X9jjyh0p7x8)__

Using S&P 500 data

Original converted from R in [An Introduction to Statistical Learning](https://www.statlearning.com/) by Algovibes

$$
log(Y) = \sum_{j=1}^{p}X_j\beta_j+\epsilon
$$

The model predicts *binary* output:

- Mail spam possbility: true or false
- Stock market: up or down

Daily returns of S&P 500 for a time period from 2001 to 2005
Today column is the relative price change for the day
Direction column is whether the change is up or down
Lags are prior daily returns
- Lag1 prior day
- Lag2 prior prior day
- Lag3 prior prior prior day
etc.
Volume of the prior day in billions

## Confusion Matrix

                   Predicted No                   Predicted Yes

Actual No          True Negative (TN)                 False Positive (FP)

Actual Yes         False Negative (FN)                 True Positive (TP)

Actual No = TN + FP

Actual Yes = FN + TP

Predicted No = TN + FN

Predicted Yes = FP + TP

total = Predicted No + Predicted Yes

total = Actual No    + Actual Yes

### Accuracy

Accuracy = (TP + TN) / total

### Misclassification Rate

Misclassification Rate = (FP + FN) / total

### True Positive Rate

True Positive Rate = TP / (Actual Yes)

### False Positive Rate

False Positive Rate = FP / (Actual No)

### True Negative Rate

True Negative Rate = TN / (Actual No)

### Precision

Precision = TP / (Predicted Yes)

### Prevalence

Prevalence = (Actual Yes) / total


# Parametric Findings

## Original model

Using data from 2001-01-01 to 2005-12-31 (code below has been changed)

0.5714285714285714: Lag 1, Lag 2

0.4603174603174603: Lag 1, Lag 2,                      Volume 

**0.5833333333333334: Lag 1, Lag 2, Lag 3**

0.4801587301587302: Lag 1, Lag 2, Lag 3,               Volume

0.5793650793650794: Lag 1, Lag 2, Lag 3, Lag 4

0.4801587301587302: Lag 1, Lag 2, Lag 3, Lag 4,        Volume

0.5753968253968254: Lag 1, Lag 2, Lag 3, Lag 4, Lag 5

0.4722222222222222: Lag 1, Lag 2, Lag 3, Lag 4, Lag 5, Volume

# References

__[statsmodels](https://www.statsmodels.org/stable/index.html)__

In [91]:
import pandas as pd
import statsmodels.api as sm
import yfinance as yf

In [92]:
ticker = "^GSPC"
year_start = "2001-01-01"
year_end   = "2020-12-31"
year_split = 2018
data = yf.download(ticker, start=year_start, end=year_end)

[*********************100%***********************]  1 of 1 completed


In [93]:
df = data["Adj Close"].pct_change() * 100
df = df.rename("Today")
df = df.reset_index()

In [94]:
for i in range(1, 6):
    df["Lag " + str(i)] = df["Today"].shift(i)
df["Volume"] = data.Volume.shift(1).values / 100000000
df = df.dropna()

In [95]:
df["Direction"] = [1 if i > 0 else 0 for i in df["Today"]]
df = sm.add_constant(df)

In [96]:
X = df[["const", "Lag 1", "Lag 2", "Lag 3", "Lag 4", "Lag 5", "Volume"]]
y = df.Direction

In [97]:
model = sm.Logit(y, X)
result = model.fit()
result.summary()
prediction = result.predict(X)

Optimization terminated successfully.
         Current function value: 0.687507
         Iterations 4


In [98]:
def confusion_matrix(act, pred):
    predtrans = ["Up" if i > 0.5 else "Down" for i in pred]
    actuals = ["Up" if i > 0 else "Down" for i in act]
    confusion_matrix = pd.crosstab(pd.Series(actuals), pd.Series(predtrans), rownames=["Actual"], colnames=["Predicted"])
    return confusion_matrix

In [99]:
confusion_matrix(y, prediction)

Predicted,Down,Up
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
Down,189,2121
Up,160,2555


In [100]:
def lr_model(y_train, x_train, x_test, y_test):
    model = sm.Logit(y_train, x_train)
    result = model.fit()
    prediction = result.predict(x_test)
    cm = confusion_matrix(y_test, prediction)
    accuracy = (cm["Down"]["Down"] + cm["Up"]["Up"]) / len(x_test)
    return accuracy

In [101]:
x_train = df[df.Date.dt.year < year_split][["const", "Lag 1", "Lag 2"]]
y_train = df[df.Date.dt.year < year_split]["Direction"]
x_test = df[df.Date.dt.year == year_split][["const", "Lag 1", "Lag 2"]]
y_test = df[df.Date.dt.year == year_split]["Direction"]
print(lr_model(y_train, x_train, x_test, y_test))


Optimization terminated successfully.
         Current function value: 0.688911
         Iterations 4
0.5418326693227091


In [102]:
x_train = df[df.Date.dt.year < year_split][["const", "Lag 1", "Lag 2", "Volume"]]
y_train = df[df.Date.dt.year < year_split]["Direction"]
x_test = df[df.Date.dt.year == year_split][["const", "Lag 1", "Lag 2", "Volume"]]
y_test = df[df.Date.dt.year == year_split]["Direction"]
print(lr_model(y_train, x_train, x_test, y_test))

Optimization terminated successfully.
         Current function value: 0.688767
         Iterations 4
0.5418326693227091


In [103]:
x_train = df[df.Date.dt.year < year_split][["const", "Lag 1", "Lag 2", "Lag 3"]]
y_train = df[df.Date.dt.year < year_split]["Direction"]
x_test = df[df.Date.dt.year == year_split][["const", "Lag 1", "Lag 2", "Lag 3"]]
y_test = df[df.Date.dt.year == year_split]["Direction"]
print(lr_model(y_train, x_train, x_test, y_test))

Optimization terminated successfully.
         Current function value: 0.688855
         Iterations 4
0.549800796812749


In [104]:
x_train = df[df.Date.dt.year < year_split][["const", "Lag 1", "Lag 2", "Lag 3", "Volume"]]
y_train = df[df.Date.dt.year < year_split]["Direction"]
x_test = df[df.Date.dt.year == year_split][["const", "Lag 1", "Lag 2", "Lag 3", "Volume"]]
y_test = df[df.Date.dt.year == year_split]["Direction"]
print(lr_model(y_train, x_train, x_test, y_test))

Optimization terminated successfully.
         Current function value: 0.688704
         Iterations 4
0.5418326693227091


In [105]:
x_train = df[df.Date.dt.year < year_split][["const", "Lag 1", "Lag 2", "Lag 3", "Lag 4"]]
y_train = df[df.Date.dt.year < year_split]["Direction"]
x_test = df[df.Date.dt.year == year_split][["const", "Lag 1", "Lag 2", "Lag 3", "Lag 4"]]
y_test = df[df.Date.dt.year == year_split]["Direction"]
print(lr_model(y_train, x_train, x_test, y_test))

Optimization terminated successfully.
         Current function value: 0.688832
         Iterations 4
0.5418326693227091


In [106]:
x_train = df[df.Date.dt.year < year_split][["const", "Lag 1", "Lag 2", "Lag 3", "Lag 4", "Volume"]]
y_train = df[df.Date.dt.year < year_split]["Direction"]
x_test = df[df.Date.dt.year == year_split][["const", "Lag 1", "Lag 2", "Lag 3", "Lag 4", "Volume"]]
y_test = df[df.Date.dt.year == year_split]["Direction"]
print(lr_model(y_train, x_train, x_test, y_test))

Optimization terminated successfully.
         Current function value: 0.688685
         Iterations 4
0.5378486055776892


In [107]:
x_train = df[df.Date.dt.year < year_split][["const", "Lag 1", "Lag 2", "Lag 3", "Lag 4", "Lag 5"]]
y_train = df[df.Date.dt.year < year_split]["Direction"]
x_test = df[df.Date.dt.year == year_split][["const", "Lag 1", "Lag 2", "Lag 3", "Lag 4", "Lag 5"]]
y_test = df[df.Date.dt.year == year_split]["Direction"]
print(lr_model(y_train, x_train, x_test, y_test))

Optimization terminated successfully.
         Current function value: 0.688684
         Iterations 4
0.5378486055776892


In [108]:
x_train = df[df.Date.dt.year < year_split][["const", "Lag 1", "Lag 2", "Lag 3", "Lag 4", "Lag 5", "Volume"]]
y_train = df[df.Date.dt.year < year_split]["Direction"]
x_test = df[df.Date.dt.year == year_split][["const", "Lag 1", "Lag 2", "Lag 3", "Lag 4", "Lag 5", "Volume"]]
y_test = df[df.Date.dt.year == year_split]["Direction"]
print(lr_model(y_train, x_train, x_test, y_test))

Optimization terminated successfully.
         Current function value: 0.688548
         Iterations 4
0.5298804780876494
