# Introduction to Classification

In [None]:
import datetime
from tqdm import tqdm

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr

import statsmodels.api as sm

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    roc_auc_score,
    roc_curve,
    confusion_matrix,
)

def add_lags(df, columns, n_lags=1):
    """
    Add lags to specific columns in a DataFrame.

    Parameters:
    - df (DataFrame): Original DataFrame.
    - columns (list): List of column names for which to create lags.
    - n_lags (int): Number of lags to create for each column.

    Returns:
    - DataFrame: Updated DataFrame with lag columns.
    """
    for column in columns:
        for lag in range(1, n_lags + 1):
            df[f"{column}_lag{lag}"] = df[column].shift(lag)
    return df

In [None]:
df = pd.read_csv("data/Rv_daily_lec4.csv", index_col=0)

In [None]:
col_to_transform = ["TBill3M", "TBill1Y", "Oil", "Gold", "SP_volume"]
for c in col_to_transform:
    df["{}_ret".format(c)] = df[c].pct_change(1) * 100
df = df.dropna()

df = add_lags(df, ["Return_close"], n_lags=3)
df = df.dropna()
df = df.replace([np.inf, -np.inf], 0)

In [None]:
df.head()

<div style="text-align:center; font-size:24px">
    <span style="color:red">How do we setup a classification problem?</span>
</div>


<details>
    <summary>Click to expand!</summary>
    
    Define the Target Variable: Decide what you want to classify. For example, you might want to predict whether the "Return_close" is positive or negative. This would turn the problem into a binary classification task.

    Convert the Target Variable: Based on the definition, you'll need to transform the "Return_close" into a binary variable. You can set a threshold (e.g., 0) and classify the returns as:

    - 1 if "Return_close" > 0 (Positive Return)
    - 0 if "Return_close" <= 0 (Negative Return or No Gain)
</details>


In [None]:
df["Ret_binary"] = (df["Return_close"] > 0).astype(int)

In [None]:
df["Ret_binary"].unique()

In [None]:
df["Ret_binary"].value_counts()/df["Ret_binary"].value_counts().sum()

## Logistic Regression for Binary Classification

Logistic regression is a type of regression analysis used for binary classification problems, where the outcome variable (dependent variable) is binary, meaning it has only two possible values, often coded as 0 and 1. The output of logistic regression is quite different from linear regression, and its interpretation is distinct as well.

### Output of Logistic Regression

The output of logistic regression is a probability estimate that an observation belongs to the positive class (class 1). This probability is bounded between 0 and 1. Mathematically, logistic regression models the log-odds (logit) of the probability of the positive class. The logistic regression equation is typically expressed as:

$$ \text{Logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots + \beta_px_p $$

- $p$ represents the probability of the positive class.
- $\beta_0, \beta_1, \beta_2, \ldots, \beta_p$ are the coefficients of the model.
- $x_1, x_2, \ldots, x_p$ are the input features.

- To obtain the probability $p$ from the log-odds, we apply the inverse of the logit function, which is the logistic (sigmoid) function:

$$
p = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots + \beta_px_p)}}
$$

This function ensures that the output is a probability value between 0 and 1. The logistic function takes the linear combination of the input features and their coefficients (the log-odds) and maps it to a probability, allowing us to interpret the model's predictions in terms of the likelihood that an observation belongs to the positive class.

### Interpreting Logistic Regression

1. **Coefficient Interpretation:** The coefficients ($\beta$) in logistic regression represent the change in the log-odds of the probability of the positive class for a one-unit change in the corresponding predictor variable, while holding all other predictors constant. Unlike linear regression, where coefficients represent changes in the dependent variable, logistic regression coefficients are related to the log-odds of the outcome.

2. **Probability Interpretation:** The probability estimate $p$ can be converted into class predictions. Commonly, a threshold of 0.5 is used; if $p > 0.5$, the observation is classified as the positive class, and if $p \leq 0.5$, it's classified as the negative class.

### Differences from Linear Regression

1. **Output Type:** The most significant difference is the output type. Linear regression predicts a continuous outcome (e.g., house prices), while logistic regression predicts a probability for a binary outcome (e.g., whether a customer will buy a product or not).

2. **Equation Form:** The equations differ fundamentally. Linear regression uses a linear equation to predict the continuous outcome, while logistic regression models the log-odds of the probability of the positive class using a logistic (S-shaped) curve.

3. **Residuals:** In linear regression, the residuals (the differences between predicted and actual values) follow a normal distribution. In logistic regression, the residuals don't follow a normal distribution but rather follow a binomial distribution.

4. **Assumptions:** Linear regression assumes linearity, constant variance, and normality of residuals. Logistic regression does not make these same assumptions because it models probabilities.

Logistic regression is specifically designed for binary classification tasks, and its output is a probability estimate. The interpretation of coefficients in logistic regression is based on log-odds, making it well-suited for problems where you want to predict the probability of an event occurring or not occurring.

## How Logistic Regression Works

**Likelihood Function**

- Logistic regression models the probability of the positive class (\(y = 1\)) as:
  $$
  P(y = 1 | X) = \frac{1}{1 + e^{-X\beta}}, \quad P(y = 0 | X) = 1 - P(y = 1 | X).
  $$
- The likelihood of the observed data is:
  $$
  \mathcal{L}(\beta) = \prod_{i=1}^n P(y_i | X_i) = \prod_{i=1}^n \left[ p_i^{y_i} (1 - p_i)^{1 - y_i} \right],
  $$
  where $p_i = P(y_i = 1 | X_i)$.
- The log-likelihood simplifies optimization:
  $$
  \log \mathcal{L}(\beta) = \sum_{i=1}^n \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right].
  $$

**Optimization Problem**

- The goal is to find the coefficients $\beta$ that maximize the log-likelihood:
  $$
  \hat{\beta} = \arg \max_\beta \log \mathcal{L}(\beta).
  $$
- The log-likelihood is concave, ensuring a global maximum, but it's nonlinear, so no closed-form solution exists.

- **Gradient of the log-likelihood**:
  $$
  \frac{\partial \log \mathcal{L}(\beta)}{\partial \beta} = \sum_{i=1}^n (y_i - p_i) X_i.
  $$
- The gradient points in the direction to adjust $\beta$ to improve the fit.

In [None]:
# Select features
features = ["RV"]

X = df[features]
y = df["Ret_binary"]

# Initialize Logistic Regression Model
logistic_model = LogisticRegression()

# Fit the model
logistic_model.fit(X, y)

# Predict on the test set
y_pred = logistic_model.predict(X)

# Evaluate the model
accuracy = accuracy_score(y, y_pred)
print(f"Accuracy: {accuracy}")

In [None]:
logistic_model.predict_proba(X)

In [None]:
logistic_model.predict(X)

In [None]:
y_pred

**Inspecting the Parameters**

One can access the estimated coefficients and the intercept of the logistic regression model using the coef_ and intercept_ attributes.

In [None]:
# Coefficients
coef = logistic_model.coef_[0]
print(f"Coefficient for RV: {coef}")

# Intercept
intercept = logistic_model.intercept_[0]
print(f"Intercept: {intercept}")

**Checking for Statistical Significance**

To check the statistical significance of the coefficients, you can perform a hypothesis test. This usually involves a **Wald test** in the context of logistic regression. 

**Null Hypothesis (H0):** $\beta_1 = 0$

**Alternative Hypothesis (H1):** $\beta_1 \neq 0$


$$ \text{Wald Statistic} = \frac{(\hat{\beta}_1 - 0)^2}{\text{Var}(\hat{\beta}_1)} $$

The Wald test is already used with linear regression estimations, as we have seen in the previous lecture. However, in logistic regression, the estimated coefficients $ \hat{\beta}_1$ are related to the log-odds of the dependent variable, and the variance calculation involves the complex structure of the logistic model. The test statistic follows a chi-squared distribution with 1 degree of freedom.


You can use the statsmodels library to fit the logistic regression model and obtain the p-values.

In [None]:
# Add a constant to the features (for the intercept)
X_sm = sm.add_constant(X)

# Fit logistic regression model
logit_model = sm.Logit(y, X_sm)
result = logit_model.fit()

# Summary of the regression, including p-values
result.summary()

<div style="text-align:center; font-size:24px">
    <span style="color:red">How should you choose between statsmodels and scikit-learn?</span>
</div>


Evaluate the performance of a model through Accuracy, Precision, Recall and F1 Score metrics and provides a brief explanation of the Confusion Matrix. 
Once you have built your model, the most important question that arises is how good is your model? So, evaluating your model is the most important task in the data science work which delineates how good your predictions are.

Remind the fllowing concepts:

- True Positives (TP) - These are the correctly predicted positive values which means that the value of actual class is 1 and the value of predicted class is also 1.

- True Negatives (TN) - These are the correctly predicted negative values which means that the value of actual class is not 1 and value of predicted class is also not 1. 

False positives and false negatives, these values occur when your actual class contradicts with the predicted class.

- False Positives (FP) – When actual class is not 1 and predicted class is 1. 

- False Negatives (FN) – When actual class is 1 but predicted class in not 1. 

Once you understand these four concepts then we can calculate Accuracy, Precision, Recall and F1 score.

- Accuracy - Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model.

$$Accuracy = \frac{TP+TN}{TP+FP+FN+TN}$$

- Precision - Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all of the observation labelled as 1, how many actually are  1? High precision relates to the low false positive rate. 

$$Precision = \frac{TP}{TP+FP}$$

- Recall (Sensitivity) - Recall is the ratio of correctly predicted positive observations to the all observations in actual class - yes. The question recall answers is: Of all the observation that belongs to class 1, how many did we predicted correctly?

$$Recall = \frac{TP}{TP+FN}$$

- F1 score - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution as in our case. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall.

$$F1 Score = \frac{2*(Recall * Precision)}{(Recall + Precision)}$$

So, whenever you build a model, these concepts should help you to figure out how good your model has performed.

Look [here](https://developers.google.com/machine-learning/crash-course/classification/thresholding) and [here](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc) for a better understanding of the next metrics we are going to use.


In [None]:
# Select features
features = ["RV", "TBill1Y_ret", "TBill3M_ret"]

X = df[features]
y = df["Ret_binary"]

# Initialize Logistic Regression Model
logistic_model = LogisticRegression()

# Fit the model
logistic_model.fit(X, y)

# Predict on the test set
y_pred = logistic_model.predict(X)

# Evaluate the model
accuracy = accuracy_score(y, y_pred)
print(f"Accuracy: {accuracy}")

In [None]:
# Predict probabilities and select those for the positive class (y = 1) for each observation in the test set. It is used later for the ROC-AUC calculation.
y_probs = logistic_model.predict_proba(X)[:, 1]

# Print a report summarizing the precision, recall, and F1-score for each class. It provides a quick overview of how well the model is performing for each class.
print("Classification Report:")
print(classification_report(y, y_pred))

# This computes the Area Under the Receiver Operating Characteristic Curve (ROC-AUC). A value closer to 1 indicates better classification performance.
roc_auc = roc_auc_score(y, y_probs)
print(f"ROC-AUC Score: {roc_auc}")

# This calculates the False Positive Rate (FPR) and True Positive Rate (TPR) for various threshold values, which are used to plot the ROC curve.
fpr, tpr, _ = roc_curve(y, y_probs)

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, label=f"ROC curve (area = {roc_auc})")
plt.plot([0, 1], [0, 1], "k--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic")
plt.legend(loc="lower right")
plt.show()

In [None]:
# Function to fit a logistic regression model and plot the ROC
def plot_roc(model, X, y, label, ax):
    model.fit(X, y)
    y_probs = model.predict_proba(X)[:, 1]
    fpr, tpr, _ = roc_curve(y, y_probs)
    roc_auc = roc_auc_score(y, y_probs)
    ax.plot(fpr, tpr, label=f"{label} (AUC = {roc_auc:.2f})")

In [None]:
df.columns

In [None]:
fig, ax = plt.subplots()

features = ["RV"]
X = df[features]
y = df["Ret_binary"]
model1 = LogisticRegression()
plot_roc(model1, X,y, "LR1", ax)

features = ["RV", "TBill1Y_ret"]
X = df[features]
y = df["Ret_binary"]
model2 = LogisticRegression(penalty="l2", solver="liblinear")
plot_roc(model2,  X,y, "LR2", ax)

features = ["RV", "TBill1Y_ret", "TBill3M_ret"]
X = df[features]
y = df["Ret_binary"]
model3 = LogisticRegression()
plot_roc(model3, X,y, "LR3", ax)

features = [
    "TBill3M",
    "TBill1Y",
    "Oil",
    "RV",
    "Gold",
    "SP_close",
    "SP_volume",
    "Holiday",
    "weekday",
    "TBill3M_ret",
    "TBill1Y_ret",
    "Oil_ret",
    "Gold_ret",
    "SP_volume_ret",
    "Return_close_lag1",
    "Return_close_lag2",
    "Return_close_lag3",
]
X = df[features]
y = df["Ret_binary"]
model4 = LogisticRegression()
plot_roc(model4,  X,y, "LR4", ax)


# Add random classifier line and labels
ax.plot([0, 1], [0, 1], "k--")
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.05])
ax.set_xlabel("False Positive Rate")
ax.set_ylabel("True Positive Rate")
ax.set_title("Receiver Operating Characteristic")
ax.legend(loc="lower right")

## Evaluation of Classification Models

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
tn, fp, fn, tp = confusion_matrix([0, 1, 0, 1], [1, 1, 1, 0]).ravel()

In [None]:
confusion_matrix([0, 1, 0, 1], [1, 1, 1, 0])

In [None]:
tn

**CONFUSION MATRIX**: true values are displayed by column, while predicted value are displayed by row. The entry [0,0] represents the number of negative output in class 1 that the model predict correctly, while the entry [1,1] represents the number of positive output in class 1 that the model predict correctly.

In [None]:
# Select features
features = ["RV", "TBill1Y_ret", "TBill3M_ret"]
X = df[features]
y = df["Ret_binary"]
logistic_model = LogisticRegression()
logistic_model.fit(X, y)
y_pred = logistic_model.predict(X)

In [None]:
print(confusion_matrix(y, y_pred))

In [None]:
confusion_matrix(y, y_pred).sum()

In [None]:
len(y)

In [None]:
print(classification_report(y, y_pred))

# Logistic Regression for Multiclass Classification

Multiclass classification and binary classification are two distinct tasks in supervised learning, and logistic regression can be adapted to handle both. Here's a comparison and description of performing multiclass classification vs. binary classification using logistic regression:

**Binary Classification**
In binary classification, there are only two possible outcomes or classes. Logistic regression models the probability that the target variable belongs to a particular category. It uses the logistic function to squeeze the output of a linear equation between 0 and 1. The class label prediction is then made based on whether the probability is above or below a certain threshold (usually 0.5).

**Multiclass Classification**
Multiclass classification extends the concept of binary classification to more than two classes. In the context of logistic regression, this can be done in several ways:

- *One-vs-All (OvA) or One-vs-Rest (OvR)*: In this approach, separate binary logistic regression models are trained for each class against all other classes. If there are $k$ classes, $k$ different binary logistic regression models are trained. Predictions are made by evaluating all $k$ models and choosing the class for which the corresponding model gives the highest probability.

- *Softmax Regression (Multinomial Logistic Regression)*: Softmax regression generalizes logistic regression to multiple classes. Instead of modeling just one binary response, the softmax function is used to model the multinomial probability distribution over the classes. The output is a probability for each class, and the predicted class label is the one with the highest probability.


**Binary Classification**: Simpler and suitable when there are only two classes to distinguish between.

**Multiclass Classification**: More complex, as it deals with multiple classes. Requires special handling through techniques like OvA/OvR or softmax regression.

*Interpretability*: While binary logistic regression provides direct insights into the effect of each feature on the odds of belonging to one class, multiclass logistic regression, especially when using the softmax function, may be less straightforward to interpret. In multiclass classification, evaluation metrics such as accuracy, confusion matrix, and multiclass ROC-AUC become more complex, as they must account for multiple classes.


Here is what we are going to do now:
- Analyze the Distribution: First, you'll want to understand the distribution of the Return_close variable. You can use descriptive statistics or visualization methods to get an overview of the data.

- Choose the Bin Edges: Based on your analysis, you can decide how to segment the data. For example, you might want to have three classes: "Negative Return", "No Gain", and "Positive Return". You'll need to define the edges of the bins that separate these classes.

- Apply `pd.cut` with Custom Bin Edges to transform the continuous `Return_close` variable into categorical classes.

In [None]:
df["Return_close"].describe(percentiles=[0.1, 0.25, 0.75, 0.9])

In [None]:
df["Multiclass_Target"].unique()

In [None]:
y_pred

In [None]:
# Transform 'Return_close' into a multiclass target variable
# Example: Negative Return, No Gain, Positive Return
df["Multiclass_Target"] = pd.cut(
    df["Return_close"],
    bins=[-float("inf"), -0.5, 0.5, float("inf")],
    labels=["Negative Return", "No Gain", "Positive Return"],
)


features = ["RV", "TBill1Y_ret"]
X = df[features]
y = df["Multiclass_Target"]


logistic_model = LogisticRegression(max_iter=1000, multi_class="auto")
logistic_model.fit(X, y)
y_pred = logistic_model.predict(X)

accuracy = accuracy_score(y, y_pred)
print(f"Accuracy: {accuracy}")

print("Classification Report:")
print(classification_report(y, y_pred))

In [None]:
logistic_model.predict_proba(X)

In [None]:
df["Multiclass_Target"].value_counts()

<div style="text-align:center; font-size:24px">
    <span style="color:red">How do we know if we are using One-vs-Rest (OvR) or Softmax Regression?</span>
</div>
