# Lecture Notes on Logistic Regression
---

## 1. Introduction to Logistic Regression

**Logistic Regression** is a statistical method used for modeling the relationship between a dependent variable and one or more independent variables. Unlike linear regression, which predicts continuous outcomes, logistic regression is used for predicting categorical outcomes, typically binary (e.g., yes/no, success/failure).

### What is Logistic Regression?

Logistic regression estimates the probability that a given input point belongs to a certain category. It uses the logistic function to squeeze the output of a linear equation between 0 and 1, making it suitable for binary classification tasks.

### Why is Logistic Regression Important?

1. **Binary Classification**:
   - **Versatile Tool**: Ideal for problems where the outcome is binary, such as spam detection, disease diagnosis, and customer churn prediction.
   - **Probabilistic Interpretation**: Provides probabilities for outcomes, allowing for nuanced decision-making.

2. **Ease of Implementation**:
   - **Simple Concept**: Builds on the familiar linear regression framework, making it easy to understand and implement.
   - **Widely Available**: Supported by numerous statistical and machine learning libraries.

3. **Interpretability**:
   - **Clear Insights**: Coefficients indicate the direction and strength of the relationship between predictors and the outcome.
   - **Feature Importance**: Helps in identifying significant predictors influencing the outcome.

4. **Foundation for Advanced Models**:
   - **Multiclass Extensions**: Forms the basis for more complex models like multinomial and ordinal logistic regression.
   - **Integration with Other Techniques**: Can be combined with regularization methods and used in ensemble techniques.

### Real-World Applications of Logistic Regression

1. **Healthcare**:
   - **Disease Prediction**: Estimating the probability of a patient having a particular disease based on symptoms and test results.
   - **Patient Readmission**: Predicting the likelihood of a patient being readmitted to a hospital within a certain period.

2. **Finance**:
   - **Credit Scoring**: Assessing the probability of a borrower defaulting on a loan.
   - **Fraud Detection**: Identifying potentially fraudulent transactions.

3. **Marketing**:
   - **Customer Churn Prediction**: Estimating the likelihood of customers leaving a service.
   - **Response Modeling**: Predicting the probability of customers responding to a marketing campaign.

4. **Social Sciences**:
   - **Voting Behavior**: Predicting the likelihood of individuals voting for a particular candidate based on demographics.
   - **Educational Outcomes**: Estimating the probability of students graduating based on various factors.

5. **Technology**:
   - **Spam Detection**: Classifying emails as spam or not spam.
   - **Recommendation Systems**: Predicting user preferences and behaviors.

### Simple Example to Illustrate Logistic Regression

**Scenario**:
Imagine you work for a bank and want to predict whether a customer will default on a loan based on their income and credit score. This is a binary classification problem where the outcome is either "default" or "no default."

**Data Collected**:

| Customer | Income (X₁) | Credit Score (X₂) | Default (Y) |
|----------|-------------|-------------------|-------------|
| 1        | 50,000      | 700               | No          |
| 2        | 60,000      | 650               | Yes         |
| 3        | 55,000      | 720               | No          |
| 4        | 45,000      | 600               | Yes         |
| 5        | 70,000      | 750               | No          |
| 6        | 65,000      | 680               | Yes         |
| 7        | 80,000      | 800               | No          |
| 8        | 40,000      | 580               | Yes         |
| 9        | 90,000      | 820               | No          |
| 10       | 35,000      | 550               | Yes         |

**Objective**:
Use logistic regression to predict whether a customer will default on a loan based on their income and credit score.

**Steps**:

1. **Plot the Data**:
   - Create a scatter plot with Income on the X-axis and Credit Score on the Y-axis.
   - Differentiate between default and non-default customers using colors or markers.

2. **Fit the Logistic Regression Model**:
   - Use logistic regression to find the best-fitting model that separates the default and non-default customers.
   - The model will estimate the probability of default based on income and credit score.

3. **Interpret the Results**:
   - **Coefficients**: Indicate the impact of income and credit score on the probability of default.
   - **Odds Ratios**: Show how the odds of default change with a one-unit increase in the predictor variables.

4. **Make Predictions**:
   - Use the model to predict the probability of default for new customers.
   - Classify customers as "default" or "no default" based on a chosen probability threshold (e.g., 0.5).

**Visualization**:
The scatter plot with the logistic regression boundary will show how the model separates the two classes based on income and credit score.

### Why is this Example Useful?

- **Binary Outcome**: Clearly demonstrates a binary classification problem.
- **Real-World Relevance**: Shows how logistic regression is applied in financial decision-making.
- **Interpretability**: Allows for easy interpretation of how predictors influence the probability of default.

### Summary

Logistic regression is a powerful tool for predicting categorical outcomes. Its ability to estimate probabilities makes it invaluable for decision-making in various fields. By understanding and applying logistic regression, you can build models that not only classify outcomes but also provide insights into the factors driving those outcomes.

---

## 2. Key Concepts

Before diving into logistic regression, it's important to understand some basic terms and ideas that form the foundation of this method.

### Dependent and Independent Variables

- **Dependent Variable (Y)**: The outcome we want to predict or explain.
  - *Example*: Default (Yes/No), Disease Status (Positive/Negative), Purchase Decision (Buy/Not Buy).
  
- **Independent Variable (X)**: The input or predictor that influences the dependent variable.
  - *Example*: Income, Credit Score, Age, Number of Previous Purchases.

**Example Scenario**:
If you want to predict whether a customer will default on a loan based on their income and credit score, the default status is the dependent variable, and income and credit score are the independent variables.

### The Logistic Function

The logistic function, also known as the sigmoid function, transforms the output of a linear equation into a probability between 0 and 1. This makes it suitable for binary classification.

**Logistic Function Formula**:

$$
\text{Probability} = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n)}}
$$

- **e**: The base of the natural logarithm (approximately equal to 2.71828).
- $\beta_0$: Intercept term.
- $\beta_1, \beta_2, \dots, \beta_n$: Coefficients for the independent variables.
- $X_1, X_2, \dots, X_n$: Independent variables.


**Graph of Logistic Function**:
The logistic function produces an S-shaped curve that approaches 0 and 1 asymptotically. This characteristic makes it ideal for modeling probabilities.

### Odds and Log-Odds

Understanding odds and log-odds is crucial for interpreting logistic regression coefficients.

- **Odds**:
  - **Definition**: The ratio of the probability of an event occurring to the probability of it not occurring.
  - **Formula**: $\text{Odds} = \frac{P(Y=1)}{P(Y=0)}$
  - **Example**: For odds of 0.25, log-odds are $\ln(0.25) \approx -1.386$.

- **Log-Odds (Logit)**:
  - **Definition**: The natural logarithm of the odds.
  - **Formula**: $\text{Log-Odds} = \ln\left(\frac{P(Y=1)}{P(Y=0)}\right)$
  - **Example**: For odds of 0.25, log-odds are $\ln(0.25) \approx -1.386$.

**Relation to Logistic Function**:
The logistic regression model models the log-odds as a linear combination of the independent variables.

### Coefficients (Intercept and Slope)

- **Intercept** ($\beta_0$):
  - **What It Represents**: The log-odds of the dependent variable when all independent variables are zero.
  - **Interpretation**: In some cases, it may not have a meaningful real-world interpretation, especially if the independent variables cannot be zero.

- **Slope** ($\beta_1, \beta_2, \dots, \beta_n$):
  - **What They Represent**: The change in the log-odds of the dependent variable for a one-unit increase in the corresponding independent variable.
  - **Interpretation**: Positive coefficients increase the probability of the event occurring, while negative coefficients decrease it.

- **Example**:
    If $\beta_1 = 0.5$, then a one-unit increase in $X_1$ increases the log-odds of the event by 0.5. This translates to multiplying the odds by $e^{0.5} \approx 1.65$, meaning the odds increase by 65%.

---

## 3. Assumptions of Logistic Regression

For logistic regression to provide accurate and reliable results, certain assumptions about the data and the relationship between variables must be met. If these assumptions are violated, the results may be misleading.

### 1. Binary Outcome

- **Definition**: The dependent variable should be binary, meaning it has only two possible outcomes (e.g., Yes/No, 1/0).
- **Why It Matters**: Logistic regression is specifically designed for binary classification tasks.
- **How to Check**: Ensure that the dependent variable has exactly two distinct categories.

### 2. Independence of Observations

- **Definition**: Each observation should be independent of the others.
- **Why It Matters**: Violations can lead to biased estimates and incorrect standard errors.
- **How to Check**: Use study design methods to ensure independence. For time series data, consider using models that account for autocorrelation.

### 3. No Multicollinearity

- **Definition**: Independent variables should not be highly correlated with each other.
- **Why It Matters**: High correlation between predictors can make it difficult to assess the individual effect of each variable.
- **How to Check**: Calculate the Variance Inflation Factor (VIF) for each predictor. A VIF value greater than 5 or 10 indicates high multicollinearity.

### 4. Linearity of the Logit

- **Definition**: The logit (log-odds) should have a linear relationship with the independent variables.
- **Why It Matters**: Logistic regression models the logit as a linear combination of the predictors.
- **How to Check**: Use the Box-Tidwell test or include interaction terms between predictors and their log transformations.

### 5. Large Sample Size

- **Definition**: Logistic regression requires a sufficiently large sample size to provide reliable estimates.
- **Why It Matters**: Small sample sizes can lead to overfitting and unreliable coefficient estimates.
- **How to Check**: Ensure that the sample size is adequate, typically with at least 10 events per predictor variable.

### 6. No Outliers or Influential Points

- **Definition**: Data points should not unduly influence the model.
- **Why It Matters**: Outliers can skew the results and lead to inaccurate predictions.
- **How to Check**: Use diagnostic measures like Cook’s distance, leverage values, and standardized residuals to identify and address influential points.

**Simple Tip**: Always visualize your data and perform diagnostic checks to ensure these assumptions hold before trusting your logistic regression results. If any assumptions are violated, consider data transformation, removing outliers, or using alternative modeling techniques.

---

## 4. Types of Logistic Regression

Logistic regression can be categorized based on the nature of the dependent variable and the complexity of the relationship.

### Binary Logistic Regression

- **Definition**: Models the probability of a binary outcome based on one or more independent variables.
- **Equation**: $\text{Logit}(P) = \ln\left(\frac{P}{1-P}\right) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_nX_n$
- **Use Case**: When the dependent variable has two categories, such as default/no default, success/failure.

**Example**:
Predicting whether a customer will default on a loan based on income and credit score.

### Multinomial Logistic Regression

- **Definition**: Extends binary logistic regression to handle dependent variables with more than two categories that are nominal (no intrinsic order).
- **Equation**: $\text{Logit}(P_i) = \ln\left(\frac{P_i}{P_K}\right) = \beta_{0i} + \beta_{1i}X_1 + \beta_{2i}X_2 + \dots + \beta_{ni}X_n$
  - $P_i$: Probability of outcome $i$
  - $P_K$: Probability of the reference outcome $K$
- **Use Case**: When the dependent variable has multiple categories, such as types of housing (apartment, house, condo).

**Example**:
Classifying vehicles into categories like sedan, SUV, and truck based on features like engine size and fuel efficiency.

### Ordinal Logistic Regression

- **Definition**: Used when the dependent variable has more than two categories with a natural order but unknown spacing between categories.
- **Equation**: $\text{Logit}(P(Y \leq j)) = \alpha_j - (\beta_1X_1 + \beta_2X_2 + \dots + \beta_nX_n)$
  - $\alpha_j$: Threshold parameters for each category
- **Use Case**: When the dependent variable is ordinal, such as customer satisfaction ratings (poor, fair, good, very good, excellent).


**Example**:
Predicting the likelihood of students achieving grades A, B, C, D, or F based on study hours and attendance.

### Summary

Understanding the different types of logistic regression allows you to choose the appropriate model based on the nature of your dependent variable and the specific requirements of your analysis. Binary logistic regression is the most common, but multinomial and ordinal logistic regression extend the applicability of logistic models to more complex classification tasks.

---

## 5. Evaluating Logistic Regression Models

After building a logistic regression model, it's important to evaluate how well it performs. This helps in understanding the model's accuracy and reliability.

### Accuracy

- **Definition**: The proportion of correctly predicted instances out of all instances.
- **Formula**: $\text{Accuracy} = \frac{\text{True Positives} + \text{True Negatives}}{\text{Total Population}}$
- **Interpretation**:
  - **High Accuracy**: Indicates that the model correctly predicts a large proportion of instances.
  - **Limitations**: Can be misleading in imbalanced datasets where one class dominates.

**Simple Example**:
If a model correctly predicts 90 out of 100 instances, its accuracy is 90%.

### Confusion Matrix

- **Definition**: A table that summarizes the performance of a classification model by showing the true versus predicted classifications.
- **Components**:
  - **True Positives (TP)**: Correctly predicted positive instances.
  - **True Negatives (TN)**: Correctly predicted negative instances.
  - **False Positives (FP)**: Incorrectly predicted positive instances.
  - **False Negatives (FN)**: Incorrectly predicted negative instances.

|                     | Predicted Positive | Predicted Negative |
|---------------------|--------------------|--------------------|
| **Actual Positive** | TP                 | FN                 |
| **Actual Negative** | FP                 | TN                 |

**Use Case**:
Helps in calculating other evaluation metrics and understanding the types of errors the model makes.

### Precision, Recall, F1-Score

- **Precision**:
  - **Definition**: The proportion of true positives out of all predicted positives.
  - **Formula**: $\text{Precision} = \frac{TP}{TP + FP}$
  - **Interpretation**: High precision indicates that the model has a low false positive rate.
    
- **Recall (Sensitivity)**:
  - **Definition**: The proportion of true positives out of all actual positives.
  - **Formula**: $\text{Recall} = \frac{TP}{TP + FN}$
  - **Interpretation**: High recall indicates that the model captures most of the actual positives.
    
- **F1-Score**:
  - **Definition**: The harmonic mean of precision and recall.
  - **Formula**: $F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$
  - **Interpretation**: Balances precision and recall, providing a single metric for model performance.

**Simple Example**:
If a model has Precision = 0.8 and Recall = 0.6, the F1-Score is:
$F1 = 2 \times \frac{0.8 \times 0.6}{0.8 + 0.6} = 0.69$


### ROC Curve and AUC

- **ROC Curve (Receiver Operating Characteristic Curve)**:
  - **Definition**: A graphical representation of a model's diagnostic ability by plotting the True Positive Rate (Recall) against the False Positive Rate (1 - Specificity) at various threshold settings.
  - **Interpretation**: The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the model.

- **AUC (Area Under the ROC Curve)**:
  - **Definition**: Measures the entire two-dimensional area underneath the ROC curve.
  - **Range**: 0 to 1
    - **AUC = 1**: Perfect model.
    - **AUC = 0.5**: Model performs no better than random guessing.
  - **Interpretation**: Higher AUC indicates better model performance.

**Simple Example**:
An AUC of 0.85 suggests that the model has a good ability to distinguish between the two classes.

### Constructive Evaluation Strategy

- **Use Multiple Metrics**: Relying solely on accuracy can be misleading, especially with imbalanced datasets. Combine accuracy with precision, recall, F1-Score, and AUC for a comprehensive evaluation.
- **Confusion Matrix Analysis**: Examine the confusion matrix to understand the types of errors the model is making.
- **ROC Curve Analysis**: Use ROC curves and AUC to assess the trade-off between true positive rates and false positive rates.
- **Cross-Validation**: Implement cross-validation techniques to ensure that the model's performance is consistent across different subsets of the data.
- **Threshold Optimization**: Adjust the decision threshold based on the specific requirements of your application (e.g., minimizing false negatives in disease diagnosis).

**Simple Tip**: Always consider the context and the costs associated with different types of errors when choosing evaluation metrics and thresholds.

---

## 6. Sample Code: Implementing Logistic Regression in Python

Let's see how to perform logistic regression using Python. We'll use two popular libraries: **Scikit-learn** and **Statsmodels**.

### Using Scikit-learn

**Scikit-learn** is a powerful library for machine learning tasks. It provides simple and efficient tools for data analysis and modeling.

```python
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, auc

# Sample Dataset: Income vs. Default
data = {
    'Income': [50000, 60000, 55000, 45000, 70000, 65000, 80000, 40000, 90000, 35000],
    'Credit_Score': [700, 650, 720, 600, 750, 680, 800, 580, 820, 550],
    'Default': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]  # 0: No Default, 1: Default
}
df = pd.DataFrame(data)

# Define Independent Variables (X) and Dependent Variable (Y)
X = df[['Income', 'Credit_Score']]
Y = df['Default']

# Split the dataset into Training and Testing sets (80% train, 20% test)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Initialize the Logistic Regression model
model = LogisticRegression()

# Train the model using the training data
model.fit(X_train, Y_train)

# Make predictions on the test set
Y_pred = model.predict(X_test)
Y_pred_proba = model.predict_proba(X_test)[:, 1]  # Probability estimates for the positive class

# Evaluate the model
accuracy = accuracy_score(Y_test, Y_pred)
conf_matrix = confusion_matrix(Y_test, Y_pred)
class_report = classification_report(Y_test, Y_pred)

# Calculate ROC Curve and AUC
fpr, tpr, thresholds = roc_curve(Y_test, Y_pred_proba)
roc_auc = auc(fpr, tpr)

print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)
print(f"AUC: {roc_auc:.2f}")

# Visualize the ROC Curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([-0.01, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

# Visualize the Decision Boundary (if applicable)
# Note: With two features, we can plot the decision boundary
from matplotlib.colors import ListedColormap

# Define the mesh grid
h = .02  # step size in the mesh
x_min, x_max = X['Income'].min() - 1000, X['Income'].max() + 1000
y_min, y_max = X['Credit_Score'].min() - 20, X['Credit_Score'].max() + 20
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# Predict classifications for each point in the mesh
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Define custom colors
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00'])

# Plot the contour and training points
plt.figure()
plt.contourf(xx, yy, Z, cmap=cmap_light)

# Plot also the test points
plt.scatter(X_test['Income'], X_test['Credit_Score'], c=Y_test, cmap=cmap_bold, edgecolor='k', s=100)
plt.xlabel('Income')
plt.ylabel('Credit Score')
plt.title('Logistic Regression Decision Boundary')
plt.show()
