# Logistic Regression

## 1. Introduction to Classification
In Linear Regression, we predict a continuous quantitative value (e.g., house price). However, in many real-world problems, we want to predict a **category** or **class**.

Examples:
*   **Email**: Is this email Spam or Not Spam?
*   **Medical**: Does this patient have Heart Disease or Not?
*   **Finance**: Will this customer Default on their loan or Not?

These are **Classification** problems. The response variable $Y$ is qualitative (e.g., $Y \in \{0, 1\}$).

## 2. Why not Linear Regression?
You might be tempted to use Linear Regression for a binary outcome (0 or 1). However, this has major issues:
1.  **Unbounded Output**: Linear regression can predict values like -0.5 or 1.2, which don't make sense as probabilities.
2.  **Violates Assumptions**: The errors are not normally distributed.

Instead, we model the **Probability** that $Y$ belongs to a particular category.

## 3. The Logistic Function
To ensure our prediction falls between 0 and 1, we use the **Logistic Function** (Sigmoid function).

$$ p(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} $$

*   If $\beta_0 + \beta_1 X$ is very large positive, $p(X) \approx 1$.
*   If $\beta_0 + \beta_1 X$ is very large negative, $p(X) \approx 0$.

This creates an **S-shaped curve** rather than a straight line.

### Log-Odds (Logit)
By rearranging the equation, we get linear relationship with the **log-odds**:

$$ \log\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + \beta_1 X $$

*   The quantity $\frac{p(X)}{1-p(X)}$ is called the **Odds** (e.g., 4:1 odds means 80% probability).
*   Increasing $X$ by one unit changes the **log-odds** by $\beta_1$.

## 4. Estimating Coefficients (Maximum Likelihood)
Unlike Linear Regression which uses Least Squares (minimizing error), Logistic Regression uses **Maximum Likelihood Estimation (MLE)**.

**Intuition**: We search for $\beta_0$ and $\beta_1$ such that the predicted probabilities correspond as closely as possible to the observed individuals. 
*   If a person Has Disease ($Y=1$), we want their $p(X)$ to be close to 1.
*   If a person No Disease ($Y=0$), we want their $p(X)$ to be close to 0.

## 5. Implementation Example
We will use `sklearn` to implement Logistic Regression on a synthetic dataset.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# 1. Generate synthetic classification data
# 1000 samples, 1 feature for easy visualization
X, y = make_classification(n_samples=1000, n_features=1, n_informative=1, 
                           n_redundant=0, n_clusters_per_class=1, random_state=42)

# 2. Split into Training and Testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Fit the Logistic Regression Model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# 4. Making Predictions
# predict_proba gives the probability (e.g., 0.85)
y_prob = log_reg.predict_proba(X_test)[:, 1]
# predict gives the class label (e.g., 1)
y_pred = log_reg.predict(X_test)

# 5. Evaluation
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

### Visualizing the S-Curve
Below, we visualize how the logistic regression fits the data points (red) with a smooth sigmoid curve (blue). The decision boundary is at Probability = 0.5.

In [None]:
plt.figure(figsize=(10, 6))

# Scatter plot of actual test data (0 or 1)
plt.scatter(X_test, y_test, color='red', alpha=0.3, label='Test Data Points')

# Generate a range of X values to plot the smooth curve
X_range = np.linspace(X.min(), X.max(), 300).reshape(-1, 1)
y_range_prob = log_reg.predict_proba(X_range)[:, 1]

# Plot the Logistic Function
plt.plot(X_range, y_range_prob, color='blue', linewidth=3, label='Logistic Sigmoid Curve')

# Decision Boundary (0.5 probability)
plt.axhline(0.5, color='gray', linestyle='--', label='Decision Boundary (P=0.5)')

plt.xlabel('Feature Value')
plt.ylabel('Probability of Class 1')
plt.legend()
plt.title('Logistic Regression Fit')
plt.show()

## 6. Quiz

Test your understanding of Logistic Regression.

**Q1. What is the range of output for the Logistic Function?**
A) $(-\infty, +\infty)$
B) $[0, 1]$
C) $[-1, 1]$

**Q2. If the coefficient $\beta_1$ is positive, what does it imply?**
A) Increasing X increases the probability of $Y=1$.
B) Increasing X decreases the probability of $Y=1$.
C) X has no effect on Y.

**Q3. Which method is used to estimate parameters in Logistic Regression?**
A) Least Squares
B) Maximum Likelihood Estimation (MLE)
C) Gini Index

---
### Sample Answers
**Q1:** B) $[0, 1]$. Probabilities must always be between 0 and 1.
**Q2:** A). A positive coefficient means the log-odds (and thus probability) increase as X increases.
**Q3:** B). MLE is used to find the parameters that maximize the probability of the observed data.