# Logistic Regression (Theoretical and Practical Implementation)

## 🎓 Part 1: Theoretical Explanation

### 1. Introduction to Logistic Regression
- **Definition**: Logistic Regression is a statistical model used for binary classification problems. Despite its name, it is used for classification rather than regression tasks.
- **Goal**: Predict the probability that a given input belongs to a particular category (class 0 or class 1).

### 2. Why Not Linear Regression for Classification?
- Linear regression outputs continuous values, which can exceed the [0,1] range.
- Classification requires probabilistic interpretation.
- Logistic regression addresses this by using the **sigmoid (logistic) function** to constrain output between 0 and 1.

### 3. The Logistic (Sigmoid) Function
$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$
- S-shaped curve.
- Converts linear combination of inputs into a probability.
- Output ranges from 0 to 1.

### 4. Model Representation
- Input features: $x = (x_1, x_2, ..., x_n)$
- Parameters: $\beta = (\beta_0, \beta_1, ..., \beta_n)$
- Linear combination: $z = \beta_0 + \beta_1 x_1 + ... + \beta_n x_n$
- Prediction: $P(y=1|x) = \sigma(z) = \frac{1}{1 + e^{-z}}$

### 5. Cost Function: Binary Cross-Entropy (Log-Loss)
$$
L(\beta) = - \sum_{i=1}^m \left[y_i \log(p_i) + (1 - y_i) \log(1 - p_i)\right]
$$
- Measures how well the model's predictions match actual labels.
- Convex, enabling effective optimization using Gradient Descent.

### 6. Model Optimization
- **Gradient Descent**: Iteratively updates weights to minimize the loss.
- **Regularization**:
  - L1 (Lasso): Promotes sparsity.
  - L2 (Ridge): Penalizes large coefficients to prevent overfitting.

### 7. Decision Boundary
- Class prediction:
$$
y = \begin{cases} 1 & \text{if } P(y=1|x) \geq 0.5 \\ 0 & \text{otherwise} \end{cases}
$$
- Can be visualized in 2D feature space as a line separating classes.

### 8. Model Assumptions
- Linearity in the log-odds.
- No multicollinearity.
- Independence of observations.

### 9. ROC Curve (Receiver Operating Characteristic)
- **Definition**: A plot of the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.
- **True Positive Rate (Recall)**: $TPR = \frac{TP}{TP + FN}$
- **False Positive Rate**: $FPR = \frac{FP}{FP + TN}$
- Helps visualize the trade-off between sensitivity and specificity.
- **AUC (Area Under Curve)**: Measures the overall ability of the model to distinguish between classes. AUC close to 1.0 indicates a good model.

### 10. Model Interpretation
- The coefficients $\beta_i$ indicate the effect of each feature on the log-odds of the outcome.
- To interpret:
  - Convert to **odds ratio**: $OR = e^{\beta_i}$
  - $OR > 1$: Feature increases the odds of the positive class.
  - $OR < 1$: Feature decreases the odds.
- Useful for understanding feature importance and direction of influence.

## 💪 Part 2: Practical Implementation (Python)

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve

In [None]:
# Load Dataset
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df.head()

In [None]:
# Data Preprocessing
df.isnull().sum()

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Model Training and Evaluation
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

In [None]:
# ROC Curve
y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_proba)

plt.plot(fpr, tpr)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.show()

print("AUC Score:", roc_auc_score(y_test, y_proba))

In [None]:
# Model Interpretation
coeff_df = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_[0]})
coeff_df['Odds Ratio'] = np.exp(coeff_df['Coefficient'])
coeff_df.sort_values(by='Odds Ratio', ascending=False)

## 📆 Assignment for Students
**Objective**: Apply logistic regression to a new dataset.
1. Choose a binary classification dataset from Kaggle or UCI (e.g., Titanic, Pima Indians Diabetes).
2. Perform EDA (Exploratory Data Analysis).
3. Preprocess the data (handle missing values, encode categories, normalize if needed).
4. Implement logistic regression.
5. Evaluate the model using confusion matrix, precision, recall, F1-score, and ROC curve.
6. Interpret the coefficients.
7. Submit a Jupyter Notebook with your code, plots, and written observations.