# 📘 Lecture: Logistic Regression


## 1. Introduction  

In the previous session, we studied **Linear Regression**, which helps us predict continuous values.  
But what if our output variable is **categorical**?  
For example:  
- Will a student **pass or fail** an exam?  
- Is an email **spam or not spam**?  
- Will a customer **buy or not buy** a product?  

Here, **Logistic Regression** comes into play.  

👉 Logistic Regression is a **classification algorithm** used when the dependent variable is **binary (0/1, Yes/No, True/False)**.  



## 2. Theoretical Explanation  

### 2.1 Why Not Linear Regression for Classification?  
If we apply linear regression on a classification problem, the predictions could be:  
- Less than 0 (e.g., -0.5)  
- Greater than 1 (e.g., 1.3)  

But probabilities must always lie between **0 and 1**.  
So, we need a model that maps any real value into a range between **0 and 1**.  



### 2.2 The Sigmoid Function (Logistic Function)  

\[
\sigma(z) = \frac{1}{1 + e^{-z}}
\]

- Converts any real number into the range **(0, 1)**.  
- If probability > 0.5 → predict class **1**  
- If probability ≤ 0.5 → predict class **0**  



### 2.3 Hypothesis of Logistic Regression  

\[
h_\theta(x) = \sigma(\theta^T x) = \frac{1}{1 + e^{-(\theta^T x)}}
\]

- Input: linear combination of features (**θx**)  
- Output: probability (between 0 and 1)  



### 2.4 Decision Boundary  

The threshold (commonly **0.5**) helps decide between classes:  

- If \( h_\theta(x) ≥ 0.5 \) → Predict **1**  
- If \( h_\theta(x) < 0.5 \) → Predict **0**  



### 2.5 Cost Function  

Instead of Mean Squared Error, we use **Log Loss (Binary Cross-Entropy):**

\[
J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \Big[ y^{(i)} \log(h_\theta(x^{(i)})) + (1-y^{(i)}) \log(1-h_\theta(x^{(i)})) \Big]
\]



## 3. Practical Implementation (Python)  

We’ll use the **Scikit-learn** library for implementation.  

### Example: Predicting if a Flower is Iris-Setosa 🌸  


In [None]:

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Load dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Convert to binary classification: "Setosa (0) vs Not Setosa (1)"
df['binary_target'] = df['target'].apply(lambda x: 0 if x == 0 else 1)

# Features and target
X = df[iris.feature_names]
y = df['binary_target']

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

# Example probability predictions
print("\nPredicted Probabilities:\n", model.predict_proba(X_test[:5]))


In [None]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


In [None]:
# Load dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

In [None]:
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


In [None]:
# Convert to binary classification: "Setosa (0) vs Not Setosa (1)"
df['binary_target'] = df['target'].apply(lambda x: 0 if x == 0 else 1)

In [None]:
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,binary_target
0,5.1,3.5,1.4,0.2,0,0
1,4.9,3.0,1.4,0.2,0,0
2,4.7,3.2,1.3,0.2,0,0
3,4.6,3.1,1.5,0.2,0,0
4,5.0,3.6,1.4,0.2,0,0
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2,1
146,6.3,2.5,5.0,1.9,2,1
147,6.5,3.0,5.2,2.0,2,1
148,6.2,3.4,5.4,2.3,2,1


In [None]:
X = df[iris.feature_names]
y = df['binary_target']

In [None]:
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
model = LogisticRegression()


In [None]:
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

In [None]:
y_pred

array([1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1,
       0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0])

In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 1.0
Confusion Matrix:
 [[15  0]
 [ 0 23]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        15
           1       1.00      1.00      1.00        23

    accuracy                           1.00        38
   macro avg       1.00      1.00      1.00        38
weighted avg       1.00      1.00      1.00        38



In [None]:
# Example probability predictions
print("\nPredicted Probabilities:\n", model.predict_proba(X_test[:5]))


Predicted Probabilities:
 [[4.74010419e-03 9.95259896e-01]
 [9.55066467e-01 4.49335332e-02]
 [6.22053105e-06 9.99993779e-01]
 [6.34734513e-03 9.93652655e-01]
 [2.38285934e-03 9.97617141e-01]]



## 🎯 Student Task 1  
- Try predicting **Versicolor vs Non-Versicolor** instead of Setosa.  
- Use only **two features** (`petal length` and `petal width`) and plot a **decision boundary** (bonus).  
- Compare model performance with different test sizes (20%, 30%, 40%).  



## 4. Real-World Applications of Logistic Regression  

- Email **spam detection**  
- Medical diagnosis (**disease present / not present**)  
- Credit scoring (**default / no default**)  
- Marketing (**will customer buy or not?**)  



## 5. Assignment  

📌 **Problem Statement:**  
A bank wants to predict whether a customer will **subscribe to a term deposit** based on customer attributes.  

📂 Dataset: [Bank Marketing Dataset (UCI)](https://archive.ics.uci.edu/ml/datasets/bank+marketing)  

### Tasks:  
1. Load the dataset using **Pandas**.  
2. Preprocess categorical variables (use `pd.get_dummies` or `LabelEncoder`).  
3. Train a **Logistic Regression model**.  
4. Evaluate using:  
   - Accuracy  
   - Confusion Matrix  
   - Classification Report  
5. Write a conclusion:  
   - Which features are most important?  
   - How well does logistic regression perform?  

---

✅ By the end of this lecture, you should:  
- Understand the **theory of Logistic Regression**.  
- Implement it in **Python**.  
- Work with probabilities and decision boundaries.  
- Apply Logistic Regression to a **real-world dataset**.  
