<a href="https://colab.research.google.com/github/debojit11/ml_nlp_dl_transformers/blob/main/ML_week_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 3: Logistic Regression – Predicting Categories from Text

# **SECTION 1: Welcome & Objectives**

In [None]:
print("Welcome to Week 3!")
print("This week, you'll:")
print("- Learn what Logistic Regression is")
print("- Understand how it's different from Linear Regression")
print("- Use it to classify SMS messages as spam or ham (not spam)")

Welcome to Week 3!
This week, you'll:
- Learn what Logistic Regression is
- Understand how it's different from Linear Regression
- Use it to classify SMS messages as spam or ham (not spam)


# **SECTION 2: Why Move to Logistic Regression?**

### Why Linear Regression Isn't Enough
Linear Regression is great for predicting **numbers** — like temperature or prices.

But what if we want to answer **yes/no questions**, like:
- Is this email spam?
- Is this review positive?
- Is this a job posting fake?

We need a model that predicts **categories**, not continuous values.
➡️ That's where **Logistic Regression** comes in.

# **SECTION 3: Logistic Regression Intuition**

### Logistic Regression Intuition
Instead of fitting a straight line, we want to estimate the **probability** that something belongs to a category (like spam).

We still use a linear function underneath:
$[ z = w_1x_1 + w_2x_2 + ... + b $]

But then we pass it through a special function called the **sigmoid function**:
$[ \sigma(z) = \frac{1}{1 + e^{-z}} $]

This squashes the output between 0 and 1 — which we interpret as probability.
- If $( \sigma(z) > 0.5 $): predict 1 (e.g. spam)
- If $( \sigma(z) < 0.5 $): predict 0 (e.g. not spam)

## 🧠 Logistic Regression: Theory & Math (Made Simple)

### 🔍 What’s the goal?
Linear Regression predicted **numbers**, but now we want to **predict categories** — like:

- Is this message spam or not?
- Is this review positive or negative?

So we need a model that can output **probabilities** and make **yes/no** decisions.

---

### 📉 Why Linear Regression fails for classification

Imagine trying to use a straight line to decide if a message is spam.  
You might end up predicting values like **-3.5 or 7.2** — which don't make sense as probabilities!

We want outputs between **0 and 1**, so we can say:

- $P(\text{spam}) = 0.92$ → very likely spam  
- $P(\text{spam}) = 0.08$ → probably not spam  

---

### 🔁 Logistic Regression to the rescue!

Logistic Regression still uses a **linear function** under the hood:

$$
z = w_1x_1 + w_2x_2 + \dots + b
$$

But instead of returning $z$ directly, it passes it through a **sigmoid function**:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

This "squashes" the result between 0 and 1.

---

### 🔢 What does the output mean?

- If $\sigma(z) > 0.5$: predict class 1 (e.g. spam)  
- If $\sigma(z) < 0.5$: predict class 0 (e.g. not spam)

That’s how we convert a **linear model** into a **classifier**.

---

### 🧮 How does it learn?

We need to find the best weights ($w$ and $b$) that minimize the **log loss**:

$$
\text{Log Loss} = -\frac{1}{n} \sum \left[ y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}) \right]
$$

Where:
- $y$ is the true label (0 or 1)  
- $\hat{y}$ is the predicted probability  
- Log loss punishes confident wrong predictions more heavily

Like before, we use **gradient descent** to find the best weights that minimize this loss.

---

### ✅ Why use Logistic Regression?

- Easy to understand and fast to train  
- Outputs probabilities, not just labels  
- Works well when the classes are **linearly separable**



# **SECTION 4: NLP Use Case – Spam Detection**

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

In [None]:
# Load sample SMS Spam dataset (from UCI repository, small version)
data = pd.read_csv("https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv", sep='\t', names=["label", "message"])
data['label_num'] = data['label'].map({'ham': 0, 'spam': 1})

In [None]:
# Feature extraction
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(data['message'])
y = data['label_num']

In [None]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Train Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
# Evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.9641255605381166
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       966
           1       1.00      0.73      0.84       149

    accuracy                           0.96      1115
   macro avg       0.98      0.87      0.91      1115
weighted avg       0.97      0.96      0.96      1115



In [None]:
# Try it on a custom input
sample = ["Congratulations! You won a free lottery ticket! Claim now."]
vectorized_sample = vectorizer.transform(sample)
predicted_class = model.predict(vectorized_sample)[0]
probability = model.predict_proba(vectorized_sample)[0][1]

In [None]:
print("Message:", sample[0])
print("Spam probability:", round(probability, 2))
print("Prediction:", "Spam" if predicted_class == 1 else "Ham")

Message: Congratulations! You won a free lottery ticket! Claim now.
Spam probability: 0.7
Prediction: Spam


### 🚧 Where it struggles

- Doesn’t handle **non-linear** data well  
- Performance drops if features are **correlated**  
- Not great for large vocab NLP tasks without regularization

---

### 📦 Summary

- Logistic Regression predicts **probabilities** using the **sigmoid function**  
- It helps us classify text into categories like spam/not spam  
- It’s your first real **NLP classifier**!

# **SECTION 5: What's Next?**

### What Logistic Regression Can't Do
Logistic Regression is great for simple linear decision boundaries.
But what if your data is:
- **Not linearly separable**?
- **More complex with curves**?

➡️ Time to meet **Naive Bayes** and **Decision Trees** in Week 4!

# **SECTION 6: Exercises**

### Exercises:
1. Try training the model on only the top 1000 words in the dataset.
2. Test the classifier with your own text messages.
3. Explore using `CountVectorizer` instead of `TfidfVectorizer`.

You're now making real NLP classifications! 🚀