### 1. Install dependencies

- pandas ‚Üí used for data handling and analysis
- scikit-learn ‚Üí provides machine learning tools like LogisticRegression, CountVectorizer, etc.

%pip install pandas scikit-learn

### 2. Import required libraries

* `pandas` ‚Üí for loading and managing dataset
* `train_test_split` ‚Üí splits data into training and testing
* `CountVectorizer` ‚Üí converts text to numerical features
* `LogisticRegression` ‚Üí our ML model
* `accuracy_score`, `classification_report` ‚Üí to evaluate the model

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report


### 3. Load the dataset

Download the dataset first from:
üîó [https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)

### 4. Read CSV

* Reads the CSV file into a pandas DataFrame.
* `encoding="latin-1"` avoids errors due to special characters.
* `df.head()` shows the first 5 rows of the dataset.


In [4]:
df = pd.read_csv("spam.csv", encoding="latin-1")
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


### 5. Clean the dataset
* We only need 2 columns:

  * `label` ‚Üí whether message is ‚Äúham‚Äù or ‚Äúspam‚Äù
  * `message` ‚Üí actual SMS text

In [5]:
df = df[['v1', 'v2']]
df.columns = ['label', 'message']
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### **6. Convert text labels to numbers**
* Machine learning models can‚Äôt understand text labels.
* We convert:

  * `ham` ‚Üí 0
  * `spam` ‚Üí 1
* New column `label_num` will contain numeric labels.

In [6]:
df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})
df.head()

Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


### ‚úÇÔ∏è 7. Split data into training and testing sets

* 80% of data ‚Üí training
* 20% ‚Üí testing
* `random_state=42` makes sure you get the same split every time.

#### 1Ô∏è‚É£ **80% of data ‚Üí training**
The model needs to **learn patterns** ‚Äî like which words appear more often in spam vs normal messages.
That learning happens from **training data**.

Example:
Let‚Äôs say you have 1000 SMS messages in your dataset.

* 800 messages (80%) are used to **train** the model ‚Äî the model *studies* these.
* 200 messages (20%) are kept aside for **testing** ‚Äî the model *never sees* these during learning.

Why?
Because if we test on the same data we trained on, the model might just memorize answers.
Testing on unseen data shows how well it *generalizes* to new examples.

üß† Think of it like:

* > Training = studying for the exam
* > Testing = writing the actual exam

#### 2Ô∏è‚É£ **20% of data ‚Üí testing**

The **testing set** checks how well the model performs on *unseen messages*.

* These messages were *not shown* to the model during training.
* We use them to measure **accuracy**, **precision**, **recall**, etc.

This gives us a realistic idea of how our model would perform on *real-world data* (like new messages coming tomorrow).


#### 3Ô∏è‚É£ **`random_state=42`**

Every time you split data randomly, Python shuffles your dataset in a random way.

That means:

```python
train_test_split(..., test_size=0.2)
```

will give you a *different* 80-20 split each time you run it.

To make results **consistent** and **reproducible**, we fix a ‚Äúseed‚Äù using `random_state`.

üí° You can use *any number* (42 is just a common joke number used in data science üòÑ).
When `random_state=42`, the random shuffling happens in the *same pattern* every time ‚Äî
so your results (accuracy, precision, etc.) stay the same whenever you re-run the notebook.

----

If you write:

```python
train_test_split(..., random_state=2)
```

you‚Äôll get a particular random shuffle and split ‚Äî let‚Äôs call it **Split A**.

If you change it to:

```python
train_test_split(..., random_state=3)
```

you‚Äôll get a *different* shuffle and split ‚Äî call it **Split B**.

But here‚Äôs the key point:

> Whenever you use the **same random_state value again**, you‚Äôll get **the exact same split** every single time.


In [16]:
X_train, X_test, y_train, y_test = train_test_split(
    df['message'], df['label_num'], test_size=0.2, random_state=42    
)

### 8. Convert text to numerical vectors


* `CountVectorizer` converts each text into a ‚Äúbag of words‚Äù numeric form.
* `fit_transform()` ‚Üí learns the vocabulary and transforms training data
* `transform()` ‚Üí converts test data using the same learned vocabulary


In [17]:
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

### 9. Train the Logistic Regression model

* `LogisticRegression` is a classification algorithm.
* The model learns from training data how words relate to being spam or not spam.

In [19]:
model = LogisticRegression()
model.fit(X_train_vec, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


### 10. Make predictions and evaluate

* `predict()` ‚Üí predicts labels for test messages.
* `accuracy_score` ‚Üí measures how many predictions are correct.
* `classification_report` ‚Üí shows precision, recall, and F1-score.

In [20]:
y_pred = model.predict(X_test_vec)
print("‚úÖ Accuracy:", accuracy_score(y_test, y_pred))
print("\nüìä Classification Report:\n", classification_report(y_test, y_pred))

‚úÖ Accuracy: 0.9775784753363229

üìä Classification Report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99       965
           1       0.99      0.84      0.91       150

    accuracy                           0.98      1115
   macro avg       0.98      0.92      0.95      1115
weighted avg       0.98      0.98      0.98      1115



#### üîπ 1. **Precision**

> ‚ÄúOut of all predicted *spam*, how many were *actually spam*?‚Äù

$$
\text{Precision} = \frac{TP}{TP + FP}
$$

‚úÖ High precision = your model **doesn‚Äôt lie much** (few false alarms).

**Example:**
- Model said 30 emails are spam.  
- Out of them, 27 were actually spam.  
  ‚Üí Precision = 27 / 30 = **0.9 (90%)**

---

#### üîπ 2. **Recall**

> ‚ÄúOut of all *actual spam* emails, how many did the model catch?‚Äù

$$
\text{Recall} = \frac{TP}{TP + FN}
$$

‚úÖ High recall = your model **catches most real spam**.

**Example:**
- There were 40 actual spam emails.  
- Model caught 27 of them.  
  ‚Üí Recall = 27 / 40 = **0.675 (67.5%)**

---

#### üîπ 3. **F1-score**

> ‚ÄúA balance between precision and recall.‚Äù

$$
\text{F1} = 2 \times \frac{Precision \times Recall}{Precision + Recall}
$$

‚úÖ F1 is high **only if both precision and recall are high** ‚Äî it‚Äôs their *harmonic mean*.

**Example:**
$$
F1 = 2 \times \frac{0.9 \times 0.675}{0.9 + 0.675} \approx 0.77
$$


#### üß© What is **Support**?

**Support** = **number of actual samples** (from your test set) that belong to a particular class.

It tells you **how many true examples** of each label were present in `y_test`.


```
              precision    recall  f1-score   support
not spam          0.80      1.00      0.89        2
spam              1.00      0.67      0.80        3
accuracy                             0.87        5
macro avg         0.90      0.83      0.85        5
weighted avg      0.92      0.87      0.88        5
```

üëâ The `support` column shows:

* `not spam` ‚Üí **2 samples**
* `spam` ‚Üí **3 samples**


### 11. Try with your own messages

* Converts your custom messages into vectors.
* Predicts whether each one is **spam or ham**.
* Prints human-readable results.

In [35]:
sample_msgs = [
    "Congratulations! You‚Äôve won ‚Çπ10,000 cash. Claim your prize now!",
    "Get a free Netflix subscription ‚Äî limited offer!",
    "Exclusive deal! 50% off on all electronics today only!",
    "WINNER!! As a valued network customer you have",
    "Had your mobile 11 months or more? U R entitle",
    "Your free ringtone is waiting to be collecte"
]

sample_vec = vectorizer.transform(sample_msgs)
predictions = model.predict(sample_vec)

for msg, pred in zip(sample_msgs, predictions):
    label = "SPAM üö®" if pred == 1 else "HAM ‚úÖ"
    print(f"{msg} --> {label}")

Congratulations! You‚Äôve won ‚Çπ10,000 cash. Claim your prize now! --> SPAM üö®
Get a free Netflix subscription ‚Äî limited offer! --> HAM ‚úÖ
Exclusive deal! 50% off on all electronics today only! --> HAM ‚úÖ
WINNER!! As a valued network customer you have --> HAM ‚úÖ
Had your mobile 11 months or more? U R entitle --> HAM ‚úÖ
Your free ringtone is waiting to be collecte --> HAM ‚úÖ
