<a href="https://colab.research.google.com/github/Yirmeyahuu/Basicython/blob/main/Copy_of_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Classification using Logistic Regression and K-NN

**Classification** is a type of supervised machine learning task where the goal is to predict the categorical label of a given input.
In simpler terms, classification involves predicting which category or class an instance belongs to based on its features.

For example, given an email, the task might be to classify it as either spam or not spam, or given an image of an animal, the task might be to classify the image as a cat, dog, or horse.

**Types of Classification**

**Binary**
There are only two possible classes or categories.

*  Predicting if an email is spam or not spam.
*  Predicting if buy or sell.

**Multi-class classification**, there are more than two possible classes or categories, and each input is classified into one of these multiple classes.


*  Predicting the type of animal in an image, where the classes could be cat, dog, horse, etc.
*   Predicting which category a news article belongs to: Politics, Sports, Entertainment, etc.






## Spam Detection with Full Metrics & Visualization

**Import Libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.decomposition import PCA

**Load Mock Data**

In [None]:
data = {
    "text": [
        "Congratulations! You've won a free ticket",
        "Can we reschedule our meeting?",
        "Exclusive deal just for you, click now!",
        "Don't forget to submit your report",
        "Limited time offer, earn money fast",
        "Dinner at my place tonight?",
        "Claim your free gift card today",
        "Team call postponed to 3 PM",
        "Win a brand new car instantly",
        "Project update attached",
        "Get your free trial now",
        "Lunch tomorrow at 12?",
        "Urgent! Verify your account immediately",
        "Reminder: doctor appointment at 5 PM",
        "Earn cash online easily",
        "Weekly newsletter from HR",
        "Special promotion ends today",
        "Can you review the presentation slides?",
        "Free subscription for limited users",
        "Happy birthday! Let's celebrate!"
    ],
    "label": [
        1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0
    ]
}

**Convert to Panda Frame**

In [None]:
df = pd.DataFrame(data)

**Conver text data to numerical Features**

Messages:
1. "Free money!!!"
2. "Seminar at 9AM"

Vocabulary: ["9am", "Free", "Seminar", "money"] -> all Unique Words in dataset

Counts:

Message 1 → [0, 1, 0, 1]  # "Free" and "money" appear once

Message 2 → [1, 0, 1, 0]  # "9AM" and "Seminar" appear once





In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['text']).toarray() # learn
y = df['label']

**Visualize Word Count**

Here we’re turning each message into numbers so the computer can understand it. Each `row is a message`, e`ach column is a word`, and the color shows how many times that word appears in the message. Brighter colors mean the word shows up more

In [None]:
df_table = pd.DataFrame(X, columns=vectorizer.get_feature_names_out())
df_table.index = [f"Msg {i+1}" for i in range(X.shape[0])]
print("Word Count Table:")
display(df_table)

**Split Dataset for Training and Test**

We want to train the model on some data and test it on unseen data to see how well it works.

`test_size=0.3` 30% for Training

 `random_state=42` Just a seed Key so that the Test and Training will remain the same everytime we run.

 Note: `train_test_split` return 3 values (common in python)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

`fit_transform`
Calculate mean & std on training data and scale it

`transform(X_test)`
Scale test data using training data stats


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

**Train Logistic Regression**

In [None]:
lr = LogisticRegression()
lr.fit(X_train_scaled, y_train) # Teach
y_pred_lr = lr.predict(X_test_scaled) # Test Unseen
y_prob_lr = lr.predict_proba(X_test_scaled)[:,1] # Returns the confidence that each message is spam (0.85 or 0.10)

**Train K-NN**

In [None]:
knn = KNeighborsClassifier(n_neighbors=3) # 3 votes to Win
knn.fit(X_train_scaled, y_train) # Storing Training Data (memorize)
y_pred_knn = knn.predict(X_test_scaled) # Using Test Data, Find the nearest neighbors based on word count

**Print Metrics Function**

**Accuracy**
`TP + TN / Total Samples`

Measures the overall correctness of the model. It tells you the percentage of predictions (both positive and negative) that the model got right.

**Precision** `TP / TP + FP`

Measures how many of the positive predictions made by the model were actually correct.

In other words, of all the times the model said "positive," how often was it right

*High Value means model is good avoding false positive.*

**Recall** `TP / TP + FN`
Measures how many actual positive cases the model correctly identified.

In other words, of all the actual positive instances (e.g., spam), how many did the model correctly says positive?

*High Value shows model good identifying positive.*

**F1-Score** `(2 (Precision)(Recall)) / (Precision + Recall)`
Harmonic mean of Precision and Recall.

It combines both precision and recall into one metric to provide a balance between them.

Useful when you need a single metric to evaluate performance.


In [None]:
def print_metrics(y_true, y_pred, model_name):
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()

    print(f"--- {model_name} ---")
    print("Confusion Matrix:\n", cm)
    print("True Positives (TP):", tp)
    print("True Negatives (TN):", tn)
    print("False Positives (FP):", fp)
    print("False Negatives (FN):", fn)

    accuracy = (tp + tn) / (tp + tn + fp + fn)
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

    print(f"Accuracy: {accuracy:.2f}")
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}")
    print(f"F1 Score: {f1:.2f}")

**Logistic Regression Metrics**

In [None]:
print_metrics(y_test, y_pred_lr, "Logistic Regression")

**K-NN Metrics**

In [None]:
print_metrics(y_test, y_pred_knn, "k-NN")

**Visualization**



In [None]:
from matplotlib.colors import ListedColormap

# -----------------------------
# Step 8: 2D Visualization with PCA
# -----------------------------
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# Create a colormap
cmap_light = ListedColormap(['#FFAAAA', '#AAAAFF'])
cmap_bold  = ListedColormap(['#FF0000', '#0000FF'])

# Logistic Regression decision boundary
xx, yy = np.meshgrid(np.linspace(X_test_pca[:,0].min()-1, X_test_pca[:,0].max()+1, 200),
                     np.linspace(X_test_pca[:,1].min()-1, X_test_pca[:,1].max()+1, 200))
Z = lr.predict(pca.inverse_transform(np.c_[xx.ravel(), yy.ravel()]))
Z = Z.reshape(xx.shape)

plt.figure(figsize=(8,6))
plt.contourf(xx, yy, Z, alpha=0.2, cmap=cmap_light)
plt.scatter(X_test_pca[:,0], X_test_pca[:,1], c=y_test, cmap=cmap_bold, edgecolor='k', s=50)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("Logistic Regression Decision Boundary (2D PCA)")
plt.legend(handles=[
    plt.Line2D([0], [0], marker='o', color='w', label='Not Spam', markerfacecolor='#0000FF', markersize=8),
    plt.Line2D([0], [0], marker='o', color='w', label='Spam', markerfacecolor='#FF0000', markersize=8)
])
plt.show()

# k-NN decision boundary
Z_knn = knn.predict(pca.inverse_transform(np.c_[xx.ravel(), yy.ravel()]))
Z_knn = Z_knn.reshape(xx.shape)

plt.figure(figsize=(8,6))
plt.contourf(xx, yy, Z_knn, alpha=0.2, cmap=cmap_light)
plt.scatter(X_test_pca[:,0], X_test_pca[:,1], c=y_test, cmap=cmap_bold, edgecolor='k', s=50)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("k-NN Decision Boundary (2D PCA)")
plt.legend(handles=[
    plt.Line2D([0], [0], marker='o', color='w', label='Not Spam', markerfacecolor='#0000FF', markersize=8),
    plt.Line2D([0], [0], marker='o', color='w', label='Spam', markerfacecolor='#FF0000', markersize=8)
])
plt.show()


Custom Message To Predict

In [None]:
new_message = "Check my balance"

# Perform Transformation
new_X = vectorizer.transform([new_message]).toarray()
new_X_scaled = scaler.transform(new_X)

# Predict
pred_lr = lr.predict(new_X_scaled)[0]
pred_knn = knn.predict(new_X_scaled)[0]

print(f"\nLogistic Regression Prediction: {'Spam' if pred_lr==1 else 'Not Spam'}")
print(f"k-NN Prediction: {'Spam' if pred_knn==1 else 'Not Spam'}")


**View Neighbor**

In [None]:

# Find nearest neighbors
distances, indices = knn.kneighbors(new_X_scaled, n_neighbors=3)

print("\nNearest neighbors for the new message:")
for i, idx in enumerate(indices[0]):
    label = "Spam" if y.iloc[idx]==1 else "Not Spam"
    print(f"{i+1}. '{df['text'].iloc[idx]}' - Label: {label}, Distance: {distances[0][i]:.2f}")