<a href="https://colab.research.google.com/github/anderseurenius/Utbildning-AI-AIR600/blob/main/Lab4_Modern_Classification_Algorithms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 4 — Modern Classification Algorithms (Student Template)


---

## IMPORTANT
1. **Make a copy** (File → Save a copy in Drive) before starting.
2. Do **not** modify cells marked **PROVIDED — DO NOT MODIFY**.
3. Write your code only in the **Student Task** code cells.


## 1) Imports (PROVIDED — DO NOT MODIFY)

In [8]:
# ======================================================
# PROVIDED CODE — DO NOT MODIFY
# ======================================================

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score


## 2) Dataset Generation (PROVIDED — DO NOT MODIFY)

In [9]:
# ======================================================
# PROVIDED CODE — DO NOT MODIFY
# ======================================================
# Each row represents one network session.
# This synthetic dataset is generated only for learning purposes.

np.random.seed(42)
n_samples = 500

# Number of failed login attempts in the session
failed_logins = np.random.poisson(lam=2, size=n_samples)

# Amount of data transferred during the session (in MB)
data_volume_mb = np.random.normal(loc=300, scale=120, size=n_samples).clip(50, 1000)

# Whether the session occurred at an unusual time (1 = yes, 0 = no)
unusual_time = np.random.binomial(1, 0.25, size=n_samples)

# Days since the system was last patched
patch_age_days = np.random.randint(0, 365, size=n_samples)

# Whether an admin account was used in the session
admin_login = np.random.binomial(1, 0.15, size=n_samples)

# Cyber risk score (0–100). Higher means higher risk.
risk_score = (
    5 * failed_logins +
    0.04 * data_volume_mb +
    15 * unusual_time +
    0.03 * patch_age_days +
    20 * admin_login +
    0.5 * failed_logins**2 +                 # nonlinear effect
    np.random.normal(0, 5, size=n_samples)   # noise
)

risk_score = np.clip(risk_score, 0, 100)

df = pd.DataFrame({
    "failed_logins": failed_logins,
    "data_volume_mb": data_volume_mb,
    "unusual_time": unusual_time,
    "patch_age_days": patch_age_days,
    "admin_login": admin_login,
    "risk_score": risk_score
})

df.head()


Unnamed: 0,failed_logins,data_volume_mb,unusual_time,patch_age_days,admin_login,risk_score
0,4,491.340608,0,216,0,54.978987
1,1,198.364638,0,119,0,15.001039
2,3,181.032918,0,174,0,21.995223
3,3,50.0,0,22,0,11.908258
4,1,223.32459,0,212,1,43.741031


## 3) Binary Labels + Train/Test Split (PROVIDED — DO NOT MODIFY)

In [10]:
# ======================================================
# PROVIDED CODE — DO NOT MODIFY
# ======================================================

# High Risk (1): risk_score >= 70
# Low Risk  (0): risk_score < 70
X = df.drop(columns=["risk_score"])
y = (df["risk_score"] >= 70).astype(int)

# Stratify keeps the class proportions similar in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Train classes:")
print(y_train.value_counts())
print("\nTest classes:")
print(y_test.value_counts())


Train classes:
risk_score
0    387
1     13
Name: count, dtype: int64

Test classes:
risk_score
0    97
1     3
Name: count, dtype: int64


## 4) Evaluation Helper (PROVIDED — DO NOT MODIFY)

In [11]:
# ======================================================
# PROVIDED CODE — DO NOT MODIFY
# ======================================================
# Prints confusion matrix + common classification metrics.

def clf_report_simple(y_true, y_pred, model_name):
    print(model_name)
    print("Confusion Matrix:")
    print(confusion_matrix(y_true, y_pred))
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.3f}")
    print(f"Precision: {precision_score(y_true, y_pred, zero_division=0):.3f}")
    print(f"Recall   : {recall_score(y_true, y_pred, zero_division=0):.3f}")
    print(f"F1-score : {f1_score(y_true, y_pred, zero_division=0):.3f}")
    print("-" * 40)


# ✅ Student Tasks

Write your code in the empty code cells under each task.

**Tip:** Evaluate using `clf_report_simple(y_test, y_pred, "Model Name")`.


## Task 1 — Decision Tree

**Goal:** Train a decision tree classifier.

**What to do:** Train → Predict → Evaluate.

**What to observe:** Does it overfit? How is recall?


In [12]:
# Student code here
# Create model
dt_model = DecisionTreeClassifier(random_state=42)

# Train
dt_model.fit(X_train, y_train)

# Predict
y_pred_dt = dt_model.predict(X_test)

# Evaluate
clf_report_simple(y_test, y_pred_dt, "Decision Tree")


Decision Tree
Confusion Matrix:
[[94  3]
 [ 3  0]]
Accuracy : 0.940
Precision: 0.000
Recall   : 0.000
F1-score : 0.000
----------------------------------------


## Task 2 — Random Forest

**Goal:** Use many trees to improve stability.

**What to do:** Train → Predict → Evaluate.

**What to observe:** Compare to the single tree.


In [13]:
# Student code here

# Create model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train
rf_model.fit(X_train, y_train)

# Predict
y_pred_rf = rf_model.predict(X_test)

# Evaluate
clf_report_simple(y_test, y_pred_rf, "Random Forest")

Random Forest
Confusion Matrix:
[[97  0]
 [ 3  0]]
Accuracy : 0.970
Precision: 0.000
Recall   : 0.000
F1-score : 0.000
----------------------------------------


## Task 3 — KNN

**Goal:** Distance-based classification.

**What to do:** Use a Pipeline with StandardScaler + KNN.

**What to observe:** How scaling affects results.


In [14]:
# Student code here

# Create pipeline with scaling + KNN
knn_model = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(n_neighbors=5))
])

# Train
knn_model.fit(X_train, y_train)

# Predict
y_pred_knn = knn_model.predict(X_test)

# Evaluate
clf_report_simple(y_test, y_pred_knn, "KNN")

KNN
Confusion Matrix:
[[97  0]
 [ 2  1]]
Accuracy : 0.980
Precision: 1.000
Recall   : 0.333
F1-score : 0.500
----------------------------------------


## Task 4 — SVM

**Goal:** Margin-based classification.

**What to do:** Use a Pipeline with StandardScaler + SVC.

**What to observe:** Compare precision/recall to other models.


In [15]:
# Student code here

# Create pipeline with scaling + SVM
svm_model = Pipeline([
    ("scaler", StandardScaler()),
    ("svm", SVC(kernel="rbf", random_state=42))
])

# Train
svm_model.fit(X_train, y_train)

# Predict
y_pred_svm = svm_model.predict(X_test)

# Evaluate
clf_report_simple(y_test, y_pred_svm, "SVM")

SVM
Confusion Matrix:
[[97  0]
 [ 2  1]]
Accuracy : 0.980
Precision: 1.000
Recall   : 0.333
F1-score : 0.500
----------------------------------------


## Task 5 — Compare Models

Create a small DataFrame with Accuracy, Precision, Recall, F1 for each model.


In [16]:
# Student code here
# Create comparison DataFrame

models = {
    "Decision Tree": y_pred_dt,
    "Random Forest": y_pred_rf,
    "KNN": y_pred_knn,
    "SVM": y_pred_svm
}

results = []

for name, y_pred in models.items():
    results.append({
        "Model": name,
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred, zero_division=0),
        "Recall": recall_score(y_test, y_pred, zero_division=0),
        "F1": f1_score(y_test, y_pred, zero_division=0)
    })

results_df = pd.DataFrame(results)
results_df


Unnamed: 0,Model,Accuracy,Precision,Recall,F1
0,Decision Tree,0.94,0.0,0.0,0.0
1,Random Forest,0.97,0.0,0.0,0.0
2,KNN,0.98,1.0,0.333333,0.5
3,SVM,0.98,1.0,0.333333,0.5


## Reflection Questions

1. Which model achieved the highest **recall**? Why is recall important in cyber risk?
2. Which model seems most likely to overfit? (2–3 lines)
3. Which model would you deploy and why?


## Final Checklist

- [ ] All tasks completed
- [ ] All cells run without errors
- [ ] Outputs are visible
- [ ] Renamed to `Lab2_StudentName.ipynb`
- [ ] Downloaded and submitted the `.ipynb`
