# TP 3: Classification

## Quick Recap: Classification

**Classification** = predicting a **category/label** from data.

Examples:
- Identify malicious traffic vs normal traffic *(binary)*
- Classify network applications (Web, Video, Gaming, VoIP) *(multiclass)*


### How Classification Works
- Classification is **supervised learning**: the model trains on data with known labels.
- The model learns patterns and predicts a **class** for new unseen data, often by estimating **probabilities** first.
- A **decision boundary** separates classes in the feature space (e.g., normal vs malicious traffic).

## üìù Exercise 1: K-Nearest Neighbors (KNN)

### Preparation: Generating "synthetic" data

- We simulate a simple bank-customer dataset with two groups.  
- Each class has different average balance and similar income and we sample points from Gaussian distributions to mimic real financial variation.  
- We also create a binary `student` feature and a `default` label (0 or 1).  
- This dataset will be used to practice KNN classification.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(100)

# Means
mu_class0 = [700, 3000]
mu_class1 = [1350, 3000]

# Covariance matrices
cov_class0 = [[70000, 0], [0, 900000]]
cov_class1 = [[70000, 0], [0, 900000]]

n = 300  # samples per class

# Generate 2D Gaussian data: (balance, income)
balance0, income0 = np.random.multivariate_normal(mu_class0, cov_class0, n).T
balance1, income1 = np.random.multivariate_normal(mu_class1, cov_class1, n).T

# Ensure no negative data values
balance0 = np.maximum(balance0, 0)
income0  = np.maximum(income0, 0)
balance1 = np.maximum(balance1, 0)
income1  = np.maximum(income1, 0)

# Combine classes
balance = np.concatenate([balance0, balance1])
income  = np.concatenate([income0, income1])

# Labels: 0 = class0, 1 = class1
labels = np.array([0]*n + [1]*n)

# Student indicator
max_balance = balance.max()
student = (np.random.rand(len(balance)) < (balance / max_balance) / 2).astype(int)

# Build dataframe
df = pd.DataFrame({
    "balance": balance,
    "income": income,
    "student": student,
    "default": labels
})

df.head()

In [None]:
# Plot the synthetic data
plt.scatter(df.balance[df.default==0], df.income[df.default==0], label="Class 0")
plt.scatter(df.balance[df.default==1], df.income[df.default==1], label="Class 1")

plt.legend()
plt.show()

In [None]:
# Plot histograms for balance distributions of each class
count0, bins0, _ = plt.hist(balance0, bins=15, density=True, alpha=0.6, color='blue', label='Class 0 (Low balance)')
count1, bins1, _ = plt.hist(balance1, bins=15, density=True, alpha=0.6, color='red', label='Class 1 (High balance)')

plt.plot(
    bins0,
    1 / np.sqrt(2 * np.pi * cov_class0[0][0]) * np.exp(- (bins0 - mu_class0[0])**2 / (2 * cov_class0[0][0])),
    linewidth=2,
    color='blue'
)
plt.plot(
    bins1,
    1 / np.sqrt(2 * np.pi * cov_class1[0][0]) * np.exp(- (bins1 - mu_class1[0])**2 / (2 * cov_class1[0][0])),
    linewidth=2,
    color='red'
)

plt.legend()
plt.grid(alpha=0.3)
plt.show()


### Part A ‚Äî Visualizing the KNN Decision Boundary

Using the dataset you created:

1. Train a **K-Nearest Neighbors (KNN)** classifier using the entire dataset (no train/test split).  
2. Use only the features `balance` and `income`.
3. Try the following values of \( K = 1, 3, 10, 40 \).

For each value of \( K \):
- Fit the KNN model.
- Plot the decision boundary in 2D.
- Overlay the data points for both classes on the same plot.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Values of k to test
k_values = [1, 3, 10, 40]

In [None]:
## TODO:

### Part B ‚Äî Choosing the Best K using Cross-Validation

1. Use only the features **`balance`** and **`income`**.
2. For \( K = 1, 2, \dots, 100 \):
   - Perform cross-validation using `sklearn.model_selection.cross_validate`.
   - Use the following scoring metrics:
     - `accuracy`
     - `roc_auc`
     - `neg_mean_absolute_error`
   - Record both **Train** and **Test** scores for each metric.

3. For each metric, compute the **mean Train** and **mean Test** scores across folds.

4. Plot **three figures**:
   - Accuracy vs \( K \)
   - ROC-AUC vs \( K \)
   - MAE vs \( K \)  *(remember: MAE = ‚àí `neg_mean_absolute_error`)*

   Each figure should contain **two curves** ‚Äî one for **Train**, one for **Test** ‚Äî with the **x-axis** in **logarithmic** scale and the **y-axis** in **linear** scale.


In [None]:
## TODO:

#### Questions:

- Which value of \( K \) gives the best overall performance?
- Do Train and Test curves agree on the optimal \( K \)?
- How does the Test Classification Error change as \( K \) increases?
- What do these results tell you about **overfitting** and **underfitting** in KNN?

#### Answer:

In [None]:
# TODO:

## üìù Exercise 2: Binary Classification with Logistic Regression

1. **Split the dataset:**
   - Divide the data into **Train (70%)** and **Test (30%)** sets.

2. **Train the model:**
   - Fit a **Logistic Regression** classifier on the training data.
   - Predict the label for the first test sample using:

     ```python
     logistic.predict(X_test[0])
     logistic.predict_proba([X_test[0]])
     ```
   - Observe and describe what these two functions return (class label vs. probability).

3. **Compute model accuracy:**
   - Use ```logreg.score(X_test, y_test)``` to calculate the **mean prediction accuracy**.

In [None]:
## TODO:

4. **Repeat for multiple splits:**

   - Split the dataset in several different random ways.
   - For each split:
     - Fit the logistic regression model.
     - Calculate the **test error rate**.
   - Compute the **average test error** over all splits.
   - Compare this average error rate to the optimal KNN test error found in Exercise 1.
   - Discuss what you observe.

In [None]:
## TODO:

5. **Analyze model performance:**

    - For the last train‚Äìtest split, compute the **confusion matrix**.
    - Calculate and report the following performance metrics:
      - Accuracy (ACC)
      - True Positive Rate (TPR / Recall)
      - False Positive Rate (FPR)
      - Precision (PPV) 

In [None]:
## TODO:

6. **Repeat using Cross-Validation:**

   - Repeat the whole processes using **cross validation**
   - Use `sklearn.model_selection.cross_validate` to perform cross-validation on the logistic model.
   - Compare the results with the previous test errors.

In [None]:
## TODO:

## üìù Exercise 3: Multinomial (Softmax) Logistic Regression

1) **Load the data (same source as the notebook):**

   - Quickly inspect with `df.shape`, `df.head()`.

In [None]:
import pandas as pd

# Replace with the correct filename if needed
filename = "network_dataset.csv"
df = pd.read_csv(filename)

display(df.head())


2) **Create multiclass target variables:**

    - Create a **multiclass target variable** called `res` to use in Multinomial Logistic Regression:

    - The dataset currently contains a numeric label column called `label_num`, which represents the **resolution** of network traffic (e.g., video streaming quality). Transform this numeric value into three categories:

        - **0 ‚Üí Low resolution** (below 240)  
        - **1 ‚Üí Mid resolution** (between 240 and 480)  
        - **2 ‚Üí High resolution** (480 or above)


In [None]:
# Define thresholds
threshold1 = 240
threshold2 = 480

# Create the 3-class target variable
res = [
    0 if d < threshold1
    else 1 if threshold1 <= d < threshold2
    else 2
    for d in df['label_num']
]

# Add to the DataFrame
df['res'] = res

# Check distribution of classes
print(df['res'].value_counts().sort_index())

# Preview the new column
df[['label_num', 'res']].head()

3) **Define features (X) and target (y):**

    - Now that we‚Äôve created the multiclass target (`res`), we need to separate our data into:

        - **X (features):** all columns we‚Äôll use to predict resolution  
        - **y (target):** the new `res` column we just created  

    - Exclude the target (`res`) and any irrelevant or non-numeric identifiers (e.g., timestamps, IDs) from X. Make sure to check the data types so you know which features are numeric and which are categorical.

In [None]:
## TODO:

4) **Preprocess the data:**

    - Before training, we need to preprocess the features:

        - Scale numeric columns (so all values are on comparable scales)
        - One-hot encode categorical columns (so text-based categories become numbers)
        - Combine both preprocessing steps in a ColumnTransformer

In [None]:
## TODO:

5) **Build the multinomial logistic regression pipeline:**

    - Now we‚Äôll create a pipeline that connects preprocessing and model training in one go.

    - Steps inside the pipeline:
        1. Apply the preprocessing (`ColumnTransformer`)
        2. Train a Logistic Regression model configured for multiclass (softmax) classification.

    - Use: ```LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=500) ```


In [None]:
## TODO:

6) **Split the data and train the model:**

    - Split the dataset into **training** and **testing** subsets to evaluate performance properly.

        - Use an 75/25 split (`test_size=0.25`)
        - Set a fixed random seed (`random_state=42`)
        - Use **stratify=y** to preserve class proportions
        - Fit the pipeline on the training data

In [None]:
## TODO:

7) **Evaluate model performance:**

    - Now test your trained model and analyze its performance.
        - Accuracy
        - Macro F1-score
        - Classification report
        - Confusion matrix

In [None]:
## TODO: