# K-Nearest Neighbors Classifier on Imbalanced Dataset

## 1. Import Necessary Libraries


In [None]:
# Import libraries for data manipulation and machine learning
import pandas as pd
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix


## 2. Load and Balance the Dataset


In [None]:
data = pd.read_csv('magic04.data', header=None)

# Separate gamma ('g') and hadron ('h') classes
gamma = data[data.iloc[:, -1] == 'g']
hadron = data[data.iloc[:, -1] == 'h']

# Balance the dataset by downsampling the gamma class
gamma_less = gamma.sample(n=len(hadron), random_state=42)

## 3. Split the Dataset into Training, Validation, and Test Sets


In [None]:
# split data
# train test and validation from gamma
temp, g_test = train_test_split(gamma_less, test_size= 0.15, random_state= 42)
g_train, g_val = train_test_split(temp, test_size= 0.1765, random_state= 42)

# train test and validation from hadron
temp, h_test = train_test_split(hadron, test_size= 0.15, random_state= 42)
h_train, h_val = train_test_split(temp, test_size= 0.1765, random_state= 42)

train = pd.concat([g_train, h_train])
val = pd.concat([g_val, h_val])
test = pd.concat([g_test, h_test])

# Separate features and target labels
x_train = train.iloc[:, :-1]
y_train = train.iloc[:, -1]

x_val = val.iloc[:, :-1]
y_val = val.iloc[:, -1]

x_test = test.iloc[:, :-1]
y_test = test.iloc[:, -1]

## 4. Apply the K-Nearest Neighbors Classifier with Different K Values


In [None]:
# Test different values of K
k_values = range(1, 151, 5)
results = []

# Loop to evaluate each K
for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k, metric='euclidean')
    knn.fit(x_train, y_train)

    y_pred = knn.predict(x_val)

    # Calculate metrics for each K value
    acc = accuracy_score(y_val, y_pred)
    prec = precision_score(y_val, y_pred, pos_label='g')
    recall = recall_score(y_val, y_pred, pos_label='g')
    f1 = f1_score(y_val, y_pred, pos_label='g')
    conf_m = confusion_matrix(y_val, y_pred)

    results.append({'K': k, 'Accuracy': acc, 'Precision': prec, 'Recall': recall, 'F1-Score': f1, 'Confusion Matrix': conf_m})


## 5. Select the Best K Value Based on F1-Score


In [None]:
# Find the best K based on validation set F1-Score
best_result = max(results, key=lambda x: x['F1-Score'])
best_k = best_result['K']
print(f"Best K based on validation set: {best_k}")


## 6. Train the Model with the Best K Value and Evaluate on the Test Set


In [None]:
# Train and evaluate the final model
knn = KNeighborsClassifier(n_neighbors=best_k, metric='euclidean')
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)

# Calculate final evaluation metrics
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred, pos_label='g')
recall = recall_score(y_test, y_pred, pos_label='g')
f1 = f1_score(y_test, y_pred, pos_label='g')
conf_m = confusion_matrix(y_test, y_pred)

# Print final test set results
print("\nFinal Test Set Evaluation")
print(f"Accuracy: {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"Confusion Matrix:\n{conf_m}")


## 7. Analysis of Different K Values and Model Performance

As K increases:
- **Higher K values** smooth out decision boundaries by considering a larger neighborhood for classification. This can increase model stability and reduce sensitivity to noise.
- **Lower K values** consider fewer neighbors, which can make the model sensitive to individual points and overfit to the training data.

In this task, we used the F1-score to select the best K value since it balances precision and recall, particularly useful in cases with imbalanced classes. The best K was found to be **6** based on validation F1-score.

Below, we examine the model's performance with K=6 on the test set.

## 8. Final Test Set Evaluation

### Final Evaluation Metrics
' Accuracy: 0.7550

' Precision: 0.7003

' Recall: 0.8914

' F1-Score: 0.7844

' Confusion Matrix:

' [[895 109]
'  [383 621]]

### Interpretation of Results
- **Accuracy (75.50%)**: This metric suggests that the model correctly classifies around 76% of test samples.
- **Precision (70.03%)**: Precision indicates that when the model predicts 'gamma' (signal), it is correct 70.03% of the time. This lower precision might be due to the influence of noisy or overlapping features.
- **Recall (89.14%)**: A high recall indicates that the model successfully captures most of the 'gamma' signals. The model effectively avoids false negatives, which could be crucial if identifying all 'gamma' samples is essential.
- **F1-Score (78.44%)**: The F1-score provides a balance between precision and recall, making it a valuable indicator of the model's overall reliability in handling both classes.

### Confusion Matrix Interpretation
- **True Positives (895)**: Correctly identified gamma ('g') samples.
- **True Negatives (621)**: Correctly identified hadron ('h') samples.
- **False Positives (109)**: Samples predicted as 'gamma' that are actually 'hadron'.
- **False Negatives (383)**: Samples predicted as 'hadron' that are actually 'gamma'.

### Observations on K's Influence

As K values were adjusted, smaller values resulted in higher precision but lower recall, suggesting overfitting, while larger values improved recall at the cost of precision. This indicates a trade-off, with higher K values leading to a more generalized and stable model. Ultimately, K=6 struck a good balance for this dataset, optimizing recall and minimizing misclassification, aligning well with the goal of robust classification between 'gamma' and 'hadron' classes.