## <center> Assignment 3 </center>

#### Name: Aditya Chauhan
#### Student ID: 169027493

You are provided with a training dataset and a testing dataset for a binary classification problem with labels {0, 1}. The last column of the training dataset contains the labels, while the testing dataset includes only attributes (descriptive features).

Train an effective classifier using the training dataset. You may choose your data processing approach, classifier type, and parameter tuning methods as needed. The sklearn package in Python is recommended for implementing your model.

Make predictions on the testing dataset and generate a file containing a single column of predicted labels (0 or 1) in the same order as the testing dataset. Ensure that the output file does not include a header and that your prediction.txt file contains exactly one column and 352 rows.

Please submit your implementation code and the predicted output file as two separate files (not compressed into a zip file), named <b>A3.ipynb</b> and <b>prediction.txt</b>, respectively. Your assignment will be evaluated based on your model's performance, particularly its F1-score, as well as other criteria.

In [4]:
import pandas as pd
df_train = pd.read_csv('A3_data/train.csv',sep=',',index_col=0) 
df_test_attribute_only = pd.read_csv('A3_data/test_attribute.csv',sep=',',index_col=0) 

In [5]:
# Data Exploration
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

print("Train Dataset Info:")
print(df_train.info())
print("\nTest Dataset Info:")
print(df_test_attribute_only.info())

print("\nTrain Dataset Head:")
print(df_train.head())

print("\nTest Dataset Head:")
print(df_test_attribute_only.head())

# Check for missing values
print("\nMissing Values in Train Dataset:")
print(df_train.isnull().sum())

print("\nMissing Values in Test Dataset:")
print(df_test_attribute_only.isnull().sum())

# Correlation Heatmap
# plt.figure(figsize=(10, 8))
# sns.heatmap(df_train.corr(), annot=True, cmap="coolwarm", fmt=".2f")
# plt.title("Correlation Heatmap")
# plt.show()

X_train = df_train.iloc[:, :-1]
y_train = df_train.iloc[:, -1]

# Handle missing values & scale
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(df_test_attribute_only)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Train k-NN Model

# Split training data into train + validation
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Train k-NN Classifier & Validate the Model
# Can change the n_neighbors value to change F-1 Score
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_split, y_train_split)
y_val_pred = knn.predict(X_val)
print("\nF1 Score on Validation Set:", f1_score(y_val, y_val_pred))

# Predictions on Test Data
y_test_pred = knn.predict(X_test)
output_file = 'prediction.txt'
np.savetxt(output_file, y_test_pred.astype(int), fmt='%d')
print(f"\nPredictions saved to {output_file}")

Train Dataset Info:
<class 'pandas.core.frame.DataFrame'>
Index: 652 entries, 0 to 651
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       652 non-null    float64
 1   1       652 non-null    float64
 2   2       652 non-null    float64
 3   3       652 non-null    float64
 4   4       652 non-null    float64
 5   5       652 non-null    float64
 6   6       652 non-null    float64
 7   7       652 non-null    float64
 8   8       652 non-null    int64  
dtypes: float64(8), int64(1)
memory usage: 50.9 KB
None

Test Dataset Info:
<class 'pandas.core.frame.DataFrame'>
Index: 352 entries, 0 to 351
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       352 non-null    float64
 1   1       352 non-null    float64
 2   2       352 non-null    float64
 3   3       352 non-null    float64
 4   4       352 non-null    float64
 5   5       352 non-null    float64
 6 

### Briefly describe your approach in the following cell.

## Data Exploration

Before building the model, I wanted to explore & understand the data to identify potential issues:

- Loaded & Checked dataset info
- Identified & handled missing values
- Used heatmap to visualize relationships between features

## Data Preprocessing

- Missing Value imputation, ensuring nothing is discrarded
- Applied StandardScaler to standardize the features
    - Ensured testing & training set both had the same standardization
- Seperated features from target in the training dataset
    - Only test data used for predictions

## Model Selection

k-NN was chosen because it very well categorizes data, it is simple, & it captures local pattens, which compliments the dataset. Since we're trying to categorize points as either (0,1) I felt that similarity-based learning was the best approach.  

## Model Implementation

- Used scikit-learn for Euclidean distance
- Sorted training points by distance for neighbor selection
- Selected k-closest points
    - If there is a tie, the lower-index neighbors are prioritized
- Predictions saved in output

## Model Training

- Final k-NN model was trained using the optimal k-value, found by trial and error till it was "optimized"
- Used a training-validation split (80/20) to evaluate on unseen validation data
- Performance validated by F-1 Score
- Tried to use cross validation to find optimum k-value (Didn't work)

## Advantages & Disadvantages

### Advantages

- Simple
- Flexible

### Disadvantages

- Incredibly sensitive to scaling & k value
- Performance gets worse with more dimensions