## Overview
In this task, you will analyze a dataset containing employee information to train predictive models that determine whether employees have left the company (attrited) or not. The process will begin with an in-depth exploration of the dataset, followed by a thorough analysis of its features and the implementation of essential preprocessing techniques, such as label encoding and feature scaling. Subsequently, you will construct and assess models using K-Nearest Neighbors (KNN), Random Forest and Bagging. After training and optimizing each model, you will evaluate their performance by comparing metrics such as accuracy and feature importance, in order to identify the most effective approach for accurate attrition prediction.

In [None]:
student_number = None
full_name = None
assert student_number and full_name is not None, 'please input your information'

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Configure visualizations
sns.set(style="whitegrid")

# Load and Explore Dataset (5 points)

In [None]:
# Load the dataset
# Note: "Attrition" is our target column

df = pd.read_csv('dataset.csv')

In [None]:
df.head(5)

In [None]:
# TODO: Check the basic structure of the dataset using .info() and .describe()
# Use: df.info() to check data types and missing values
# Use: df.describe() to get summary statistics of numeric features

# TODO: Check for any missing values in the dataset
# Use: df.isnull().sum() to find if any column has missing values

# TODO: Explore the target variable (binary classification)
# Use value_counts() to see the distribution of our target (Attrition) column and then visualize it (bar plot).


df.head()

 Examine the DataFrame to gain insights into individuals' monthly income and the factors that typically influence this value.

In [None]:
# TODO: Plot (line plot) the average MonthlyIncome against the YearsAtCompany. 
# TODO: Then find which departments have the highest and lowest incomes on average.

# Data Preprocessing (5 points)

Label Encode categorical columns and create a new DataFrame. Then split this data into train and test.

In [None]:
# TODO: Label encode all categorical columns
encoded_df= None

In [None]:
# Split into features and target variable
X = df.drop(columns=['Attrition'])
y = df['Attrition']

# TODO: Perform a train-test split using train_test_split() from sklearn
# Split the dataset into training and test sets with a test size of 30%

# TODO: Scale the features using StandardScaler
# Fit the scaler on the training data and transform both the training and test sets

# K-Nearest Neighbors (KNN) Model (10 points)

In [None]:
from collections import Counter

class CustomKNN:
    def __init__(self, k):
        """
        Initialize the KNN classifier.

        Parameters:
        - k (int): Number of neighbors to consider.
        """
        # Store the number of neighbors (k)
        self.k = k

    def fit(self, X_train, y_train):
        """
        Fit the KNN classifier to the training data.

        Parameters:
        - X_train (numpy array): Training feature vectors.
        - y_train (numpy array): Training labels.
        """
        # Store training data
        self.X_train = np.array(X_train)
        self.y_train = np.array(y_train)

    def euclidean_distance(self, x1, x2):
        """
        Calculate the Euclidean distance between two data points.

        Parameters:
        - x1 (numpy array): First data point.
        - x2 (numpy array): Second data point.

        Returns:
        - float: Euclidean distance between x1 and x2.
        """
        # TODO: Calculate and return the Euclidean distance
        pass

    def predict(self, X_test):
        """
        Predict labels for test data.

        Parameters:
        - X_test (numpy array): Test feature vectors.

        Returns:
        - numpy array: Predicted labels.
        """
        # TODO: Predict label for each test instance and return the array of predictions
        pass

    def _predict(self, x):
        """
        Predict label for a single data point.

        Parameters:
        - x (numpy array): Test data point.

        Returns:
        - int: Predicted label.
        """
        # TODO: Compute distances from x to all training points.
        # Find the indices and labels of k nearest neighbors.
        # Perform majority vote and return the most common label among them.
        pass


In [None]:
# You can choose any range of k values that you want.
k_values = [1, 3, 5, 7, 9, 11, 13, 15]
accuracies = []



for k in k_values:
    y_pred_custom= []
    
    # TODO: Fit the model using the scaled training data
    # TODO: Make predictions on the scaled test data
    # TODO: Evaluate the model's accuracy for each value of k and choose the best one
    
    print(f'k: {k} - Accuracy: {accuracy_score(y_test, y_pred_custom)}')
    

Best_custom_model= None

# Keep the best_k value (needed later on with bagging)
best_k= None

In [None]:
# TODO: Print the accuracy and classification report using Scikit-learn metrics for your best model


In [None]:
# TODO: Create a confusion matrix for KNN predictions
# Use confusion_matrix from sklearn.metrics

# TODO: Visualize the confusion matrix using seaborn's heatmap
# Add annotations and a title for better readability

## Evaluation (5 points)
In this section, you will assess your model's performance on a new set of unseen data. Load the test.csv file (which has already been encoded), use your best_custom_model to make predictions, and save the results in a DataFrame named 'result.csv'. The DataFrame should contain a column labeled 'target', which holds your model's predictions.

In [None]:
# Load test.csv
eval_df= pd.read_csv('test.csv')

# TODO: Use your old scaler to scale the data
# TODO: Predict using your model

y_pred_eval= None

In [None]:
# Save the results as a csv file
result_df= pd.DataFrame()
result_df['target']=pd.Series(y_pred_eval)
result_df.to_csv('result.csv', index= False)

# Bagging with KNN (10 points)

In [None]:
# TODO: Implement Bagging with KNN
# Use BaggingClassifier with KNeighborsClassifier as the base estimator
# Here we use the best_k value we found before

bagging_knn = BaggingClassifier(KNeighborsClassifier(n_neighbors=best_k), n_estimators=50, random_state=42)

# TODO: Fit the BaggingClassifier on the scaled training data
# Use bagging_knn.fit() with the training data

# TODO: Use the trained Bagging model for predictions on the test data
# Use bagging_knn.predict()

# TODO: Print the Bagging KNN model accuracy and classification report
# Use accuracy_score and classification_report

In [None]:
# TODO: Create a confusion matrix for Bagging KNN predictions
# Use confusion_matrix from sklearn.metrics

# TODO: Visualize the confusion matrix using seaborn's heatmap
# Add annotations and a title for better readability

# Model Comparison (5 points)

In [None]:
# TODO: Compare model accuracies for KNN, Bagging KNN
# Create a DataFrame with model names and their respective accuracies

# TODO: Visualize the model comparison using a line plot
# Use seaborn's line plot to plot model names vs. accuracies