**K-NEAREST NEIGHBOURS (KNN) – ASSIGNMENT**

**Objective:**

To implement and evaluate the K-Nearest Neighbours (KNN) algorithm for
classification tasks using a dataset that requires the classification of
animal types.

**Dataset Description:**

The dataset contains various features related to animals (e.g., number
of legs, presence of hair, feathers, etc.), and the target variable is
the **type of animal**.

**Tasks:**

**1. Analyze the Data Using Visualizations**

-   **Histograms**: To check feature distributions.

-   **Box Plots**: To detect outliers.

-   **Pair Plots**: To visualize relationships between features.

-   **Count Plots**: For distribution of the animal types (target
    variable).

-   **Heatmaps**: To observe feature correlations.

**2. Preprocess the Data**

-   **Missing Values**: Check using .isnull().sum() and handle via
    imputation or removal.

-   **Outliers**: Detected using boxplots or z-score/ IQR methods and
    handled accordingly.

-   **Normalization/Scaling**: Use MinMaxScaler or StandardScaler to
    bring all features to the same scale, as KNN is distance-based and
    sensitive to magnitude.

**3. Split the Dataset**

-   Split the data into **training (80%)** and **testing (20%)** using:

python

Copy code

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

**4. Implement K-Nearest Neighbours (KNN)**

Using **scikit-learn**:

python

Copy code

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3, metric='minkowski', p=2) \#
Euclidean distance

knn.fit(X_train, y_train)

**5. Choose Distance Metric & Optimal K**

-   **Distance Metric**:

    -   Euclidean (p=2) is most common.

    -   Manhattan (p=1) can be used if features are sparse.

-   **Choosing K**:

    -   Try different k values (e.g., 1 to 20), then use
        **cross-validation** or **elbow method** (plotting accuracy
        vs. k) to find optimal value.

**6. Evaluate the Classifier**

Use:

python

Copy code

from sklearn.metrics import classification_report, accuracy_score

y_pred = knn.predict(X_test)

print(classification_report(y_test, y_pred))

print("Accuracy:", accuracy_score(y_test, y_pred))

**Metrics Used**:

-   Accuracy

-   Precision

-   Recall

-   F1 Score

**7. Visualize Decision Boundaries**

Only applicable in **2D feature space**, so reduce to two principal
components using PCA:

python

Copy code

from sklearn.decomposition import PCA

pca = PCA(n_components=2)

X_train_2D = pca.fit_transform(X_train)

X_test_2D = pca.transform(X_test)

knn_2D = KNeighborsClassifier(n_neighbors=3)

knn_2D.fit(X_train_2D, y_train)

\# Then plot the decision boundary using meshgrid