<a href="https://www.kaggle.com/code/amirulmahmud/classification-of-rock-mine-with-knn?scriptVersionId=124936047" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **Load The Data**

Sonar (sound navigation ranging) is a technique that uses sound propagation to navigate, communicate with or detect objects on or under the surface of the water.
The dataset contains the response metrics for 60 separate sonar frequencies sent out against a known mine field and known rocks. These frequencies are then labeled with the known object they were beaming the sound at (either a rock or a mine).
The objective is to create a machine learning model that is capable of classifying a rock or a mine based on the response of the 60 separate sonar frequencies.

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('/kaggle/input/rock-or-mine-classification/ROCK_OR_MINE.csv', header=None)

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df = df.rename(columns={60: 'Label'})

In [None]:
df.columns

# Data Cleaning

Before data is used, let's check if there is any missing values in the data.

In [None]:
df.isna().sum()

In [None]:
df.isna().sum().sum()

Fortunately, the data is complete and does not have any missing values.

# Exploratory Data Analysis

Create a heatmap to visualize the correlation between the difference frequency responses.

In [None]:
df['Target'] = df['Label'].map({"R":0, "M":1})

In [None]:
plt.figure(figsize=(8,6),dpi=200)
sns.heatmap(df.corr(), cmap='coolwarm')

Find the top 5 correlated frequencies with target.

In [None]:
np.abs(df.corr()['Target']).sort_values().tail(6)

In [None]:
sns.boxplot(x=df['Label'],y=df[10])

In [None]:
sns.boxplot(x=df['Label'],y=df[11])

In [None]:
sns.boxplot(x=df['Label'],y=df[48])

Check the statistic summaries

In [None]:
df.describe().transpose()

# Split The Data

The approach here will use Cross Validation on 80% of the dataset, and then judge the results on a final test set of 20% to evaluate the model.

In [None]:
df.columns

In [None]:
X = df.drop(['Label','Target'],axis=1)
y = df['Label']

In [None]:
X.columns

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Check the balance of the label data in training set.

In [None]:
sns.countplot(x=y_train)

In [None]:
y_train.value_counts()

As we see here, the label class of training data is almost balance. So, the data is ready to use in the next steps.

# Modelling with KNN

Create a PipeLine that contains both a StandardScaler and a KNN model

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

In [None]:
scaler = StandardScaler()

In [None]:
knn = KNeighborsClassifier()

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
pipe = Pipeline([('scaler',scaler), ('knn',knn)])

Perform a grid-search with the pipeline to test various parameters and report back the best performing parameters.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
parameters = ({'knn__n_neighbors': list(range(1,30)),
              'knn__weights': ['uniform', 'distance'],
              'knn__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']})

In [None]:
grid = GridSearchCV(estimator=pipe,param_grid=parameters,scoring='accuracy',cv=5)

In [None]:
grid.fit(X_train,y_train)

Find the best estimator and best parameters

In [None]:
grid.best_estimator_

In [None]:
grid.best_estimator_.get_params()

Cross Validation Results

In [None]:
cv_results = pd.DataFrame(grid.cv_results_)

In [None]:
cv_results.info()

In [None]:
cv_score = cv_results.groupby('param_knn__n_neighbors').agg('mean')['mean_test_score']

In [None]:
cv_score

In [None]:
plt.figure(figsize=(5,4),dpi=150)
plt.plot(range(1,30),cv_score,'o-')
plt.xlabel('K Value')
plt.ylabel('Accuracy')

# Final Model Evaluation

In [None]:
y_pred = grid.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

In [None]:
confusion_matrix(y_test,y_pred)

In [None]:
print(classification_report(y_test,y_pred))

In [None]:
accuracy_score(y_test,y_pred)

# CONCLUSION

1. The best parameters of KNN estimator in this model are n_neighbors = 1, weights = 'uniform', and algorithm = 'auto'.
2. The model performs quite well in predicting the unseen data (X_test), with accuracy of 90.48%