# Sonar Classification using Neural Network

In this notebook I will build a model to classify classic Sonar. the dataset is a collection of sonar data. Which predicts whether we will find a rock or a mine. It contains 60 columns of data consisting of sonar data, and the last column tells us whether it found rock or mine. In total, it has 208 instances.

### Step 01: Importing Libs

I will use pandas and numpy libraries for data structures, sklearn for everything that involves learning

In [35]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import LabelEncoder

np.set_printoptions(precision=2)

### Step 02: Loading Sonar dataset

I'm using a .xlxs that my teacher gave me, but it's the same as https://www.kaggle.com/datasets/mayurdalvi/sonar-mine-dataset/data

In [36]:
sonar = pd.read_excel('../../datasets/sonar.xlsx', sheet_name=0)  
sonar.head(5)

Unnamed: 0,Atributo_1,Atributo_2,Atributo_3,Atributo_4,Atributo_5,Atributo_6,Atributo_7,Atributo_8,Atributo_9,Atributo_10,...,Atributo_52,Atributo_53,Atributo_54,Atributo_55,Atributo_56,Atributo_57,Atributo_58,Atributo_59,Atributo_60,Classe
0,0.02,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0027,0.0065,0.0159,0.0072,0.0167,0.018,0.0084,0.009,0.0032,Rocha
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044,Rocha
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598,0.6194,...,0.0232,0.0166,0.0095,0.018,0.0244,0.0316,0.0164,0.0095,0.0078,Rocha
3,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0121,0.0036,0.015,0.0085,0.0073,0.005,0.0044,0.004,0.0117,Rocha
4,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0031,0.0054,0.0105,0.011,0.0015,0.0072,0.0048,0.0107,0.0094,Rocha


### Step 03: Preprocessing and Separating bases

Separating features and labels from the sonar dataset, transforming categorical labels into numeric values. 

By default, this division is 75% of the data for the training base and 25% of the data for the test base. As we have little data, we will use 90% for the training base and 10% for the testing base.

In [23]:
X = sonar.iloc[:,0:(sonar.shape[1] - 1)]

le = LabelEncoder()
y = le.fit_transform(sonar.iloc[:,(sonar.shape[1] - 1)])

class_names = le.classes_


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)

### Step 04: Normalize data

Calculates the mean and standard deviation of each attribute in the training base. Normalizes attributes by the norm Z = (X - mean) / standard deviation, and use the same transformation on test data.

This feat did not result in any improvement in results, it continued the same thing =( 

I kept it just for example purposes

In [24]:
mean_on_train = X_train.mean(axis=0)
std_on_train = X_train.std(axis=0)

X_train_scaled = (X_train - mean_on_train) / std_on_train
X_test_scaled = (X_test - mean_on_train) / std_on_train

In [25]:
X_train_scaled.head(5)

Unnamed: 0,Atributo_1,Atributo_2,Atributo_3,Atributo_4,Atributo_5,Atributo_6,Atributo_7,Atributo_8,Atributo_9,Atributo_10,...,Atributo_51,Atributo_52,Atributo_53,Atributo_54,Atributo_55,Atributo_56,Atributo_57,Atributo_58,Atributo_59,Atributo_60
15,-0.007495,0.638671,0.483103,0.762932,1.483681,2.098394,1.5319,0.762256,-0.293677,-0.902187,...,-0.10511,-1.09046,0.618111,-0.528433,1.622807,-0.133417,1.222518,-0.475749,1.851083,0.148767
7,0.919905,0.441774,0.971801,-0.478792,0.689278,-0.200289,-0.299491,-0.840796,-0.288733,0.530981,...,-0.932204,-0.583358,0.15055,-0.879039,0.359989,0.224458,0.083374,-0.506019,-0.519931,-0.236277
55,-0.414544,-0.827774,-0.85827,-0.631429,-0.372824,-1.396543,-0.516262,-0.589049,-0.395842,-0.65178,...,-1.094379,-0.634068,-0.912089,-0.730706,-0.278359,-0.798043,-0.596712,-0.778454,-1.143883,-0.910104
92,-0.166958,-0.604427,-0.524836,-1.01096,-0.711792,-0.570558,0.076669,-0.313595,-0.240946,-0.340755,...,-0.283503,-0.208102,0.15055,-0.79813,-0.347744,-1.172959,-0.766733,-1.096294,-0.582327,0.225776
134,3.286664,1.97581,-0.5172,0.589669,-0.023425,0.139822,3.358509,3.648878,3.057203,2.788257,...,0.040847,0.420705,0.008865,1.669598,2.510942,-0.883251,0.066372,0.629125,0.010427,1.496421


### Step 05: Creating model and training

The most efficient way to determine the number of neurons in the hidden layer is through systematic search. However, a widely used initial starting point corresponds to:

(num_inputs + num_outputs) / 2

In [31]:
mlp = MLPClassifier(solver='lbfgs', hidden_layer_sizes=[31], max_iter=1000, random_state=0)
mlp.fit(X_train_scaled, y_train)
y_pred = mlp.predict(X_test_scaled)

In [34]:
print("Network layers: {}".format(mlp.n_layers_))
print("Neurons in the hidden layer: {}".format(mlp.hidden_layer_sizes))
print("Neurons in the output layer: {}".format(mlp.n_outputs_))
print("Weights in the input layer: {}".format(mlp.coefs_[0].shape))
print("Weights in the hidden layer: {}".format(mlp.coefs_[1].shape))
print("Training set accuracy: {:.2f}".format(mlp.score(X_train_scaled, y_train)))
print("Test set accuracy: {:.2f}".format(mlp.score(X_test_scaled, y_test)))

Network layers: 3
Neurons in the hidden layer: [31]
Neurons in the output layer: 1
Weights in the input layer: (60, 31)
Weights in the hidden layer: (31, 1)
Training set accuracy: 1.00
Test set accuracy: 0.86


### Step 06: Evaluating model

In [33]:
print(classification_report(y_test, y_pred, target_names=class_names))

cnf_matrix = confusion_matrix(y_test, y_pred)
print(cnf_matrix)

              precision    recall  f1-score   support

        Mina       0.78      0.88      0.82         8
       Rocha       0.92      0.85      0.88        13

    accuracy                           0.86        21
   macro avg       0.85      0.86      0.85        21
weighted avg       0.86      0.86      0.86        21

[[ 7  1]
 [ 2 11]]


# Considerations

This discrepancy between the training set accuracy (1.00) and the test set accuracy (0.86) strongly suggests that the model is suffering from Overfitting.