Importing modules

In [872]:
import sklearn
from sklearn.utils import shuffle
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
from sklearn import linear_model, preprocessing
import matplotlib.pyplot as plt

Reading the data which is in csv format and is separated by commas, and then showing a short snippet.

The dataset is from this website: https://www.kaggle.com/datasets/aungpyaeap/fish-market
I will be using K-NN to classify fish based on information about its size into species.

To clarify what the headings actually mean:



* Species: Name of fish species
* Weight: Weight of fish (g)
* Length1: Vertical length (cm)
* Length2: Diagonal length (cm)
* Length3: Horizontal length (cm)
* Height: Fish height (cm)
* Width: Diagonal width (cm)


In [873]:
data = pd.read_csv("Fish.csv", sep=",")
print(data.head())

  Species  Weight  Length1  Length2  Length3   Height   Width
0   Bream   242.0     23.2     25.4     30.0  11.5200  4.0200
1   Bream   290.0     24.0     26.3     31.2  12.4800  4.3056
2   Bream   340.0     23.9     26.5     31.1  12.3778  4.6961
3   Bream   363.0     26.3     29.0     33.5  12.7300  4.4555
4   Bream   430.0     26.5     29.0     34.0  12.4440  5.1340


Using built in preprocessing to change the strings in Species to integers so the algorithm can read it

In [874]:
le = preprocessing.LabelEncoder()
Species = le.fit_transform(list(data["Species"]))

Setting the axis

In [875]:
X = data[["Weight", "Length1", "Length2", "Length3", "Height", "Width"]]
y = Species

Splitting the data into training and testing, then printing the accuracy. I have set the K value to 9 as after testing it for different values it seems to be most accurate.

One issue I noticed is that the accuracy was always a clean looking number with limited decimal places. The reason for this is because there is not enough data in the fish dataset, so if the data was ~10x more in size then the models would be more accurate.

This issue is reciprocated after I decided to write code to return the accuracy of predicting each species.

In [876]:
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)

model = KNeighborsClassifier(n_neighbors=9)

model.fit(x_train, y_train)
acc = model.score(x_test, y_test)
print(acc)

0.4375


For each unique species in the dataset, I calculated its accuracy however due to the small dataset I had an issue where when I split the data, the testing set had nothing about certain species, which meant that the model had nothing to predict with.

As well as this, the percentages are "clean" numbers which shows that the dataset size is too small to do accurate classification, which the exception of Bream which appeared way more times in the dataset. In the case of Bream, the classification was quite successful.

In [877]:
unique_species = data["Species"].unique()

species_accuracies = {species: 0 for species in unique_species}

y_pred = model.predict(x_test)

for species in unique_species:
    encoded_species = le.transform([species])[0]
    species_indices = [i for i, label in enumerate(y_test) if label == encoded_species]
    
    if len(species_indices) > 0:
        correct_predictions = sum(1 for idx in species_indices if y_pred[idx] == encoded_species)
        species_accuracy = correct_predictions / len(species_indices)
        species_accuracies[species] = species_accuracy
        print(f"Accuracy for {species}: {species_accuracy:.2%}")
    else:
        print(f"No instances of {species} in the test set")

Accuracy for Bream: 100.00%
Accuracy for Roach: 50.00%
Accuracy for Whitefish: 0.00%
Accuracy for Parkki: 0.00%
Accuracy for Perch: 33.33%
Accuracy for Pike: 0.00%
Accuracy for Smelt: 100.00%
