<div style="text-align: center"><img src="https://img.freepik.com/free-vector/cute-underwater-animals-fish-seahorse-jellyfish-octopus-vector-cartoon-set-aquarium-characters-funny-marine-creatures-puffer-fish-isolated-black-background_107791-10364.jpg?t=st=1727454528~exp=1727458128~hmac=897f710448f4a156c45c9ece16702960931ffd6cc25f1d26c413f5767b0cfc50&w=2000" width="100%" heigh="100%" alt="Retrieve&Re-Rank pipeline"></div>

In this project, we explore various **clustering techniques** on a fish species dataset. Although the true species labels are known, we omit them during the clustering process and only use them later to assess the results.

We employ KMeans, Agglomerative Clustering, DBSCAN, and Gaussian Mixture to identify 9 clusters, representing the 9 distinct species, and then compare the clusters to the true labels.

Clustering is a valuable method in data analysis, uncovering hidden patterns and grouping similar objects based on their features, making it crucial for exploratory research.

Ultimately, we will evaluate how closely each technique matches the actual species, providing insights into the best-performing method for this dataset.

`Clustering` is an unsupervised machine learning method used to group data points into clusters based on similarity. Some popular clustering techniques include:

`1.K-Means Clustering:`
1. Partitions the data into a fixed number of clusters (k).

1. Each data point belongs to the cluster with the nearest mean.

1. Iteratively refines the cluster centroids to minimize the variance within clusters.

2.`Agglomerative Clustering (Hierarchical):`
1. Builds a hierarchy of clusters by repeatedly merging the closest pairs of clusters.
1. Can be visualized using a dendrogram.
1. Does not require specifying the number of clusters beforehand, unlike K-Means.

`3.DBSCAN (Density-Based Spatial Clustering of Applications with Noise):`
1. Groups together points that are closely packed together (points with many nearby neighbors).
1. Points that are isolated from others are considered noise.
1. It can identify clusters of arbitrary shape and size, unlike K-Means.

`4.Gaussian Mixture Model (GMM):`
1. Assumes that the data is generated from a mixture of several Gaussian distributions.
1. Each cluster corresponds to a Gaussian component, and points are assigned probabilistically to these clusters.
1. Itâ€™s more flexible than K-Means as it can model elliptical clusters and works well when the data fits a Gaussian distribution.

In [1]:
import numpy as np
import pandas as pd
from scipy import stats

import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
from matplotlib.patches import Patch
import seaborn as sns
%matplotlib inline

palette1 = ["#f72585", "#b5179e", "#7A0CC3", "#560bad", "#330A92", "#3f37c9", "#4361ee", "#4895ef", "#4cc9f0"]
palette2 = ["#2C7B7B", "#004D4A", "#D02748", "#F9E03B",  "#F0C808", "#D5A2D5", "#3E2A63", "#6BC6D3", "#007BA8"]
sns.set_theme(context='notebook', palette=palette2, style='darkgrid')

import sklearn
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
from sklearn.mixture import GaussianMixture

import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv("/kaggle/input/fish-species-sampling-weight-and-height-data/fish_data.csv")

In [3]:
df.head()

Unnamed: 0,species,length,weight,w_l_ratio
0,Anabas testudineus,10.66,3.45,0.32
1,Anabas testudineus,6.91,3.27,0.47
2,Anabas testudineus,8.38,3.46,0.41
3,Anabas testudineus,7.57,3.36,0.44
4,Anabas testudineus,10.83,3.38,0.31


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4080 entries, 0 to 4079
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   species    4080 non-null   object 
 1   length     4080 non-null   float64
 2   weight     4080 non-null   float64
 3   w_l_ratio  4080 non-null   float64
dtypes: float64(3), object(1)
memory usage: 127.6+ KB


In [5]:
df.describe()

Unnamed: 0,length,weight,w_l_ratio
count,4080.0,4080.0,4080.0
mean,17.353544,3.739875,0.252782
std,7.114684,1.040365,0.123046
min,6.36,2.05,0.08
25%,11.3275,3.07,0.17
50%,17.35,3.31,0.19
75%,22.585,4.1,0.34
max,33.86,6.29,0.64


In [6]:
df.isnull().sum().sum()

0

In [7]:
df.duplicated().sum()

109

In [8]:
# Remove the dulpicated rows from the DataFrame
df.drop_duplicates(inplace=True)

In the code `df.drop_duplicates(inplace=True)`, the argument inplace=True means that the operation will be performed directly on the original DataFrame (df) without creating a new copy.

Effect:
With `inplace=True`: The DataFrame is modified in place, meaning that the duplicate rows are removed from df itself, and no new object is returned.
Without `inplace=True`: A new DataFrame with duplicates removed is created and returned, leaving the original DataFrame (df) unchanged unless explicitly reassigned (e.g., df = df.drop_duplicates()).

In [9]:
df.shape

(3971, 4)

In [10]:
X = df.drop('species', axis=1)
y = df.species

# Credit:


https://www.kaggle.com/code/annastasy/fish-clustering-diverse-techniques