# Instruction to lab work # 4: Clustering & Dim reduction

> Student name    - Volodymyr

> Student surname - Donets

> Group           - KU-31

# Description of the work

## Theory on dimension reduction methods

* [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis)
* [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding)
* [UMAP](https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction#Uniform_manifold_approximation_and_projection)

* [YouTube Playlist PCA, t-SNE, UMAP how it works](https://www.youtube.com/watch?v=jc1_yPYmspk&list=PLV8yxwGOxvvoJ87mFL27k7XSDq_lF3pD5)
* [PCA core: SVD](https://www.youtube.com/watch?v=nbBvuuNVfco)

### Estimating quality of dimension reduction:

* For PCA: `explained variance ratio` np.sum(pca.explained_variance_ratio_)
* For other methods there is no built-in methods for estimating dimension reduction quality, but you can do this with incorporating with other ML methods like classification for labeled methods and estimating clustering quality on dimensionally reduced methods.

## Theory on clustering methods

* [Sklearn clustering methods comparison](https://scikit-learn.org/stable/modules/clustering.html)
* [k-means](https://en.wikipedia.org/wiki/K-means_clustering)
* [Spectral Clustering overview](https://arxiv.org/html/2501.13597v2)
* [Spectral Clustering Kaggle](https://www.kaggle.com/code/vipulgandhi/spectral-clustering-detailed-explanation)
* [Agglomerative (Hierarchical) Clustering](https://medium.com/@prasanth32888/agglomerative-hierarchical-clustering-ahc-e9e7a48cb042)
* [DBSCAN](https://medium.com/@sachinsoni600517/clustering-like-a-pro-a-beginners-guide-to-dbscan-6c8274c362c4)


### Estimating quality of resulting clusters

* WCSS (Within Cluster Sum of Squares) Use it Only with k-means. []()
* [Silhouette Score](https://en.wikipedia.org/wiki/Silhouette_(clustering)) -- The silhouette value ranges from âˆ’1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. __A clustering with an average silhouette width of over 0.7 is considered to be "strong", a value over 0.5 "reasonable", and over 0.25 "weak".__ 
* [Calinski-Harabasz Score](https://en.wikipedia.org/wiki/Calinski%E2%80%93Harabasz_index) -- higher Calinski-Harabasz score relates to a model with better defined clusters
* [David Bouldin Score](https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index) -- lower Davies-Bouldin index relates to a model with better separation between the clusters
* [Clustering quality estimating metrics available in Sklearn](https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation)
* [Density-based clustering validation](https://en.wikipedia.org/wiki/Density-based_clustering_validation) this is similar to Silhouette score.


## Datasets to use (just use that one from the 2nd lab work)

1. [Liver Cirrhosis Stage Classification](https://www.kaggle.com/datasets/aadarshvelu/liver-cirrhosis-stage-classification)
2. [Star Dataset for Stellar Classification](https://www.kaggle.com/datasets/vinesmsuic/star-categorization-giants-and-dwarfs)
3. [Fitness Classification Dataset](https://www.kaggle.com/datasets/muhammedderric/fitness-classification-dataset-synthetic)
4. [ECG Arrhythmia Classification Dataset](https://www.kaggle.com/datasets/sadmansakib7/ecg-arrhythmia-classification-dataset)
5. [Mushroom Classification Enhanced](https://www.kaggle.com/datasets/sakurapuare/mushroom-classification-enhanced)
6. [Dry Bean Dataset Classification](https://www.kaggle.com/datasets/nimapourmoradi/dry-bean-dataset-classification)
7. [Swarm Behaviour Classification](https://www.kaggle.com/datasets/deepcontractor/swarm-behaviour-classification) -- challenging one 270 Mb dataset
8. [Anemia Types Classification](https://www.kaggle.com/datasets/ehababoelnaga/anemia-types-classification)
9. [NASA: Asteroids Classification](https://www.kaggle.com/datasets/shrutimehta/nasa-asteroids-classification)

## Task description

1. Use the dataset from your previous work, to simplify the job.
2. Perform dimension reduction & it's visualization with PCA, t-SNE & UMAP. And do its visualization with labeling the expected class.
3. Do clustering with selected method and 


# Import dependencies

In [None]:
# run if you don't have libs
!pip install plotly
!pip install tqdm

In [37]:
import numpy as np
import pandas as pd

from sklearn.datasets import make_moons
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA

from sklearn.metrics import classification_report
from sklearn.metrics import ConfusionMatrixDisplay

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

from tqdm import tqdm

# Some useful code

# Example of experiments

## 1. Toy example

## 2. Real data example

## 2.1. Load & prepare data

## 2.2. Do training of your k-NN model

# Own experiments on the selected data

# Conclusions