# Self-Organizing Map Project Submission

### Identity of 4 Subjects

Hussain sample 1: **Predicted → Patient**  
Hussain sample 2: **Predicted → Patient**  
Hussain sample 3: **Predicted → Patient**  
Hussain sample 4: **Predicted → Patient**

---

# Comparison of Kohonen Network with K-Means and k-Nearest Neighbors (k-NN)

The Kohonen Network, or Self-Organizing Map (SOM), is an unsupervised learning method used mainly for clustering and visualizing high-dimensional data by projecting it onto a lower-dimensional usually 2D grid. Below are the main differences between SOM and two other popular techniques:

---

**1. Kohonen Network vs. K-Means Clustering**

- **Topology preservation:** SOM tries to preserve the spatial relationships of the input data by organizing similar inputs close to each other on a 2D map. K-Means does not maintain any topological structure, it just assigns data points to the nearest cluster center.

- **Learning mechanism:** SOM uses a competitive learning approach where not only the winning neuron is updated, but also its neighbors based on a neighborhood function. K-Means updates only the cluster centers based on the average of assigned points.

- **Output structure:** SOM produces a structured grid of neurons which can be visualized and interpreted. K-Means gives flat clusters without any spatial layout.

> I read the difference between SOM and K-mean to the following resources:
> - [K-Means and SOM](https://dzone.com/articles/k-means-and-som-gentle-introduction-to-worlds-most#:~:text=K%2Dmeans%20and%20Kohonen%20SOM,%2C%20they're%20remarkably%20similar.)
> - A research paper titled [*"Comparative analysis of SOM neural network with K-means clustering algorithm"*](https://ieeexplore.ieee.org/document/5492838), which explains the differences in a very structured and insightful manner.

---

**2. Kohonen Network vs. k-Nearest Neighbors (k-NN)**

- **Supervision:** SOM is an unsupervised method used during the training phase to organize data. In contrast, k-NN is a **supervised** classification algorithm, it requires labeled data and makes predictions by looking at the closest labeled examples.

- **Usage purpose:** SOM is mainly used for discovering patterns, visualizing data, and clustering. k-NN is used for classifying new data points based on their similarity to existing labeled data.

- **Model structure:** SOM builds a model (a map with weights) through training. k-NN does not build a model in advance, it stores the training data and makes decisions only when a new sample is presented (lazy learning).

> I read this difference in the research paper *"Comparison Between K-Nearest Neighbors, Self-Organizing Maps and Optimum-Path Forest in the Recognition of Packages using Image Analysis by Zernike Moments"* ([IEEE link](https://ieeexplore.ieee.org/document/7059477)).  
> This paper explains in detail how k-NN uses different distance measures and also compares its performance with SOM using image analysis.

---
In short, SOM is useful when we want to understand the structure of our data, while K-Means is better for quick clustering, and k-NN is more suitable for classifying new data using known examples.

---

## The following steps outline how I implement, train, and test the Kohonen Self-Organizing Map (SOM) network:

**Step 1:** 

First I Load and combine Control (healthy) and patient data from files, and assigning labels 0 and 1 respectively.

In [1]:

import numpy as np
# --------- Load Data ---------
def load_data(healthy_path, patient_path):
    healthy_data = np.loadtxt(healthy_path)
    patient_data = np.loadtxt(patient_path)

    # Labels: 0 = healthy, 1 = patient
    data = np.vstack((healthy_data, patient_data))
    labels = np.array([0] * len(healthy_data) + [1] * len(patient_data))


    return data, labels

**Step 2:**  
I implemented the SOM training by initializing random weights for each neuron and iteratively updating them over multiple epochs. For each input, I calculated the squared Euclidean distance that computed as the sum of squared differences between the input vector and each neuron's weight vector using the formula:  

$$
d = \sum_{i=1}^n (w_i - x_i)^2
$$

to find the closest neuron (the winner), Then I adjusted the winner's weights to move slightly closer to the input. This process was repeated for all samples and epochs to allow the neurons to self-organize based on the data.


In [2]:
def train_som(data, num_neurons, learning_rate, epochs):
    # Get the number of features in each input vector
    input_size = data.shape[1]

    # Initialize weight vectors for each neuron with random values
    weights = np.random.rand(num_neurons, input_size)

    # Training loop for the specified number of epochs
    for epoch in range(epochs):
        print(f"\n--- Epoch {epoch + 1} ---")

        # Iterate over each input sample
        for idx, x in enumerate(data):
            # Calculate the squared Euclidean distance from the input to each neuron
            distances = np.sum((weights - x) ** 2, axis=1)

            # Print distances to all neurons just for my understanding
            for i, dist in enumerate(distances):
                print(f"Distance to neuron {i}: {dist:.4f}")

            # Find the index of the closest winning neuron
            winner = np.argmin(distances)
            print(f"Winning neuron: {winner}")

            # Update the weight vector of the winning neuron
            # Move it slightly closer to the input vector
            weights[winner] += learning_rate * (x - weights[winner])

            # Print the updated weights after this sample
            print("Updated weights:")
            print(weights)

    # Return the final trained weight vectors for all neurons
    return weights

**Step 3:**  
I assigned labels to each neuron by checking which data points were closest to it using squared Euclidean distance. For each neuron, I collected the true labels of the data points it won and then assigned it the most frequent label using majority voting.


In [3]:
# --------- Assign Labels to Neurons ---------
def label_neurons(weights, data, true_labels):
    assignments = [[] for _ in range(len(weights))]
    for i, x in enumerate(data):
        distances = np.sum((weights - x) ** 2, axis=1)
        winner = np.argmin(distances)
        assignments[winner].append(true_labels[i])

    neuron_labels = []
    for a in assignments:
        if a:
            counts = np.bincount(a)
            neuron_labels.append(np.argmax(counts))
        else:
            neuron_labels.append(-1)
    return neuron_labels


**Step 4:**  
To make predictions, I used the same squared Euclidean distance formula to find the closest (winning) neuron for each input sample, and returned the label previously assigned to that neuron.


In [4]:
# --------- Prediction ---------
def predict(x, weights, neuron_labels):
    distances = np.sum((weights - x) ** 2, axis=1)
    winner = np.argmin(distances)
    return neuron_labels[winner]

**Step 5:**  
In the main section, I first loaded the training data (`control.txt` and `patient.txt`) and trained the SOM using the defined training function. After training, I assigned class labels (Healthy or Patient) to the neurons based on the training data. Then, I loaded my test dataset (`Hussain.txt`) and used the trained weights and labeled neurons to classify each test sample by predicting whether it's Healthy or Patient.

I used a learning rate of **0.1** and **100 epochs** to train the SOM for more stable and refined weight updates. However, I also tested the network with a **learning rate of 0.5** and **50 epochs**, and it still performed reasonably well. The results were slightly faster to converge but with less precision compared to the lower learning rate setting.


In [5]:
# --------- MAIN ---------
healthy_path = "control.txt"   #  Upload Training dataset files
patient_path = "patient.txt"

# Load and label data
data, labels = load_data(healthy_path, patient_path)
print(f"Data loaded. Shape: {data.shape}")

# Train SOM
weights = train_som(data, num_neurons=2, learning_rate=0.1, epochs=100)
print("Training complete.")
print("Final Weights", weights)

# Assign neuron labels
neuron_labels = label_neurons(weights, data, labels)
for i, label in enumerate(neuron_labels):
    print(f"Neuron {i} → Class {'Healthy' if label == 0 else 'Patient'}")

# --------- TEST DATA ---------
test_path = "Hussain.txt"  # Upload my dataset
test_data = np.loadtxt(test_path)

print("\n--- Testing on New Data ---")
for i, x in enumerate(test_data):
    prediction = predict(x, weights, neuron_labels)
    label_str = "Healthy" if prediction == 0 else "Patient"
    print(f"Hussain sample {i + 1}: Predicted → {label_str}")

Data loaded. Shape: (20, 650)

--- Epoch 1 ---
Distance to neuron 0: 2749.5930
Distance to neuron 1: 2804.7197
Winning neuron: 0
Updated weights:
[[0.53868197 0.38111622 0.55541007 ... 0.52865301 0.70407351 0.32813916]
 [0.4546257  0.41956452 0.39608519 ... 0.77682283 0.80430997 0.54212875]]
Distance to neuron 0: 2459.7505
Distance to neuron 1: 2944.0030
Winning neuron: 0
Updated weights:
[[0.48481377 0.3430046  0.49986907 ... 0.47578771 0.63366616 0.29532524]
 [0.4546257  0.41956452 0.39608519 ... 0.77682283 0.80430997 0.54212875]]
Distance to neuron 0: 2079.5586
Distance to neuron 1: 2917.2383
Winning neuron: 0
Updated weights:
[[0.43633239 0.30870414 0.44988216 ... 0.42820894 0.57029954 0.26579272]
 [0.4546257  0.41956452 0.39608519 ... 0.77682283 0.80430997 0.54212875]]
Distance to neuron 0: 1793.1850
Distance to neuron 1: 2947.6186
Winning neuron: 0
Updated weights:
[[0.39269915 0.27783372 0.40489394 ... 0.38538804 0.51326959 0.23921345]
 [0.4546257  0.41956452 0.39608519 ... 0.77