In [2]:
"""

Data	X	Y
x1	1	1
x2	6	6
x3	7	7
x4	3	2
x5	6	8
x6	2	1
x7	2	3
Figure 1:Dataset

Review the table labeled Figure 1: Dataset which has seven samples and 
two attributes (X and Y) per sample. You decided to cluster the table 
data into two clusters using K-mean clustering. Initially, point x1 is 
chosen as Centroid1 and point x2 is chosen as Centroid2. 
Using Euclidean distance measure, what will be the final X and Y values of the 
Centroids after the first iteration of the K-mean clustering algorithm?

Explanation:
Initial Data Points: The data is stored in a dictionary where each point is labeled (e.g., x1, x2).
Centroids: We initialize the centroids using points x1 and x2.
Euclidean Distance Function: A function that calculates the Euclidean distance between two points.
Clustering: Each data point is compared to the centroids, and it is assigned to the cluster of the closest centroid.
Recompute Centroids: The new centroids are computed by taking the mean of the points assigned to each cluster.
Result: The updated centroids after the first iteration are printed.
You can run this code to find the new centroids after the first iteration of K-means clustering.
"""
import numpy as np

# Initial data points (X, Y)
data = {
    'x1': (1, 1),
    'x2': (6, 6),
    'x3': (7, 7),
    'x4': (3, 2),
    'x5': (6, 8),
    'x6': (2, 1),
    'x7': (2, 3)
}

# Initial centroids
centroid1 = np.array([1, 1])  # Point x1
centroid2 = np.array([6, 6])  # Point x2

# Function to calculate Euclidean distance
def euclidean_distance(point1, point2):
    return np.sqrt(np.sum((point1 - point2) ** 2))

# Assign each data point to the closest centroid
cluster1 = []
cluster2 = []

for key, value in data.items():
    point = np.array(value)
    dist_to_c1 = euclidean_distance(point, centroid1)
    dist_to_c2 = euclidean_distance(point, centroid2)

    # Assign the point to the closest centroid
    if dist_to_c1 < dist_to_c2:
        cluster1.append(point)
    else:
        cluster2.append(point)

# Recompute the centroids by taking the mean of points in each cluster
new_centroid1 = np.mean(cluster1, axis=0)
new_centroid2 = np.mean(cluster2, axis=0)

# Output the final centroids
print(f"New Centroid1: {new_centroid1}")
print(f"New Centroid2: {new_centroid2}")



New Centroid1: [2.   1.75]
New Centroid2: [6.33333333 7.        ]


In [4]:
"""

Data	X	Y
x1	1	1
x2	6	6
x3	7	7
x4	3	2
x5	6	8
x6	2	1
x7	2	3
Figure 1:Dataset

Review the table labeled Figure 1: Dataset which has seven samples 
and two attributes (X and Y) per sample. You decided to cluster the 
table data into two clusters using K-mean clustering. Initially, 
point x1 is chosen as Centroid1 and point x3 is chosen as Centroid2. 
Using Euclidean distance measure, what will be the final clustering 
sets and X and Y values of the Centroids after the K-mean clustering algorithm terminates?

Initial Setup: We define the dataset in the data dictionary and the initial centroids.
Euclidean Distance Calculation: A helper function calculates the Euclidean distance between two points.
Clustering and Centroid Updates: The algorithm iterates, assigns points to the nearest centroid, and recalculates the centroids until they no longer change.
Convergence Check: The algorithm checks if the centroids remain unchanged between iterations.
Output: Once the centroids stabilize, it prints the final centroids and the points in each cluster.
"""
import numpy as np

# Initial data points (X, Y)
data = {
    'x1': (1, 1),
    'x2': (6, 6),
    'x3': (7, 7),
    'x4': (3, 2),
    'x5': (6, 8),
    'x6': (2, 1),
    'x7': (2, 3)
}

# Function to calculate Euclidean distance
def euclidean_distance(point1, point2):
    return np.sqrt(np.sum((point1 - point2) ** 2))

# Function to recompute the centroids by averaging the points in the clusters
def recompute_centroid(cluster):
    return np.mean(cluster, axis=0)

# Initial centroids
centroid1 = np.array([1, 1])  # Point x1
centroid2 = np.array([7, 7])  # Point x3

# Perform iterations until the centroids do not change
converged = False
while not converged:
    # Assign points to clusters
    cluster1 = []
    cluster2 = []
    
    for key, value in data.items():
        point = np.array(value)
        dist_to_c1 = euclidean_distance(point, centroid1)
        dist_to_c2 = euclidean_distance(point, centroid2)
        
        # Assign point to the nearest centroid
        if dist_to_c1 < dist_to_c2:
            cluster1.append(point)
        else:
            cluster2.append(point)
    
    # Convert to numpy arrays for easier calculations
    cluster1 = np.array(cluster1)
    cluster2 = np.array(cluster2)
    
    # Recompute the centroids
    new_centroid1 = recompute_centroid(cluster1)
    new_centroid2 = recompute_centroid(cluster2)
    
    # Check for convergence (if centroids do not change)
    if np.array_equal(new_centroid1, centroid1) and np.array_equal(new_centroid2, centroid2):
        converged = True
    else:
        centroid1, centroid2 = new_centroid1, new_centroid2

# Final clusters and centroids after convergence
print("Final Centroid1 (X, Y):", new_centroid1)
print("Final Centroid2 (X, Y):", new_centroid2)
print("Final Cluster 1:", cluster1)
print("Final Cluster 2:", cluster2)



Final Centroid1 (X, Y): [2.   1.75]
Final Centroid2 (X, Y): [6.33333333 7.        ]
Final Cluster 1: [[1 1]
 [3 2]
 [2 1]
 [2 3]]
Final Cluster 2: [[6 6]
 [7 7]
 [6 8]]


In [5]:
"""

Data	X	Y
x1	1	1
x2	6	6
x3	7	7
x4	3	2
x5	6	8
x6	2	1
x7	2	3
Figure 1:Dataset

Review the table labeled Figure 1: Dataset which has seven samples 
and two attributes (X and Y) per sample. You decided to: (1) 
cluster the table data into two clusters using K-mean clustering and 
(2) evaluate the clustering results using sum of squared error (SSE). 
Initially, point x1 is chosen as Centroid1 and point x3 is chosen as Centroid2.

Using Euclidean distance measure, what is the SSE for the 
resulting clustering after the K-mean clustering algorithm terminates?

Note: Your answer should be in format X.X or X.XX 
(depending on precision required on decimal places).

Explanation:
Clustering:

We initialize two centroids at 𝑥1=(1,1) and x3=(7,7).
We iterate through the dataset, assigning each point to the nearest centroid, 
then updating the centroids until convergence.
SSE Calculation:

Once the clustering is complete, we calculate the sum of squared 
Euclidean distances between each point and its assigned centroid.
Final SSE Output:

The final SSE is printed with two decimal points of precision.
"""
import numpy as np

# Initial data points (X, Y)
data = {
    'x1': (1, 1),
    'x2': (6, 6),
    'x3': (7, 7),
    'x4': (3, 2),
    'x5': (6, 8),
    'x6': (2, 1),
    'x7': (2, 3)
}

# Function to calculate Euclidean distance
def euclidean_distance(point1, point2):
    return np.sqrt(np.sum((point1 - point2) ** 2))

# Function to recompute the centroids by averaging the points in the clusters
def recompute_centroid(cluster):
    return np.mean(cluster, axis=0)

# Initial centroids
centroid1 = np.array([1, 1])  # Point x1
centroid2 = np.array([7, 7])  # Point x3

# Perform iterations until the centroids do not change
converged = False
while not converged:
    # Assign points to clusters
    cluster1 = []
    cluster2 = []
    
    for key, value in data.items():
        point = np.array(value)
        dist_to_c1 = euclidean_distance(point, centroid1)
        dist_to_c2 = euclidean_distance(point, centroid2)
        
        # Assign point to the nearest centroid
        if dist_to_c1 < dist_to_c2:
            cluster1.append(point)
        else:
            cluster2.append(point)
    
    # Convert to numpy arrays for easier calculations
    cluster1 = np.array(cluster1)
    cluster2 = np.array(cluster2)
    
    # Recompute the centroids
    new_centroid1 = recompute_centroid(cluster1)
    new_centroid2 = recompute_centroid(cluster2)
    
    # Check for convergence (if centroids do not change)
    if np.array_equal(new_centroid1, centroid1) and np.array_equal(new_centroid2, centroid2):
        converged = True
    else:
        centroid1, centroid2 = new_centroid1, new_centroid2

# Calculate the SSE (Sum of Squared Error)
sse = 0

# Sum of squared distances for points in cluster1 (with Centroid1)
for point in cluster1:
    sse += euclidean_distance(point, centroid1) ** 2

# Sum of squared distances for points in cluster2 (with Centroid2)
for point in cluster2:
    sse += euclidean_distance(point, centroid2) ** 2

# Output the final SSE
print(f"Sum of Squared Error (SSE): {sse:.2f}")


Sum of Squared Error (SSE): 7.42


In [7]:
"""

Data	X	Y
x1	1	1
x2	6	6
x3	7	7
x4	3	2
x5	6	8
x6	2	1
x7	2	3
Figure 1:Dataset

Review the table labeled Figure 1: Dataset which has seven samples 
and two attributes (X and Y) per sample. You decided to: (1) 
cluster the table data into two clusters using K-mean clustering and 
(2) evaluate the clustering results using average silhouette score. 
Initially, point x1 is chosen as Centroid1 and point x3 is chosen as Centroid2.

Using Euclidean distance measure, what is the average silhouette 
score for the resulting clustering after the K-mean clustering algorithm terminates?

Note: Your answer should be in format X.X or X.XX (depending 
on precision required on decimal places).

Perform K-means clustering starting with centroids x1=(1,1) and x3=(7,7).
Calculate the silhouette score for each point, which is given by:
s(i)= b(i)−a(i) / max(a(i),b(i)) 
- a(i) is the average distance between the point and other points in the same cluster.
- b(i) is the average distance between the point and the points in the nearest cluster (other than the one the point is assigned to).
Compute the average silhouette score across all points.
"""
import numpy as np
from sklearn.metrics import silhouette_score

# Initial data points (X, Y)
data = {
    'x1': (1, 1),
    'x2': (6, 6),
    'x3': (7, 7),
    'x4': (3, 2),
    'x5': (6, 8),
    'x6': (2, 1),
    'x7': (2, 3)
}

# Convert data to NumPy array
X = np.array(list(data.values()))

# Function to calculate Euclidean distance
def euclidean_distance(point1, point2):
    return np.sqrt(np.sum((point1 - point2) ** 2))

# Function to recompute the centroids by averaging the points in the clusters
def recompute_centroid(cluster):
    return np.mean(cluster, axis=0)

# Initial centroids
centroid1 = np.array([1, 1])  # Point x1
centroid2 = np.array([7, 7])  # Point x3

# Perform iterations until the centroids do not change
converged = False
while not converged:
    # Assign points to clusters
    cluster1 = []
    cluster2 = []
    labels = []  # To keep track of cluster assignments for silhouette score

    for key, value in data.items():
        point = np.array(value)
        dist_to_c1 = euclidean_distance(point, centroid1)
        dist_to_c2 = euclidean_distance(point, centroid2)

        # Assign point to the nearest centroid
        if dist_to_c1 < dist_to_c2:
            cluster1.append(point)
            labels.append(0)  # Cluster 1 label
        else:
            cluster2.append(point)
            labels.append(1)  # Cluster 2 label

    # Convert to numpy arrays for easier calculations
    cluster1 = np.array(cluster1)
    cluster2 = np.array(cluster2)

    # Recompute the centroids
    new_centroid1 = recompute_centroid(cluster1)
    new_centroid2 = recompute_centroid(cluster2)

    # Check for convergence (if centroids do not change)
    if np.array_equal(new_centroid1, centroid1) and np.array_equal(new_centroid2, centroid2):
        converged = True
    else:
        centroid1, centroid2 = new_centroid1, new_centroid2

# Once clusters are formed, compute the silhouette score
labels = np.array(labels)
silhouette_avg = silhouette_score(X, labels, metric='euclidean')

# Output the final average silhouette score
print(f"Average Silhouette Score: {silhouette_avg:.2f}")



Average Silhouette Score: 0.75


In [9]:
"""
Cluster	Tennis	Golf	Baseball	Basketball	Football	Hockey	Total	Entropy	Purity
#1	1	1	0	11	4	676	693		
#2	27	89	333	827	253	33	1562		
#3	326	465	8	105	16	29	949		
Total	354	555	341	943	273	738	3204		
Figure 4: Cluster Data for Various Sports

Review the table labeled Figure 4: Cluster Data for Various Sports.
You are asked to validate the provided clustering data. 
Assume supervised validation will be used since class labels are provided in the table.
What is the entropy measure for each cluster and the total clustering?

Entropy Calculation Overview:
Entropy is a measure of the uncertainty or impurity within a cluster. 
The formula for calculating the entropy of a cluster is:

E=−∑i=1 pi ​ log2 (pi)

where p𝑖 is the proportion of items in class i within the cluster.

Step-by-step process:
For each cluster:

Calculate the proportion of each class (Tennis, Golf, etc.) in the cluster.
Apply the entropy formula to compute the entropy for that cluster.
For the total clustering:

Calculate the weighted entropy across all clusters,
where the weight is the size of each cluster relative to the total number of points.

Entropy for Each Cluster: This is calculated by considering the proportion of each class within the cluster.
Total Clustering Entropy: This is the weighted sum of the individual cluster entropies, weighted by the number of points in each cluster.
"""

import numpy as np

# Cluster data
clusters = {
    "Cluster 1": [1, 1, 0, 11, 4, 676],
    "Cluster 2": [27, 89, 333, 827, 253, 33],
    "Cluster 3": [326, 465, 8, 105, 16, 29]
}

# Total points in each cluster
totals = {
    "Cluster 1": 693,
    "Cluster 2": 1562,
    "Cluster 3": 949
}

# Total points in the dataset
total_points = 3204

# Function to calculate entropy for a cluster
def calculate_entropy(cluster, total):
    entropy = 0
    for value in cluster:
        if value != 0:
            p_i = value / total
            entropy -= p_i * np.log2(p_i)
    return entropy

# Calculate entropies for each cluster
entropies = {}
for cluster_name, cluster_data in clusters.items():
    entropy = calculate_entropy(cluster_data, totals[cluster_name])
    entropies[cluster_name] = entropy
    print(f"{cluster_name} Entropy: {entropy:.4f}")

# Calculate total clustering entropy (weighted entropy)
weighted_entropy = sum(
    (totals[cluster_name] / total_points) * entropies[cluster_name]
    for cluster_name in clusters
)

print(f"Total Clustering Entropy: {weighted_entropy:.4f}")


Cluster 1 Entropy: 0.2000
Cluster 2 Entropy: 1.8407
Cluster 3 Entropy: 1.6964
Total Clustering Entropy: 1.4431


In [10]:
"""
Cluster	Tennis	Golf	Baseball	Basketball	Football	Hockey	Total	Entropy	Purity
#1	1	1	0	11	4	676	693		
#2	27	89	333	827	253	33	1562		
#3	326	465	8	105	16	29	949		
Total	354	555	341	943	273	738	3204		
Figure 4: Cluster Data for Various Sports

Review the table labeled Figure 4: Cluster Data for Various Sports. 
You are asked to validate the provided clustering data.
Assume supervised validation will be used since class labels are provided in the table.

The purity measure for a cluster is defined as purity(Ci)=1/ni max(n^d_i), 
where Ci is cluster i, ni is the number of samples in cluster, and 
n^d_i is the number of samples belonging to class d in cluster i. 
Similarly, purity for the total clustering is defined as, where 
n is the total number of samples, 
m is the total number of clusters, and 
n^d_i is the number of samples belonging to class 
d in cluster i.

Given this information, what is the purity measure for each cluster and the total clustering?

"""

# Cluster data
clusters = {
    "Cluster 1": [1, 1, 0, 11, 4, 676],
    "Cluster 2": [27, 89, 333, 827, 253, 33],
    "Cluster 3": [326, 465, 8, 105, 16, 29]
}

# Total points in each cluster
totals = {
    "Cluster 1": 693,
    "Cluster 2": 1562,
    "Cluster 3": 949
}

# Total points in the dataset
total_points = 3204

# Function to calculate purity for a cluster
def calculate_purity(cluster, total):
    dominant_class_count = max(cluster)
    purity = dominant_class_count / total
    return purity

# Calculate purities for each cluster
purities = {}
weighted_sum = 0
for cluster_name, cluster_data in clusters.items():
    purity = calculate_purity(cluster_data, totals[cluster_name])
    purities[cluster_name] = purity
    weighted_sum += totals[cluster_name] * purity
    print(f"{cluster_name} Purity: {purity:.4f}")

# Calculate total clustering purity (weighted purity)
total_purity = weighted_sum / total_points

print(f"Total Clustering Purity: {total_purity:.4f}")


Cluster 1 Purity: 0.9755
Cluster 2 Purity: 0.5294
Cluster 3 Purity: 0.4900
Total Clustering Purity: 0.6142
