2) For this problem you will use an image segmentation data set for clustering. You will experiment with using PCA as an approach to reduce dimensionality and noise in the data. You will compare the results of clustering the data with and without PCA using the provided image class assignments as the ground truth. The data set is divided into three files. The file "segmentation_data.txt" contains data about images with each line corresponding to one image. Each image is represented by 19 features (these are the columns in the data and correspond to the feature names in the file "segmentation_names.txt". The file "segmentation_classes.txt" contains the class labels (the type of image) and a numeric class label for each of the corresponding images in the data file. After clustering the image data, you will use the class labels to measure completeness and homogeneity of the generated clusters. The data set used in this problem is based on the Image Segmentation data set at the UCI Machine Learning Repository.

A) Load in the image data matrix (with rows as images and columns as features). Also load in the numeric class labels from the segmentation class file. Using your favorite method (e.g., sklearn's min-max scaler), perform min-max normalization on the data matrix so that each feature is scaled to [0,1] range.

In [25]:
import numpy as np
import pandas as pd
from sklearn import preprocessing, decomposition
from sklearn.cluster import KMeans
from sklearn.metrics import completeness_score, homogeneity_score

In [2]:
# Load in data matrix
seg_data = pd.read_csv("segmentation_data/segmentation_data.txt", header = None)
seg_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,110.0,189.0,9,0.0,0.0,1.0,0.666667,1.222222,1.186342,12.925926,10.888889,9.222222,18.666668,-6.111111,-11.111111,17.222221,18.666668,0.508139,1.910864
1,86.0,187.0,9,0.0,0.0,1.111111,0.720082,1.444444,0.750309,13.740741,11.666667,10.333334,19.222221,-6.222222,-10.222222,16.444445,19.222221,0.463329,1.941465
2,225.0,244.0,9,0.0,0.0,3.388889,2.195113,3.0,1.520234,12.259259,10.333334,9.333334,17.11111,-5.777778,-8.777778,14.555555,17.11111,0.480149,1.987902
3,47.0,232.0,9,0.0,0.0,1.277778,1.254621,1.0,0.894427,12.703704,11.0,9.0,18.11111,-5.111111,-11.111111,16.222221,18.11111,0.500966,1.875362
4,97.0,186.0,9,0.0,0.0,1.166667,0.691215,1.166667,1.00554,15.592592,13.888889,11.777778,21.11111,-5.111111,-11.444445,16.555555,21.11111,0.442661,1.863654


In [6]:
# Load in class labels
seg_class = pd.read_csv("segmentation_data/segmentation_classes.txt", header = None, sep ='\t', names = ['Name', 'Value'])
seg_class.head()

Unnamed: 0,Name,Value
0,GRASS,0
1,GRASS,0
2,GRASS,0
3,GRASS,0
4,GRASS,0


In [9]:
# performing min max normalization on segmentation data
min_max = preprocessing.MinMaxScaler().fit(seg_data)
seg_data_norm = min_max.transform(seg_data)
seg_data_norm

array([[0.43083004, 0.74166667, 0.        , ..., 0.12371135, 0.50813884,
        0.83184923],
       [0.33596838, 0.73333333, 0.        , ..., 0.12739322, 0.46332908,
        0.83698646],
       [0.88537549, 0.97083333, 0.        , ..., 0.11340205, 0.48014903,
        0.84478233],
       ...,
       [0.50197628, 0.625     , 0.        , ..., 0.07216495, 0.5409177 ,
        0.17591546],
       [0.58893281, 0.6125    , 0.        , ..., 0.08100147, 0.50308645,
        0.18478933],
       [0.48616601, 0.62916667, 0.        , ..., 0.09646539, 0.4799313 ,
        0.17037463]])

B) Next, Perform Kmeans clustering (for this problem, use the Kmeans implementation in scikit-learn) on the image data (since there are a total 7 pre-assigned image classes, you should use K = 7 in your clustering). Use Euclidean distance as your distance measure for the clustering. Print the cluster centroids (use some formatting so that they are visually understandable). Compare your 7 clusters to the 7 pre-assigned classes by computing the Completeness and Homogeneity values of the generated clusters.

In [13]:
kMeans = KMeans(n_clusters=7)
kMeans.fit(seg_data_norm)

KMeans(n_clusters=7)

In [14]:
kMeans.cluster_centers_

array([[5.13993692e-01, 8.08936588e-01, 0.00000000e+00, 7.74410751e-02,
        5.05050505e-03, 5.44737633e-02, 1.40719343e-03, 4.63349822e-02,
        1.40097198e-03, 1.08789943e-01, 9.14029557e-02, 9.24140773e-02,
        1.42676436e-01, 6.79161019e-01, 7.90017879e-02, 8.21286885e-01,
        1.34900800e-01, 4.14491323e-01, 8.92332630e-01],
       [2.53035008e-01, 3.98487103e-01, 0.00000000e+00, 7.44047598e-02,
        1.93452381e-02, 7.45801683e-02, 4.38265680e-03, 6.31432965e-02,
        5.22696653e-03, 1.46783905e-01, 1.36740760e-01, 1.83526632e-01,
        1.17324071e-01, 7.17761643e-01, 3.42580536e-01, 3.56900035e-01,
        1.83837839e-01, 4.13719379e-01, 2.02018161e-01],
       [5.35098814e-01, 1.50166667e-01, 0.00000000e+00, 2.77777769e-02,
        1.66666667e-03, 3.02281387e-02, 5.42887957e-04, 2.67660451e-02,
        5.86661900e-04, 8.23246433e-01, 7.79716377e-01, 8.94170356e-01,
        7.88760696e-01, 2.70665440e-01, 6.66372551e-01, 2.89386481e-01,
        8.94170356e-01

In [16]:
# Loading in the names so it will be easier to understand for the cluster centroids
seg_names = pd.read_csv("segmentation_data/segmentation_names.txt", header = None, sep ='\t')
seg_names

Unnamed: 0,0
0,REGION-CENTROID-COL
1,REGION-CENTROID-ROW
2,REGION-PIXEL-COUNT
3,SHORT-LINE-DENSITY-5
4,SHORT-LINE-DENSITY-2
5,VEDGE-MEAN
6,VEDGE-SD
7,HEDGE-MEAN
8,HEDGE-SD
9,INTENSITY-MEAN


In [18]:
# Printing cluster centroids
pd.options.display.float_format='{:,.2f}'.format
centroids = pd.DataFrame(kMeans.cluster_centers_, columns=seg_names)
centroids

Unnamed: 0,"(REGION-CENTROID-COL,)","(REGION-CENTROID-ROW,)","(REGION-PIXEL-COUNT,)","(SHORT-LINE-DENSITY-5,)","(SHORT-LINE-DENSITY-2,)","(VEDGE-MEAN,)","(VEDGE-SD,)","(HEDGE-MEAN,)","(HEDGE-SD,)","(INTENSITY-MEAN,)","(RAWRED-MEAN,)","(RAWBLUE-MEAN,)","(RAWGREEN-MEAN,)","(EXRED-MEAN,)","(EXBLUE-MEAN,)","(EXGREEN-MEAN,)","(VALUE-MEAN,)","(SATURATION-MEAN,)","(HUE-MEAN,)"
0,0.51,0.81,0.0,0.08,0.01,0.05,0.0,0.05,0.0,0.11,0.09,0.09,0.14,0.68,0.08,0.82,0.13,0.41,0.89
1,0.25,0.4,0.0,0.07,0.02,0.07,0.0,0.06,0.01,0.15,0.14,0.18,0.12,0.72,0.34,0.36,0.18,0.41,0.2
2,0.54,0.15,0.0,0.03,0.0,0.03,0.0,0.03,0.0,0.82,0.78,0.89,0.79,0.27,0.67,0.29,0.89,0.21,0.13
3,0.26,0.46,0.0,0.03,0.01,0.04,0.0,0.03,0.0,0.03,0.02,0.04,0.02,0.77,0.22,0.51,0.04,0.8,0.18
4,0.53,0.32,0.0,0.05,0.03,0.13,0.02,0.09,0.02,0.34,0.31,0.4,0.3,0.55,0.49,0.28,0.4,0.3,0.17
5,0.77,0.42,0.0,0.01,0.02,0.05,0.01,0.03,0.0,0.04,0.04,0.06,0.03,0.77,0.23,0.48,0.06,0.53,0.24
6,0.58,0.72,0.0,0.04,0.05,0.09,0.0,0.1,0.01,0.35,0.32,0.41,0.31,0.55,0.51,0.25,0.41,0.3,0.16


In [23]:
# Calculating the completeness and homogeneity score
c_score = completeness_score(seg_class['Value'], kMeans.labels_)
print("Completeness Score: ", c_score)

h_score = homogeneity_score(seg_class['Value'], kMeans.labels_)
print("Homogeneity Score: ", h_score)

Completeness Score:  0.6849941288430821
Homogeneity Score:  0.6842294495482975


C) Perform PCA on the normalized image data matrix. You may use the linear algebra package in Numpy or the Decomposition module in scikit-learn (the latter is much more efficient). Analyze the principal components to determine the number, r, of PCs needed to capture at least 95% of variance in the data. Then use these r components as features to transform the data into a reduced dimension space. 

In [47]:
# Need to create array of seg_data for PCA
seg_data_array = np.array(seg_data)
pca = decomposition.PCA()
xtrans = pca.fit_transform(seg_data_array)

In [48]:
np.set_printoptions(precision=2, suppress=True)
print(xtrans)

[[-84.65  -4.43  -8.88 ...  -0.     0.     0.  ]
 [-84.35 -28.54  -7.58 ...   0.    -0.    -0.  ]
 [-96.16 115.47 -17.9  ...   0.     0.     0.  ]
 ...
 [-78.74  10.28  -9.03 ...  -0.     0.     0.  ]
 [-72.94  31.66  -9.31 ...   0.     0.    -0.  ]
 [-72.7    6.09  -7.91 ...   0.     0.     0.  ]]


In [46]:
# Get explained variance ratio
ex_var = pca.explained_variance_ratio_
ex_var

array([0.42, 0.24, 0.19, 0.1 , 0.03, 0.01, 0.  , 0.  , 0.  , 0.  , 0.  ,
       0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ])

In [45]:
# Get variance for each amount of components
r = 0
var = 0
for i in ex_var:
    var += i * 100
    r += 1
    print("r: ", r, " variance: ", var)

r:  1  variance:  41.71365040397344
r:  2  variance:  65.93035512841202
r:  3  variance:  84.84069640464257
r:  4  variance:  95.24457467175682
r:  5  variance:  98.64810409575803
r:  6  variance:  99.68901304683942
r:  7  variance:  99.94702358045159
r:  8  variance:  99.97900841499286
r:  9  variance:  99.99243036305238
r:  10  variance:  99.99871814836337
r:  11  variance:  99.99989558026796
r:  12  variance:  99.999990276817
r:  13  variance:  99.99999773291776
r:  14  variance:  99.99999999999996
r:  15  variance:  99.99999999999997
r:  16  variance:  99.99999999999999
r:  17  variance:  100.0
r:  18  variance:  100.0
r:  19  variance:  100.0


Need to have 4 components in order to capture at least 95% of variance of the data.

In [49]:
# using PCA with 4 components
pca = decomposition.PCA(n_components=4)
seg_data_trans = pca.fit(seg_data_array).transform(seg_data_array)
print(seg_data_trans)

[[-84.65  -4.43  -8.88 -39.01]
 [-84.35 -28.54  -7.58 -39.77]
 [-96.16 115.47 -17.9  -78.9 ]
 ...
 [-78.74  10.28  -9.03  -5.84]
 [-72.94  31.66  -9.31  -2.43]
 [-72.7    6.09  -7.91  -9.9 ]]


4) Perform Kmeans again, but this time on the lower dimensional transformed data. Then, compute the Completeness and Homogeneity values of the new clusters.

In [50]:
# Performing KMeans but specifying the number of clusters
kMeans.fit(seg_data_trans)

KMeans(n_clusters=7)

In [52]:
labels = kMeans.labels_
clust_centers = kMeans.cluster_centers_

In [53]:
pd.DataFrame(clust_centers.T)

Unnamed: 0,0,1,2,3,4,5,6
0,-54.12,-31.0,209.95,53.37,-23.93,162.96,-38.27
1,-65.01,62.71,48.13,50.85,75.68,-71.75,-45.79
2,2.65,-4.39,-16.18,731.15,-8.55,-7.0,-3.55
3,30.49,39.72,4.48,-14.24,-50.98,0.95,-63.04


In [56]:
# Computing completeness and homogeneity score on transformed data
rans_c_score = completeness_score(seg_class['Value'], labels)
print("Transformed Data Completeness Score: ", trans_c_score)
trans_h_score = homogeneity_score(seg_class['Value'], labels)
print("Transformed Data Homogeneity Score: ", trans_h_score)

Transformed Data Completeness Score:  0.5407284614059643
Transformed Data Homogeneity Score:  0.4800114829844271


E) Discuss your observations based on the comparison of the two clustering results.

When comparing the two clustering results, we can see that transforming the data can affect the completeness and homogeneity score. Without transforming the data, both the completeness and homogeneity score were around .68. After we transformed the data, the completeness score dropped to 0.54 and the homogeneity score dropped to 0.48. So in this case, we would not want to transform the data since they resulted in lower completeness and homogeneity scores.