## PCA for Reduced Dimensionality

### a. Load in the image data matrix (with rows as images and columns as features). Also load in the numeric class labels from the segmentation class file. Using your favorite method (e.g., sklearn's min-max scaler), perform min-max normalization on the data matrix so that each feature is scaled to [0,1] range.

In [1]:
#read in the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

In [2]:
#import data (use np.genfromtxt)
data = np.genfromtxt("/Users/Alexkilledme/Desktop/jpt_demo/segmentation_data/segmentation_data.txt", delimiter=",", dtype="float")
classes = np.genfromtxt("/Users/Alexkilledme/Desktop/jpt_demo/segmentation_data/segmentation_classes.txt", delimiter="\t", dtype=("string", "int"))

In [3]:
#check out data type
print type(data)
print type(classes)

<type 'numpy.ndarray'>
<type 'numpy.ndarray'>


In [4]:
# basic statistic describtion
pd.DataFrame(data).describe(include="all").T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
0,2100.0,124.940476,72.858637,1.0,62.0,121.0,188.25,254.0
1,2100.0,123.483333,57.431428,11.0,81.0,122.0,171.25,251.0
2,2100.0,9.0,0.0,9.0,9.0,9.0,9.0,9.0
3,2100.0,0.014921,0.041024,0.0,0.0,0.0,0.0,0.333333
4,2100.0,0.00455,0.023573,0.0,0.0,0.0,0.0,0.222222
5,2100.0,1.89082,2.649453,0.0,0.722222,1.277776,2.222221,29.222221
6,2100.0,5.708299,44.989359,0.0,0.349603,0.833333,1.807406,991.7184
7,2100.0,2.406772,3.469954,0.0,0.833332,1.444444,2.555556,44.722225
8,2100.0,7.904224,53.471074,-1.589457e-08,0.421638,0.989744,2.251852,1386.3292
9,2100.0,37.047654,38.135291,0.0,7.472222,21.666666,53.277778,143.44444


In [5]:
### min-max normalization 
min_max_scaler = MinMaxScaler()
data_norm = min_max_scaler.fit_transform(data)
data_norm

array([[0.43083004, 0.74166667, 0.        , ..., 0.12371135, 0.50813884,
        0.83184923],
       [0.33596838, 0.73333333, 0.        , ..., 0.12739322, 0.46332908,
        0.83698646],
       [0.88537549, 0.97083333, 0.        , ..., 0.11340205, 0.48014903,
        0.84478233],
       ...,
       [0.50197628, 0.625     , 0.        , ..., 0.07216495, 0.5409177 ,
        0.17591546],
       [0.58893281, 0.6125    , 0.        , ..., 0.08100147, 0.50308645,
        0.18478933],
       [0.48616601, 0.62916667, 0.        , ..., 0.09646539, 0.4799313 ,
        0.17037463]])

### b. Next, Perform Kmeans clustering (for this problem, use the Kmeans implementation in scikit-learn) on the image data (since there are a total 7 pre-assigned image classes, you should use K = 7 in your clustering). Use Euclidean distance as your distance measure for the clustering. Print the cluster centroids (use some formatting so that they are visually understandable). Compare your 7 clusters to the 7 pre-assigned classes by computing the Completeness and Homogeneity values of the generated clusters.

In [6]:
from sklearn.cluster import KMeans

In [7]:
### Kmeans clustering
#k=7
kmeans = KMeans(n_clusters = 7, max_iter=500, verbose=1) #initialization
kmeans.fit(data_norm)

Initialization complete
start iteration
done sorting
end inner loop
Iteration 0, inertia 382.6844284764028
start iteration
done sorting
end inner loop
Iteration 1, inertia 357.07840837227593
start iteration
done sorting
end inner loop
Iteration 2, inertia 353.0134035647237
start iteration
done sorting
end inner loop
Iteration 3, inertia 351.70913931020357
start iteration
done sorting
end inner loop
Iteration 4, inertia 351.4564499400411
start iteration
done sorting
end inner loop
Iteration 5, inertia 351.38076265455635
start iteration
done sorting
end inner loop
Iteration 6, inertia 351.3260231038808
start iteration
done sorting
end inner loop
Iteration 7, inertia 351.3012338144358
start iteration
done sorting
end inner loop
Iteration 8, inertia 351.28143330721406
start iteration
done sorting
end inner loop
Iteration 9, inertia 351.24574591141214
start iteration
done sorting
end inner loop
Iteration 10, inertia 351.19022691648536
start iteration
done sorting
end inner loop
Iteration 11

Iteration 5, inertia 372.13216335436584
start iteration
done sorting
end inner loop
Iteration 6, inertia 371.9634663769622
start iteration
done sorting
end inner loop
Iteration 7, inertia 371.7800054880201
start iteration
done sorting
end inner loop
Iteration 8, inertia 371.598313641504
start iteration
done sorting
end inner loop
Iteration 9, inertia 371.4911465576723
start iteration
done sorting
end inner loop
Iteration 10, inertia 371.43003054013764
start iteration
done sorting
end inner loop
Iteration 11, inertia 371.3230325490308
start iteration
done sorting
end inner loop
Iteration 12, inertia 371.22222066048585
start iteration
done sorting
end inner loop
Iteration 13, inertia 371.11417172038836
start iteration
done sorting
end inner loop
Iteration 14, inertia 371.0417503912173
start iteration
done sorting
end inner loop
Iteration 15, inertia 371.0112156128373
start iteration
done sorting
end inner loop
Iteration 16, inertia 370.99704405533294
start iteration
done sorting
end inne

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=500,
    n_clusters=7, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=1)

In [8]:
# define pre-assigned classes and cluster centroids
pred_classes = kmeans.labels_
centroids = kmeans.cluster_centers_

In [9]:
# Print the cluster centroids 
print "                           Cluster Centroids"
pd.DataFrame(centroids, 
             columns=("region-centroid-col","region-centroid-row","region-pixel-count","short-line-density-5",
                      "short-line-density-2","vedge-mean","vegde-sd","hedge-mean","hedge-sd",
                      "intensity-mean","rawred-mean","rawblue-mean","rawgreen-mean","exred-mean",
                      "exblue-mean","exgreen-mean","value-mean", "saturatoin-mean", "hue-mean"),
             index=("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6", "Cluster 7")).T

                           Cluster Centroids


Unnamed: 0,Cluster 1,Cluster 2,Cluster 3,Cluster 4,Cluster 5,Cluster 6,Cluster 7
region-centroid-col,0.769063,0.535099,0.302506,0.513994,0.748274,0.25415,0.251212
region-centroid-row,0.42593,0.150167,0.530862,0.808937,0.532041,0.459382,0.393366
region-pixel-count,0.0,0.0,0.0,0.0,0.0,0.0,0.0
short-line-density-5,0.014024,0.027778,0.05226,0.077441,0.039157,0.026437,0.075397
short-line-density-2,0.022654,0.001667,0.04661,0.005051,0.037651,0.013793,0.019345
vedge-mean,0.039702,0.030228,0.100817,0.054474,0.11353,0.03679,0.078009
vegde-sd,0.002983,0.000543,0.00942,0.001407,0.018922,0.002031,0.004436
hedge-mean,0.023116,0.026766,0.083972,0.046335,0.107311,0.02661,0.062256
hedge-sd,0.002094,0.000587,0.011043,0.001401,0.017627,0.001651,0.005348
intensity-mean,0.040385,0.823246,0.400608,0.10879,0.298573,0.025687,0.147286


In [10]:
### computing the Completeness and Homogeneity values of the generated clusters.
from sklearn.metrics import completeness_score, homogeneity_score

In [11]:
pred_classes

array([3, 3, 3, ..., 0, 0, 6], dtype=int32)

In [12]:
completeness_score= completeness_score(classes.T[1], pred_classes)
homogeneity_score= homogeneity_score(classes.T[1], pred_classes)

In [13]:
print 'completeness_score:' + str(completeness_score)
print 'homogeneity_score:' + str(homogeneity_score)

completeness_score:0.6116744999910889
homogeneity_score:0.609965639314724


In [14]:
### Compare your 7 clusters to the 7 pre-assigned classes

The Completness Score and Homogeneity Score are not very high but still looks good. 

### c.Perform PCA on the normalized image data matrix. You may use the linear algebra package in Numpy or the Decomposition module in scikit-learn (the latter is much more efficient). Analyze the principal components to determine the number, r, of PCs needed to capture at least 95% of variance in the data. Then use these r components as features to transform the data into a reduced dimension space.

In [15]:
from sklearn import decomposition
from numpy import linalg

In [16]:
#derive the principal components manually using linear algebra
meanVals = np.mean(data_norm, axis=0)
meanRemoved = data_norm - meanVals
covMat = np.cov(meanRemoved, rowvar=0)
np.set_printoptions(precision=2, suppress=True, linewidth=100)
print covMat[0:5]

[[ 0.08  0.    0.   -0.   -0.   -0.    0.   -0.    0.    0.    0.    0.01  0.01 -0.01  0.    0.
   0.01 -0.01  0.  ]
 [ 0.    0.06  0.    0.    0.    0.   -0.    0.   -0.   -0.03 -0.03 -0.03 -0.03  0.02 -0.02  0.02
  -0.03  0.    0.04]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
   0.    0.    0.  ]
 [-0.    0.    0.    0.02 -0.   -0.   -0.   -0.   -0.   -0.   -0.   -0.   -0.    0.   -0.    0.
  -0.   -0.    0.  ]
 [-0.    0.    0.   -0.    0.01  0.    0.    0.    0.   -0.   -0.   -0.   -0.   -0.    0.   -0.
  -0.    0.   -0.  ]]


In [17]:
#get eigen values
eigVals, eigVects = linalg.eig(np.mat(covMat))
eigValInd = np.argsort(eigVals)
eigValInd = eigValInd[::-1]
sortedEigVals = eigVals[eigValInd]
print sortedEigVals

[0.48 0.1  0.08 0.04 0.03 0.02 0.01 0.01 0.01 0.01 0.   0.   0.   0.   0.   0.   0.   0.   0.  ]


In [18]:
#get variance percentage
total = sum(sortedEigVals)
varPercentage = sortedEigVals/total*100
print varPercentage

[60.71 13.2  10.12  4.54  3.55  1.99  1.89  1.62  1.07  0.71  0.39  0.16  0.05  0.    0.    0.
  0.    0.    0.  ]


In [19]:
# finding the pc number which is at least capture 95% of variance of data by accumulated Variance Percentage
accuVar = 0
j = 0
for i in varPercentage:
    accuVar += i
    j += 1
    print "Accumulated Variance Percentage for PC ", j, "is: %0.4f" % accuVar

Accumulated Variance Percentage for PC  1 is: 60.7142
Accumulated Variance Percentage for PC  2 is: 73.9112
Accumulated Variance Percentage for PC  3 is: 84.0350
Accumulated Variance Percentage for PC  4 is: 88.5785
Accumulated Variance Percentage for PC  5 is: 92.1259
Accumulated Variance Percentage for PC  6 is: 94.1139
Accumulated Variance Percentage for PC  7 is: 96.0059
Accumulated Variance Percentage for PC  8 is: 97.6213
Accumulated Variance Percentage for PC  9 is: 98.6869
Accumulated Variance Percentage for PC  10 is: 99.3982
Accumulated Variance Percentage for PC  11 is: 99.7904
Accumulated Variance Percentage for PC  12 is: 99.9479
Accumulated Variance Percentage for PC  13 is: 99.9969
Accumulated Variance Percentage for PC  14 is: 100.0000
Accumulated Variance Percentage for PC  15 is: 100.0000
Accumulated Variance Percentage for PC  16 is: 100.0000
Accumulated Variance Percentage for PC  17 is: 100.0000
Accumulated Variance Percentage for PC  18 is: 100.0000
Accumulated Va

So we need at least 7 principal components to capture at least 95% of variance in the data.

In [20]:
# transform the original dataset into lower dimension(#7) using decomposition package.
pca = decomposition.PCA(n_components=7)
data_trans = pca.fit(data_norm).transform(data_norm)

In [21]:
#first 5 pc7 rows
pd.DataFrame(data_trans, columns=("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7")).head(5)

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7
0,-0.689082,0.532951,0.246098,-0.197812,-0.076433,0.047804,-0.047321
1,-0.66692,0.510675,0.337972,-0.174381,-0.041178,0.056551,-0.041707
2,-0.712027,0.770944,-0.155822,-0.009299,-0.166622,0.043814,-0.060695
3,-0.732419,0.505378,0.496928,-0.056917,-0.144469,0.026348,-0.097034
4,-0.642317,0.531329,0.300672,-0.177615,-0.01824,0.054663,-0.055545


### d. Perform Kmeans again, but this time on the lower dimensional transformed data. Then, compute the Completeness and Homogeneity values of the new clusters.

In [22]:
from sklearn.metrics import completeness_score, homogeneity_score
kmeans.fit(data_trans)
new_pred_classes = kmeans.labels_ #kmeans.labels_ &kmeans.cluster_centers_ are parameter of kmeans rather than variables
new_centroids = kmeans.cluster_centers_

Initialization complete
start iteration
done sorting
end inner loop
Iteration 0, inertia 322.3858375482797
start iteration
done sorting
end inner loop
Iteration 1, inertia 309.53368253161693
start iteration
done sorting
end inner loop
Iteration 2, inertia 299.2143571882077
start iteration
done sorting
end inner loop
Iteration 3, inertia 293.5355744707541
start iteration
done sorting
end inner loop
Iteration 4, inertia 291.2721015713468
start iteration
done sorting
end inner loop
Iteration 5, inertia 290.1154725860033
start iteration
done sorting
end inner loop
Iteration 6, inertia 288.73243640321147
start iteration
done sorting
end inner loop
Iteration 7, inertia 287.9379056349236
start iteration
done sorting
end inner loop
Iteration 8, inertia 287.5923946063969
start iteration
done sorting
end inner loop
Iteration 9, inertia 287.2263750071613
start iteration
done sorting
end inner loop
Iteration 10, inertia 286.7902912192013
start iteration
done sorting
end inner loop
Iteration 11, in

In [23]:
print "                     Cluster Centroids"
pd.DataFrame(new_centroids, columns=("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7"),
            index=("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6", "Cluster 7")).T

                     Cluster Centroids


Unnamed: 0,Cluster 1,Cluster 2,Cluster 3,Cluster 4,Cluster 5,Cluster 6,Cluster 7
PC1,0.43689,-0.509214,1.414527,-0.619268,-0.206229,0.178354,-0.603705
PC2,-0.104936,-0.064458,0.087223,0.64025,-0.246205,0.043932,-0.355503
PC3,0.164951,-0.336922,0.036765,0.195829,0.152785,-0.264189,0.109197
PC4,0.234379,-0.064677,-0.173195,-0.086856,0.056522,0.184995,-0.129799
PC5,-0.046159,0.07909,-0.029922,-0.06776,0.130585,0.026488,-0.130911
PC6,-0.007512,0.006465,-0.008973,0.008866,-0.00554,0.024236,-0.021603
PC7,0.015253,-0.026165,-0.021573,0.038372,0.032849,0.003353,-0.043882


In [24]:
n_completeness_score = completeness_score(classes.T[1], new_pred_classes)
n_homogeneity_score = homogeneity_score(classes.T[1], new_pred_classes)

In [25]:
print 'completeness_score:'  +str(n_completeness_score)
print 'homogeneity_score:' +str(n_homogeneity_score)

completeness_score:0.6118121490278482
homogeneity_score:0.6101643468512763


### e.Discuss your observations based on the comparison of the two clustering results.

The Completness Scores and Homogeneity Scores of original classes and decomposited clusters look very close which are all around 0.60-0.61. So we can use decomposited clusters instead of original classes to do farther analysis and save computational cost. 