TP2.txt

Attention:
- Do not edit this file in text editors like Word. Use a plain text editor only. In case of doubt, you can use Spyder as a text editor.
- Do not change the structure of this file. Just fill in your answers in the places provided (After the R#: tag).
- You can add lines in the spaces for your answers but your answers should be brief and straight to the point.
- You can include references to images or html files such as the reports generated with clusters. To do this, simply include this document in the folder with the reports or images and refer them in the text by the file name in an isolated line. For example, the line

test.png

refers to a test.png image file in the same folder as this document.

QUESTIONS:

Q1: Explain how you selected the best attributes for the clustering phase. In particular, indicate the visualization methods used to explore the extracted attributes and any statistical tests used.
R1: From the raw image data as matrix, containing 563 rows and 2500 columns, we extracted 6 features using each method: Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (TSNE) and Isometric mapping with Isomap. 
To select the most useful features for clustering from these 18 extracted attributes, we first checked for correlation between these attributes. Attributes that are highly correlated with each other are redundant, as they carry the same information and therefore should be removed to reduce the problem of curse of dimensionality. For this we computed pairwise the pearson correlations and filter out the redundant ones. In this step we computed a heatmap to better visualize the correlation patterns.
Since some of our data is labelled, we decided to make the most of this and performed independency tests (F-test) with ANOVA. Here we removed features that seemed independent from the class labels at 1% significance level, because these features are probably less helpful in distinguishing the classes. At the end of this step we had 4 selected feature left.
Finally, before setting these 4 features as our final dataset we constructed 6 plots to check pairwise relations between variables, to make sure that there were no relations that were not detected by Pearson correlation (as this is a metric of only linear association). The plots can be seen here in the order (0,1), (0,2), (0,3), (1,2), (1,3) and (2,3):

features_0_1.jpg

features_0_2.jpg

features_0_3.jpg

features_1_2.jpg

features_1_3.jpg

features_2_3.jpg

We can see that the second feature seems to be to some extent related to all other features and was therefore removed. In conclusion, 3 features were selected, corresponding to the second and third feature attribute with PCA and the first extracted with Isomap.


Q2: After selecting the attributes, did you standardize or normalize the values? Justify your decision.
R2: After selecting the attributes we need to bring them to the same scale before the clustering phase. As normalization is sensitive to outliers (because it takes the minimum and maximum of the feature into consideration), it would not be as appropriate, because clustering methods are mainly based on distances or density and it could bias the results. Standardization is not as sensitive, as it uses the mean and standard deviation, and therefore this was our chosen method.


Q3: Explain how you found the neighborhood radius value (epsilon) for the DBSCAN algorithm by following the procedure described in the article "A density-based algorithm for discovering clusters in large spatial databases with noise".
R3: We use the KMeans algorithm, fitted to our standardized data and using a vector of zeros as labels, to find the distances between all points and their k neighbours. Considering the distance to the last neighbour, in our case the fifth neighbour, for every point, we plotted the sorted distances. Looking at this plot, we need to pick an arbitrary point to be considered the boundary between the noise and the core points. We choose this point to be where the distances are no longer declining as much and form an "elbow". The epsilon is the corresponding distance value of this chosen point.


Q4: Examining the clusters generated by the DBSCAN algorithm with the value optimized by the method described in the article, do you think the result is adequate to cluster these images? Justify your answer.
R4: The DBSCAN algorithm with the value optimized by the Elbow method does not seem adequate to cluster these images as can be seen by visually inspecting the clusters generated in 

cluster_dbscan_elbow.html

The algorithm generates only one cluster and labels some images as noise. There are in fact some images that seem to have been badly segmented and are actually the space between cells and not cells themselves. Others are also not an image of a full cell but one that has been "cut" and shows, for example, only half a cell. But although it would make sense to consider some images as noise, the ones that this DBSCAN algorithm clusters as noise do not seem to actually be noise, but rather regular cell images in different cell cycle stages.


Q5: Describe your analysis of the k (for K-Means) and epsilon (for DBSCAN) parameters using the internal and external indicators referred in the assignment page. Include the two plots of the indicator values ​​(indicating the image name of each plot in one line in your answer) as a function of the k and epsilon parameters and explain how you chose the ranges for examining these parameters. Indicate, with justification, what conclusions you can draw from this analysis.
R5: 

The range chosen for examining the best k values for K-Means was from k=2 until k=15. We excluded k=1, since clustering all images in one cluster would mean the same of the biologists as not clustering at all. More than 12 clusters also seemed unreasonable to our goal of making the biologists' task easier, because looking through 12 groups of images in order to classify in 3 groups (or 4, if we consider the "noisy" images, i.e. segmentation errors) could also bring some trouble. 

Varying the k parameters we observed the following regarding the indicator values:
- Silhouette score: varies only slightly (between 0.25 and 0.31).
- Adjusted rand score: declines from k=2 until k=4, then reaches a local maximum in k=5, k=9 and again a smaller one in k=11 (between 0.15 and 0.51).
- Rand score: varies only slightly (between 0.66 and 0.75).
- Recall: is always declining (between 0.22 and 0.89)
- Precision: varies only slightly until k=7, increasing at a faster pace after this point, reaching a local maximum in k=9, and another one in k=11 (between 0.55 and 0.74)
- F1: declines in a similar way as the recall, starting out smaller but then consistently increasing further apart from it (between 0.32 and 0.72)

These variations can be seen in in the plot:

Kmeans_indicators.png

The silhouette score, being an internal index, would seem like the most important indicator to be considered, as most of our data in unlabelled. However, our features can hardly create distinguishable clusters, as can be seen in

features.png

.In an ideal situation we would use K-Means with data that would allow to create globular clusters and the silhouette score would have good results. However, in this case, this metric is not very useful, as regardless the number of clusters chosen, the measure of how similar a point is to its own cluster (cohesion) compared to other clusters (separation) will in average be the same. If the number of clusters is small, the formed clusters will not have a big distance between their own points and the other clusters, as they will take in a large amount of points. If k is large, the distance of points within the clusters will decrease, but so will the distance to the nearest cluster.
The rand index presents only very small variations, so this metric does not seem to help in choosing the best value for k.
Recall, F1 and adjusted Rand index all seem to behave in a similar way in the sense that they generally decline when the value of k increases.  It is natural for the recall to decrease when k increases. It is the ratio between true positives and the sum of true positives plus false negatives, and as there are only 3 classes in the labelled data, when k=1 all the observations are true positives but then, as we increase the number of clusters, this value decreases and the number of false negatives will keep increasing too. Precision, on the other hand, will increase when the number of clusters is large because the denominator is the sum of true positives plus false positives and this last value will keep decreasing, as less and less points will be allocated to the same cluster.
Note: Positives are the pairs of points that are in the same cluster and negatives the pairs of points that are allocated to different clusters. So, true positives are the pairs of points that are in the same cluster and belong to the same class, while false positives are the pairs of points that the algorithm clustered together but actually have different labels. Naturally, these values could only be computed for the 81 points in our dataset that are labelled and therefore it is important to remember that all the external metrics evaluate only a small subset of the data.
F1 score is the harmonic mean of precision and recall, so it will try to balance these effects.
The adjusted Rand index is a version of the Rand index, which is corrected for chance by using the expected similarity of all pairwise comparisons between clusterings so that it has a value close to 0 when it is randomly labelling, independently of the number of clusters and samples, and exactly 1 when the clusters are identical. Therefore, it seems like a better indicator than the Rand Index.
Summarising, we cannot take conclusions from the silhouette score or rand index, as they do not seem to vary when the value of k changes; F1 score should be considered as it is a balance between precision and recall; and adjusted rand index also seems like an indicator take into consideration.
Taking this into consideration, it seems like the best values for k are k=2,3,4,5. K=9 also seems interesting and should be looked at more closely, as it is a local maximum in both indicators and shows a high precision.

For examining the epsilon parameter for DBSCAN, we varied it from 0.1 to 1.1 with 0.05 steps (for a total on 21 different parameter values), because these values around the parameter obtained when optimizing with the elbow method, as described in the article. 

The silhouette score will not be considered to optimize DBSCAN, as this algorithm focuses on density between points for clustering and not distance, while the silhouette score evaluates only distances.

Observing the plot 

DBSCAN_indicators.png

we will take into consideration again the F1 score, as it is a balance between precision and recall, seeing that both these last metrics seem to fail for small and large values of epsilon.
The rand index and adjusted rand index both point in the same direction for different values of epsilon, the only difference being that for the most "central" epsilon values the rand index has a higher score.
Looking at the rand index, adjusted rand index and F1, the best epsilon value seems to be between 0.4 and 0.6.


Q6: Select some of the parameter values ​​tested in question five and examine the corresponding clusters more closely, generating the HTML file with the images. Explain how you selected these parameter values, discuss the different options and propose a recommendation that could help the biologists' task of classifying cells and rejecting segmentation errors.
R6: 
Selecting the best parameters for k as discussed in the answer to question 5, we generated the HTML files to examine the corresponding clusters closely:
- k=2: cluster 0 seems to contain almost all the images of cells in phase 1 and a few of the images of cells in phase 2. Cluster 1 contains the rest, which is all images that are not cells (badly segmented images that show the space between cells), most of the images where the cell is not fully shown, some images of cells in phase 2 and all the images of cells in phase 3.

cluster_km_2.html

- k=3: cluster 0 seems to contain almost all images of cells in phase 1 and only very few of phase 2. Cluster 1 contains some images of cells in phase 1 and some in phase 2 (approximately half in each phase). Cluster 2 contains all images that are not cells, most of the images where the cell is not fully shown, some in phase 1, some in phase 3 and all in phase 3.

cluster_km_3.html

- k=4:cluster 0 contains all kinds of images, except for cells in phase 3. Cluster 1 contains all cells of phase 3 and some images that are not cells and a few of phase 1. Cluster 2 contains images of cells in phase 2 and 1. Cluster 3 contains only images of cells in phase 1

cluster_km_4.html

- k=5: cluster 0 contains mostly imagens that are not cells and some images of cells in all phases. Cluster 1 contains images of cells in phase 1 and some in phase 2. Cluster 2 contains mostly images of cells in phase 3 and some segmentation errors. Cluster 3 and cluster 4 contain only images of cells in phase 1.

cluster_km_5.html

- k=9: cluster 0 contains mostly images of cells in phase 2 and a few in phase 3. Cluster 1 contains mostly images of cells in phase 1 and images that only show cells partially. Cluster 2 contains only images of cells in phase 1. Cluster 3 contains mostly images of cells in phase 3 and a few segmentation errors. Cluster 4, 5 and 8 contain only images of cells in phase 1. Cluster 6 contains images of cells in phase 1 and 2. Cluster 7 contains mostly segmentation errors.

cluster_km_9.html

The best parameters for epsilon are, as discussed in the answer to question 5, between 0.4 and 0.6. Selecting in a homogeneous way epsilon values between these, we chose to generate the HTML files to examine the clusters corresponding to clustering DBSCAN for epsilon values 0.4, 0.45, 0.5, 0.55 and 0.6:
- eps=0.4 (8 clusters): Clusters -1 contains some segmentation erros, and cells in all phases. Cluster 1 contains only segmentation errors. Cluster 2,3,4,5 and 6 contain images in phase 1. Most of the images are in cluster 0 where there are segmentation errors and cells in all phases.

cluster_eps_0.4.html

- eps=0.45 (8 clusters): cluster -1 contains images of cells in all phases and some segmentation errors. Cluster 1 contains only segmentation errors.  Cluster 2, 3 and 4 only images of cells in phase 1. Cluster 5 contains one image of a cell in phase 2 and the rest in phase 1. Cluster 6 contains one image of a segmentation error and cells in phase 1. Most of the images are in cluster 0 where there are segmentation errors and cells in all phases.

cluster_eps_0.45.html

- eps=0.5 (6 clusters): cluster -1 contains images of cells in all phases and a few segmentation errors. Cluster 1 contains only segmentation errors. Cluster 2 contains cells in phase 1 and 2. Cluster 3 contains cells in phase 1. Cluster 3 contains cells in phase 2. Most of the images are again in cluster 0 where there are segmentation errors and cells in all phases.

cluster_eps_0.5.html

- eps=0.55 (3 clusters): cluster -1 contains images of cells in all phases but most of the images of cells in phase 3 are here. Cluster 1 contains images in phase 3. Most of the images are again in cluster 0 where there are segmentation errors and cells in all phases.

cluster_eps_0.55.html

- eps=0.6 (3 clusters): cluster -1 contains images of cells in all phases but most of the images of cells in phase 3 are here. Cluster 1 contains images in phase 1. Most of the images are again in cluster 0 where there are segmentation errors and cells in all phases.

cluster_eps_0.6.html

Clustering with DBSCAN seems to always result in one cluster with most of the images and the other clusters with very few images, even if well segmented. This is not very useful to help the biologists' task of classifying cells and rejecting segmentation errors, because they would always have to go through a big set of unclassified images.

K-Means produces more evenly distributed clusters, in terms of the number of images in each. As K-means clustering with k=2 aggregates in one cluster mostly cells in phase 1 (and a few in phase 2), leaving the rest in the second cluster, it would save the biologists some time when classifying the cells in phase 1. K-means with k=5, produces 5 clusters where there are at most 2 types of cells. We think this binary classification can save more time as the biologists only need to distinguish between 2 specific options. Thinking in this direction made us decide that our goal to ease the biologists' task is to provide them with clusters where the big majority of the images are cells in the same phase (or segmentation errors), so that the biologists can take a certain cluster, that is, a certain group of images, knowing that they represent mostly cells in, for instance, phase 1, so that they only have to look for the 'misclassified' ones. In this way, instead of having to classify 500 images, they only have to classify a small percentage in each group we present. The more groups (clusters) we present, the smaller the percentage of misclassified (naturally up to a certain point), so if we present too many groups it is also not the best, because even though only a few in each group are mistaken, it forces them to check many groups.

Precision increases a lot for k=9 and examining the generated clusters we think this is a good choice to pursue or goal, but there are still some clusters where there are multiple types of cells with some expression. For this reason we decided to examine the clusters generated with k=11 (as this value for k also generated a local maximum in precision), which can be seen here:

cluster_km_11.html

In conclusion, our recommendation to help the biologists' task of classifying cells and rejecting segmentation errors is to follow our procedure in feature selection and run the K-Means clustering algorithm. For these images this generates 11 clusters, each containing mainly images of one known type of cell, allowing them only look for misclassifications. To further ease this task, the clusters could be aggregated as follows:
Cluster 0, 3 - segmentation errors
Cluster 1, 2, 5, 6, 7, 8 - cells in phase 1
Cluster 9 - cells in phase 2
Cluster 4 - cells in phase 3
Cluster 10 - cells in phase 2 (less clear)

Note: Out of curiosity, the reason why images of cells in phase 1 are allocated in different clusters seems to be the different lightings caused by the microscope photographs.


Q7: Discuss advantages or problems with these two algorithms (K-Means and DBSCAN) for the purpose of helping biologists to organize these images, considering your theoretical knowledge of these algorithms as well as the results you obtained in your work.
R7: The two algorithms find clusters in different ways.

When using K-Means algorithm, we define a target number k of centroids in the dataset. Every data point is then allocated to each of the clusters through reducing the in-cluster sum of squares, that is, every point is allocated to the nearest cluster. The algorithm starts with a group of randomly selected centroids (prototypes), assigns each point to the closest prototype, recomputes each prototype as the mean of all the examples assigned to its cluster and repeats this until the prototype changes very little between iterations. K-Means is an exclusive, partitional and prototype-based clustering method. 

DBSCAN solves the problems of prototype-based clustering, particularly of K-Means, that come from attributing points to clusters based on the similarity to the cluster prototypes. This causes all points to belong to the cluster with the closest prototype, and it fails to consider other types of relations between the points.
In the DBSCAN algorithm, for each point, if the number of points in its neighbourhood is less than a certain value (minPts), this point is presumed to be noise. Otherwise, a cluster is created, with this point as a core point, and all its neighbours are added to the cluster. If any neighbour is a core point belonging to another cluster, the clusters are merged. This algorithm, being density-based, solves the problem of the shape of the clusters that arises with prototype clustering, because cluster membership propagates along the paths of nearby core points and not around prototypes.

The points in our selected features do not seem to have particular shape, neither globular (which would be best for K-Means), nor with high density areas (which would be better for DBSCAN), as can be seen in the plot "features.png".

features.png

For this reason, DBSCAN could be good for the objective of identifying only segmentation errors, but could not speed up the task of classification a lot. K-Means is not very good to cluster all cell images according to its exact phase, but if the goal is not to directly use the clusters as classifiers, but rather to ease this task by producing smaller subsets where the points are more similar to each other, K-means can be a good choice. 

Q8: Consider other clustering algorithms embedded in the Scikit-Learn library. Choose one and apply it to this problem, optimizing the parameters you deem appropriate in the way that you find adequate. Justify your choices and discuss whether this option would yield more useful results for biologists.
R8:
Gaussian mixture models (GMM) can be used to cluster unlabelled data in a similar way to K-Means, but with some theoretical advantages. Most importantly, K-Means does not account for variance (width of the gaussian curve), while GMM does, which means that instead of always trying to find circular clusters, it can handle more shapes. ​K-Means is basically a specific application case of GMM where the covariances are close to 0, resulting in spherical clusters.
GMM also performs soft classification, instead of hard classification (like K-Means), which means that it provides us the probabilities, given the data, with which a point belongs to each of the possible clusters.

We decided to use the same range to vary the number of components in the model as used for varying the k value in K-Means, since the algorithm works in a similar way and it would make the results more easily comparable. We plotted the different indicators used before and the results can be seen in the following plot:

GMM_indicators.png

Choosing the number as clusters in the same way as described for the K-Means algorithm, we decided to analyse more closely the clusters that resulted from using the algorithm with 2,3,4,5,6,9 and 11 clusters. Analysing the clusters, this algorithm seems to be more effective than both the DBSCAN and K-Means algorithms, providing a good classification with less clusters.

Considering a model with 11 clusters, we see that it behaves well, showing a good quality of class division for most clusters. 

cluster_gmm_11.html

We also tried with less clusters (6) and concluded that these results were very promising as well. 

cluster_gmm_6.html

In this case some generated clusters are dedicated to images of cells in phase 1 and different lightings over the cells, one cluster has images of cells in both phase 2 and 3 and another has segmentation errors and badly captured images.
Using a model with 11 clusters produced promising results, but since we found the 6 clusters solution just as good with the advantage of producing a reduced number of clusters to be provided to the biologists, we consider this last model to be the most useful.


Q9: (Optional) Implement the Bissecting K-Means hierarchical clustering algorithm as described in the assignment page and Lecture 19. Examine and discuss the results and their application to the problem of helping the biologists select and classify cell images.
R9:
We examined three bissecting k-means models with different numbers of iterations: one with 3 iterations, another with 6 and the final one with 11. The resulting clusters can be seen in the following images:

cluster_bisectk_3.html

cluster_bisectk_6.html

cluster_bisectk_11.html

Looking at the first one we can conclude that this approach is not very different from the other algorithms, as having 4 clusters means less class division. 
We took a look at the results from the 6 iteration model and realised that the class classification improved significantly.
Finally we tried an 11 iteration model and the classification quality kept improving. These clusters were well classified, mostly containing only images corresponding to a single phase of the cell cycle as well as dedicated cluster for segmentation errors.

With the objective of helping the biologists classify the cells in the most efficient fashion, we decided that the 11 iteration model works better because it is a good compromise between too many clusters, which would defeat the purpose of clustering, and having only a small number of bad clustered ones. We found it to be more efficient having 12 well defined clusters (11 iterations) as each would be handed to the biologists as containing one or two well classified cell phases, rather than having only a few clusters which would leave more classification work to be done by hand.