6/13

Cosine similarity metric is able to separate some clusters but not all. For a general data set, identification of important clusters may not occur.

Example: A few of the major groups were found almost immediately but, a specific region is not broken down until over 60 clusters are calculated.

Ideally cluster similarity should be measured by looking at peaks and their positions.

Another consideration for runtime and the clusters that are being formed is giving the clustering algorithm an adjacency matrix to represent the connections between adjacent cells. This will adjust how clusters are formed and also may improve runtime.

![Cosine Clustering](images/clusters-3-10.png)

6/14

Implementing a phase coloring based on relative similarity. The goal is to see major colors like red, green, blue in the major clusters and then as clusters split the parts retain the colors of the parent cluster. This should result in a plot where the changing colors show the gradients in cosine similarity and regions that have a single material/spectra will have a single almost unchanged color.

Initial implementation:
Start with 3 clusters and assign then the colors red, green, blue. Then proceed to split clusters based on the hierarchical clustering method. Each time a cluster is split the the neighbors in the color wheel N1, N2 and the two children clusters C1, C2 are considered.

The children's colors are assigned such that the hue difference between N1 - C1 - C2 - N2 is proportional to the similarity of the pairs. The order of C1,C2 (N1 - C1 - C2 - N2 vs. N1 - C2 - C1 - N2) is determined by the similarity of C1 and C2 to N1. The more similar pair is used.

6/17

Wafer coloring implemented with following results. Because of the use of the cosine the clusters of interest are not segmented. The general region of the cluster is found but the specific points are not identifiable. Furthermore, there is a clear split in shade through the desired cluster.

![177 Cosine Clusters](images/Cosine_177.png)


Plan is to adjust the coloring to be on a linear scale. Ideally this would mean that red and green colors are the furthest from each other and the rest are some combination. Also when a cluster is split all other colors are shifted based on the similarity of the corresponding clusters. Instead of just considering the two neighbors and scaling the split in the range between them, the entire hue spectra is considered and all the clusters so far mapped are placed proportionally to the similarity of their neighbors.

6/18

Implementation of colormap with linear color scale resulted in a less informative visual. Because all of the similarity values are rescaled at each split the cluster colors become an even distribution over the color spectrum. Even though the separation is proportional to similarity and similarity is scaled (subtracting min value) the majority of the similarity values are the same so the scale is almost uniform.

![177 Linear Cosine Clustering](images/Cosine_linear_177.png)

Either way, because of the large number of points there is almost so visible gradient anywhere in the image that corresponds to any desired phases. Instead the visual produced shows the order in which grid locations are added to clusters in a build up clustering approach. This does produce a spiral pattern but is not really the information we are trying to visualize.

The plan is to transition to using peaks or some form of peak metric to measure similarity.

Peak location will be done via Roberts code. (how to merge the code / what files are needed)

Another task is to write a container for the existing code allowing it to function independently of the system. Plan is to start with Singularity and potentially, if need be, use Docker.

6/19

For completeness L1 and L2 distance visuals have been implemented to see how well these similarity metrics can differentiate between clusters. Results are similar to using the cosine metric with some variations. No clear advantages to either method.

![L1](images/L1_177.png)
![L2](images/L2_177.png)

Code for locating peaks in spectra data needs to be streamlined into the current clustering algorithm and then the effectiveness of the method can be assessed.

A container to run all exiting code has been implemented. It automatically installs the latest version of the code (from github) and comes with the TiNiSn_500 and TiNiSn_600 data. Testing the container to make sure it runs on other machines is still necessary.



6/20

Peak based clustering implemented with significant results. Plotting the number of peaks in each spectra results in a limilar plot to L1 and L2 similarity above. Computing a similarity of two spectra based on the number of peaks not present in the other results in a plot where key regions are identifiable. To determine if a peak is present in both plots a "delta" parameter is used as the maximum allowed peak shift. Withing this delta two peaks are considered to be the same. This value was picked based on the resulting plots so a method to properly determine the value before hand is necesary. Furthermore, many clusters that should be whole are split up by this method. Potentially incorporating the peak intensities can improve the algorithms performance.

![Peak Clustering](images/peak_clust-0.049-31.png)

Note how the Heusler and half-Heusler regions are bounded along one of their sides while in the other direction the region extends further than necesary.

6/21

Presentation: https://docs.google.com/presentation/d/1LDtxghLUUXl52NSUOivJ78-jeY5iKSavsgqMwsCn7AY/edit#slide=id.g5c0eef7190_0_88

The new direction is to identify peak clusters (the regions in space that correspond to the peaks) and use these to represent diffraction patterns.

For example: across all the diffraction patterns some 7 unique peaks are found (including peak shifting) then the reduced representation may look like (0,0,0.1,3,0,0,5) where 0 corresponds to not having that peak and the values correspond to the presence and intensity of the peak.




6/24

Peak Clustering has been implemented with some results. All the peaks in the data set are mapped to coordinates based on their x,y position in the grid and the position of the peak in the diffraction pattern (p). Each peak is assigned the coordinate (x/100,y/100,p). The x,y coordinates are scaled to prevent clustering in the peak axis instead of the x,y axes.

Peaks are clustered into 50 groups. This is an arbitrary value and needs to be adjusted to match the true number of peaks.

Peak clustering is used to convert each diffraction pattern into 50 dimensional vectors (called peak vectors). These vectors are then used to cluster the diffraction patterns on the wafer. This produces the following clusterings:

![PeakReductionClustering](images/PeakReductionClustering_50.png)

Clustering seems to correspond more accurately to the actual peaks in the data set, although some balancing of peak intensity versus peak presence needs to be made.


![PeakReductionClustering](images/PeakReductionClustering_60.png)

Here with 60 different possible peaks the clustering is more broad (regions are rounder) and it seems to align better with the actual clustering. The Orange cluster is present when with 50 peak clusters it was not.

With some analysis of how peaks are clustered it seems like around 58 peak clusters is where the same peak is split into two labels so 57 peak clusters is used:


![PeakReductionClustering](images/PeakReductionClustering_57.png)

Here we can see the amorphous region, half Heusler, and full Heusler decently separated from the rest of the wafer. However some individual grid locations still pertain to the mentioned clusters when they shouldn't. This seems to be a product of the current similarity function.

The similarity between two diffraction patters is found as follows. The peak vectors are obtained for each. The values in these vectors are log scaled and the L1 distance is taken. Clustering is performed without connectivity.

A specific example of how clustering goes wrong. Above between the blue and purple clusters in the lower region of the wafer. The vertical split should actually be slanted as seen from the plots below.

![DF_Plot_1](images/DF_Plot_1.png)


Here is an example of where one group of peaks is split into multiple labels. Note the 2 vs 52 labels.

![DF_Plot_2](images/DF_Plot_2.png)

6/25

Peak Clustering analyzed. Seems that the clusters being found are not completely what is desired. 
https://docs.google.com/presentation/d/1LbSoimvhLF_YgP4nXWMb8xxLKqh7yunrSV6v92U5Cg4/edit?usp=sharing

Plan is to normalize peaks before clustering them. Then use a single parameter to scale the p-axis to determine idea value for labeling peaks. Also, adjust sensitivity in peak finding code to locate more significant peaks. Finally, plot peak width and peak count gradients along with clustering to see if there are significant differences that are being missed.

Additional Directions
 - Different peak finding approaches
 - PCA on peak vectors to identify independent clusters
 - Different clustering approaches to count the number of clusters
 - Similarity Metric can be adjusted to incorporate peak width/other parameters
 - Similarity metric can be adjusted to put more weight on new peaks.

6/26

Peak points have been normalized for clustering and the idea scale parameter for the p-axis seems to be around 30-100. Larger scales make the different "peak plains" more distinct but risks seperating planes that are slanted.

Analysis of clustering methods without n_clusters

DBSCAN (almost perfect)

![DBSCAN_Peak_Clustering](images/DBSCAN_Peak_Clustering.png)


OPTICS

![OPTICS_Peak_Clustering](images/OPTICS_Peak_Clustering.png)

BIRCH

![BIRCH_Peak_Clustering](images/BIRCH_Peak_Clustering.png)

From this analysis it seems that DBSCAN (with good parameters can achieve similar clustering to Agglomerative (which was used before).

Using DBSCAN peak clustering and hierarchical clustering with L1 similarity the following cluster map is produced. Some clusters are reasonable but adjustments to the similarity metric need to be made to place more value on new peaks.

![DBSCAN_Cluster_Plot](images/DBSCAN_Cluster_Plot.png)

6/27


PCA clustering implemented on top of the existing dimentionality reduction with DBSCAN. PCA is used to further reduce the diffraction patterns to 20 components and then agglomerative clustering is used to produce the clusters (below).

![DBSCAN_Peaks_Width](images/DBSCAN_Peaks_Width.png)

PCA does not seem to provide significant improvements to the produced clustering when compared to the same plot without PCA reduction (below). The second plot above shows the number of peaks relative to the maximum in each cluster. This plot does have some indication to the position of the half Heusler but, it also means that the clustering technique is again missing an entire phase. On the right there is a plot of the largest peak width in each cluster. It shows that there is an amorphous region in the top left. (white corresponds to Nan values in the data for the peak width).

To understand why these new methods don't seem to be improving the results we can plot the unique "peak vectors" for each diffraction patters ie which peaks are present/not present for each diffraction pattern.

![unique_peak_vector_plot](images/unique_peak_vector_plot.png)

Here we can see that the majority of the diffraction patterns in the wafer are unique with respect to the peaks being found. This also holds true for different amounts of sensitivity in the peak finding code. This is either because the peak clustering is failing or because the peak finding method is missing/finding extra peaks. The following is an example of how some peaks are not being detected.

![Missing_Peaks](images/Missing_Peaks.png)

Notice that on the red curve, just before 3.0 and 4.25, there are secondary peaks that are merged with larger ones. These peaks are important but are not being detected. Also notice the peak on the green curve near 4.5. This peak is not detected. Adjusting the sensitivity may find this peak but then other noise in the data which may not be a peak can still be classified. 

6/28

Meeting: https://docs.google.com/presentation/d/1W0IAJGbA_Cv5EFLt1ppfoBdD9oUBmTpz7X25ejGXYyI/edit?usp=sharing

Peak finding code needs to be modified to better detect peaks. Specifically partially merged peaks as shown above (6/27). Running peak detection on the square root of the diffraction pattern may improve detection.

7/1

Peak count and width penalty scores implemented to show cluster membership. Additionally, the combined penalty over all clusters is plotted as the number of clusters increases.

![PenaltyScorePlot](images/PenaltyScorePlot.png)

The top left plot shows the clustering(everypoint is its own cluster)
The top middle shows the peak count penalty in the gradient (darker is higher penalty, again all fully bright)
The top right shows the peak width penalty score (solid color as every point is its own cluster)

The bottom plot shows the combined peak count and peak width penalties over time.

The peak width penalty doesn't seem to hold much indication on the correct number of clusters. Furthermore, the peak penalty, after 20 clusters, is further improved by each additional cluster. There seems to be an initial bump in the peak penalty which may be due to "unlucky" cluster division.

Aditional plot at 20 clusters:

![PenaltyScorePlot2](images/PenaltyScorePlot2.png)


First approach to improving the peak finding algorithm is to sqrt the peak intensities before running peak detection. This allows smaller peaks to be more visible.

![PenaltyScorePlotSqrt](images/PenaltyScorePlotSqrt.png)

It seems that this change also divides one of the clusters known to be single phase. This means that either this method detects too many peaks (ie extra peaks that arent in the single phase region) or it has only detected a portion of the peaks that need to be found in the single phase region and so splitting it.


Below is a video version of these plot to demonstrate how dividing clusters affects the penalties.

In [1]:
%%HTML
<video width="320" height="240" controls>
  <source src="images/peak_width_penalty.mp4" type="video/mp4">
</video>

7/2

Evaluation of the video above shows that the single phase regions persist as more clusters are created. A future direction is to automatically detect these "persistant" regions.

7/3

Peak finding mistakes evaluated and recorded through manual process.

![PeakFindingErrors](images/PeakFindingErrors.png)

Note: some peaks are missed even in manual process.

Plan is to evaluate these errors and see how they can be detected. Peak finding on the log or sqrt of the data may improve visibility of the peaks. Or a different approach can be used to find more peaks.

A criteria can be determined for the peaks that are missed and this can help identify how the algorithm needs to be adjusted.

Also clustering in the "heat map" of the peaks may help with detection.

![peak_heat_map](images/peak_heat_map.png)

7/5

Peak error plot:

![peak_error_plot](images/peak_error_plot.png)

Black dots are local maxima, green are local minima, and red is the manually labeled mistake.
The plot titles are the vertical distances between the labeled error and the nearest local minima.

Figure produced with proportion measurements. The black dot is the nearest peak found by the algorithm. The red dot as before is a missed peak and the green dot is a local minima. The title of each plot is the proportion that the mistake stands out from the local minima relative to the peak.

Proportion = (Error - min) / (Peak - min)

![peak_error_proportion_plot](images/peak_error_proportion_plot.png)

Notice that the proportions vary significantly and are relatively small. Note in some sholder cases the minimum is not between the peak and the error (here the proportions holds less meaning).

7/9

Work towards density based clustering in heat map. In order for the peak heat map seen below to be "clustered" by existing algorithms it needs to be converted to a collection of points.

To do this, every pixel in the image is filled uniformly with random points. The number of points corresponds to the brightness.

Now a density based clustering method can be used to cluster on the heat map.

![peak_heat_map](images/peak_heat_map.png)

The following images are the results of using several clustering methods:

DBSCAN

![heat_map_DBSCAN](images/heat_map_DBSCAN.png)

GMM

![heat_map_GMM](images/heat_map_GMM.png)

Agglomerative (single linkage)

![heat_map_Agg](images/heat_map_Agg.png)

DBSCAN is able to grab major clusters but, fails around the edges. A lot of small regions that aren't clusters are found.

GMM is able to make divisions between large clusters but, it dedicates too many points to each cluster. This is an artifact of how the algorithm works.

Agglomerative effectively grabs the clusters but the one on the right is not properly split.



General points:

It may be better to uniformly distribute points in the grid regions. The random nature of the current method may be creating high density spots in low density regions and so creating outlier points seen in DBSCAN. Furthermore, there may be other random events that further make clustering more difficult.

Potentially a custom clustering algorithm could be made to cluster based on the heat map itself. This can avoid the problem of representing the heat map as a collection of points and can provide a more rigorous method of clustering.

7/11

Peak finding code fixed to report curves that are fitted to data. The found peaks are plotted in junction with peaks not found by the method earlier.


![peak_error_proportion_plot_fixed](images/peak_error_proportion_plot_fixed.png)

Black - peaks found by code
Red - peaks missed by earlier algorithm
Green - local min

Notice that many of the peaks that were missed before are now properly found. However, there are now false positive reports of peaks.


Work started on different peak fitting algorithm. The idea is to use local minima to select the blocks instead of BBA. This is more likely to create regions that don't cut peaks into parts.


7/12

Initial peak fitting implemented. Sholder peaks are not detected unless they are a local maxima. The advantage of this method is that many more small peaks are detected.

![min_block_peak_fitting](images/min_block_peak_fitting.png)

![min_block_peak_fitting_errors](images/min_block_peak_fitting_errors.png)

Notice that in the image above two separate sholder peaks are completely missed by this method. This is because they are not bound by a local minima. Additionally there is a peak on the right side which is 

Additionally peaks need to be plotted on the peak heat map for visualization and peak_width_penalty, DBSCAN_PCA_AGG, need to be rerun with fixed peakBBA method and new method for comparisons.


Analysis of min-block peak fitting and BBA curve fitting:

The black dots are the peaks found by the BBA curve fitting algorithm.
The black X'sare the peaks found by the min-block curve fitting method.

![minblock_BBA_curves_analysis1](images/minblock_BBA_curves_analysis1.png)

Here we see that the BBA curve fitting algorithm has some false positives and doesn't always hit peaks excactly.
The min-block curve fitting method seems to grab more peaks but, grabs extra peaks which are very small. It is not clear if they are noise. The two methods will be compared in terms of how they produce clusters.


7/15

Peaks on Heat Map Plot 

Peak fitting using local minima as blocks:
![peaks_on_heatmap_minblock](images/peaks_on_heatmap_minblock.png)

Peak fitting using BBA algorithm to find blocks. Local max of each section.
![peaks_on_heatmap_BBApeaks](images/peaks_on_heatmap_BBApeaks.png)

Peak fitting using BBA algorithm to find blocks. Centers of fitted curves to each block.
![peaks_on_heatmap_BBAcurves](images/peaks_on_heatmap_BBAcurves.png)


Clustering (DBSCAN_PCA_AGG) redone with peak BBA, curve BBA, and peak min-block

Min-Block peak fitting: Ignoring the white plot it is visible that due to the many extraneous peaks the algorithm is unable to properly seperate clusters.
![DBSCAN_PCA_AGG_minblock_peaks](images/DBSCAN_PCA_AGG_minblock_peaks.png)

BBA peak fitting: (same algorithm as originally implemented) Clustering is decent but, not perfect.
![DBSCAN_PCA_AGG_BBA_peaks](images/DBSCAN_PCA_AGG_BBA_peaks.png)


BBA curve fitting: Clustering seems to much more acurately grab clusters in the wafer.
![DBSCAN_PCA_AGG_BBA_curves](images/DBSCAN_PCA_AGG_BBA_curves.png)

In all three cases the mle solver for pca determined that only 1 dimension needed to be reduced from the initial peak dimention reduction. This may be because of the occasional false positives in the peak finding code.


