6/13

Cosine similarity metric is able to separate some clusters but not all. For a general data set, identification of important clusters may not occur.

Example: A few of the major groups were found almost immediately but, a specific region is not broken down until over 60 clusters are calculated.

Ideally cluster similarity should be measured by looking at peaks and their positions.

Another consideration for runtime and the clusters that are being formed is giving the clustering algorithm an adjacency matrix to represent the connections between adjacent cells. This will adjust how clusters are formed and also may improve runtime.

![Cosine Clustering](images/clusters-3-10.png)

6/14

Implementing a phase coloring based on relative similarity. The goal is to see major colors like red, green, blue in the major clusters and then as clusters split the parts retain the colors of the parent cluster. This should result in a plot where the changing colors show the gradients in cosine similarity and regions that have a single material/spectra will have a single almost unchanged color.

Initial implementation:
Start with 3 clusters and assign then the colors red, green, blue. Then proceed to split clusters based on the hierarchical clustering method. Each time a cluster is split the the neighbors in the color wheel N1, N2 and the two children clusters C1, C2 are considered.

The children's colors are assigned such that the hue difference between N1 - C1 - C2 - N2 is proportional to the similarity of the pairs. The order of C1,C2 (N1 - C1 - C2 - N2 vs. N1 - C2 - C1 - N2) is determined by the similarity of C1 and C2 to N1. The more similar pair is used.

6/17

Wafer coloring implemented with following results. Because of the use of the cosine the clusters of interest are not segmented. The general region of the cluster is found but the specific points are not identifiable. Furthermore, there is a clear split in shade through the desired cluster.

![177 Cosine Clusters](images/Cosine_177.png)


Plan is to adjust the coloring to be on a linear scale. Ideally this would mean that red and green colors are the furthest from each other and the rest are some combination. Also when a cluster is split all other colors are shifted based on the similarity of the corresponding clusters. Instead of just considering the two neighbors and scaling the split in the range between them, the entire hue spectra is considered and all the clusters so far mapped are placed proportionally to the similarity of their neighbors.

6/18

Implementation of colormap with linear color scale resulted in a less informative visual. Because all of the similarity values are rescaled at each split the cluster colors become an even distribution over the color spectrum. Even though the separation is proportional to similarity and similarity is scaled (subtracting min value) the majority of the similarity values are the same so the scale is almost uniform.

![177 Linear Cosine Clustering](images/Cosine_linear_177.png)

Either way, because of the large number of points there is almost so visible gradient anywhere in the image that corresponds to any desired phases. Instead the visual produced shows the order in which grid locations are added to clusters in a build up clustering approach. This does produce a spiral pattern but is not really the information we are trying to visualize.

The plan is to transition to using peaks or some form of peak metric to measure similarity.

Peak location will be done via Roberts code. (how to merge the code / what files are needed)

Another task is to write a container for the existing code allowing it to function independently of the system. Plan is to start with Singularity and potentially, if need be, use Docker.

6/19

For completeness L1 and L2 distance visuals have been implemented to see how well these similarity metrics can differentiate between clusters. Results are similar to using the cosine metric with some variations. No clear advantages to either method.

![L1](images/L1_177.png)
![L2](images/L2_177.png)

Code for locating peaks in spectra data needs to be streamlined into the current clustering algorithm and then the effectiveness of the method can be assessed.

A container to run all exiting code has been implemented. It automatically installs the latest version of the code (from github) and comes with the TiNiSn_500 and TiNiSn_600 data. Testing the container to make sure it runs on other machines is still necessary.



6/20

Peak based clustering implemented with significant results. Plotting the number of peaks in each spectra results in a limilar plot to L1 and L2 similarity above. Computing a similarity of two spectra based on the number of peaks not present in the other results in a plot where key regions are identifiable. To determine if a peak is present in both plots a "delta" parameter is used as the maximum allowed peak shift. Withing this delta two peaks are considered to be the same. This value was picked based on the resulting plots so a method to properly determine the value before hand is necesary. Furthermore, many clusters that should be whole are split up by this method. Potentially incorporating the peak intensities can improve the algorithms performance.

![Peak Clustering](images/peak_clust-0.049-31.png)

Note how the Heusler and half-Heusler regions are bounded along one of their sides while in the other direction the region extends further than necesary.

6/21

Presentation: https://docs.google.com/presentation/d/1LDtxghLUUXl52NSUOivJ78-jeY5iKSavsgqMwsCn7AY/edit#slide=id.g5c0eef7190_0_88

The new direction is to identify peak clusters (the regions in space that correspond to the peaks) and use these to represent diffraction patterns.

For example: across all the diffraction patterns some 7 unique peaks are found (including peak shifting) then the reduced representation may look like (0,0,0.1,3,0,0,5) where 0 corresponds to not having that peak and the values correspond to the presence and intensity of the peak.




6/24

Peak Clustering has been implemented with some results. All the peaks in the data set are mapped to coordinates based on their x,y position in the grid and the position of the peak in the difraction pattern (p). Each peak is assigned the coordinate (x/100,y/100,p). The x,y coordinates are scaled to prevent clustering in the peak axis instead of the x,y axes.

Peaks are clustered into 50 groups. This is an arbitrary value and needs to be adjusted to match the true number of peaks.

Peak clustering is used to convert each diffraction pattern into 50 dimensional vectors (called peak vectors). These vectors are then used to cluster the diffraction patterns on the waffer. This produces the following clusterings:

![PeakReductionClustering](images/PeakReductionClustering_50.png)

Clustering seems to correspond more accurately to the actual peaks in the data set, although some balancing of peak intensity versus peak presence needs to be made.


![PeakReductionClustering](images/PeakReductionClustering_60.png)

Here with 60 different possible peaks the clustering is more broad (regions are rounder) and it seems to align better with the actual clustering. The Orange cluster is present when with 50 peak clusters it was not.

With some analysis of how peaks are clustered it seems like around 58 peak clusters is where the samepeak is split into two labels so 57 peak clusters is used:


![PeakReductionClustering](images/PeakReductionClustering_57.png)

Here we can see the amorphous region, half heusler, and full heusler decently seperated from the rest of the wafer. However some individual grid locations still pertain to the mentioned clusters when they shouldn't. This seems to be a product of the current similarity function.

The similarity between two diffrection patters is found as follows. The peak vectors are obtained for each. The values in these vectors are log scaled. and the L1 distance is taken. Clustering is performed without connectivity.

A specific example of how clustering goes wrong. Above between the blue and purple clusters in the lower region of the wafer. The vertical split shoul actually be slanted as seen from the plots below.

![DF_Plot_1](images/DF_Plot_1.png)


Here is an example of where one group of peaks is split into multiple labels. Note the 2 vs 52 labels.

![DF_Plot_2](images/DF_Plot_2.png)