# The Milky Way spiral arms: testing the clustering algorithm

## Introduction

We want to understand the behavious of the clustering algorithm (t-sne+HDBSCAN) we have developed to locate the spiral arms in our Galaxy. In this purpose, we have created various mock samples. 

## Mock data

Mock data based on the Reid et al. 2019 model of spiral arms.
- Model selected over other possibilities (e.g., Hou et al. 2021) because the proximity of the Norma and Sct-Cen arms is challenging
- Note: Only a few Cepheids are currently known at the expected locations of the 3-kpc and Norma arms

Spiral arms populated with stars using N bivariate normal distributions located around N equidistant points along each spiral arm. Such a distribution is defined by its mean and covariance matrix (analogous to the mean and variance
of a 1D normal distribution).
In our case: 
- mean: the (x,y) positions of each of the N reference points
- covariance matrix:$
\left( \begin{array}{cc}
\sigma^{2} & 0 \\
0 & \sigma^{2} \\
\end{array} \right)
$, $\sigma$ values proportional to the widths of the individual spiral arms in Reid et al. (2019). 

A few stars (out of several 10000s) outside the exact angular domain encompassed by the Reid et al. (2019) model, they were generated by the very first and very last bivariate distributions. 
- they are discarded for comparison purposes 
- the extremities of the mock spiral arms get sharpened

Finally, N$_{spiral}$ mock Cepheids are randomly selected as spiral arms Cepheids members.

|    arm  | sigma^2  |   N  |       | Noise | parameter  |
|---------|----------|------|-------|-------|------------|
|         |          |      | Ideal | Noisy | Inter arms |
|   3 kpc |    0.014 |  20  |  0.0  |  0.03 |    0.01    |
|   Norma |    0.011 |  20  |  0.0  |  0.03 |    0.01    |
| Sct-Cen |    0.018 |  20  |  0.0  |  0.03 |    0.01    |
| Sgr-Car |    0.021 |  20  |  0.0  |  0.03 |    0.01    |
|   Local |    0.024 |  50  |  0.0  |  0.03 |    0.01    |
| Perseus |    0.027 |  50  |  0.0  |  0.03 |    0.01    |
|   Outer |    0.050 |  50  |  0.0  |  0.03 |    0.01    |

<div align="center">
Parameters for the creation of mock spiral arms based on the model by Reid et al. 2019
<\div>

Comparison between the original mock data and the retrieved sample: Hotelling's t-Squared statistics. 
- multivariate generalization of the Student's t-distribution
- specifically useful when comparing 2 distributions of unequal sizes, mean values and variances

Null hypothesis: the original and retrieved samples for a given spiral arm are drawn from the same parent distribution.

## Comparison between the mock and the retrieved sample: ideal test case

The next figure shows:
- the mock data (N_spiral=1500 Cepheids) in the Milky Way plane (top left)
- the groups identified by HDBSCAN in the t-SNE space (top center).
- the same (color-coded) groups in the Milky Way plane (top right) 
- the same groups in the ($\Theta$), ln(r)) space (bottom)

Note: Only a few Cepheids in the 3 kpc arm (because of its small angular extension) --> ignored in what follows<br>
Note: Hyper-parameters of t-SNE+HDBSCAN not adjusted to the specifics of the mock data but kept the same values as for the real data

<p style="text-align:center;"><img src="plots_spiral_test/App1a_simulated_arms_sig_0.1_noise_0.0_None_1500_pts_0_noise_all_plots.jpg" width="1000"></p>

- Very high p-values, very high recovery fractions -->  spiral arms almost perfectly recovered by the algorithm. (See also next figure)  
- The Norma and the Sct-Cen arms are recovered via two segments resulting from under-densities / gaps in the spatial distribution of Cepheids
- Increasing the number of stars in the mock catalog reduces the number and extension of such gaps --> mock spiral arms recovered within a single structure
- The other arms (e.g., the Perseus arm) also show gaps without being split into segments --> segments are more likely to occur in crowded regions of the t-SNE space

|     Arm | Group | p-value | N  _spiral   | Retrieved |  % retrieved |
|--------:|------:|--------:|-------------:|----------:|-------------:|
|   Norma |   7,6 |    0.98 |           93 |        92 |        98.92 |
| Sct-Cen |   4,5 |    1.00 |          169 |       168 |        99.41 |
| Sgr-Car |     0 |    1.00 |          167 |       167 |       100.00 |
|   Local |     2 |    1.00 |          158 |       158 |       100.00 |
| Perseus |     3 |    1.00 |          550 |       550 |       100.00 |
|   Outer |     1 |    1.00 |          362 |       362 |       100.00 |

<div align="center">
Groups retrieved by the algorithm that correspond to a given spiral arm.
</div>

<p style="text-align:center;"><img src="plots_spiral_test/App1b_simulated_arms_sig_0.1_noise_0.0_None_1500_pts_0_noise_segments.jpg" width="400"></p>

## Test case 2: Noisy spiral arms

<p style="text-align:center;"><img src="plots_spiral_test/App2a_simulated_arms_sig_0.1_noise_0.03_None_1500_pts_0_noise_all_plots.jpg" width="1000"></p>

<p style="text-align:center;"><img src="plots_spiral_test/App2b_simulated_arms_sig_0.1_noise_0.03_None_1500_pts_0_noise_segments.jpg" width="400"></p>

New mock samples with increased dispersion of the Cepheids in spiral arms:
- Random noise added to their (x,y) coordinates
- Random values sampled from a univariate standard normal distribution scaled by the "noise" parameter added to (x,y)
- Here, the noise parameter is 0.03.<br>

Still N_spiral=1500 Cepheids. 
- The algorithm performs similarly well, the mock spiral arms are very well reproduced
- Sct-Cen arm: the p-value strongly drops despite a recovery rate approaching 95%: 2 stars attributed to group 8 rather belong to group 7 --> group 8 included in the comparison to the original mock data Sct-Cen --> low p-value.
- A few stars close to the gap in the Sct-Cen arm remain unclassified (probably due to their proximity to the Norma arm),

| Arm     | Group | p-value | N  _spiral   | Retrieved |  % retrieved |
|---------|-------|--------:|-------------:|----------:|-------------:|
| Norma   |   4,8 |    0.94 |           92 |        91 |        98.91 |
| Sct-Cen |5,6,7,8|    0.01 |          150 |       142 |        94.67 |
| Sgr-Car |     1 |    1.00 |          159 |       159 |       100.00 |
| Local   |     2 |    1.00 |          161 |       161 |       100.00 |
| Perseus |     3 |    1.00 |          553 |       553 |       100.00 |
| Outer   |     0 |    1.00 |          384 |       384 |       100.00 |

<div align="center">
Groups retrieved by the algorithm that correspond to a given spiral arm.
<\div>

- Mock Cepheids stay confined to the spiral arms
- Close proximity of the Norma and Sct-Cen spiral arms --> impossible to further increase the dispersion of individual spiral arms (see figure below, the mock Norma and Sct-Cen spiral arms come close to touching each other already with a noise parameter equals to 0.02)<br>

--> remains a bit unrealistic

<p style="text-align:center;"><img src="plots_spiral_test/Simulated_arms_clean_sig_0.1_noise_0.02.jpg" width="400"></p>

## Test case 3: Inter-arms' Cepheids

More realistic test case: 
- mock sample of Cepheids in spiral arms (N_spiral=900, noise=0.01)
- and a large collection of N_other=1500 Cepheids (coordinates drawn from a bivariate normal distribution with a mean centered on the Sun and a variance of 80)<br>

These N_other Cepheids can be considered as inter-arm Cepheids, although by construction some of them may overlap the spiral
arms 

<p style="text-align:center;"><img src="plots_spiral_test/App3a_simulated_arms_sig_0.1_noise_0.01_Gaussian_900_pts_1500_noise_all_plots.jpg" width="1000"></p>

<p style="text-align:center;"><img src="plots_spiral_test/App3b_simulated_arms_sig_0.1_noise_0.01_Gaussian_900_pts_1500_noise_segments.jpg" width="400"></p>

- Almost all spiral arm members are retrieved
- The t-SNE groups also wrongly include $\approx$55% nonmembers<br>

-->
- Lower p-values
- Norma and Sct-Cen arm: partial mismatch between the input and retrieved spiral arms: in the region where they are closest
to each other, Norma and Sct-Cen are merged into a single structure.
- The other spiral arms are well retrieved, sometimes by the means of several segments
- An artificial structure (group 9) emerges from the noise because it is relatively isolated in the t-SNE space, it would have been falsely identified as a real structure beyond the outer arm.

| Arm     |       Group       | p-value |   N_spiral   | Retrieved |  % retrieved | Extra |  % extra |
|---------|-------------------|--------:|-------------:|----------:|-------------:|------:|---------:|
| Norma   |               7,4 |    0.00 |           56 |        56 |       100.00 |    11 |    19.64 |
| Sct-Cen |               8,4 |    0.47 |          108 |       108 |       100.00 |    25 |    23.15 |
| Sgr-Car |             2,6,5 |    0.02 |           98 |        98 |       100.00 |    35 |    35.71 |
| Local   |                 0 |    0.27 |           88 |        88 |       100.00 |    24 |    27.27 |
| Perseus | 14,16,15,13,12,17 |    0.43 |          341 |       337 |        98.83 |    49 |    14.37 |
| Outer   |                 1 |    0.00 |          208 |       208 |       100.00 |   113 |    54.33 |