# Identifying the Milky Way spiral arms using clustering algorithms

## Context

Mapping the spiral structure of the Milky Way is difficult:
- the Sun is located in the Galactic plane 
- the light emitted by stars is absorbed by interstellar material, in particular dust. 
  that are quasi immune to dust extinction, have been used with great success. 

For these reasons, molecular masers associated with young massive stars in high-mass star-forming regions are among the most reliable tracers:
- they are extremely young --> they are still at their birthplace
- their distances (measured via the parallax method through radio-interferometry) are very accurate
- the observations in the radio wavelength domain are unaffected by dust extinction, 
However, only a couple hundred of such sources are known so far. 

Based on those, Reid et al. (2019) found that the Milky Way spiral structure consists of four arms, plus the Local arm, which they consider to be an isolated segment. However, alternative models exist, for instance, a two-(major)-arm model by Drimmel (2000).

Here, we want to determine the location of the spiral arms using Cepheids: 
- they are bright --> they can be observed at very large distances
- their distances are very accurate
- Using near- or mid-infrared photometry, their distances are minimally affected by interstellar extinction.

The distances to individual Cepheids have been determined thanks to period-wesenheit relations, a specific type of period-luminosity relations (see the corresponding notebook) and the (negligible) effect of the Galactic warp on distances has been taken into account (see the corresponding notebook).

The age of a Cepheid is inversely correlated with its period via period-age relations (e.g., Efremov 1978, Bono et al. 2005), but the accuracy of the age is at least 100%.
HOWEVER their ranking by age is very reliable (the period, which can be measured with great accuracy, is the driving parameter of period-age relations.
Although younger than 300 Myr (many stars are several Gyr old, some even older than 10 Gyr), Cepheids potentially had time to move away from their birthplace. In what follows we limited our sample of Cepheids using various age limits

Over-plotting the spiral arms from Masers (Reid et al. 2019) on top of the entire sample of Cepheids (no age restriction)
--> the Cepheids’ overdensities match the spiral arms well. 

<p style="text-align:center;"><img src="plots_spiral/Overplot_spiral_Reid19.jpg" width="300"></p>

Classical Cepheids (black dots) and spiral arms from Reid et al. 2019 (colored lines) in the Galactic plane. 
Concentric circles are shown every 4 kpc to guide the eye. 
The Galactic center (black filled circle) is at (0,0) and the Sun (yellow star) at (8.275,0).

Inter-arm regions have lower densities of Cepheids, as can be seen in the radial distribution of Cepheids located in a Galactocentric angular sector around 160°

<p style="text-align:center;"><img src="plots_spiral/Cepheid_density_vs_Reid19.jpg" width="300"></p>

Kernel density estimation (with a kernel bandwidth of 0.1) of the radial distribution of Cepheids located in a Galactocentric angular sector around 160°. The spiral arms from Reid et al. (2019) in this sector are shown as vertcal dashed lines.
(The angular sector around 160° intercepts the spiral arms in a region where the completeness of the data is not hindered by the two shadow cones visible the previous figure which hamper the detection of Cepheids beyond nearby regions with strong extinction.

## Clustering algorithm: t-SNE + HDBSCAN

t-SNE: t-distributed Stochastic Neighbor Embedding (van der Maaten & Hinton 2008)

- t-SNE often used to visualize high-dimensional data in a lower-dimensional space
- Here we only used as input the coordinates (θ, ln r) of the Cepheids in our dataset (r: Galactocentric distance, $\Theta$: Galactocentric azimuth)

t-SNE uses a Student’s t-distribution to compute the similarity between two data points in the t-SNE output space, 
--> it performs very well in keeping similar input data points close together in the output space, even if they come from crowded regions. 
--> Downside: t-SNE performs poorly when data are sparse. 

t-SNE:
- data standardized
- t-SNE initialized using a principal component analysis (PCA) 
- and run for 6000 iterations in a 2D space. 

NB: For our dataset, the topology of the outcome in the t-SNE space is robust to the choice of the perplexity value (the effective number of neighbors considered by t-SNE for any given data point)
NB: For our dataset, the topology of the outcome in the t-SNE space is robust to the choice of the early exaggeration value (set to 5), which ensures that tight clusters in the data will not overlap in the t-SNE space. 

Individual groups are then identified using the clustering algorithm HDBSCAN, a clustering algorithm using unsupervised learning to identify clusters in a distribution of data points (Campello et al. 2015; McInnes et al. 2017). 

HDBSCAN:
- minimum of 5 groups (well below the number of clusters actually found)
- minimum of 20 members per group (to avoid spurious detections of tiny groups)
- assumption: Euclidean distances between individual points in the t-SNE space.

Top-left panel: distribution of Cepheids (here, age max = 150 Myr) in the t-SNE space. The color-coding indicates groups identified by HDBSCAN.

Bottom panel: Cepheids in the (θ, ln r) space with the same color-coding. The groups identified by t-SNE+HDBSCAN form narrow, linear sequences in this plane, as is expected under the common assumption that spiral arms follow a logarithmic spiral.

Top-right panel: Spatial distribution of the identified groups in the Milky Way plane --> each group forms indeed a section of a given spiral arm. 

<p style="text-align:center;"><img src="plots_spiral/single_age_150.jpg" width="600"></p>

NB: large groups (1, 3) gathering distant Cepheids reflect the fact that t-SNE does not perform well with sparse data (the search of pulsating variable stars is still largely incomplete and their classification uncertain at large distances in the disk) and do not trace reliable spatial structures.
NB: Similarly, a few isolated Cepheids in the outer disk are attributed to likely unreliable groups, for instance, to groups 2 or 15.

NB: 150 Myr = arbitrary age limit, it is a compromise to identify a good number of spiral features without taking into account older stars that may have drifted away from their birth place (see the extensive discussion in the paper).

Properties of a given individual segment/group:

- fit a linear relation in the (θ, ln r) space: ln r = a $\times$ θ + b through all group members 
- from the slope, derive the pitch angle. 
- reference angle = midpoint of the minimal and maximal Galactocentric azimuths covered by the group
- radius corresponding to the reference angle = reference (logarithm of the) radius

With these values, we trace the spiral segments displayed below.
Several segments located at a similar reference radius can be interpreted as different sections of the same spiral arm.

NB: several large groups not resolved by the algorithm (see text above), they are ignored
NB: several groups formed from a very small number of stars over large distances, they are probably spurious

<p style="text-align:center;"><img src="plots_spiral/4plots_150.jpg" width="600"></p>

## Comparison with previous studies

Comparison with previous models. 
Top left: spiral segments (olive) over-plotted on Cepheids identified as members of a group. 
Top right: spiral segments and the model of Reid et al. (2019).
Bottom left: spiral segments and the model of Levine et al. (2006b).
Bottom right: spiral segments and the model of Hou (2021).

A detailed discussion about the comparison of individual segments/spiral arms can be found in the paper

## Outcome of tests on mock data

We have run several tests on mock data to understand the behavious of the algorithm (see the dedicated notebook):

The mock spiral structure is based on the Reid et al. (2019) model. 
Tests show that:
- the algorithm recovers the mock spiral arms very well, even with large amounts of "inter-arm" Cepheids ("noise").
- a fraction of these "inter-arm" Cepheids is included in the nearest arm, the impact on the recovered location is marginal.
- the algorithm is sensitive to small gaps (regions without stars) in individual spiral arms --> a given spiral arm may then be split in several segments limited by those gaps. 
- this is more likely to occur when two spiral arms are very close to each other. In such a case, two segments from two different arms may be joined within the same group, leading to a spurious location of a spiral arm at a median distance between the two segments.

[Link to the original paper](https://ui.adsabs.harvard.edu/abs/2022A%26A...668A..40L/abstract)