# Finding clusters in RRLyrae data

Various properties of astronomical objects may have some correlation. Given a data set of 2D points, one can examine whether they exhibit any concentration. If so, groups of points with similar properties (clusters) can be found.

In this notebook, periods and magnitudes of RRLyrae, obtained from GAIA database ae considered. Gaussian mixture model is used as the algorithm for calculatiing results.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GMM, GaussianMixture
from astroquery.gaia import Gaia
from astropy.table import Table

Query the GAIA database and select Period-Magnitudes of RRLyrae

In [None]:
# Select Period-Magnitudes of RRLyrae
query="SELECT p1, int_average_g FROM gaiadr1.rrlyrae WHERE int_average_g > 18"

name=None
output_file='results'
output_format='csv'
verbose=False
dump_to_file=False
background=False
upload_resource=None
upload_table_name=None

job = Gaia.launch_job_async(query, name, output_file, output_format, verbose, dump_to_file, background, upload_resource, upload_table_name)

results0 = job.get_results()

In [None]:
# Must change the type, otherwise, GMM.fit will raise the error.
results = np.array([list(elem) for elem in results0])

In [None]:
# Optionally, the results can be printed.
print("Results:\n", results)

In [None]:
x = results[:,0] 
y = results[:,1] 

# Plot the lightcurve
plt.scatter(x,y, marker='.')
plt.title('Period-Magnitudes')
plt.xlabel('Period')
plt.ylabel('Magnitude')
plt.show()

The best model is the one with lowest BIC. So first, best number of clusters (number of components of GMM with the lowest BIC) should be found. Then, Gaussian mixture model will be calculated by specifying number of components found by previous algorithm.

In [None]:
# Find best number of clusters via BIC
N_possible_clusters = np.arange(1, 8) # Let's assume the model with up to 7 clusters
clfs = [GaussianMixture(N, max_iter=500).fit(results) for N in N_possible_clusters] # list of len(N_possible_clusters) GMMs]
BICs = np.array([clf.bic(results) for clf in clfs]) # numpy.ndarray of len(N_possible_clusters)
clf = clfs[np.argmin(BICs)] # GMM with the lowest BIC
print("Best number of clusters (number of components of GMM with the lowest BIC):", clf.n_components)

# Specify number of components found by previous algorithm
gmm_input = GaussianMixture(n_components=clf.n_components) # n components mixture
gmm_input.fit(results) # fit the model to the data
log_dens = gmm_input.score(results, y=None) # evaluate the log density; Log probabilities of each data point in data

# Print results
print("Cluster", "Mean")
for i in range(clf.n_components):
    mean = clf.means_[i]
    print(i+1, mean)