# In-class notebook: 2024-02-07


In this notebook, we look at different methods to look for structure in the data. We will use the SDSS Great Wall as an example to look at different Kernal Density Estimations (KDE), Gaussian Mixure Models (GMM) and hierarchical clustering. We will then briefly talk about correlation functions. 

This notebook is intended to support Chapter 6 of the textbook, and material is taken from the following scripts (from astroML):
* https://github.com/astroML/astroML-notebooks/blob/main/chapter6/astroml_chapter6_Density_Estimation_for_SDSS_Great_Wall.ipynb
* https://github.com/astroML/astroML_figures/blob/main/book_figures/chapter6/fig_great_wall_MST.py
* https://github.com/astroML/astroML_figures/blob/main/book_figures/chapter6/fig_correlation_function.py

## Density Estimation for SDSS "Great Wall"

This region is known as the "SDSS Great Wall", and contains an extended cluster of several thousand galaxies approximately 300Mpc (about 1 billion light years) from earth.  This shows the positions of over 8,000 galaxies projected to a 2D plane with Earth at the point (0, 0).  

In [None]:
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.colors import LogNorm

In [None]:
# Create the grid on which to evaluate the results
Nx = 50
Ny = 125
xmin, xmax = (-375, -175)
ymin, ymax = (-300, 200)

A = np.meshgrid(np.linspace(xmin, xmax, Nx), np.linspace(ymin, ymax, Ny))
Xgrid = np.array((A[0].flatten(),A[1].flatten())).T

In [None]:
# Fetch the great wall data

from astroML.datasets import fetch_great_wall
X = fetch_great_wall()
print(len(X))

# First plot: scatter the points
ax1 = plt.subplot(111, aspect='equal')
ax1.scatter(X[:, 1], X[:, 0], s=1, lw=0, c='k')
ax1.text(0.95, 0.9, "input", ha='right', va='top',
         transform=ax1.transAxes,
         bbox=dict(boxstyle='round', ec='k', fc='w'))

ax1.set_xlim(ymin, ymax - 0.01)
ax1.set_ylim(xmin, xmax)
    
ax1.set_xlabel('$y$ (Mpc)')
ax1.set_ylabel('$x$ (Mpc)')


### Evaluate KDE with Gaussian Kernel

Use a *Gaussian kernel* to evaluate the kernel density. The function $K(u)$, a smooth function, represents the weight at a given point, which is normalized such that $\int K(u)du = 1$.  
The expression for Gaussian kernel is 

$$K(u) = \frac{1}{ {2\pi}^{\frac{D}{2}} } e^{\frac{-{u}^2}{2}}$$  

where D is the number of dimensions of the parameter space and $u = d(x, x_i) /h$.

In [None]:
# evaluate Gaussian kernel
# you can read the documentation here
# https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html
from sklearn.neighbors import KernelDensity

def estimate_kde(ker):
    kde = KernelDensity(bandwidth=5, kernel=ker)
    log_dens = kde.fit(X).score_samples(Xgrid) # compute log likelihood
    dens = X.shape[0] * np.exp(log_dens).reshape((Ny, Nx))
    return dens
   
dens1 = estimate_kde('gaussian')

plt.imshow(dens1.T, origin='lower',
               extent=(ymin, ymax, xmin, xmax), cmap=plt.cm.binary)

plt.title('Gaussian (h=5)')
plt.xlabel('$y$ (Mpc)')
plt.ylabel('$x$ (Mpc)')

In [None]:
# log scale
plt.imshow(dens1.T, origin='lower', norm=LogNorm(),
               extent=(ymin, ymax, xmin, xmax), cmap=plt.cm.binary)

plt.title('Gaussian (h=5)')

plt.xlabel('$y$ (Mpc)')
plt.ylabel('$x$ (Mpc)')

### Evaluate KDE with Top-hat Kernel
Use a *top-hat (box) kernel* to evaluate the kernel density. The expression for top-hat kernel is 

$$f(z) = \left\{ \begin{array}{rcl}
\frac{1}{V_{D}(1)} & \mbox{if} & |u|\leq1, \\
0 & \mbox{if} & |u|>1.
\end{array}\right.$$  

This kernel gives the most "spread out" estimation for each distribution freature.

In [None]:
dens2 = estimate_kde('tophat')

plt.imshow(dens2.T, origin='lower',
               extent=(ymin, ymax, xmin, xmax), cmap=plt.cm.binary)

plt.title('Tophat (h=5)')

plt.xlabel('$y$ (Mpc)')
plt.ylabel('$x$ (Mpc)')

### Evaluate KDE with Exponential Kernel
Use a *exponential kernel* to evaluate the kernel density. The expression for exponential kernel is 

$$K(u) = \frac{1}{D!V_{D}(1)}e^{-|u|}.$$  

where $V_D(r)$ is the volume of a D-dimensional hypersphere of radius r.  
This kernel gives the "sharpest" estimation for each distribution feature.  

In [None]:
dens3 = estimate_kde('exponential')

plt.imshow(dens3.T, origin='lower',
               extent=(ymin, ymax, xmin, xmax), cmap=plt.cm.binary)

plt.title('Exponential (h=5)')

plt.xlabel('$y$ (Mpc)')
plt.ylabel('$x$ (Mpc)')

### Evaluate density using K-Nearest-Neighbor Estimation
Another estimator is the K-nearest-neighbor estimator, originally proposed by [Dressler et al. 1980](https://ui.adsabs.harvard.edu/abs/1980ApJ...236..351D/abstract) . In this method, the implied point density at an arbitrary position x is estimated as

$$\hat{f_K}(x) = \frac{K}{V_D(d_K)}$$

where $V_D$ is evaluated volume, and D is the problem dimensionality.  

In [None]:
# calculate K Neighbors Density with k = 5
from astroML.density_estimation import KNeighborsDensity

knn5 = KNeighborsDensity('bayesian', 5)
dens_k5 = knn5.fit(X).eval(Xgrid).reshape((Ny, Nx))

plt.imshow(dens_k5.T, origin='lower', #norm=LogNorm(), # show log
               extent=(ymin, ymax, xmin, xmax), cmap=plt.cm.binary)

plt.title('KNN (k=5)')

plt.xlabel('$y$ (Mpc)')
plt.ylabel('$x$ (Mpc)')

In [None]:
# KNeighborsDensity?

In [None]:
# calculate K Neighbors Density with k = 40
knn40 = KNeighborsDensity('bayesian', 40)
dens_k40 = knn40.fit(X).eval(Xgrid).reshape((Ny, Nx))

plt.imshow(dens_k40.T, origin='lower', #norm=LogNorm(), # show log
               extent=(ymin, ymax, xmin, xmax), cmap=plt.cm.binary)

plt.title('KNN (k=40)')

plt.xlabel('$y$ (Mpc)')
plt.ylabel('$x$ (Mpc)')

### Gaussian Mixture Model
GMM calculate the underlying pdf of a point as a sum of multi-dimensional Gaussians using the equation below

$$\rho(x) = Np(x) = N \sum_{j=1}^{M} \alpha_j \mathcal{N}(\mu_j, \Sigma_j)$$

where M is the number of Gaussians, $\mu_j$ is the the location, and $\Sigma_j$ is the covariance of a Gaussian.

In [None]:
from sklearn.mixture import GaussianMixture

# Calculate GMM
def compute_GMM(n_clusters, max_iter=1000, tol=3, covariance_type='full'):
    clf = GaussianMixture(n_clusters, covariance_type=covariance_type,
                          max_iter=max_iter, tol=tol, random_state=0)
    clf.fit(X)
    print("converged:", clf.converged_)
    return clf

clf = compute_GMM(n_clusters=200)
log_dens = clf.score_samples(Xgrid).reshape(Ny, Nx)


In [None]:

plt.imshow(np.exp(log_dens.T), origin='lower', #norm=LogNorm(), # show log
               extent=(ymin, ymax, xmin, xmax), cmap=plt.cm.binary)

plt.title('GMM (N=200)')

plt.xlabel('$y$ (Mpc)')
plt.ylabel('$x$ (Mpc)')

### Hierarchical clustering

Construct a hierarchical tree (a Euclidean Minimum Spanning Tree, MST) of the data and use it to define clusters. We first show the dendrogram connecting them, then the clustering based on this dendrogram, created by removing the largest 10% of the graph edges, and keeping the remaining connected clusters with 30 or more members.

In [None]:
from sklearn.neighbors import kneighbors_graph
# see here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.kneighbors_graph.html

from scipy.sparse.csgraph import minimum_spanning_tree
# see here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csgraph.minimum_spanning_tree.html
# A minimum spanning tree is a graph consisting of the subset of edges which together 
# connect all connected nodes, while minimizing the total sum of weights on the edges. 
# This is computed using the Kruskal algorithm.

XX = np.random.random((100,2))
G = kneighbors_graph(XX, n_neighbors=10, mode = 'distance')
# only keep nearest 10 neighbors
T = minimum_spanning_tree(G)

# a graph is a sparse matrix, showing True if the element is within 2 distance, and 0 if not
# X = [[0], [3], [1]]
# >>> from sklearn.neighbors import kneighbors_graph
# >>> A = kneighbors_graph(X, 2, mode='connectivity', include_self=True)
# >>> A.toarray()
# array([[1., 0., 1.],
#        [0., 1., 1.],
#        [1., 0., 1.]])

plt.figure()
plt.imshow(G.toarray())
plt.colorbar()

plt.figure()
plt.imshow(T.toarray())
plt.colorbar()

In [None]:
from scipy import sparse
from astroML.clustering import HierarchicalClustering, get_graph_segments

# Compute the MST clustering model
n_neighbors = 10 # number of neighbors in the MST
edge_cutoff = 0.9 # fraction of "edges" to keep, where an edge is one line segment in the MST
cluster_cutoff = 10 # clusters must have <10 nodes
model = HierarchicalClustering(n_neighbors=n_neighbors,
                               edge_cutoff=edge_cutoff,
                               min_cluster_size=cluster_cutoff)
model.fit(X)
print("max scale of cluster: %2g Mpc" % np.percentile(model.full_tree_.data,100 * edge_cutoff))

n_components = model.n_components_
labels = model.labels_
print(n_components, "clusters", labels.shape, len(X))

In [None]:
# Get the x, y coordinates of the beginning and end of each line segment
T_x, T_y = get_graph_segments(model.X_train_,
                              model.full_tree_)
# full tree

T_trunc_x, T_trunc_y = get_graph_segments(model.X_train_,
                                          model.cluster_graph_)
# only keep clusters

In [None]:
# Fit a GaussianMixture to each individual cluster
Nx = 50
Ny = 125
xmin, xmax = (-375, -175)
ymin, ymax = (-300, 200)

A = np.meshgrid(np.linspace(xmin, xmax, Nx), np.linspace(ymin, ymax, Ny))
Xgrid = np.array((A[0].flatten(),A[1].flatten())).T

density = np.zeros(Xgrid.shape[0])

for i in range(n_components):
    ind = (labels == i)
    Npts = ind.sum()
    Nclusters = min(12, Npts // 5)
    gmm = GaussianMixture(Nclusters, random_state=0).fit(X[ind])
    dens = np.exp(gmm.score_samples(Xgrid))
    density += dens / dens.max()

density = density.reshape((Ny, Nx))

In [None]:
plt.figure(figsize=(7, 2.5))

plt.plot(T_y, T_x, c='k', lw=0.5)

plt.title('Full MST')
plt.xlim(ymin, ymax)
plt.ylim(xmin, xmax)

plt.xlabel('$y$ (Mpc)')
plt.ylabel('$x$ (Mpc)')

In [None]:
plt.figure(figsize=(7.5, 3))

plt.plot(T_trunc_y, T_trunc_x, c='k', lw=0.5)
plt.imshow(density.T, origin='lower', cmap=plt.cm.hot_r,
          extent=[ymin, ymax, xmin, xmax])

plt.title('Clusters')
plt.xlim(ymin, ymax)
plt.ylim(xmin, xmax)

plt.xlabel('$y$ (Mpc)')
plt.ylabel('$x$ (Mpc)')

## Correlation Functions

The two-point correlation function of SDSS spectroscopic galaxies in the range 0.08 < z < 0.12, with m < 17.7. This is the same sample for which the luminosity function is computed in figure 4.10. Errors are estimated using ten
bootstrap samples. Dotted lines are added to guide the eye and correspond to a power law proportional to :math:`\theta^{-0.8}`. Note that the red galaxies (left panel) are clustered more strongly than the blue galaxies (right panel).


In [None]:
from astroML.utils.decorators import pickle_results
from astroML.datasets import fetch_sdss_specgals
from astroML.correlation import bootstrap_two_point_angular
# you can check out here to see how the two-point calculation is done with a KD tree
# https://github.com/astroML/astroML/blob/main/astroML/correlation.py

#------------------------------------------------------------
# Get data and do some quality cuts
data = fetch_sdss_specgals()
m_max = 17.7

# redshift and magnitude cuts
data = data[data['z'] > 0.08]
data = data[data['z'] < 0.12]
data = data[data['petroMag_r'] < m_max]

# RA/DEC cuts
RAmin, RAmax = 140, 220
DECmin, DECmax = 5, 45
data = data[data['ra'] < RAmax]
data = data[data['ra'] > RAmin]
data = data[data['dec'] < DECmax]
data = data[data['dec'] > DECmin]

ur = data['modelMag_u'] - data['modelMag_r']
flag_red = (ur > 2.22)
flag_blue = ~flag_red

data_red = data[flag_red]
data_blue = data[flag_blue]

print("data size:")
print("  red gals: ", len(data_red))
print("  blue gals:", len(data_blue))


In [None]:
# Set up correlation function computation
#  This calculation takes a long time with the bootstrap resampling,
#  so we'll save the results.
@pickle_results("correlation_functions.pkl")
def compute_results(Nbins=16, Nbootstraps=10,  method='landy-szalay', rseed=0):
    np.random.seed(rseed)
    bins = 10 ** np.linspace(np.log10(1 / 60.), np.log10(6), 16)

    results = [bins]
    for D in [data_red, data_blue]:
        results += bootstrap_two_point_angular(D['ra'],
                                               D['dec'],
                                               bins=bins,
                                               method=method,
                                               Nbootstraps=Nbootstraps)

    return results

(bins, r_corr, r_corr_err, r_bootstraps,
 b_corr, b_corr_err, b_bootstraps) = compute_results()

bin_centers = 0.5 * (bins[1:] + bins[:-1])

In [None]:
corr = [r_corr, b_corr]
corr_err = [r_corr_err, b_corr_err]
bootstraps = [r_bootstraps, b_bootstraps]
labels = ['$u-r > 2.22$\n$N=%i$' % len(data_red),
          '$u-r < 2.22$\n$N=%i$' % len(data_blue)]

fig = plt.figure(figsize=(8, 4))

for i in range(2):
    ax = fig.add_subplot(121 + i, xscale='log', yscale='log')

    ax.errorbar(bin_centers, corr[i], corr_err[i],
                fmt='.k', ecolor='gray', lw=1)

    t = np.array([0.01, 10])
    ax.plot(t, 10 * (t / 0.01) ** -0.8, ':k', linewidth=1)

    ax.text(0.95, 0.95, labels[i],
            ha='right', va='top', transform=ax.transAxes)
    ax.set_xlabel(r'$\theta\ (deg)$')
    if i == 0:
        ax.set_ylabel(r'$\hat{w}(\theta)$')