# **Self-Organizing Map (SOM) for Redshift Estimation**
### **Notebook Summary**
This notebook demonstrates the use of **Self-Organizing Maps (SOMs)** to structure and reduce photometric data for estimating redshift distributions. The main steps are:

1. **Loading Data**: Import fluxes and errors from a **deep-field photometric sample** and a **spectroscopic sample**.  
2. **Training the SOM**: A SOM is trained using the **deep-field photometric sample**, learning the structure of the color-magnitude space.  
3. **Assigning Galaxies to the SOM**: Both the **deep photometric sample** and the **spectroscopic sample** are assigned to the trained SOM.  
4. **Estimating \( p(z) \)**: The redshift distribution is estimated by computing the probability of redshift per SOM cell.  
5. **Analyzing Results**: The final redshift distribution is visualized, along with diagnostics on how well the SOM represents the spectroscopic sample.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import joblib
import astropy.io.fits as pf

- Andresa Campos, https://arxiv.org/pdf/2408.00922
- Code available at https://github.com/AndresaCampos/sompz_y6

In [None]:
import NoiseSOM as ns

### **Loading Data**
- The photometric and spectroscopic samples are loaded.  
- The **deep-field sample** contains galaxies **without redshifts**.  
- The **spectroscopic sample** includes galaxies with measured redshifts.  

In [None]:
spec_file = pf.open('spec_data_SOM_small.fits')
deep_file = pf.open('deep_data_SOM_small.fits')

In [None]:
spec_data = spec_file[1].data
deep_data = deep_file[1].data

In [None]:
# broad bands in deep photometry catalog
bands =  ['u', 'g', 'r', 'i', 'z', 'J', 'H', 'K']
bands_label =  'flux_'
bands_err_label = 'flux_err_'

# SOM configuration
som_side = 10

In [None]:
# Create flux and flux_err vectors for deep sample
len_sample = len(deep_data[bands_label + bands[0]])
fluxes_d = np.zeros((len_sample, len(bands)))
fluxerrs_d = np.zeros((len_sample, len(bands)))
for i, band in enumerate(bands):
    fluxes_d[:, i] = deep_data[bands_label + band]
    fluxerrs_d[:, i] = deep_data[bands_err_label + band]

# Create flux and flux_err vectors for spec sample
len_sample_spec = len(spec_data[bands_label + bands[0]])
fluxes_s = np.zeros((len_sample_spec, len(bands)))
fluxerrs_s = np.zeros((len_sample_spec, len(bands)))
for i, band in enumerate(bands):
    fluxes_s[:, i] = spec_data[bands_label + band]
    fluxerrs_s[:, i] = spec_data[bands_err_label + band]

### **Training the SOM**
- The SOM is trained using the **deep-field galaxies** (no redshifts).  
- The SOM clusters galaxies based on their **photometric fluxes**.  
- It learns the **color structure of the data**, preserving topological relationships.  


In [None]:
# Scramble the order of the catalog for purposes of training
indices = np.random.choice(fluxes_d.shape[0], size=fluxes_d.shape[0], replace=False)

In [None]:
# Initialise the learning funcion 
hh = ns.hFunc(fluxes_d.shape[0],  a=(0.3, 0.1), sigma=(5., 1.)) 

# Set the metric 
metric = ns.LinearMetric()
# metric = ns.AsinhMetric(lnScaleSigma=0.4, lnScaleStep=0.03)


In [None]:
# Now training the SOM!
som = ns.NoiseSOM(metric, fluxes_d[indices, :], fluxerrs_d[indices, :],
                  learning=hh,
                  shape=(som_side, som_side),
                  wrap=False, logF=True,
                  initialize='sample',
                  minError=0.02)

### **Assigning Galaxies to the SOM**
- Both the **deep photometric sample** and the **spectroscopic sample** are assigned to the trained SOM.  
- Each galaxy is mapped to its **Best Matching Unit (BMU)** in the SOM grid.  

In [None]:
# Assign galaxies to the som
som = ns.NoiseSOM(metric, None, None,
                  learning=None, 
                  shape=(som_side, som_side),
                  wrap=False, logF=True,
                  initialize=som.weights,
                  minError=0.02)

assignment_deep, _ = som.classify(fluxes_d, fluxerrs_d)
assignment_spec, _ = som.classify(fluxes_s, fluxerrs_s)

### **Visualizing the SOM**
Galaxy assignment to cells

In [None]:
# Plot how galaxies are distributed in the som
hist, _ = np.histogram(assignment_deep, bins=np.linspace(0, som_side*som_side, som_side*som_side + 1), density=False)
hist = hist.reshape(som_side, som_side)

fig, ax = plt.subplots(figsize=(4,4))
img = ax.imshow(np.log10(hist), cmap='RdYlBu_r')

cbar = fig.colorbar(img, ax=ax, shrink=0.8)

ax.set_title('Deep data assignment')
plt.show()

### **Examining the Distribution of Colors**
- Colors (e.g., \( g - r, r - i, i - z \)) provide key information about galaxy populations.  
- The SOM is expected to **group galaxies with similar colors together**, preserving color relationships.  

In [None]:
# obtain the mean magnitude for each cell c 
mag_spec = np.full((som_side * som_side, 8), np.nan)  
mag_deep = np.full((som_side * som_side, 8), np.nan)

# Iterate over cells and bands
for c in range(som_side * som_side):
    for m, band in enumerate(bands):
        spec_values = spec_data[f'mag_{band}'][assignment_spec == c]
        deep_values = deep_data[f'mag_{band}'][assignment_deep == c]

        # Check if there are any valid values before computing mean
        if spec_values.size > 0:
            mag_spec[c, m] = np.nanmean(spec_values)
        if deep_values.size > 0:
            mag_deep[c, m] = np.nanmean(deep_values)

In [None]:
# calculate the colors 
colors_spec = np.zeros((som_side*som_side, len(bands)-1))
colors_deep = np.zeros((som_side*som_side, len(bands)-1))
for j in range(som_side*som_side):
    for i in range(len(bands)-1):
        colors_spec[j,i] = mag_spec[j,i] - mag_spec[j,i+1]
        colors_deep[j,i] = mag_deep[j,i] - mag_deep[j,i+1]

In [None]:
fig, axes = plt.subplots(2, 4, figsize=(12, 5))  

for m in range(7):
    col_reshaped = colors_deep.reshape(som_side, som_side, 7)
    row = m // 4  
    col = m % 4   

    ax = axes[row, col] 
    img = ax.imshow(col_reshaped[:, :, m], cmap='RdYlBu_r')

    ax.set_title(f'Colors {bands[m]}-{bands[m+1]}')
    fig.colorbar(img, ax=ax, shrink=1)

fig.delaxes(axes[1, 3])

plt.tight_layout()
plt.show()



fig, axes = plt.subplots(2, 4, figsize=(12, 5))  

for m in range(7):
    col_reshaped = colors_spec.reshape(som_side, som_side, 7)
    row = m // 4  
    col = m % 4   

    ax = axes[row, col] 
    img = ax.imshow(col_reshaped[:, :, m], cmap='RdYlBu_r')

    ax.set_title(f'Colors {bands[m]}-{bands[m+1]}')
    fig.colorbar(img, ax=ax, shrink=1)

fig.delaxes(axes[1, 3])

plt.tight_layout()
plt.show()

**If color transitions smoothly across the SOM**, it suggests the SOM is correctly preserving photometric relationships.  
The deep and spectroscopic color distributions should be similar, if otherwise it could mean that the deep sample may contain galaxies in color spaces **not well represented** in the spectroscopic sample

### **Examining the Distribution of Redshift**

In [None]:
# compute the average z and std for each cell c 
pzc_mean = np.full(som_side * som_side, np.nan)
pzc_std = np.full(som_side * som_side, np.nan)

for c in range(som_side * som_side):
    z_values = spec_data['Z'][assignment_spec == c]

    if z_values.size > 0:  # Only compute if there are valid values
        pzc_mean[c] = np.mean(z_values)
        pzc_std[c] = np.std(z_values)


In [None]:
fig, ax = plt.subplots(1,3, figsize = (15,5))

# ---------------
hist, _ = np.histogram(assignment_spec, bins=np.linspace(0, som_side*som_side, som_side*som_side + 1), density=False)
hist = hist.reshape(som_side, som_side)

img = ax[0].imshow(np.log10(hist), cmap='RdYlBu_r')
cbar = fig.colorbar(img, ax=ax[0], shrink=0.8)

ax[0].set_title('Spec data assignment')
 
# ---------------
hist, _ = np.histogram(pzc_mean, bins=np.linspace(0, som_side*som_side, som_side*som_side + 1), density=False)
hist = pzc_mean.reshape(som_side, som_side)

img = ax[1].imshow(hist, cmap='RdYlBu_r')
cbar = fig.colorbar(img, ax=ax[1], shrink=0.8)

ax[1].set_title('<z>')

# ---------------
hist, _ = np.histogram(pzc_std, bins=np.linspace(0, som_side*som_side, som_side*som_side + 1), density=False)
hist = pzc_std.reshape(som_side, som_side)

img = ax[2].imshow(hist, cmap='RdYlBu_r')
cbar = fig.colorbar(img, ax=ax[2], shrink=0.8)

ax[2].set_title(r'$ \sigma z$')
# ---------------

plt.tight_layout
plt.show()

### **Comparison of Feature Distributions Before & After SOM**
- The SOM groups galaxies based on their **photometric features**, preserving structure while reducing dimensionality.  
- Here, we compare the **input feature distributions** (e.g., fluxes, colors, magnitudes) to their **SOM-mapped equivalents**.  
- If the SOM is working correctly:
  - The distribution should **broadly match** between input and mapped data.
  - Some shifts may occur, particularly in **regions where photometry is noisy or undersampled**.
  - Outlier regions might collapse into a **few highly occupied SOM cells**.
- Large discrepancies could indicate:
  - Poor metric choice.
  - Overfitting to the training sample.
  - Missing bands or poor photometric calibration.


In [None]:
#bands =  ['u', 'g', 'r', 'i', 'z', 'J', 'H', 'K']
# Choose a band

b_index = 3  # Example: 1st column in fluxes (replace with actual index)

# Original distribution
plt.hist(fluxes_d[:, b_index], bins=30, alpha=0.5, label="Original", density=True)

# SOM-reduced distribution (take mean feature values per cell)
som_means = [np.mean(fluxes_d[assignment_deep == c, b_index]) for c in range(som_side*som_side)]
plt.hist(som_means, bins=30, alpha=0.5, label="SOM-reduced", density=True)

plt.xlabel("Flux {}-band".format(bands[b_index]))
plt.ylabel("Density")
plt.title("Comparison of Feature Distributions Before & After SOM")
plt.legend()
plt.show()


## **Now let's see what happens if we increase the resolution!**

In [None]:
# increase SOM resolution
som_side = 20

In [None]:
# Retrain the SOM!
som = ns.NoiseSOM(metric, fluxes_d[indices, :], fluxerrs_d[indices, :],
                  learning=hh,
                  shape=(som_side, som_side),
                  wrap=False, logF=True,
                  initialize='sample',
                  minError=0.02)

In [None]:
# Assign galaxies 
som = ns.NoiseSOM(metric, None, None,
                  learning=None, 
                  shape=(som_side, som_side),
                  wrap=False, logF=True,
                  initialize=som.weights,
                  minError=0.02)

assignment_deep, _ = som.classify(fluxes_d, fluxerrs_d)
assignment_spec, _ = som.classify(fluxes_s, fluxerrs_s)

In [None]:
# Plot how galaxies are distributed in the som
hist, _ = np.histogram(assignment_deep, bins=np.linspace(0, som_side*som_side, som_side*som_side + 1), density=False)
hist = hist.reshape(som_side, som_side)

fig, ax = plt.subplots(figsize=(4,4))
img = ax.imshow(np.log10(hist), cmap='RdYlBu_r')
cbar = fig.colorbar(img, ax=ax, shrink=0.8)

ax.set_title('Deep data assignment')
plt.show()

In [None]:
# obtain the mean magnitude for each cell c 
mag_spec = np.full((som_side * som_side, 8), np.nan)  
mag_deep = np.full((som_side * som_side, 8), np.nan)

# Iterate over cells and bands
for c in range(som_side * som_side):
    for m, band in enumerate(bands):
        spec_values = spec_data[f'mag_{band}'][assignment_spec == c]
        deep_values = deep_data[f'mag_{band}'][assignment_deep == c]

        # Check if there are any valid values before computing mean
        if spec_values.size > 0:
            mag_spec[c, m] = np.nanmean(spec_values)
        if deep_values.size > 0:
            mag_deep[c, m] = np.nanmean(deep_values)

In [None]:
# calculate the colors 
colors_spec = np.zeros((som_side*som_side, len(bands)-1))
colors_deep = np.zeros((som_side*som_side, len(bands)-1))
for j in range(som_side*som_side):
    for i in range(len(bands)-1):
        colors_spec[j,i] = mag_spec[j,i] - mag_spec[j,i+1]
        colors_deep[j,i] = mag_deep[j,i] - mag_deep[j,i+1]

In [None]:
fig, axes = plt.subplots(2, 4, figsize=(12, 5))  

for m in range(len(bands)-1):
    col_reshaped = colors_deep.reshape(som_side, som_side, len(bands)-1)
    row = m // 4  
    col = m % 4   

    ax = axes[row, col] 
    img = ax.imshow(col_reshaped[:, :, m], cmap='RdYlBu_r')

    ax.set_title(f'Colors deep {bands[m]}-{bands[m+1]}')
    fig.colorbar(img, ax=ax, shrink=1)

fig.delaxes(axes[1, 3])

plt.tight_layout()
plt.show()



fig, axes = plt.subplots(2, 4, figsize=(12, 5))  

for m in range(len(bands)-1):
    col_reshaped = colors_spec.reshape(som_side, som_side, len(bands)-1)
    row = m // 4  
    col = m % 4   

    ax = axes[row, col] 
    img = ax.imshow(col_reshaped[:, :, m], cmap='RdYlBu_r')

    ax.set_title(f'Colors spec {bands[m]}-{bands[m+1]}')
    fig.colorbar(img, ax=ax, shrink=1)

fig.delaxes(axes[1, 3])

plt.tight_layout()
plt.show()

In [None]:
# compute the average z and std for each cell c 
pzc_mean = np.full(som_side * som_side, np.nan)
pzc_std = np.full(som_side * som_side, np.nan)

for c in range(som_side * som_side):
    z_values = spec_data['Z'][assignment_spec == c]

    if z_values.size > 0:  # Only compute if there are valid values
        pzc_mean[c] = np.mean(z_values)
        pzc_std[c] = np.std(z_values)

In [None]:
fig, ax = plt.subplots(1,3, figsize = (15,5))

# ---------------
hist, _ = np.histogram(assignment_spec, bins=np.linspace(0, som_side*som_side, som_side*som_side + 1), density=False)
hist = hist.reshape(som_side, som_side)

img = ax[0].imshow(np.log10(hist), cmap='RdYlBu_r')
cbar = fig.colorbar(img, ax=ax[0], shrink=0.8)

ax[0].set_title('Spec data assignment')
 
# ---------------
hist, _ = np.histogram(pzc_mean, bins=np.linspace(0, som_side*som_side, som_side*som_side + 1), density=False)
hist = pzc_mean.reshape(som_side, som_side)

img = ax[1].imshow(hist, cmap='RdYlBu_r')
cbar = fig.colorbar(img, ax=ax[1], shrink=0.8)

ax[1].set_title('<z>')

# ---------------
hist, _ = np.histogram(pzc_std, bins=np.linspace(0, som_side*som_side, som_side*som_side + 1), density=False)
hist = pzc_std.reshape(som_side, som_side)

img = ax[2].imshow(hist, cmap='RdYlBu_r')
cbar = fig.colorbar(img, ax=ax[2], shrink=0.8)

ax[2].set_title(r'$ \sigma z$')
# ---------------

plt.tight_layout
plt.show()

**Missing redshift information** in a significant portion of SOM will be leaving some photometric regions uncalibrated!! This can lead to a distorted estimated p(z) and to **biases in redshift estimation** and large-scale structure analyses.

# **Exploring Non-Linear Metrics in SOM**
- So far, we've used a **linear metric** (e.g., Euclidean distance in flux space) to train the SOM.
- However, **high-dimensional spaces behave differently**, and linear distances might not be the best choice.
- **Why?**
  - In **higher dimensions**, distances tend to cluster around the mean due to the "curse of dimensionality."
  - A **non-linear metric** (e.g., one based on asinh-transformed features) can help preserve relative differences.
  - This could lead to **better clustering** and **improved redshift recovery**.
- Let's switch the SOM metric from `LinearMetric` to a **non-linear metric**, and observe:
  - Changes in **SOM cell assignments**.
  - Differences in **final redshift distributions**.
  - Whether **previously degenerate regions** become better separated.


In [None]:
# Scramble the order of the catalog for purposes of training
indices = np.random.choice(fluxes_d.shape[0], size=fluxes_d.shape[0], replace=False)

In [None]:
# Initialise the learning funcion 
hh = ns.hFunc(fluxes_d.shape[0],  a=(0.3, 0.1), sigma=(5., 1.)) 

# Set a non linear metric, optimised foor color space!!! 
metric = ns.AsinhMetric(lnScaleSigma=0.4, lnScaleStep=0.03)


we won't be training the SOM, it takes a few minutes! Load it instead

In [None]:
# Load the saved SOM file
saved_som = np.load("trained_som_AsinhMetric.npz")

# Extract weights and shape
weights = saved_som["weights"]
shape = tuple(saved_som["shape"])  # Ensure shape is a tuple

# Initialize a new NoiseSOM from the loaded weights and shape
som = ns.NoiseSOM(metric, None, None,
                  learning=hh,
                  shape=shape,
                  wrap=False, logF=True,
                  initialize=weights,
                  minError=0.02)

We won't be assigning galaxies to the SOM, it takes MORE than a few minutes! Load it again :) 

In [None]:
assignment_deep = np.load("assignment_deep_AsinhMetric.npy")
assignment_spec = np.load("assignment_spec_AsinhMetric.npy")

In [None]:
# compute the average z and std for each cell c 
pzc_mean = np.full(som_side * som_side, np.nan)
pzc_std = np.full(som_side * som_side, np.nan)

for c in range(som_side * som_side):
    z_values = spec_data['Z'][assignment_spec == c]

    if z_values.size > 0:  # Only compute if there are valid values
        pzc_mean[c] = np.mean(z_values)
        pzc_std[c] = np.std(z_values)

In [None]:
fig, ax = plt.subplots(1,3, figsize = (15,5))

# ---------------
hist, _ = np.histogram(assignment_spec, bins=np.linspace(0, som_side*som_side, som_side*som_side + 1), density=False)
hist = hist.reshape(som_side, som_side)

img = ax[0].imshow(np.log10(hist), cmap='RdYlBu_r')
cbar = fig.colorbar(img, ax=ax[0], shrink=0.8)

ax[0].set_title('Spec data assignment')
 
# ---------------
hist, _ = np.histogram(pzc_mean, bins=np.linspace(0, som_side*som_side, som_side*som_side + 1), density=False)
hist = pzc_mean.reshape(som_side, som_side)

img = ax[1].imshow(hist, cmap='RdYlBu_r')
cbar = fig.colorbar(img, ax=ax[1], shrink=0.8)

ax[1].set_title('<z>')

# ---------------
hist, _ = np.histogram(pzc_std, bins=np.linspace(0, som_side*som_side, som_side*som_side + 1), density=False)
hist = pzc_std.reshape(som_side, som_side)

img = ax[2].imshow(hist, cmap='RdYlBu_r')
cbar = fig.colorbar(img, ax=ax[2], shrink=0.8)

ax[2].set_title(r'$ \sigma z$')
# ---------------

plt.tight_layout
plt.show()

#### **Why Does a Non-Linear Metric Improve the Flux Distribution in the SOM?**
- A **linear metric** (e.g., Euclidean) distorts flux relationships:
  - Faint sources are **compressed**, and bright sources **stretch apart**.
  - This creates **empty SOM cells** and **biases the mapping**.
- A **non-linear metric (e.g., `AsinhMetric`)** rescales fluxes:
  - **Linear scaling for faint sources** (preserving details).  
  - **Log scaling for bright sources** (preventing outliers from dominating).
- **Impact:**  
  ✅ More uniform flux mapping → **Fewer empty cells**  
  ✅ Better-preserved flux distribution → **More reliable clustering & redshifts**  
  ✅ Smoother transition across SOM cells → **More robust data reduction**  

Now compare histograms of **flux before & after SOM mapping** with the non linear metric!  

In [None]:
#bands =  ['u', 'g', 'r', 'i', 'z', 'J', 'H', 'K']
# Choose a band

b_index = 3  # Example: 1st column in fluxes (replace with actual index)

# Original distribution
plt.hist(fluxes_d[:, b_index], bins=30, alpha=0.5, label="Original", density=True)

# SOM-reduced distribution (take mean feature values per cell)
som_means = [np.mean(fluxes_d[assignment_deep == c, b_index]) for c in range(som_side*som_side)]
plt.hist(som_means, bins=30, alpha=0.5, label="SOM-reduced", density=True)

plt.xlabel("Flux {}-band".format(bands[b_index]))
plt.ylabel("Density")
plt.title("Comparison of Feature Distributions Before & After SOM")
plt.legend()
plt.show()


# **Finally, estimating the Redshift Distribution \( p(z) \)**
- The redshift distribution is computed using:  
  $p(z) = \sum_c p(z | c) p(c) $  
- Here, p(c)  is the probability of a galaxy being assigned to a SOM cell,  
  and  p(z | c)  is the redshift distribution for that cell.  

In [None]:
def compute_pzc(spec_assigned_cells, spec_redshifts, num_bins=50, z_range=(0, 3)):
    """Computes p(z | c) for each SOM cell using the spectroscopic sample."""
    nz_som = {}
    for (c, z) in zip(spec_assigned_cells, spec_redshifts):
        if c not in nz_som:
            nz_som[c] = []
        nz_som[c].append(z)
    
    nz_prob = {c: np.histogram(z_list, bins=num_bins, range=z_range, density=True)[0] 
               for c, z_list in nz_som.items()}
    z_bins = np.linspace(z_range[0], z_range[1], num_bins + 1)
    
    return nz_prob, z_bins

def compute_pc(phot_assigned_cells):
    """Computes p(c) for the photometric sample."""
    unique_cells, counts = np.unique(phot_assigned_cells, return_counts=True)
    total_phot = len(phot_assigned_cells)
    return {c: count / total_phot for c, count in zip(unique_cells, counts)}

def estimate_pz(nz_prob, p_c, z_bins):
    """Computes final p(z) using p(z) = sum_c p(z|c) p(c)."""
    p_z = np.zeros(len(z_bins) - 1)
    for c in p_c:
        if c in nz_prob:  # Only use cells with spectroscopic redshifts
            p_z += nz_prob[c] * p_c[c]
    
    # Normalize p(z)
    p_z /= np.sum(p_z)
    
    return z_bins, p_z

In [None]:
# compute p(z) for each cell c
nz_prob, z_bins = compute_pzc(assignment_spec, spec_data['Z'], num_bins=25, z_range=(0, 1.5))

In [None]:
# compute p(c) for each cell c
pc = compute_pc(assignment_deep)

In [None]:
# compute final p(z)
z_bins, pz = estimate_pz(nz_prob, pc, z_bins)

### **Visualizing the Redshift Distribution**
- The final **estimated redshift distribution** is plotted.  
- The **spectroscopic redshifts** are overlaid for comparison.  
- If \( p(z) \) does not change significantly under different conditions,  
  it suggests the SOM assignments are not varying much.  


In [None]:
zbinsc = 0.5 * (z_bins[:-1] + z_bins[1:])

plt.plot(zbinsc, pz/np.trapz(pz, zbinsc), label='Estimated p(z)')
plt.hist(spec_data['Z'], bins = z_bins, color = 'tab:blue', alpha = 0.3, density = True, label = 'Spectroscopic z')
plt.xlabel('Redshift z')
plt.ylabel('Probability Density n(z)')
plt.title('Estimated Redshift Distribution using SOM')
plt.legend()
plt.show()
