obs example: SDSS spectra #154

dnelson86 · 2024-02-14T11:03:35Z

There is a new obs data file available:

/virgotng/mpia/obs/SDSS/sdss-dr17-spectra.hdf5

As with the GAIA file, the description attributes contain unit metadata.

Can you make a code snippet to use scida to: create a 3-panel plot, each panel a 2D image of spec (color is flux), for z (ordered) vs wave. The three panels separate by class (0 = galaxy, 1 = star, 2 = qso).

Can start an "obs cookbook".

Nice addition to the docs gallery.

The text was updated successfully, but these errors were encountered:

dnelson86 · 2024-02-14T11:04:33Z

Has been added to www.tng-project.org/data/obs/

dnelson86 · 2024-02-15T15:36:15Z

import matplotlib.pyplot as plt
import numpy as np
import scida

path = '/virgotng/mpia/obs/SDSS/'
filename = 'sdss-dr17-spectra.hdf5'

ds = scida.load(path + filename)

fig, axes = plt.subplots(ncols=3, nrows=1, figsize=(14,8))

classes = {0:'GALAXY', 1:'STAR', 2:'QSO'}

for cl, label in classes.items():
    w = np.where(ds['class'] == cl)[0]
    inds = np.argsort(ds['z'][w])

    im2d = ds['flux'][w[inds],:]
    im = axes[cl].imshow(im2d)
    fig.colorbar(im, label=ds['flux'].units)

plt.savefig("sdss-dr17-spectra.png", dpi=150)

Is very slow, not much happens after awhile. What is a more efficient approach?

cbyrohl · 2024-02-16T09:57:24Z

Let me give it a try, but I would already have the following suggestions:

be explicit about the use of dask operations, right now you are using functions such as np.where, np.argsort. When passing dask functions, the dask wrappers will be used automatically, but it will be hard to see when and whether this conversion happens. Use da over np for dask.array as da where possible.
Make sure that regular logging warnings from dask are activated (i.e. not disabled), this often indicates unintended performance bottlenecks.
im2d does not appear to be a particularly small array (~1/3 * 4864154 * 4700 * 4 ~ 30 GB each).
Slicing operations in a distributed setup can be very costly. It very much comes down to the chunking in the different dimensions, make sure to read this.

Fixing this will require you to pick one of two routes:

Identify performance bottlenecks as laid out above, particularly the chunking relevant for the slicing operation, and only compute a more reduced data product than a 10GB sized array.
If you are really interested in the 10GB sized output to be displayed via matplotlib, I would convert all arrays relevant here to numpy arrays (e.g. a = ds["a"].compute() ) and performing all operations in numpy and only using scida for reading these arrays in. This appears feasible as the intermediate data products have a similar size as the final data product, thus not requiring any distributed memory setup.

dnelson86 · 2024-02-16T10:03:47Z

Can you convert the code into a more efficient version, that downsamples the array to e.g. ~1000x1000 pixels (plenty) for the imshow?

cbyrohl · 2024-02-16T13:34:57Z

Here's an example to get you started:

path = '/virgotng/mpia/obs/SDSS/'
filename = 'sdss-dr17-spectra.hdf5'

ds = scida.load(path + filename, units=False)


classes = {0:'GALAXY', 1:'STAR', 2:'QSO'}

def average_bins(array, bin_size=1000):
    """Average bins in the first axis by a given bin_size."""
    a, b = array.shape
    remainder = a % bin_size
    # If there is a remainder, pad the array
    if remainder != 0:
        pad_size = bin_size - remainder
        padded_array = np.pad(array, ((0, pad_size), (0, 0)), mode='constant', constant_values=0)
    else:
        padded_array = array
    # Now, reshape and compute the mean along the new axis
    reshaped_array = padded_array.reshape(-1, bin_size, b)
    averaged_array = reshaped_array.mean(axis=1)
    
    return averaged_array

ims = []
for cl, label in classes.items():
    # class and z are small arrays, so we just load them into memory as numpy arrays...
    w = np.where(np.array(ds['class']) == cl)[0]
    z = np.array(ds['z'])[w]
    inds = np.argsort(z)  # ... and argsort is not properly supported by dask anyway
    flux = ds['flux'].rechunk((-1, 100))  # important to rechunk in first dimension where we use indices
    im2d = flux[w[inds],:]
    
    # reduce size by averaging 10000 spectra each
    bin_size = 10000
    im2d = average_bins(im2d, bin_size)
    im2d = im2d.compute()
    ims.append(im2d)
    
    
fig, axes = plt.subplots(ncols=3, nrows=1, figsize=(14,8))
for i, (ax, im2d) in enumerate(zip(axes,ims)):
    im = axes[i].imshow(im2d, origin="lower", aspect="auto")
    ax.set_xlabel("spectral direction")
    if i==0:
        ax.set_ylabel("redshift direction")
    else:
        ax.set_yticks([])
fig.colorbar(im, label="flux")
plt.savefig("sdss-dr17-spectra.png", dpi=150)

See code comments for details of relevant changes. This was run in ~1 min in a distributed environment with a vera/p.large node on 8 cores.

dnelson86 added documentation Improvements or additions to documentation good first issue Good for newcomers labels Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

obs example: SDSS spectra #154

obs example: SDSS spectra #154

dnelson86 commented Feb 14, 2024 •

edited

dnelson86 commented Feb 14, 2024

dnelson86 commented Feb 15, 2024

cbyrohl commented Feb 16, 2024

dnelson86 commented Feb 16, 2024

cbyrohl commented Feb 16, 2024

obs example: SDSS spectra #154

obs example: SDSS spectra #154

Comments

dnelson86 commented Feb 14, 2024 • edited

dnelson86 commented Feb 14, 2024

dnelson86 commented Feb 15, 2024

cbyrohl commented Feb 16, 2024

dnelson86 commented Feb 16, 2024

cbyrohl commented Feb 16, 2024

dnelson86 commented Feb 14, 2024 •

edited