Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

obs example: SDSS spectra #154

Open
dnelson86 opened this issue Feb 14, 2024 · 5 comments
Open

obs example: SDSS spectra #154

dnelson86 opened this issue Feb 14, 2024 · 5 comments
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers

Comments

@dnelson86
Copy link
Collaborator

dnelson86 commented Feb 14, 2024

There is a new obs data file available:

/virgotng/mpia/obs/SDSS/sdss-dr17-spectra.hdf5

As with the GAIA file, the description attributes contain unit metadata.

Can you make a code snippet to use scida to: create a 3-panel plot, each panel a 2D image of spec (color is flux), for z (ordered) vs wave. The three panels separate by class (0 = galaxy, 1 = star, 2 = qso).

Can start an "obs cookbook".

Nice addition to the docs gallery.

@dnelson86 dnelson86 added documentation Improvements or additions to documentation good first issue Good for newcomers labels Feb 14, 2024
@dnelson86
Copy link
Collaborator Author

Has been added to www.tng-project.org/data/obs/

@dnelson86
Copy link
Collaborator Author

import matplotlib.pyplot as plt
import numpy as np
import scida

path = '/virgotng/mpia/obs/SDSS/'
filename = 'sdss-dr17-spectra.hdf5'

ds = scida.load(path + filename)

fig, axes = plt.subplots(ncols=3, nrows=1, figsize=(14,8))

classes = {0:'GALAXY', 1:'STAR', 2:'QSO'}

for cl, label in classes.items():
    w = np.where(ds['class'] == cl)[0]
    inds = np.argsort(ds['z'][w])

    im2d = ds['flux'][w[inds],:]
    im = axes[cl].imshow(im2d)
    fig.colorbar(im, label=ds['flux'].units)

plt.savefig("sdss-dr17-spectra.png", dpi=150)

Is very slow, not much happens after awhile. What is a more efficient approach?

@cbyrohl
Copy link
Owner

cbyrohl commented Feb 16, 2024

Let me give it a try, but I would already have the following suggestions:

  1. be explicit about the use of dask operations, right now you are using functions such as np.where, np.argsort. When passing dask functions, the dask wrappers will be used automatically, but it will be hard to see when and whether this conversion happens. Use da over np for dask.array as da where possible.
  2. Make sure that regular logging warnings from dask are activated (i.e. not disabled), this often indicates unintended performance bottlenecks.
  3. im2d does not appear to be a particularly small array (~1/3 * 4864154 * 4700 * 4 ~ 30 GB each).
  4. Slicing operations in a distributed setup can be very costly. It very much comes down to the chunking in the different dimensions, make sure to read this.

Fixing this will require you to pick one of two routes:

  1. Identify performance bottlenecks as laid out above, particularly the chunking relevant for the slicing operation, and only compute a more reduced data product than a 10GB sized array.
  2. If you are really interested in the 10GB sized output to be displayed via matplotlib, I would convert all arrays relevant here to numpy arrays (e.g. a = ds["a"].compute() ) and performing all operations in numpy and only using scida for reading these arrays in. This appears feasible as the intermediate data products have a similar size as the final data product, thus not requiring any distributed memory setup.

@dnelson86
Copy link
Collaborator Author

Can you convert the code into a more efficient version, that downsamples the array to e.g. ~1000x1000 pixels (plenty) for the imshow?

@cbyrohl
Copy link
Owner

cbyrohl commented Feb 16, 2024

Here's an example to get you started:

path = '/virgotng/mpia/obs/SDSS/'
filename = 'sdss-dr17-spectra.hdf5'

ds = scida.load(path + filename, units=False)


classes = {0:'GALAXY', 1:'STAR', 2:'QSO'}

def average_bins(array, bin_size=1000):
    """Average bins in the first axis by a given bin_size."""
    a, b = array.shape
    remainder = a % bin_size
    # If there is a remainder, pad the array
    if remainder != 0:
        pad_size = bin_size - remainder
        padded_array = np.pad(array, ((0, pad_size), (0, 0)), mode='constant', constant_values=0)
    else:
        padded_array = array
    # Now, reshape and compute the mean along the new axis
    reshaped_array = padded_array.reshape(-1, bin_size, b)
    averaged_array = reshaped_array.mean(axis=1)
    
    return averaged_array

ims = []
for cl, label in classes.items():
    # class and z are small arrays, so we just load them into memory as numpy arrays...
    w = np.where(np.array(ds['class']) == cl)[0]
    z = np.array(ds['z'])[w]
    inds = np.argsort(z)  # ... and argsort is not properly supported by dask anyway
    flux = ds['flux'].rechunk((-1, 100))  # important to rechunk in first dimension where we use indices
    im2d = flux[w[inds],:]
    
    # reduce size by averaging 10000 spectra each
    bin_size = 10000
    im2d = average_bins(im2d, bin_size)
    im2d = im2d.compute()
    ims.append(im2d)
    
    
fig, axes = plt.subplots(ncols=3, nrows=1, figsize=(14,8))
for i, (ax, im2d) in enumerate(zip(axes,ims)):
    im = axes[i].imshow(im2d, origin="lower", aspect="auto")
    ax.set_xlabel("spectral direction")
    if i==0:
        ax.set_ylabel("redshift direction")
    else:
        ax.set_yticks([])
fig.colorbar(im, label="flux")
plt.savefig("sdss-dr17-spectra.png", dpi=150)

See code comments for details of relevant changes. This was run in ~1 min in a distributed environment with a vera/p.large node on 8 cores.

sdss-dr17-spectra

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants