Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle data with samples #19

Open
CJ-Wright opened this issue Nov 15, 2017 · 5 comments
Open

How to handle data with samples #19

CJ-Wright opened this issue Nov 15, 2017 · 5 comments

Comments

@CJ-Wright
Copy link
Member

Usually when we are taking data it gets put into a databroker. The databroker takes care of storing large data sets. However, we may get data with the samples (eg XRD patterns taken on a lab source) which is not in the databroker. How should we handle this?

The way I see this there are two options (although there may be more):

  1. Sidewinder the data into the databroker (although we are missing tons of metadata, some of it critical (x-ray wavelength?))
  2. Put the data into filestore and hand the sample database the tokens. On retrieval we can open up the data.
@sbillinge
Copy link

This speaks to a more general discussion probably about what should be in a filestore and what should be in databrokers.

For my money this is less a philosophical issue and more a file-size issue. If the file-size is greater than XXXX it should go into filestore?

It may also be a searchability issue though. We don't want to search through large files of data for metadata, but we don't want to lose large amounts of metadata into large datafiles that are not in databroker.

What are your thoughts? the file-size limit may be the simplest thing.

@sbillinge
Copy link

on a similar topic, wherever the x-ray data go, when we find out later what the x-ray wavelength was, how can we then associate it? mutable databroker? non-mutable but some kind of event stream that zips the info together?

@CJ-Wright
Copy link
Member Author

I am talking stricktly about the numerical array data. I presume that we'd parse any metadata in those files out into a dict somewhere?

To you second post:
I'm not certain there was discussions a long time ago about making databroker documents amendable (such that it would keep the history of the amendments so you could go back to the original data) I don't know where that discussion is currently. We could have two streams one of data and one for the energy, and then change which energy we point to using some searching capabilities (this is what is planned at XPD I think).

@sbillinge
Copy link

I am not sure the best way forward. I thought about parsing out the numerical array data, but then we need tools to do that reliably, which is ok when people are using known file formats but could be a big overhead to maintain.

The reason it is an important question is that if we are generating thousands of processed PDFs, FQ's etc. etc., when we decide to store rather than recompute them, do we parse out those arrays to a filestore and propapate a token, or do we just store the arrays in databroker.....I don't know the answer.

the current issue is just forcing our hand to make this decision I guess.
Pro parsing out arrays to filestore:

  1. elegance
  2. doesn't slow down searches in db
  3. its just the right thing to do
  4. ????
    Con:
  5. oof, in the future we may be keeping track of every file-format and its family, turning this into a full-time job.
  6. overkill?
  7. creating a complex solution where a simple one works just fine?

The answers to 2 and 3 will depend on performance I guess

@CJ-Wright
Copy link
Member Author

In the Pro category we should add:

  1. Space friendly (the data could live separate from the metadata database eg on tape)

The file format issue is a problem no matter which way we turn. If we are going to store data we will either

a) need to parse it on the way into some uniform storage method (filestore, hdf5, filestore+hdf5, raw json, etc.)
b) need to parse it every time we want to look at the data if we leave the data on disk in its current format

I would say the definition of overkill is "creating a complex solution where a simple one works just fine?" 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants