Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArrayOfLineRecord should be replaced by a specialized class #17

Closed
riclarsson opened this issue Aug 30, 2019 · 4 comments
Closed

ArrayOfLineRecord should be replaced by a specialized class #17

riclarsson opened this issue Aug 30, 2019 · 4 comments
Labels
discussion Conversation about feature ideas enhancement Iterations on existing features
Milestone

Comments

@riclarsson
Copy link
Contributor

I have said this before but was advised to open an issue here.

In short

Our current LineRecord contains a lot of copies between lines that are not line specific but absorption band or species specific. Additionally, because of the variety in the data a LineRecord can contain, it is not possible to store and read an ArrayOfLineRecord reliably and efficiently as binary data. We can address this by making an ArrayOfLineRecord class to store the metadata and size information of all the lines in the array.

In more details: size

We currently have global quantum numbers and local quantum numbers in the same variable. There are 32 quantum numbers stored per level in a LineRecord. This means 1024 bits per line is stored in RAM. Less than 10 of these numbers are not band specific, so at least 700 bits can be saved per line if we make the separation and can throw away the global numbers.

We currently store line shape information, that is the way to compute the parameters required for every line, with independent species information. If we can guarantee that the line shape calculations are the same for every line in an ArrayOfLineRecord, several optimizations are possible in the code. Also, the broadening metadata information is stored as an Index describing the type, two bools describing if self and air broadening is present, and an ArrayOfSpeciesTag. A single SpeciesTag contains 64 bits of information, and the average size is 2 SpeciesTag per line (for air and self broadening). This means that every line could store about 144 bits less of information if the broadening data was made global.

Additionally, the species and isotopologue is stored per LineRecord but could be made global.
The line shape normalization, the line shape mirroring, the line shape cutoff frequency, the line shape line mixing pressure limit, the line shape population type, the reference temperature of the lines, and the LineRecord version are also stored per LineRecord but could arguably be made global. This is an additional 72 bits of information less per line that could be stored globally once.

In total, per LineRecord, we could store something like 900 bits less. This is a significant part of the size of a LineRecord, which today is about 1904 bits large in average. So the storage gains would be good.

In more details: storage

Another advantage of this change would be that the entire ArrayOfLineRecord would have a predictable size. This means that each LineRecord can be read and stored to binary files. This could give a large increase in speed of reading and writing larger line databases for multiple uses.

The tag of such an XML file would look something like:

<ArrayOfLineRecord version="0" nlines="40" species="O2-66" broadeningspecies="SELF N2 H2O CO2 H2 He BATH" lineshape="VP" mirroringtype="LP" zeemaneffect="true" normalizationtype="VVH" cutofflimit="810e9" linemixinglimit="1e0" populationdistribution="LTE">

This might look like a beast, but it really does not change much from how the data looks like today even if we store the ArrayOfLineRecord with just a single line in ascii form. All the sizes should be made clear from the tag, so that each line is known to be the same number of bits large. This allows fast and efficient binary IO.

Additional benefits: code readability

Lastly, another advantage is that ArrayOfLineRecord specific optimizations in the line-by-line code could be made.

One such optimization is that the partition function today has to checked for every absorption line so that the isotoplogue and reference temperature has not been changed from the last one. If this is instead just computed once we have better predictability of the code, since the partition function is then going to be constant.

The same can be said about the line shape volume mixing vector. If this is known to be constant the first time it is computed, the code checking that it is the same can be removed.

Also, the Doppler broadening would be known from the beginning of the cross-section calculations.

These changes would likely not affect the speed of the code execution by much, but making as many things as possible constant expressions helps greatly with readability. And the fewer non-constants we have, the easier it will be to ensure that parallel code executes efficiently.

@stefanbuehler
Copy link
Contributor

Hi Richard,

which granularity do you envision this to have? Do I understand right that the settings like broadeingspecies, etc. would apply to the entire array?

I think that makes sense. But I'm slightly confused about the notion of bands, which you also use in the beginning. And, related to this, what to do with global and local quantum numbers.

Do you propose to use one ArrayOfLineRecord per band? And, in that case, should the band-global quantum numbers be part of the XML header, extending the list of tags in the example you give under "In more details: storage"?

@riclarsson
Copy link
Contributor Author

Hi Stefan @stefanbuehler

Note that I have not made an implementation of this yet because it is quite a massive change, so there are likely many things that I have not thought about and that could therefore be glaring errors. From your comment I spotted at least one thing I missed (more further down).

I will get slightly technical. So before that, Oliver suggested going with the name "ArtsCatalog" for the ArrayOfLineRecord-class. I suggest the smaller version of LineRecord is called "AbsorptionLine". I will use these names here to explain what I mean to have part of each.

In short

The short answers first. The broadeningspecies should apply to the whole ArtsCatalog. The ArtsCatalog should have a sliding level of granularity that allows it to separate bands if we want it to and to ignore this if we want it to. The XML-header was indeed wrong and should hold two more tags for local and global quantum numbers..

Not so short

About the granularity I think we want to allow. The finest granularity must be line-by-line. The coarsest granularity should not be more than all the lines of a single isotopologue. A middle-ground would be to define a subset of lines of a single isotopologue. We can deal with all of these cases today. However, today our coarsest granularity is not on isotopologue level but on species level. I believe having this on species level rather than isotopologue level has made a lot of code more difficult to read and write, so the loss of coarseness is likely good going forward.

To explain the granularity we want to support. The line-by-line granularity is useful for the Zeeman effect or for experimental line shapes. The coarsest granularity is useful for things like broadband absorption or standard line-by-line calculations. The middle-ground is useful for line mixing or for infrared non-LTE effects. I believe we should continue to support these cases but make this support cleaner.

In all three cases, the ArtsCatalog should have just one array broadeningspecies that applies to all AbsorptionLine(s). For the coarsest granularity, broadeningspecies should be the only thing beside the isotopologue that matters. For the finer levels of granularity, I think that quantum numbers should encode the granularity. I missed this in my XML-tag above and I believe that the tags would be better if they contained this information.

This XML-tag should contain 2 more variables. To keep describing the 60 GHz band in the tag, these extra tags should be globalquantum="UP v0 0 Hund 1 Lambda 0 S 1 LO v0 0 Hund 1 Lambda 0 S 1" and localquantum="J N". The length of the global quantum number variable inside ARTS might be just the same as QuantumIdentifier is today, but the local variables could be replaced by just a few select numbers. The "band" I talk about here is therefore the combination of all the AbsorptionLine(s) that share the same globalquantum. I believe this is enough to allow all levels of granularity and flexible enough to not waste much RAM.

This flexibility is important. With it we can have

  • a way to just catch the 118 GHz O2-66 AbsorptionLine in a tag. This would be to write globalquantum="UP v0 0 Hund 1 Lambda 0 S 1 J 1 N 1 LO v0 0 Hund 1 Lambda 0 S 1 N 1 J 0" and localquantum="".
  • a way to write that all the AbsorptionLine(s) of an isotopologue should be contained in the ArtsCatalog. This is to write globalquantum="" with none (localquantum="") to all (localquantum="*") quantum numbers part of the local list (or select only some number(s) localquantum="J").
  • a way to catch each AbsorptionLine(s) separated by all their available quantum numbers. This would be to write globalquantum="*".

I believe this level of flexibility is important since we do not always care for the quantum numbers but when we do we need to be able to find them easily. We also already have this level of flexibility with current day HITRAN and others, so there will be no problems to adopt methods that selects at what granularity you want to generate your ArrayOfArtsCatalog or ArrayOfArrayOfArtsCatalog.

@erikssonpatrick
Copy link
Contributor

Hi,

If not clear, I remind about that during this week the aim is to work on things already started. For the moment we don't want any new "baustellen".

Bye,

Patrick

@stefanbuehler
Copy link
Contributor

Dear Richard,

I like your suggestion and think it goes in the right direction. It is much better to specify as many parameters as possible for the whole collection of lines, instead of for each individual line.

For the name of the class, I think ArtsCatalog is slightly misleading, since Catalog is more used for the whole structure (including all species). How about "LineList", "LineCollection", or something like that?

How do you think around the connection of this to abs_species? I think it would work very well if each element of abs_species (each absorber, associated with a VMR field) were associated with an ArrayOfLineList. So the catalog as a whole could be viewed as an ArrayOfArrayOfLineList. This actually fits the philosophy of abs_species, which already is an ArrayOfArrayOfSpeciesTag.

The LBL absorption calculation core routine could take one LineList and return the associated absorption. A higher level method would then just loop over all LineLists and call the core routine. Is this how you were thinking?

I assume you can include a way to bring HITRAN data into the new internal format? (This is the core requirement for replacing the existing format, because we have to remain able to run calculations directly with HITRAN files.)

All the best,

Stefan

@olemke olemke added discussion Conversation about feature ideas enhancement Iterations on existing features labels Sep 3, 2019
@olemke olemke closed this as completed in 0909c85 Dec 12, 2019
@olemke olemke added this to the ARTS 3 milestone Sep 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Conversation about feature ideas enhancement Iterations on existing features
Projects
None yet
Development

No branches or pull requests

4 participants