Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save & load responses as parquet #8684

Merged
merged 1 commit into from
Oct 7, 2024

Conversation

yngve-sk
Copy link
Contributor

@yngve-sk yngve-sk commented Sep 12, 2024

(semeio PR: equinor/semeio#648 )

Issue
Towards combining datasets without xr nan artifacts etc

Approach
read&write parquet files with polars

Closes: #6525

Some benchmarking:

Drogon ahm main vs normal
                                                        main     parquet
Open Manage Experiments
Open plotter:                                          3.9s       4.3s
Select FGORH (w all ensembles active)                  1.8        1.2s
Select w1 (w all ensembles active)                4s         3.2s
Select w2 (w all ensembles active)             4.1        3.3s


SLOWPLOT case              main      parquet
open GUI with migration    3m15s       5min
open GUI w/o migration      16s        15s
migrate to7                2m59s       4m45s
open plotter               51s         11s
Select summary vector         21.9s        7s
Open manage experiments     1s          1s
Select experiment           1s         1s
Select Ensemble-            >1s        >1s
Ensemble->Observations      12s        19s (slower)
Select realization          1s          6s

@yngve-sk yngve-sk marked this pull request as draft September 12, 2024 06:37
@yngve-sk yngve-sk force-pushed the responses-as-parquet branch 29 times, most recently from 3e063f7 to 378376d Compare September 17, 2024 08:21
@yngve-sk yngve-sk force-pushed the responses-as-parquet branch 6 times, most recently from 31a824e to 0d8ddd4 Compare September 30, 2024 08:02
@codecov-commenter
Copy link

codecov-commenter commented Sep 30, 2024

Codecov Report

Attention: Patch coverage is 98.24561% with 6 lines in your changes missing coverage. Please review.

Project coverage is 91.47%. Comparing base (d1c3a88) to head (c0fb05c).
Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
src/ert/config/ert_config.py 73.33% 4 Missing ⚠️
src/ert/config/observations.py 95.83% 1 Missing ⚠️
src/ert/config/summary_config.py 93.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8684      +/-   ##
==========================================
+ Coverage   91.42%   91.47%   +0.05%     
==========================================
  Files         344      344              
  Lines       21120    21243     +123     
==========================================
+ Hits        19308    19433     +125     
+ Misses       1812     1810       -2     
Flag Coverage Δ
cli-tests 39.58% <35.67%> (-0.05%) ⬇️
gui-tests 73.30% <54.67%> (-0.25%) ⬇️
performance-tests 50.15% <49.70%> (+<0.01%) ⬆️
unit-tests 80.24% <83.33%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yngve-sk yngve-sk self-assigned this Sep 30, 2024
Copy link
Collaborator

@oyvindeide oyvindeide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this improves readability and makes the responses more generic, which is good 👍

assert all(
fopr.data.columns.get_level_values("data_index").values == list(range(200))
)
# Why 210, not 200?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outdated comment?

@@ -1,212 +1,212 @@
------------ ------------------- ----- ----- ----- ----- ------ ----- ------
FOPR 2010-01-10T00:00:00 0.002 0.100 5.657 0.566 0.076 0.105 Active
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To decrease your diff you could probably just fix the formatting here 😅 Not a big deal though, see that it is only the formatting that changed.

pivoted = responses_for_type.pivot(
on="realization",
index=["response_key", *response_cls.primary_key],
aggregate_function="mean",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the implication of mean?

Copy link
Collaborator

@oyvindeide oyvindeide Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It said so in the comment 😅

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will that be output somewhere? Is it possible to for example log it?

Copy link
Contributor Author

@yngve-sk yngve-sk Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for the edge case where we end up with duplicate values for one response at one index, for example a given time. In that case, we need to aggregate them for the pivoted table to make sense, else the index used to pivot contains duplicates. So taking the average of the duplicate response values on the timestep seems to be somewhat "close enough" to do what we want, we could set it to use min,max,median,first, etc, could configure it, but not sure if it would be interesting to users to do this?

Example from running test_that_duplicate_summary_time_steps_does_not_fail:

responses_for_type.pivot(
            on="realization",
            index=["response_key", *response_cls.primary_key],
            aggregate_function="mean",
        )
Out[9]: 
shape: (1, 5)
┌──────────────┬─────────────────────┬───────────┬────────┬──────────┐
│ response_key ┆ time                ┆ 0         ┆ 1      ┆ 2        │
│ ---          ┆ ---                 ┆ ---       ┆ ---    ┆ ---      │
│ str          ┆ datetime[ms]        ┆ f32       ┆ f32    ┆ f32      │
╞══════════════╪═════════════════════╪═══════════╪════════╪══════════╡
│ FOPR         ┆ 2014-09-10 00:00:00 ┆ -1.603837 ┆ 0.0641 ┆ 0.740891 │
└──────────────┴─────────────────────┴───────────┴────────┴──────────┘
responses_for_type
Out[10]: 
shape: (4, 4)
┌─────────────┬──────────────┬─────────────────────┬───────────┐
│ realization ┆ response_key ┆ time                ┆ values    │
│ ---         ┆ ---          ┆ ---                 ┆ ---       │
│ u16         ┆ str          ┆ datetime[ms]        ┆ f32       │
╞═════════════╪══════════════╪═════════════════════╪═══════════╡
│ 0           ┆ FOPR         ┆ 2014-09-10 00:00:00 ┆ -1.603837 │
│ 1           ┆ FOPR         ┆ 2014-09-10 00:00:00 ┆ 0.0641    │
│ 2           ┆ FOPR         ┆ 2014-09-10 00:00:00 ┆ 0.740891  │
│ 2           ┆ FOPR         ┆ 2014-09-10 00:00:00 ┆ 0.740891  │
└─────────────┴──────────────┴─────────────────────┴───────────┘

Alternatively we could strive to achieve something like this:

┌──────────────┬─────────────────────┬───────────┬────────┬──────────┐
│ response_key ┆ time                ┆ 0         ┆ 1      ┆ 2        │
│ ---          ┆ ---                 ┆ ---       ┆ ---    ┆ ---      │
│ str          ┆ datetime[ms]        ┆ f32       ┆ f32    ┆ f32      │
╞══════════════╪═════════════════════╪═══════════╪════════╪══════════╡
│ FOPR         ┆ 2014-09-10 00:00:00 ┆ -1.603837 ┆ 0.0641 ┆ 0.740891 │
│ FOPR         ┆ 2014-09-10 00:00:00 ┆    NaN    ┆  NaN   ┆ 0.740891 │
└──────────────┴─────────────────────┴───────────┴────────┴──────────┘

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be logged / given as a warning somehow, I'm not so familiar with when/why it happens, which may be relevant to what the warning/logging message should be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Performance-wise it might be slow to always check if some values were aggregated, or a naive try-catch around the pivot, as it will pass if there are no duplicate values)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a good, somewhat performant way of warning the user this has happened, that would be good. My hunch is that this would typically happen in pressure tests where the time resolution is quite high, and the simulator does not have the same resolution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be OK to do this in a separate PR? I think the try-catch, first trying without an aggregation, then trying with one, should be easy to add / easy to remove if it turns out to have bad side effects. Should maybe be tested as its own thing just to be sure.

# We need to either assume that if there is a time column
# we will approx-join that, or we could specify in response configs
# that there is a column that requires an approx "asof" join.
# Suggest we simplify and assume that there is always only
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, if and when we add new response types where this might be relevant we can add it then.

self.observations: Dict[str, xr.Dataset] = self.enkf_obs.datasets
self.observations: Dict[str, polars.DataFrame] = self.enkf_obs.datasets

def write_observations_to_folder(self, dest: Path) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nitpick, but should this function be here? Maybe it belongs with the observations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to enkf_obs

},
return polars.DataFrame(
{
"report_step": polars.Series(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This made it much easier to read!

if self.observation_type == EnkfObservationImplementationType.GEN_OBS:
datasets = []
actual_response_key = self.data_key
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use self.data_key directly? Same on the next line, seems it is only used once.

@@ -61,8 +80,12 @@ def __getitem__(self, key: str) -> ObsVector:
def __eq__(self, other: object) -> bool:
if not isinstance(other, EnkfObs):
return False

if self.datasets.keys() != other.datasets.keys():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isnt this duplicated in ErtConfig?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appears so, but this is for the EnkfObs, and in ErtConfig it is for the dict mapping response type to obs ds. Long-term we should maybe cut out enkfobs and only keep the dict but right now it is a bit duplicated and necessary.

@yngve-sk yngve-sk force-pushed the responses-as-parquet branch 4 times, most recently from 2769d67 to 41aae5d Compare October 4, 2024 10:35
Copy link
Collaborator

@oyvindeide oyvindeide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice job, just some minor comments.

@@ -183,6 +183,9 @@ def summary_observations(
"error_mode": draw(st.sampled_from(ErrorMode)),
"value": draw(positive_floats),
}

assume(kws["error_mode"] == ErrorMode.ABS or kws["error"] < 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in a separate commit, but think it has effect on logic from the first commit? If so they should be squashed so the tests pass on all commits.

@@ -236,3 +236,36 @@ def test_that_mismatched_responses_gives_nan_measured_data(ert_config, prior_ens
assert pd.isna(fopr_2.loc[0].iloc[0])
assert pd.isna(fopr_2.loc[1].iloc[0])
assert pd.isna(fopr_1.loc[2].iloc[0])


def test_reading_past_2263_is_ok(ert_config, storage, prior_ensemble):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be squashed into the previous commit as the bug is fixed there, and so the test belongs along side that. Feel free to write a longer commit body of the first commit explaining the reason behind this change and the implications.

* Datetime reading past 2263 should now work,
added test asserting that it does work
* Enforced f32 precision for observations & responses
@yngve-sk yngve-sk added release-notes:unreleased-feature-changes PR with changes to a feature which is not yet released. Not for introduction of new features! release-notes:bug-fix Automatically categorise as bug fix in release notes release-notes:breaking-change Automatically categorise as breaking change in release notes labels Oct 7, 2024
@yngve-sk yngve-sk merged commit a52cebf into equinor:main Oct 7, 2024
55 of 56 checks passed
@yngve-sk yngve-sk removed the release-notes:unreleased-feature-changes PR with changes to a feature which is not yet released. Not for introduction of new features! label Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes:breaking-change Automatically categorise as breaking change in release notes release-notes:bug-fix Automatically categorise as bug fix in release notes
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Failed internalizing data from forward model
3 participants