Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CsvComparison NaN Strategy [PRIORITY] #170

Open
andkay opened this issue Jan 11, 2022 · 8 comments
Open

CsvComparison NaN Strategy [PRIORITY] #170

andkay opened this issue Jan 11, 2022 · 8 comments
Assignees
Labels
invalid This doesn't seem right

Comments

@andkay
Copy link
Contributor

andkay commented Jan 11, 2022

Recent changes to the main branch have changed the behaviour of the CsvComparison.build method to drop records with no match after merging simulation data to benchmark data, rather than infilling with 0s (pandas dataframe fillna -> dropna).

This change was required to calculate some newer benchmarks correctly, but in general is dangerous. In some cases, the absence of evidence <-> evidence of absence.

To fix this, we need to introduce a drop_na attribute to this class that can be set by individual handler implementations as required.

@andkay andkay added the invalid This doesn't seem right label Jan 11, 2022
@andkay andkay self-assigned this Jan 11, 2022
@fredshone
Copy link
Contributor

personally i think this behaviour is a useful default. It allows users to hack mode specific or subpop specific bms for example (by only providing the items they want to actually occur).

The OG bms had a lot of logic to try and validate the join of data - I think this was of marginal use - ultimately responsibility for bm data must sit with the user (bms are a "advances user" feature).

Perhaps we just add an index check for consistency and warning or debug log?

Also worth noting that i would like to relax all bms (including the painful and complex OG cordons etc) to use csv comparisons. These will compare a subset of link volume counts - so will drop 99% of the table. Similar story for link speeds, trip data and so on where we comparing only subset of sim data.

@andkay
Copy link
Contributor Author

andkay commented Jan 17, 2022

Thanks for your thoughts. Definitely understand your motivation for this change, but was imagining that benchmark authors could set this behaviour as appropriate for a specific benchmark. I'm not too worried about the bm data format -- but rather the simulation data.

When comparing simulation data that relies on accumulating/counting events, the absence of relevant events against the index should (often) be interpreted as a legitimate 0 value. Dropping these records (rather than filling) could potentially skew the bm score in addition to confusing users.

@andkay
Copy link
Contributor Author

andkay commented Jan 17, 2022

I don't think this is a problem for any of the existing tools -- but as we've discussed before, we may want to rely increasingly on CsvComparison rather than building bespoke tools going forward. Exposing this behaviour to the interface will also have the benefit of alerting devs that it may need to be modified when creating new tools (depending on their goals, of course).

@fredshone
Copy link
Contributor

Could we make two CSVComparison parents? eg CSVComparison and CSVComparisonZeroFill? This would be enough to put devs on right track.

I am wary of option proliferation (I am aware I'm responsible).

@andkay
Copy link
Contributor Author

andkay commented Jan 17, 2022

We could, theoretically – but it’s not a great option in my opinion. Having two classes that are exactly the same except for 5-characters that control a specific pandas behaviour seems like a burden later.

To be clear, I’m not proposing that we expose this option to users via the config – only to developers through a class-attribute.

@Theodore-Chatziioannou
Copy link
Contributor

Silently dropping links with no simulation traffic or no ID match is very risky behaviour for a benchmark tool.
@fredshone I feel that a "left join" (with the left table being the benchmarks one) would still work with the use cases you mention (as opposed to the current dropna behaviour, which is an inner join)?

@Theodore-Chatziioannou
Copy link
Contributor

Theodore-Chatziioannou commented Jan 17, 2022

  • outer join: all benchmark and simulation data are kept in the output file. I believe this is the case @fredshone wishes to avoid, because it keeps a lot of redundant information
  • inner join: the intersection of benchmark and simulation data is kept, hiding potential errors. This is the case that alarms myself and @akay-arup
  • left join: all benchmarks are kept. If a benchmark is not found in the simulation, then a zero or NA value flags the issue for further investigation. If the user still wishes to have less information in the output table, they can simply create a filtered version of the benchmark file themselves.

@fredshone
Copy link
Contributor

great - just need to be clear about what's the one on the left :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

3 participants