CsvComparison NaN Strategy [PRIORITY] #170

andkay · 2022-01-11T19:37:40Z

Recent changes to the main branch have changed the behaviour of the CsvComparison.build method to drop records with no match after merging simulation data to benchmark data, rather than infilling with 0s (pandas dataframe fillna -> dropna).

This change was required to calculate some newer benchmarks correctly, but in general is dangerous. In some cases, the absence of evidence <-> evidence of absence.

To fix this, we need to introduce a drop_na attribute to this class that can be set by individual handler implementations as required.

The text was updated successfully, but these errors were encountered:

fredshone · 2022-01-17T17:24:01Z

personally i think this behaviour is a useful default. It allows users to hack mode specific or subpop specific bms for example (by only providing the items they want to actually occur).

The OG bms had a lot of logic to try and validate the join of data - I think this was of marginal use - ultimately responsibility for bm data must sit with the user (bms are a "advances user" feature).

Perhaps we just add an index check for consistency and warning or debug log?

Also worth noting that i would like to relax all bms (including the painful and complex OG cordons etc) to use csv comparisons. These will compare a subset of link volume counts - so will drop 99% of the table. Similar story for link speeds, trip data and so on where we comparing only subset of sim data.

andkay · 2022-01-17T17:49:14Z

Thanks for your thoughts. Definitely understand your motivation for this change, but was imagining that benchmark authors could set this behaviour as appropriate for a specific benchmark. I'm not too worried about the bm data format -- but rather the simulation data.

When comparing simulation data that relies on accumulating/counting events, the absence of relevant events against the index should (often) be interpreted as a legitimate 0 value. Dropping these records (rather than filling) could potentially skew the bm score in addition to confusing users.

andkay · 2022-01-17T17:54:18Z

I don't think this is a problem for any of the existing tools -- but as we've discussed before, we may want to rely increasingly on CsvComparison rather than building bespoke tools going forward. Exposing this behaviour to the interface will also have the benefit of alerting devs that it may need to be modified when creating new tools (depending on their goals, of course).

fredshone · 2022-01-17T17:56:55Z

Could we make two CSVComparison parents? eg CSVComparison and CSVComparisonZeroFill? This would be enough to put devs on right track.

I am wary of option proliferation (I am aware I'm responsible).

andkay · 2022-01-17T18:05:58Z

We could, theoretically – but it’s not a great option in my opinion. Having two classes that are exactly the same except for 5-characters that control a specific pandas behaviour seems like a burden later.

To be clear, I’m not proposing that we expose this option to users via the config – only to developers through a class-attribute.

Theodore-Chatziioannou · 2022-01-17T18:15:05Z

Silently dropping links with no simulation traffic or no ID match is very risky behaviour for a benchmark tool.
@fredshone I feel that a "left join" (with the left table being the benchmarks one) would still work with the use cases you mention (as opposed to the current dropna behaviour, which is an inner join)?

Theodore-Chatziioannou · 2022-01-17T18:20:31Z

outer join: all benchmark and simulation data are kept in the output file. I believe this is the case @fredshone wishes to avoid, because it keeps a lot of redundant information
inner join: the intersection of benchmark and simulation data is kept, hiding potential errors. This is the case that alarms myself and @akay-arup
left join: all benchmarks are kept. If a benchmark is not found in the simulation, then a zero or NA value flags the issue for further investigation. If the user still wishes to have less information in the output table, they can simply create a filtered version of the benchmark file themselves.

fredshone · 2022-01-17T18:25:15Z

great - just need to be clear about what's the one on the left :)

andkay added the invalid This doesn't seem right label Jan 11, 2022

andkay self-assigned this Jan 11, 2022

andkay assigned Theodore-Chatziioannou Jan 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CsvComparison NaN Strategy [PRIORITY] #170

CsvComparison NaN Strategy [PRIORITY] #170

andkay commented Jan 11, 2022

fredshone commented Jan 17, 2022

andkay commented Jan 17, 2022 •

edited

Loading

andkay commented Jan 17, 2022

fredshone commented Jan 17, 2022

andkay commented Jan 17, 2022

Theodore-Chatziioannou commented Jan 17, 2022

Theodore-Chatziioannou commented Jan 17, 2022 •

edited

Loading

fredshone commented Jan 17, 2022

CsvComparison NaN Strategy [PRIORITY] #170

CsvComparison NaN Strategy [PRIORITY] #170

Comments

andkay commented Jan 11, 2022

fredshone commented Jan 17, 2022

andkay commented Jan 17, 2022 • edited Loading

andkay commented Jan 17, 2022

fredshone commented Jan 17, 2022

andkay commented Jan 17, 2022

Theodore-Chatziioannou commented Jan 17, 2022

Theodore-Chatziioannou commented Jan 17, 2022 • edited Loading

fredshone commented Jan 17, 2022

andkay commented Jan 17, 2022 •

edited

Loading

Theodore-Chatziioannou commented Jan 17, 2022 •

edited

Loading