-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CsvComparison NaN Strategy [PRIORITY] #170
Comments
personally i think this behaviour is a useful default. It allows users to hack mode specific or subpop specific bms for example (by only providing the items they want to actually occur). The OG bms had a lot of logic to try and validate the join of data - I think this was of marginal use - ultimately responsibility for bm data must sit with the user (bms are a "advances user" feature). Perhaps we just add an index check for consistency and warning or debug log? Also worth noting that i would like to relax all bms (including the painful and complex OG cordons etc) to use csv comparisons. These will compare a subset of link volume counts - so will drop 99% of the table. Similar story for link speeds, trip data and so on where we comparing only subset of sim data. |
Thanks for your thoughts. Definitely understand your motivation for this change, but was imagining that benchmark authors could set this behaviour as appropriate for a specific benchmark. I'm not too worried about the bm data format -- but rather the simulation data. When comparing simulation data that relies on accumulating/counting events, the absence of relevant events against the index should (often) be interpreted as a legitimate |
I don't think this is a problem for any of the existing tools -- but as we've discussed before, we may want to rely increasingly on CsvComparison rather than building bespoke tools going forward. Exposing this behaviour to the interface will also have the benefit of alerting devs that it may need to be modified when creating new tools (depending on their goals, of course). |
Could we make two CSVComparison parents? eg I am wary of option proliferation (I am aware I'm responsible). |
We could, theoretically – but it’s not a great option in my opinion. Having two classes that are exactly the same except for 5-characters that control a specific pandas behaviour seems like a burden later. To be clear, I’m not proposing that we expose this option to users via the config – only to developers through a class-attribute. |
Silently dropping links with no simulation traffic or no ID match is very risky behaviour for a benchmark tool. |
|
great - just need to be clear about what's the one on the left :) |
Recent changes to the main branch have changed the behaviour of the
CsvComparison.build
method to drop records with no match after merging simulation data to benchmark data, rather than infilling with 0s (pandas dataframefillna
->dropna
).This change was required to calculate some newer benchmarks correctly, but in general is dangerous. In some cases, the absence of evidence <-> evidence of absence.
To fix this, we need to introduce a
drop_na
attribute to this class that can be set by individual handler implementations as required.The text was updated successfully, but these errors were encountered: