-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix loopy simplification #107
Conversation
genet/core.py
Outdated
Simplifies network graph, retaining only nodes that are junctions | ||
:param no_processes: Number of processes to split some computation across. The method is pretty fast though | ||
and 1 process is often preferable --- there is overhead for splitting and joining the data. | ||
:param keep_loops: bool, simplification often leads to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some of this comment fell down the back of the sofa... 😝
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BUT WHAT DOES SIMPLIFICATION LEAD TO?!! The suspense is killing me!
genet/core.py
Outdated
pt_stop_loops = set(self.schedule.stop_attribute_data(keys=['linkRefId'])['linkRefId']) | ||
to_remove = loops - pt_stop_loops | ||
if to_remove: | ||
logging.info(f'Simplification led to {len(loops)} in the network. {len(to_remove)} are not connected' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may even be worthy of a warning, rather than info
genet/core.py
Outdated
# mark graph as having been simplified | ||
self.graph.graph["simplified"] = True | ||
|
||
return to_remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea to return the IDs of the nodes that were removed? If so, when keep_loops
is true, should we be returning an empty set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, maybe I should rename this, I want to return the same IDs either way, so if keep_loops
is true someone knows easily what was removed, can also go to the changelog for that but i feel this is cleaner, and with the changelog there could be other remove events
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided against this in the end, no one has any business with link IDs that don't exist. After removal, those IDs can be reused for other things so this would just get confusing --- something I didn't think of before and came bit me in the ass in NZ the other day.
tests/test_core_network.py
Outdated
puma_network_pre_simplify, puma_network_post_simplify): | ||
n = puma_network_pre_simplify | ||
|
||
stops_at_risk = ['5221390681543854913', '5221390302070799085', '5221390323679791901'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these PT stop IDs? Maybe they should belong to the fixture, so you have, say, a puma_network_with_pt_stops
fixture that is the network and a list of the PT stop IDs that belong to the network. That way the implicit knowledge of the network, i.e. what the IDs of the PT stop nodes are, is contained in just one place - the fixture - rather than leaking out into the test (or more than one test).
So the test would look more like (with a helper method assert_link_length
- maybe there is a better, higher level version of that, like assert_is_blah
, that hides the detail of link length == 1
- assert_valid_pt_node
, or whatever):
def test_simplify_does_not_oversimplify_PT_endpoints(puma_network_with_pt_stops):
network, pt_stop_ids = puma_network_with_pt_stops
for stop in pt_stop_ids:
assert_link_length(stop, 1, network)
network.simplify()
for stop in pt_stop_ids:
assert_link_length(stop, 1, network)
That looks more like the usual basic structure of a unit test:
- setup
- call code under test
- validate expectations
Also known as "Arrange, Act, Assert"
Or... maybe there is a way you can query the network and discover the PT stops, inside the test, rather than having a pre-prepared list of the PT stop IDs? So then, with a helper method get_pt_stops
, something like:
def test_simplify_does_not_oversimplify_PT_endpoints(puma_network_with_pt_stops):
for stop_id in get_pt_stops(puma_network_with_pt_stops):
assert_link_length(stop_id, 1, puma_network_with_pt_stops)
puma_network_with_pt_stops.simplify()
for stop_id in get_pt_stops(puma_network_with_pt_stops):
assert_link_length(stop_id, 1, puma_network_with_pt_stops)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah, so these are only a small subset of specific stops that were affected by the bug, the majority of other PT stops were fine. I guess there is a way to pull out all of those stops with some conditions, but that will require using non trivial amount of genet code which we're not testing here, so I'm not sure which is better ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe keep the magic number hardcoding, but move it into the fixture with a descriptive name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to a fixture with a descriptive name
tests/test_core_network.py
Outdated
|
||
def test_simplify_removes_loops_by_default_but_keeps_pt_stop_loops( | ||
puma_network_pre_simplify, puma_network_post_simplify): | ||
n = puma_network_pre_simplify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, the code under test is kind of hidden away in a fixture. Fixtures should just provide data to use for the test, really, they shouldn't be calling the code under test. They're part of the setup. We've also got some implicit knowledge of the network (particular link IDs) lurking in this test - is there a way to stop that knowledge from being spread around like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to test saving a simplified network because the data types change (because we merge multiple links, they will have an average or max of a numeric attribute but for example osm tags will all be retained in a set or list and I need to be sure that that network is still valid under matsim. I would like to keep using a puma network that we simplify here because it's close to what we deal with in real life. I added something like this, is that a good compromise? If something goes wrong with simplification, it's caught and a message directs away from this test?
tests/test_core_network.py
Outdated
|
||
def test_simplified_network_saves_to_correct_dtds(tmpdir, network_dtd, schedule_dtd, puma_network_post_simplify): | ||
deleted_self_loops, n = puma_network_post_simplify | ||
n.write_to_matsim(tmpdir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're kind of testing two things at once here - network simplification, and network serialisation. This test could therefore presumably fail if either the serialisation code was broken, or there was something funky going on in the simplification that meant it couldn't be serialised properly. We want the number of things that can cause our tests to fail to be as small as possible. Is there a way to reduce the number of things that can make this test fail?
@@ -32,7 +32,7 @@ jobs: | |||
pip install -e . | |||
- name: Lint with flake8 | |||
run: | | |||
flake8 . --max-line-length 120 --count --show-source --statistics --exclude=scripts,tests,notebooks | |||
flake8 . --max-line-length 120 --count --show-source --statistics --exclude=scripts,tests,notebooks,venv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lately I've been thinking that we should be linting our tests too.
tests/test_core_network.py
Outdated
puma_network_pre_simplify, puma_network_post_simplify): | ||
n = puma_network_pre_simplify | ||
|
||
stops_at_risk = ['5221390681543854913', '5221390302070799085', '5221390323679791901'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe keep the magic number hardcoding, but move it into the fixture with a descriptive name?
tests/test_core_network.py
Outdated
|
||
report = n.generate_validation_report() | ||
report = puma_network.generate_validation_report() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validating the report feels like a separate test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to two separate tests
tests/test_core_network.py
Outdated
try: | ||
puma_network.simplify() | ||
except Exception as e: | ||
raise RuntimeError(f"Error simplifying network: {e}, check other simplification tests") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't thinking about simplify
possibly throwing an exception - if it does that, it will be clear what caused the test to fail, without wrapping and rethrowing the exception. I don't think rethrowing the exception adds anything.
What I was driving at is that we're calling simplify
- what if it doesn't throw an exception, but it's broken in some way? Testing around that and that alone would have a network, simplify it, then make assertions about it (without serialising it before making those assertions). I think we have tests like that, right?
But if we also write the network and it doesn't validate against the DTD, is that because simplify
has a bug, or because write_to_matsim
has a bug? Or both? In an ideal world, we test write
by saying "given a known network, when we serialise it to disk, it should blah, blah (validate against the DTD, etc.)". Whether or not the network was simplified first shouldn't matter, we're just testing that a network serialises properly.
It seems like the thing we're testing here is that a simplified network serialises correctly, so is there a way to have a simplified network with which to test that without calling simplify? If that's possible, it would be a better way to test the serialisation of the simplified network. As it stands, this is kind of a pipeline of two operations with some assertions at the end of the pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to testing a dedicated fixture with data schema representative of a simplified network
|
||
link_ids_post_simplify = set(dict(n.links()).keys()) | ||
def test_simplifing_puma_network_results_in_correct_record_of_simplified_links(puma_network): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this - easier for my brain to parse :-)
|
||
|
||
def test_simplified_network_saves_to_correct_dtds(tmpdir, network_dtd, network_with_simplified_schema): | ||
network_with_simplified_schema.write_to_matsim(tmpdir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Problem
During the simplification process, some links get simplified to loops. In cases of teleported PT, this results in undesirable routes.
For example, given a rail stop sequence:
And the expected, teleported, route is:
Right now GeNet oversimplifies the end points resulting in long loops, the route is:
the geometry is retained so they look like lines, but are actually very big loop links, i.e.
Not tested, but the likely MATSim consequences are agents not being able to access the service at stops A and D, as MATSim does not use our complex geometry (it's just an additional attribute)
Fix
Change to defining 'end-points' during simplification - those are the nodes retained in the network during the process. We retain nodes that are self-loops to begin with, thus protecting our PT stops from simplifying. This results in a small increase in retained nodes and fewer links being simplified.
Difference in the number of links being simplified in the test network:
Before:
6838
After:
6829
Bonus
Now, as default, all links that were simplified to self-loops, will be removed (as they are useless) unless used by PT (we use small self loops as PT stops) or an additional param is given to retain them.
This does not affect the connectivity of the graph.
Matesto pass