Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix loopy simplification #107

Merged
merged 13 commits into from Feb 24, 2022
Merged

Fix loopy simplification #107

merged 13 commits into from Feb 24, 2022

Conversation

KasiaKoz
Copy link
Collaborator

@KasiaKoz KasiaKoz commented Jan 31, 2022

Problem

During the simplification process, some links get simplified to loops. In cases of teleported PT, this results in undesirable routes.

For example, given a rail stop sequence:

[A, B, C, D]

And the expected, teleported, route is:

[A-A, A-B, B-B, B-C, C-C, C-D, D-D]

Right now GeNet oversimplifies the end points resulting in long loops, the route is:

[S1, B-B, B-C, C-C, S2]

the geometry is retained so they look like lines, but are actually very big loop links, i.e.

S1: 
   from node: B 
   to node: B 

Not tested, but the likely MATSim consequences are agents not being able to access the service at stops A and D, as MATSim does not use our complex geometry (it's just an additional attribute)

Fix

Change to defining 'end-points' during simplification - those are the nodes retained in the network during the process. We retain nodes that are self-loops to begin with, thus protecting our PT stops from simplifying. This results in a small increase in retained nodes and fewer links being simplified.

Difference in the number of links being simplified in the test network:
Before: 6838
After: 6829

Bonus

Now, as default, all links that were simplified to self-loops, will be removed (as they are useless) unless used by PT (we use small self loops as PT stops) or an additional param is given to retain them.

This does not affect the connectivity of the graph.

Matesto pass

Screenshot 2022-02-07 at 18 09 22
Screenshot 2022-02-07 at 18 09 29
Screenshot 2022-02-07 at 18 09 35

@KasiaKoz KasiaKoz added the bug Something isn't working label Jan 31, 2022
@KasiaKoz KasiaKoz marked this pull request as ready for review February 7, 2022 17:19
genet/core.py Outdated
Simplifies network graph, retaining only nodes that are junctions
:param no_processes: Number of processes to split some computation across. The method is pretty fast though
and 1 process is often preferable --- there is overhead for splitting and joining the data.
:param keep_loops: bool, simplification often leads to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some of this comment fell down the back of the sofa... 😝

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUT WHAT DOES SIMPLIFICATION LEAD TO?!! The suspense is killing me!

genet/core.py Outdated
pt_stop_loops = set(self.schedule.stop_attribute_data(keys=['linkRefId'])['linkRefId'])
to_remove = loops - pt_stop_loops
if to_remove:
logging.info(f'Simplification led to {len(loops)} in the network. {len(to_remove)} are not connected'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may even be worthy of a warning, rather than info

genet/core.py Outdated
# mark graph as having been simplified
self.graph.graph["simplified"] = True

return to_remove
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea to return the IDs of the nodes that were removed? If so, when keep_loops is true, should we be returning an empty set?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, maybe I should rename this, I want to return the same IDs either way, so if keep_loops is true someone knows easily what was removed, can also go to the changelog for that but i feel this is cleaner, and with the changelog there could be other remove events

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided against this in the end, no one has any business with link IDs that don't exist. After removal, those IDs can be reused for other things so this would just get confusing --- something I didn't think of before and came bit me in the ass in NZ the other day.

tests/test_core_network.py Outdated Show resolved Hide resolved
tests/test_core_network.py Show resolved Hide resolved
puma_network_pre_simplify, puma_network_post_simplify):
n = puma_network_pre_simplify

stops_at_risk = ['5221390681543854913', '5221390302070799085', '5221390323679791901']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these PT stop IDs? Maybe they should belong to the fixture, so you have, say, a puma_network_with_pt_stops fixture that is the network and a list of the PT stop IDs that belong to the network. That way the implicit knowledge of the network, i.e. what the IDs of the PT stop nodes are, is contained in just one place - the fixture - rather than leaking out into the test (or more than one test).

So the test would look more like (with a helper method assert_link_length - maybe there is a better, higher level version of that, like assert_is_blah, that hides the detail of link length == 1 - assert_valid_pt_node, or whatever):

def test_simplify_does_not_oversimplify_PT_endpoints(puma_network_with_pt_stops):
    network, pt_stop_ids = puma_network_with_pt_stops
    for stop in pt_stop_ids:
        assert_link_length(stop, 1, network)
        
    network.simplify()
    
     for stop in pt_stop_ids:
         assert_link_length(stop, 1, network)

That looks more like the usual basic structure of a unit test:

  • setup
  • call code under test
  • validate expectations

Also known as "Arrange, Act, Assert"

Or... maybe there is a way you can query the network and discover the PT stops, inside the test, rather than having a pre-prepared list of the PT stop IDs? So then, with a helper method get_pt_stops, something like:

def test_simplify_does_not_oversimplify_PT_endpoints(puma_network_with_pt_stops):
    for stop_id in get_pt_stops(puma_network_with_pt_stops):
        assert_link_length(stop_id, 1, puma_network_with_pt_stops)
        
    puma_network_with_pt_stops.simplify()
    
    for stop_id in get_pt_stops(puma_network_with_pt_stops):
        assert_link_length(stop_id, 1, puma_network_with_pt_stops)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, so these are only a small subset of specific stops that were affected by the bug, the majority of other PT stops were fine. I guess there is a way to pull out all of those stops with some conditions, but that will require using non trivial amount of genet code which we're not testing here, so I'm not sure which is better ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe keep the magic number hardcoding, but move it into the fixture with a descriptive name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to a fixture with a descriptive name


def test_simplify_removes_loops_by_default_but_keeps_pt_stop_loops(
puma_network_pre_simplify, puma_network_post_simplify):
n = puma_network_pre_simplify
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the code under test is kind of hidden away in a fixture. Fixtures should just provide data to use for the test, really, they shouldn't be calling the code under test. They're part of the setup. We've also got some implicit knowledge of the network (particular link IDs) lurking in this test - is there a way to stop that knowledge from being spread around like this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to test saving a simplified network because the data types change (because we merge multiple links, they will have an average or max of a numeric attribute but for example osm tags will all be retained in a set or list and I need to be sure that that network is still valid under matsim. I would like to keep using a puma network that we simplify here because it's close to what we deal with in real life. I added something like this, is that a good compromise? If something goes wrong with simplification, it's caught and a message directs away from this test?

Screenshot 2022-02-10 at 12 11 11


def test_simplified_network_saves_to_correct_dtds(tmpdir, network_dtd, schedule_dtd, puma_network_post_simplify):
deleted_self_loops, n = puma_network_post_simplify
n.write_to_matsim(tmpdir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're kind of testing two things at once here - network simplification, and network serialisation. This test could therefore presumably fail if either the serialisation code was broken, or there was something funky going on in the simplification that meant it couldn't be serialised properly. We want the number of things that can cause our tests to fail to be as small as possible. Is there a way to reduce the number of things that can make this test fail?

@KasiaKoz KasiaKoz requested a review from mfitz February 10, 2022 13:16
@@ -32,7 +32,7 @@ jobs:
pip install -e .
- name: Lint with flake8
run: |
flake8 . --max-line-length 120 --count --show-source --statistics --exclude=scripts,tests,notebooks
flake8 . --max-line-length 120 --count --show-source --statistics --exclude=scripts,tests,notebooks,venv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lately I've been thinking that we should be linting our tests too.

puma_network_pre_simplify, puma_network_post_simplify):
n = puma_network_pre_simplify

stops_at_risk = ['5221390681543854913', '5221390302070799085', '5221390323679791901']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe keep the magic number hardcoding, but move it into the fixture with a descriptive name?


report = n.generate_validation_report()
report = puma_network.generate_validation_report()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validating the report feels like a separate test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to two separate tests

try:
puma_network.simplify()
except Exception as e:
raise RuntimeError(f"Error simplifying network: {e}, check other simplification tests")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't thinking about simplify possibly throwing an exception - if it does that, it will be clear what caused the test to fail, without wrapping and rethrowing the exception. I don't think rethrowing the exception adds anything.

What I was driving at is that we're calling simplify - what if it doesn't throw an exception, but it's broken in some way? Testing around that and that alone would have a network, simplify it, then make assertions about it (without serialising it before making those assertions). I think we have tests like that, right?

But if we also write the network and it doesn't validate against the DTD, is that because simplify has a bug, or because write_to_matsim has a bug? Or both? In an ideal world, we test write by saying "given a known network, when we serialise it to disk, it should blah, blah (validate against the DTD, etc.)". Whether or not the network was simplified first shouldn't matter, we're just testing that a network serialises properly.

It seems like the thing we're testing here is that a simplified network serialises correctly, so is there a way to have a simplified network with which to test that without calling simplify? If that's possible, it would be a better way to test the serialisation of the simplified network. As it stands, this is kind of a pipeline of two operations with some assertions at the end of the pipeline.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to testing a dedicated fixture with data schema representative of a simplified network

@KasiaKoz KasiaKoz requested a review from mfitz February 23, 2022 18:30

link_ids_post_simplify = set(dict(n.links()).keys())
def test_simplifing_puma_network_results_in_correct_record_of_simplified_links(puma_network):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this - easier for my brain to parse :-)



def test_simplified_network_saves_to_correct_dtds(tmpdir, network_dtd, network_with_simplified_schema):
network_with_simplified_schema.write_to_matsim(tmpdir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@KasiaKoz KasiaKoz merged commit 1597daa into master Feb 24, 2022
@KasiaKoz KasiaKoz deleted the fix-loopy-simplification branch February 24, 2022 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants