Fix loopy simplification #107

KasiaKoz · 2022-01-31T10:23:51Z

Problem

During the simplification process, some links get simplified to loops. In cases of teleported PT, this results in undesirable routes.

For example, given a rail stop sequence:

[A, B, C, D]

And the expected, teleported, route is:

[A-A, A-B, B-B, B-C, C-C, C-D, D-D]

Right now GeNet oversimplifies the end points resulting in long loops, the route is:

[S1, B-B, B-C, C-C, S2]

the geometry is retained so they look like lines, but are actually very big loop links, i.e.

S1: 
   from node: B 
   to node: B

Not tested, but the likely MATSim consequences are agents not being able to access the service at stops A and D, as MATSim does not use our complex geometry (it's just an additional attribute)

Fix

Change to defining 'end-points' during simplification - those are the nodes retained in the network during the process. We retain nodes that are self-loops to begin with, thus protecting our PT stops from simplifying. This results in a small increase in retained nodes and fewer links being simplified.

Difference in the number of links being simplified in the test network:
Before: 6838
After: 6829

Bonus

Now, as default, all links that were simplified to self-loops, will be removed (as they are useless) unless used by PT (we use small self loops as PT stops) or an additional param is given to retain them.

This does not affect the connectivity of the graph.

Matesto pass

mfitz · 2022-02-08T18:44:00Z

genet/core.py

+        Simplifies network graph, retaining only nodes that are junctions
+        :param no_processes: Number of processes to split some computation across. The method is pretty fast though
+            and 1 process is often preferable --- there is overhead for splitting and joining the data.
+        :param keep_loops: bool, simplification often leads to


I think some of this comment fell down the back of the sofa... 😝

BUT WHAT DOES SIMPLIFICATION LEAD TO?!! The suspense is killing me!

mfitz · 2022-02-08T18:45:47Z

genet/core.py

+        pt_stop_loops = set(self.schedule.stop_attribute_data(keys=['linkRefId'])['linkRefId'])
+        to_remove = loops - pt_stop_loops
+        if to_remove:
+            logging.info(f'Simplification led to {len(loops)} in the network. {len(to_remove)} are not connected'


This may even be worthy of a warning, rather than info

mfitz · 2022-02-08T18:48:17Z

genet/core.py

        # mark graph as having been simplified
        self.graph.graph["simplified"] = True

+        return to_remove


Is the idea to return the IDs of the nodes that were removed? If so, when keep_loops is true, should we be returning an empty set?

ah, maybe I should rename this, I want to return the same IDs either way, so if keep_loops is true someone knows easily what was removed, can also go to the changelog for that but i feel this is cleaner, and with the changelog there could be other remove events

I decided against this in the end, no one has any business with link IDs that don't exist. After removal, those IDs can be reused for other things so this would just get confusing --- something I didn't think of before and came bit me in the ass in NZ the other day.

tests/test_core_network.py

mfitz · 2022-02-08T19:30:18Z

tests/test_core_network.py

+        puma_network_pre_simplify, puma_network_post_simplify):
+    n = puma_network_pre_simplify
+
+    stops_at_risk = ['5221390681543854913', '5221390302070799085', '5221390323679791901']


Are these PT stop IDs? Maybe they should belong to the fixture, so you have, say, a puma_network_with_pt_stops fixture that is the network and a list of the PT stop IDs that belong to the network. That way the implicit knowledge of the network, i.e. what the IDs of the PT stop nodes are, is contained in just one place - the fixture - rather than leaking out into the test (or more than one test).

So the test would look more like (with a helper method assert_link_length - maybe there is a better, higher level version of that, like assert_is_blah, that hides the detail of link length == 1 - assert_valid_pt_node, or whatever):

def test_simplify_does_not_oversimplify_PT_endpoints(puma_network_with_pt_stops): network, pt_stop_ids = puma_network_with_pt_stops for stop in pt_stop_ids: assert_link_length(stop, 1, network) network.simplify() for stop in pt_stop_ids: assert_link_length(stop, 1, network)

That looks more like the usual basic structure of a unit test:

setup

call code under test

validate expectations

Also known as "Arrange, Act, Assert"

Or... maybe there is a way you can query the network and discover the PT stops, inside the test, rather than having a pre-prepared list of the PT stop IDs? So then, with a helper method get_pt_stops, something like:

def test_simplify_does_not_oversimplify_PT_endpoints(puma_network_with_pt_stops): for stop_id in get_pt_stops(puma_network_with_pt_stops): assert_link_length(stop_id, 1, puma_network_with_pt_stops) puma_network_with_pt_stops.simplify() for stop_id in get_pt_stops(puma_network_with_pt_stops): assert_link_length(stop_id, 1, puma_network_with_pt_stops)

nah, so these are only a small subset of specific stops that were affected by the bug, the majority of other PT stops were fine. I guess there is a way to pull out all of those stops with some conditions, but that will require using non trivial amount of genet code which we're not testing here, so I'm not sure which is better ?

Maybe keep the magic number hardcoding, but move it into the fixture with a descriptive name?

moved to a fixture with a descriptive name

mfitz · 2022-02-08T19:36:19Z

tests/test_core_network.py

+
+def test_simplify_removes_loops_by_default_but_keeps_pt_stop_loops(
+        puma_network_pre_simplify, puma_network_post_simplify):
+    n = puma_network_pre_simplify


Again, the code under test is kind of hidden away in a fixture. Fixtures should just provide data to use for the test, really, they shouldn't be calling the code under test. They're part of the setup. We've also got some implicit knowledge of the network (particular link IDs) lurking in this test - is there a way to stop that knowledge from being spread around like this?

I have to test saving a simplified network because the data types change (because we merge multiple links, they will have an average or max of a numeric attribute but for example osm tags will all be retained in a set or list and I need to be sure that that network is still valid under matsim. I would like to keep using a puma network that we simplify here because it's close to what we deal with in real life. I added something like this, is that a good compromise? If something goes wrong with simplification, it's caught and a message directs away from this test?

mfitz · 2022-02-08T19:42:55Z

tests/test_core_network.py


+def test_simplified_network_saves_to_correct_dtds(tmpdir, network_dtd, schedule_dtd, puma_network_post_simplify):
+    deleted_self_loops, n = puma_network_post_simplify
    n.write_to_matsim(tmpdir)


We're kind of testing two things at once here - network simplification, and network serialisation. This test could therefore presumably fail if either the serialisation code was broken, or there was something funky going on in the simplification that meant it couldn't be serialised properly. We want the number of things that can cause our tests to fail to be as small as possible. Is there a way to reduce the number of things that can make this test fail?

mfitz · 2022-02-14T02:06:32Z

.github/workflows/build_pipeline.yml

@@ -32,7 +32,7 @@ jobs:
        pip install -e .
    - name: Lint with flake8
      run: |
-        flake8 . --max-line-length 120 --count  --show-source --statistics --exclude=scripts,tests,notebooks
+        flake8 . --max-line-length 120 --count  --show-source --statistics --exclude=scripts,tests,notebooks,venv


Lately I've been thinking that we should be linting our tests too.

mfitz · 2022-02-14T02:07:50Z

tests/test_core_network.py

+        puma_network_pre_simplify, puma_network_post_simplify):
+    n = puma_network_pre_simplify
+
+    stops_at_risk = ['5221390681543854913', '5221390302070799085', '5221390323679791901']


Maybe keep the magic number hardcoding, but move it into the fixture with a descriptive name?

mfitz · 2022-02-14T02:24:28Z

tests/test_core_network.py


-    report = n.generate_validation_report()
+    report = puma_network.generate_validation_report()


Validating the report feels like a separate test.

moved to two separate tests

mfitz · 2022-02-14T03:06:23Z

tests/test_core_network.py

+    try:
+        puma_network.simplify()
+    except Exception as e:
+        raise RuntimeError(f"Error simplifying network: {e}, check other simplification tests")


I wasn't thinking about simplify possibly throwing an exception - if it does that, it will be clear what caused the test to fail, without wrapping and rethrowing the exception. I don't think rethrowing the exception adds anything.

What I was driving at is that we're calling simplify - what if it doesn't throw an exception, but it's broken in some way? Testing around that and that alone would have a network, simplify it, then make assertions about it (without serialising it before making those assertions). I think we have tests like that, right?

But if we also write the network and it doesn't validate against the DTD, is that because simplify has a bug, or because write_to_matsim has a bug? Or both? In an ideal world, we test write by saying "given a known network, when we serialise it to disk, it should blah, blah (validate against the DTD, etc.)". Whether or not the network was simplified first shouldn't matter, we're just testing that a network serialises properly.

It seems like the thing we're testing here is that a simplified network serialises correctly, so is there a way to have a simplified network with which to test that without calling simplify? If that's possible, it would be a better way to test the serialisation of the simplified network. As it stands, this is kind of a pipeline of two operations with some assertions at the end of the pipeline.

changed to testing a dedicated fixture with data schema representative of a simplified network

mfitz · 2022-02-24T12:07:22Z

tests/test_core_network.py


-    link_ids_post_simplify = set(dict(n.links()).keys())
+def test_simplifing_puma_network_results_in_correct_record_of_simplified_links(puma_network):


I like this - easier for my brain to parse :-)

mfitz · 2022-02-24T12:08:14Z

tests/test_core_network.py

+
+
+def test_simplified_network_saves_to_correct_dtds(tmpdir, network_dtd, network_with_simplified_schema):
+    network_with_simplified_schema.write_to_matsim(tmpdir)


KasiaKoz added 4 commits January 28, 2022 14:23

add failing test

9e1871a

fix oversimplified teleported PT

821ade9

remove useless self loops post simplification by default

4b4e545

make fixtures for test network, speed up tests a bit

9206c26

KasiaKoz added the bug Something isn't working label Jan 31, 2022

KasiaKoz added 3 commits January 31, 2022 10:31

fix flake8

655274d

update tests with new endpoint behaviour

5d9c347

update logging messages

26bc8e7

KasiaKoz requested review from mfitz and ana-kop February 7, 2022 17:19

KasiaKoz marked this pull request as ready for review February 7, 2022 17:19

mfitz requested changes Feb 8, 2022

View reviewed changes

KasiaKoz added 2 commits February 10, 2022 12:24

address PR comments

b4751c5

fix flake8

9024644

KasiaKoz requested a review from mfitz February 10, 2022 13:16

mfitz reviewed Feb 14, 2022

View reviewed changes

KasiaKoz and others added 4 commits February 16, 2022 11:22

Merge branch 'master' into fix-loopy-simplification

bca8b06

address PR comments

551be5a

do not return deleted self loops

234ecc2

update doc string

cce86da

KasiaKoz requested a review from mfitz February 23, 2022 18:30

mfitz approved these changes Feb 24, 2022

View reviewed changes

KasiaKoz merged commit 1597daa into master Feb 24, 2022

KasiaKoz deleted the fix-loopy-simplification branch February 24, 2022 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix loopy simplification #107

Fix loopy simplification #107

KasiaKoz commented Jan 31, 2022 •

edited

mfitz Feb 8, 2022

mfitz Feb 9, 2022

mfitz Feb 8, 2022

mfitz Feb 8, 2022

KasiaKoz Feb 10, 2022

KasiaKoz Feb 23, 2022

mfitz Feb 8, 2022

KasiaKoz Feb 10, 2022

mfitz Feb 14, 2022

KasiaKoz Feb 23, 2022

mfitz Feb 8, 2022

KasiaKoz Feb 10, 2022

mfitz Feb 8, 2022

mfitz Feb 14, 2022

mfitz Feb 14, 2022

mfitz Feb 14, 2022

KasiaKoz Feb 23, 2022

mfitz Feb 14, 2022

KasiaKoz Feb 23, 2022

mfitz Feb 24, 2022

mfitz Feb 24, 2022


		report = n.generate_validation_report()
		report = puma_network.generate_validation_report()


		link_ids_post_simplify = set(dict(n.links()).keys())
		def test_simplifing_puma_network_results_in_correct_record_of_simplified_links(puma_network):



		def test_simplified_network_saves_to_correct_dtds(tmpdir, network_dtd, network_with_simplified_schema):
		network_with_simplified_schema.write_to_matsim(tmpdir)

Fix loopy simplification #107

Fix loopy simplification #107

Conversation

KasiaKoz commented Jan 31, 2022 • edited

Problem

Fix

Bonus

Matesto pass

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KasiaKoz commented Jan 31, 2022 •

edited