Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend clean_intersections to return a new graph, topologically equal to the input graph, but with clusters of nodes merged into a single centroid node. #249

Merged
merged 29 commits into from
Jun 10, 2019

Conversation

ppintosilva
Copy link
Contributor

Hi @gboeing. I'm happy to put forward the first version of the new clean_intersections() method. The method now returns the simplified graph instead of just the intersection centroids.

I've added comments to the code, which hopefully will be easy to follow, but I'm happy to write down the main steps of the algorithm. The method doesn't appear to be (too) slow, but there is probably room for improvement.

There are a few nuances to look out for, e.g. when choosing between multiple geometries, but I think that the final decision was sensible.

Finally, the simplified graph can end up having a lot of nodes of degree 2 (in its undirected version). Therefore, further simplification can be achieved by removing these nodes. I'm working on this, but maybe this should be a separate method.

I'm looking forward to your input. Cheers, Pedro.

@ppintosilva
Copy link
Contributor Author

ppintosilva commented Jan 28, 2019

Before and after pics:

  • 2700 Shattuck Ave, Berkeley, CA
  • Newcastle main road network (secundary and higher ranked roads)
Before After
shattuck_before shattuck_after
nwc_main_before nwc_main_after

Edit: visuals

@coveralls
Copy link

coveralls commented Jan 28, 2019

Coverage Status

Coverage increased (+0.1%) to 93.685% when pulling 877ba56 on ppintosilva:master into 54f5f76 on gboeing:master.

@samuelduchesne
Copy link
Contributor

This is a feature I was looking forward to! I will try it out and return back with any issues

@ppintosilva
Copy link
Contributor Author

ppintosilva commented Jan 30, 2019

Thanks for trying it out, @samuelduchesne! I'm sure that there a few glitches that I may have missed and room for improvement overall.

Copy link
Owner

@gboeing gboeing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PedrosWits I've left you some preliminary comments. Also, would you mind providing a comment in this PR documenting the logic of how this new functionality works in a few sentences?

osmnx/simplify.py Show resolved Hide resolved
osmnx/simplify.py Outdated Show resolved Hide resolved
osmnx/simplify.py Outdated Show resolved Hide resolved
osmnx/simplify.py Outdated Show resolved Hide resolved
osmnx/simplify.py Show resolved Hide resolved
@gboeing
Copy link
Owner

gboeing commented Jan 30, 2019

@PedrosWits also, can you explain a bit more about why this is the case:

Finally, the simplified graph can end up having a lot of nodes of degree 2 (in its undirected version).

@ppintosilva
Copy link
Contributor Author

@PedrosWits also, can you explain a bit more about why this is the case:

Take for instance the network below. If we only care about intersections where we have multiple turning options (and ignore u-turns), then a lot of these nodes are redundant as the only option is moving forward, along the same geometry. I realise that this is a gross oversimplification of the road network, but I'm guessing it can be useful when analysing major road infrastructure. On the other hand, it may be important to keep these nodes as they represent entry/exit points from the major network. I'm not a 100% sure on this one.

However, identifying these nodes is a bit tricky as not all nodes of degree 2 are redundant, especially the ones near the edge of the graph.

candidate_nodes_degree2_to_remove

@ppintosilva
Copy link
Contributor Author

Main logic behind the method:

  1. For each node in G, we find the closest centroid (the cluster to which that node belongs)
  2. We create the new graph G' and add the nodes, labelled 1 to n, where n is the number of centroids.
  3. To create the edges in the new graph, we iterate through each centroid:
    a. We get the list of nodes that belong to that centroid (we call these 'elements')
    b. For each element of our current centroid, we get its out going edges.
    c. We map the second node of each out-edge (successor), into nodes of G'.
    d. We compute the successors of the current centroid as the set of (mapped) successors of its elements, thus removing duplicates.
    e. For each successor centroid, we merge the existing attributes of the out going edges to that particular centroid.
    f. The 'geometry' with largest length is kept and "fixed" to avoid gaps in the resulting graph (is there a better way of choosing the geometry?)
    g. Finally, we add the edge (centroid, successor) to G'. There is no need to handle predecessors, as the current centroid will be a successor to other centroids.

@samuelduchesne
Copy link
Contributor

samuelduchesne commented Feb 1, 2019

Hi @PedrosWits, when I use the new function and then apply a ox.get_undirected() on the new graph, I get an error inside the is_duplicate_edge() function. Your method seems to add osmid attributes as ndarrays instead of lists. (I have highlighted the array below for data_other). The error occurs on line 326 in the set generator.

data= {'lanes': '1', 'bridge': 'yes', 'highway': 'primary', 'oneway': True, 'geometry': <shapely.geometry.linestring.LineString object at 0x1a296900b8>, 'to': 94, 'length': 271.10933596417567, 'ref': 'A695', 'from': 98, 'osmid': [5016040, 95817240], 'name': 'Derwenthaugh Road', 'maxspeed': '70 mph'}

data_other= {'lanes': ['1', '1'], 'bridge': 'yes', 'highway': ['primary', 'primary'], 'oneway': [True, True], 'from': 94, 'to': 98, 'length': 406.7024179676412, 'geometry': <shapely.geometry.linestring.LineString object at 0x1a29693128>, 'ref': ['A695', 'A695'], 'osmid': [[114882696, 60893148], 505407462], 'name': 'Derwenthaugh Road', 'maxspeed': ['70 mph', '70 mph']}

These values come from to same example you provided in tests.

@ppintosilva
Copy link
Contributor Author

Thanks for noticing this, @samuelduchesne. I don't think it is a ndarray, but just a nested list. I wasn't sure whether to flatten the lists when merging attributes of several candidate edges between the centroids, but now I have my answer! I wasn't sure because we lose the mapping of attributes to its prior edges, that is:

[114882696, 60893148] -> osmids or previous edge A
505407462 -> osmid of previous edge B

So, we could potentially get the previous edges back from each new edge, and reverse the clean_intersections transformation if we wanted to. But this is probably not a big deal, as we don't care about the reverse transformation, i.e. going from the cleaned intersections graph back to the original one. We just have the two graphs in memory or store them to disk.

I'll just flatten the resulting lists and submit a new commit.

@ppintosilva
Copy link
Contributor Author

I also want to do a quick speed benchmark to see how the method performs with increasing size of G.

osmnx/simplify.py Outdated Show resolved Hide resolved
@gboeing
Copy link
Owner

gboeing commented Feb 6, 2019

@PedrosWits would you mind resolving the branch conflicts so we can run the CI tests?

@ppintosilva
Copy link
Contributor Author

@PedrosWits would you mind resolving the branch conflicts so we can run the CI tests?

Whooops, I'm sorry, I didn't realise this was something that I had to do. I thought I had made sure that the two master branches didn't diverge, but I guess this wasn't the case. And this right was front of me for so long.. erm.. anyway.. it should be ok now.

@ppintosilva

This comment has been minimized.

@schoeller
Copy link

@PedrosWits
When trying

G = ox.graph_from_address('Bandung, Jawa, Indonesia', distance=3000, network_type='drive', infrastructure='way["highway"~"trunk|primary|secondary|tertiary"]', clean_periphery=True, simplify=True)
G = ox.project_graph(G, to_crs='+proj=utm +zone=48 +ellps=WGS84 +datum=WGS84 +units=m +no_defs +south')
G = ox.clean_intersections(G, tolerance=50, dead_ends=False)
gdf_nodes, gdf_edges = ox.graph_to_gdfs(G)
G = ox.gdfs_to_graph(gdf_nodes, gdf_edges)

I receive the following error

  File "<ipython-input-6-22a16ce1b815>", line 1, in <module>
    G = ox.gdfs_to_graph(gdf_nodes, gdf_edges)

  File "C:\ProgramData\Miniconda3\lib\site-packages\osmnx\save_load.py", line 610, in gdfs_to_graph
    attribute_values = {k:v for k, v in attributes[attribute_name].items() if pd.notnull(v)}

  File "C:\ProgramData\Miniconda3\lib\site-packages\osmnx\save_load.py", line 610, in <dictcomp>
    attribute_values = {k:v for k, v in attributes[attribute_name].items() if pd.notnull(v)}

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

@ppintosilva
Copy link
Contributor Author

I receive the following error
...

attribute_values = {k:v for k, v in attributes[attribute_name].items() if pd.notnull(v)}

Hi @schoeller, thanks for trying out the new method!

Issue:
The 'osmid' attribute for nodes in the new cleaned graph can be a list as opposed to a single value. This is because the centroids (nodes in the cleaned graph) can be composed by multiple nodes (of the original graph), each with its own osmid. Thus when the condition pd.notnull(v) is evaluated and v is a list of osmids, we get an array of booleans, instead of a single true/false value, which causes the ValueError exception.

Solution:
We can store the entire list in a separate attribute, say 'osmids', and store a single value of that list on the 'osmid' key. Does this sound reasonable @gboeing?


But other than that it looks like the method is doing the job right.

Before After
bandung_before bandung_after

@gboeing
Copy link
Owner

gboeing commented Mar 21, 2019

That sounds like a good solution.

@ppintosilva
Copy link
Contributor Author

That sounds like a good solution.

Oops, that solution didn't actually work as we were iterating over every k,v pair.
A few lines below in the method gdfs_to_graph the condition (isinstance(v, list) or pd.notnull(v)), is used instead of pd.notnull(value)), and that works for cases where v is a list. Matching the two seems to solve the issue.

@ppintosilva
Copy link
Contributor Author

* [ ]  Routes and their corresponding distance should remain similar for the original and cleaned graphs. Is this the case?

* [ ]  Is the choice of geometry for the new edges adequate? It seems to be the case by looking at the plots. However, I notice many instances of new edges with length greater than the expected maximum: max length (among the candidate edges) + 2*tolerance (maximum distance of old node to its centroid, for _u_ and _v_ centroids). So far I haven't come up with an explanation for this.

@gboeing , so I just had a look into these two points that I've highlighted previously.

So.. I've computed the shortest paths distances for all pairs in both the original and cleaned graphs and then, for each pair, I took the absolute difference between the two.

One thing that I've noticed is that some nodes in the original graph had a shortest distant greater than 2 times the threshold, but after cleaning, they belong to the same cluster! It's obvious to me now why the second bullet point quoted above doesn't make sense. Two nodes might be close together on a straight line, and hence be assigned to the same cluster, but that doesn't imply that their shortest distance along the network, is anywhere close to that value. That obviously depends on the topology of the network. Please see example below.

So that solves that mystery. But that makes it really difficult to test for correctness. By cleaning the graph we're effectively throwing away information, so there will always be some differences between paths computed in the original vs cleaned graphs. But can we do better than just visual inspection or is that enough?


Example

  • The dictionary key represents the ids of the nodes in the original graph (source, target) of the shortest path.
  • new_edge is the ids of the nodes in the cleaned graph
  • value is the shortest distance difference between the two (zero in the cleaned graph because they belong to the same cluster)

(4550426634, 13800765): {'new_edge': (277, 277), 'value': 455.99300000000005}

close_but_far

@ppintosilva
Copy link
Contributor Author

ppintosilva commented Jun 6, 2019

* [ ]  More detailed performance profiling

So I rerun the performance test for the new method:

Params:

  • address = '2700 Shattuck Ave, Berkeley, CA'
  • distance = 10000
  • tolerance = 30

Overall times:

  • Got all network data : 24.20 sec
  • Created graph: 4.21 sec
  • Simplified graph: 4.17 sec
  • Projected graph: 48.28 sec
  • Cleaned graph: 20.25 sec

Times inside the method:

  • checkpoint 1 (copy graph, remove dead ends, calculate centroids): 15.01 sec
  • checkpoint 2 (calling get_nearest_nodes and computing the dicts nearest_centroid and centroid_elements): 0.49 sec
  • checkpoint 3 (main loop where new graph is built): 4.71 sec

Overall, the clean_intersections method does not take a massive time compared to other existing methods. My code only incurs additional time from checkpoints 2 and 3. Up until checkpoint 1 is the same code for clean_intersections as per before the pull request.

So these both of the issues that I've highlighted previously don't appear to be issues at all.
I think that we're in a good position to merge this PR @gboeing

@gboeing
Copy link
Owner

gboeing commented Jun 7, 2019

@ppintosilva thanks! Can you resolve the merge conflict and I'll take a final reviewing pass then merge?

osmnx/simplify.py Outdated Show resolved Hide resolved
In cases where there is a single edge between the centroid and its neighbor, we can keep rather than discard parallel edges
@ppintosilva

This comment has been minimized.

@gboeing gboeing changed the base branch from master to clean_intersections June 10, 2019 13:43
@gboeing gboeing merged commit 51fbb6f into gboeing:clean_intersections Jun 10, 2019
@gboeing
Copy link
Owner

gboeing commented Jun 10, 2019

@ppintosilva thanks!

@ppintosilva
Copy link
Contributor Author

@ppintosilva thanks!

@gboeing thanks for the support, patience and guidance getting this PR through! Cheers 🙌

@psohouenou
Copy link

Hi @gboeing, it seems that the new function is not integrated/merged with the master (the latest version of Osmnx still have the version that return a Geoseries). Is there a reason for that?

@gboeing
Copy link
Owner

gboeing commented Jan 20, 2020

@psohouenou it was merged into a feature branch that went stale due to some unresolved issues (see discussion in #299). A new PR for similar functionality is now open (see #339) but needs a little more work before it's ready.

@psohouenou
Copy link

All right. Thank you very much for your prompt reply!

@gboeing
Copy link
Owner

gboeing commented May 12, 2020

Superseded by #450

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants