Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/graph analysis metrics #42

Merged
merged 10 commits into from Mar 12, 2021

Conversation

herbiebradley
Copy link
Member

@herbiebradley herbiebradley commented Mar 9, 2021

This PR adds metrics to GeoGraph by defining a Metric dataclass. A Metric object is then returned when a metric is calculated; each metric calculation function accepts only a GeoGraph, so that you can calculate any metric by calling the get_metric method on any GeoGraph and passing only the name of the metric. The PR also changes Habitats and Components to be sub-classes of GeoGraph, so you can now have recursively nested HabitatGeoGraphs, each with a ComponentGeoGraph, and you can calculate metrics on all of them!

Other features:

  • A merge_classes function which can be used to avoid some extra lines when combining class labels in the dataframe.
  • Tested support for string class labels as well as integer ones.
  • An apply_to_habitats function which allows you to apply a GeoGraph method, like merge_nodes, to every HabitatGeoGraph attached to a GeoGraph in one go, assuming each application uses the same arguments.
  • Further speedup in add_habitat.
  • Improved get_graph_components function, which now supports optionally calculating the union polygons for each component (by default this is not done for a GeoGraph, but is done for a HabitatGeoGraph since habitats are often small enough that this is fairly cheap).
  • Habitat-specific saving and loading.
  • Implemented __eq__ for GeoGraphs which checks if the two GeoGraphs are isomorphic.
  • Similarly, you can compare Metric objects with < <= == >= > and they will work as you would expect by comparing the underlying values.
  • ComponentGeoGraphs support optionally adding an edge between every pair of nodes (each node represents a component in the original graph) with the distance between the node polygons as an edge attribute. This is very slow for any habitat with more than about 100 components, so is disabled by default, but it is necessary for the avg_component_isolation metric.

Copy link
Member

@rdnfn rdnfn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, nice work! Great to have the full metrics functionality for habitats as well!

My questions are mainly about using it: how to ensure the GeoGraph and its habitats have all the attributes necessary to compute all component metrics.

Otherwise, happy to merge as is! 🚀

Comment on lines +519 to 520
barrier_classes: Optional[List[Union[str, int]]] = None,
max_travel_distance: float = 0.0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have this now 🚀

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually go into the computation yet, or is this just for future-proofing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just for future proofing at the moment.

Comment on lines +651 to +652
self, calc_polygons: bool = True, add_distance_edges: bool = False
) -> ComponentGeoGraph:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to be able to turn off the calc polygons 👍

self.components = comp_geograph
return comp_geograph

def get_metric(self, name: str) -> metrics.Metric:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding: so if I want to get all available metrics I could just use the line below?

all_metrics_of_graph = [graph.get_metric(name) for name in metrics.METRICS_DICT]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that avg_component_isolation doesn't work unless you have edges with distance between the components in the ComponentsGeoGraph of the GeoGraph you call the metric on, and avg_component_area doesn't work when the ComponentsGeoGraph doesn't have a dataframe with polygons, which is the default for the main GeoGraph. So if you do the above then you will get an error - any ideas on how to resolve this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a property like available_metrics or so would be nice? This way a user could check what possible arguments are allowed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing you can do is

for name in metrics.METRICS_DICT:
    try:
        all_metrics.append(graph.get_metric(name))
    except ValueError:
        continue

which would not give any errors and should return all the metrics it is possible to calculate.

Copy link
Member

@rdnfn rdnfn Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think given the computational cost to create some of these metrics scales so poorly we should keep the GeoGraph defaults scalable (i.e. not compute all the component edges, df etc.). The most important immidiate thing will be to add good warnings (and docstrings with examples) that let the user know how to add the relevant info to compute a more sophisticated metric.

In the next step, we could let the metrics themselves trigger the computation of these data structures. So e.g. if you want to compute avg_component_area it calls a GeoGraph method get_components(datafram=True), that automatically computes the components with df if the current component attribute doesn't have this yet. This would make it easier for the user, but maintain default scalability.

EDIT: this method would be in addition to the regular components attribute.

EDIT: this method could be an extension of the existing get_graph_components method, so that we always call that method to get components, sometimes with them being newly computed and sometimes with the existing components attribute.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can certainly make it so that the metrics functions trigger the computation of the needed data structures, but that would then make it impractical to use the code above iterating over all metrics, because it would trigger some incredibly expensive computation unless your graph is small. I personally slightly prefer allowing users to get all metrics calculatable with the defaults easily. Maybe I should add a get_all_metrics function which gets all the metrics that are valid for a geograph - then additional ones can be called optionally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm yeah I think you're right, it would be nice for the user to have an easy way to get the simple metrics. Maybe we could have get_standard_metrics (for computationally inexpensive) and get_extended_metrics (for all metrics including expensive) functions?

Copy link
Member

@rdnfn rdnfn Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I think triggering these computations is what a user would expect, if you try to get a complicated metric. But you're absolutely right, it shouldn't be the default anywhere (as by calling that iteration above).

EDIT: thought about it more .. I am somewhat indifferent between throwing an error when the graph is not fully computed yet, or a warning that it may take long, when trying to start a complicated metric. In case of the error it should be clear from the error how the user can fix it (with the example). Otherwise, up to you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, in b196db1 I added a STANDARD_METRICS list and an ALL_METRICS list, so you can just iterate over the standard metrics list to get all of the cheap metrics. Then I made the metrics functions compute the needed data structures automatically, with a warning.

Comment on lines +850 to +851
def save_habitat(self, save_path: pathlib.Path) -> None:
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding: how is this different from the main save method of the GeoGraph class?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the only difference is the data = { ... } part?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also saves all the metadata for the habitat, such as the list of valid classes & barrier classes, the max_travel_distance, etc. And the loading function loads all that stuff as well.

Comment on lines +10 to +11
if TYPE_CHECKING:
from src.models import geograph
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, didn't know there is a variable like this. Should start to use this as well!

Comment on lines +930 to +931
self.graph = nx.empty_graph(len(self.components_list))
self.metrics: Dict[str, Metric] = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding: this would then just be a number of nodes without any additional info?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought it best to create the graph so that users can at least add edges and node attributes if they want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, sounds good.

Comment on lines 80 to 81
comp_geograph = geo_graph.components
if not comp_geograph.has_df:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for usage: does a GeoGraph instance by default have this attribute, with a ComponentGeoGraph with df? Or how can I make sure that this metric can be computed for a GeoGraph?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default in the GeoGraph init we call self.components = self.get_graph_components(calc_polygons=False), which means the resulting ComponentsGeoGraph does not have a df so this metric will not work there. You can calculate the df and store it in components with geograph.components = geograph.get_graph_components(calc_polygons=True). However, by default the HabitatGeoGraph does get the components along with the component df in its init function, so this metric will work fine for those.

Comment on lines 93 to 95
def _avg_component_isolation(geo_graph: geograph.GeoGraph) -> Metric:
"""Calculate the average distance to the next-nearest component."""
comp_geograph = geo_graph.components
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great that we can compute this, at least for smaller habitats.

Again usage-question: how to I make sure that a GeoGraph instance has all the relevant attributes to be able to compute this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: how hard would it be to extend this to just regular patch isolation as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do geograph.components = geograph.get_graph_components(calc_polygons=True, add_distance_edges=True), to calculate the distance edges and make this metric work, but as warned this will be very expensive for graphs with more than about 100 components.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For patch isolation I think the approach I've been taking to calculate distances between everything is too inefficient, so I think you would have to change the idea of the metric from being distance to next-nearest patch to being average distance between neighbouring patches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also also: should we set this to zero if there is only a single component? I think this would make sense, as we would want this metric to go down as more components become connected, and in the limit (i.e. all components are connected) it should be the lowest = 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done in f5b80c0

Copy link
Member

@rdnfn rdnfn Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info, I think the best quick addition for this PR will be to add this fix

geograph.components = geograph.get_graph_components(calc_polygons=True, add_distance_edges=True)

to the warning and docstring of the two component metric functions. Then as a next step (probably after this PR) we could extend the get components function as I suggested above.

Copy link
Member

@rdnfn rdnfn Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Again beyond the scope of this PR) About patch and component isolation: didn't give this extensive thought but I think we can use the existing graph info to make this more efficient:

  1. only computing distances where there are edges (because if there is an edge, the minimal distance patch must has such an edge)
  2. slowly searching for nearby patches (like creating a larger max_travel_distance habitat) by adding a double_max travel distance buffer and then adding the edges with distance to the graph, and doing this in incriments of the max_travel_distance, until we found an edge.

Together this should still be more efficient in most cases than the complete distance computation at the moment.

Copy link
Collaborator

@Croydon-Brixton Croydon-Brixton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me as well! Mainly added a few questions for my understadning. Great job @herbiebradley 🚀

from numpy import ndarray
from src.models.polygon_utils import (
connect_with_interior_bulk,
connect_with_interior_or_edge_bulk,
connect_with_interior_or_edge_or_corner_bulk,
)

if TYPE_CHECKING:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice hack! 😉

from tqdm import tqdm

pd.options.mode.chained_assignment = None # default='warn'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's that? 😉

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because in the add_habitat function, we create a new habitat by slicing the main df to get the rows which correspond to the habitat. Further operations on the habitat df then give a Pandas warning about chained assignment, which is explained here in the first answer: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas, but for our use case we don't care about it so I disabled the warning.

def __eq__(self, o: object) -> bool:
if not isinstance(o, GeoGraph):
return False
return nx.is_isomorphic(self.graph, o.graph)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Just out of curiosity, do you know how complex such an operation is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I just realised it's exponential in the number of vertices - for some reason that function is still very fast when comparing graphs which have quite a few differences though, but impossibly slow for graphs which are the same. I've changed it to nx.fast_could_be_isomorphic which takes about half a second to compare even for graphs which are equal. It's only an approximation of course but I think it will be quite accurate for the differences usually found between GeoGraphs.

if final_index is None:
final_index = self.df.last_valid_index()
final_index = self.df.last_valid_index() + 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, that would have been bad - good catch!

"""
if not set(class_list).issubset(self.df["class_label"].unique()):
raise ValueError("`class_list` must only contain valid class names.")
self.df.loc[self.df["class_label"].isin(class_list), "class_label"] = new_name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we deal with potential touching polygons once the classes are merged? Should we also merge them into the respective superpolygons ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should be able to find these cases and call the very convenient merge_nodes function to deal with it 😄 - I'll add it to the PR tomorrow probably.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is taking more time than I thought it would, so I'll create an issue and add it in a future PR.

def _load_from_dataframe(
self,
df: gpd.GeoDataFrame,
tolerance: float = 0.0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the best-practice in this case actually? Is it to keep redundant arguments running (e.g. the way you did or with kwargs) or is it to remove them? I don't know but would be curious to learn

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's to create the redundant arguments so that if people accidentally copy a function call from the superclass and apply it to the child class, it all still works. See python/mypy#1237 (comment)
https://en.wikipedia.org/wiki/Liskov_substitution_principle

Comment on lines +958 to +959
# Using this list and iterating through it is slightly faster than
# iterating through df due to the dataframe overhead
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you benchmark with df.geometry.values (which gives you a geometry array object, that has references to the underlying C objects? How does this compare?
(Feel free to address this after the report and just open an issue - I think before the report submission performance is not priority)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking about this, I'll test it out before I merge the PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Croydon-Brixton I did some tests and it turns out making the list is slower than just getting the .values, but accessing a random element in the list is ~25x faster than accessing a random element in the geometry array (which is built on top of numpy I think). So we can stick to lists, this post might explain why they are better for accessing elements: https://stackoverflow.com/questions/29281680/numpy-individual-element-access-slower-than-for-lists

class Metric:
"""Class to represent a metric for a GeoGraph, with associated metadata."""

value: Any # No good Numpy type hints
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly 😢 I think ppl are working on it though

desc="Calculating edge weights",
total=len(self.graph.edges),
):
attrs["distance"] = geom[u].distance(geom[v])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That loop and distance calculation is probably the very slow computation you mentioned right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's that one

self.components = comp_geograph
return comp_geograph

def get_metric(self, name: str) -> metrics.Metric:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a property like available_metrics or so would be nice? This way a user could check what possible arguments are allowed.

@herbiebradley
Copy link
Member Author

All points dealt with, time to merge! 🥳

@herbiebradley herbiebradley merged commit f78684e into feature/graph-analysis Mar 12, 2021
@herbiebradley herbiebradley deleted the feature/graph-analysis-metrics branch March 12, 2021 13:58
GeoGraph automation moved this from In progress to Done Mar 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
GeoGraph
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants