Feature/graph analysis metrics #42

herbiebradley · 2021-03-09T18:58:21Z

This PR adds metrics to GeoGraph by defining a Metric dataclass. A Metric object is then returned when a metric is calculated; each metric calculation function accepts only a GeoGraph, so that you can calculate any metric by calling the get_metric method on any GeoGraph and passing only the name of the metric. The PR also changes Habitats and Components to be sub-classes of GeoGraph, so you can now have recursively nested HabitatGeoGraphs, each with a ComponentGeoGraph, and you can calculate metrics on all of them!

Other features:

A merge_classes function which can be used to avoid some extra lines when combining class labels in the dataframe.
Tested support for string class labels as well as integer ones.
An apply_to_habitats function which allows you to apply a GeoGraph method, like merge_nodes, to every HabitatGeoGraph attached to a GeoGraph in one go, assuming each application uses the same arguments.
Further speedup in add_habitat.
Improved get_graph_components function, which now supports optionally calculating the union polygons for each component (by default this is not done for a GeoGraph, but is done for a HabitatGeoGraph since habitats are often small enough that this is fairly cheap).
Habitat-specific saving and loading.
Implemented __eq__ for GeoGraphs which checks if the two GeoGraphs are isomorphic.
Similarly, you can compare Metric objects with < <= == >= > and they will work as you would expect by comparing the underlying values.
ComponentGeoGraphs support optionally adding an edge between every pair of nodes (each node represents a component in the original graph) with the distance between the node polygons as an edge attribute. This is very slow for any habitat with more than about 100 components, so is disabled by default, but it is necessary for the avg_component_isolation metric.

rdnfn

Awesome, nice work! Great to have the full metrics functionality for habitats as well!

My questions are mainly about using it: how to ensure the GeoGraph and its habitats have all the attributes necessary to compute all component metrics.

Otherwise, happy to merge as is! 🚀

rdnfn · 2021-03-10T18:16:35Z

src/models/geograph.py

+        barrier_classes: Optional[List[Union[str, int]]] = None,
        max_travel_distance: float = 0.0,


Nice to have this now 🚀

Does this actually go into the computation yet, or is this just for future-proofing?

It's just for future proofing at the moment.

rdnfn · 2021-03-10T18:17:00Z

src/models/geograph.py

+        self, calc_polygons: bool = True, add_distance_edges: bool = False
+    ) -> ComponentGeoGraph:


Good to be able to turn off the calc polygons 👍

rdnfn · 2021-03-10T18:19:43Z

src/models/geograph.py

+            self.components = comp_geograph
+        return comp_geograph
+
+    def get_metric(self, name: str) -> metrics.Metric:


Just for my understanding: so if I want to get all available metrics I could just use the line below?

all_metrics_of_graph = [graph.get_metric(name) for name in metrics.METRICS_DICT]

The problem is that avg_component_isolation doesn't work unless you have edges with distance between the components in the ComponentsGeoGraph of the GeoGraph you call the metric on, and avg_component_area doesn't work when the ComponentsGeoGraph doesn't have a dataframe with polygons, which is the default for the main GeoGraph. So if you do the above then you will get an error - any ideas on how to resolve this?

Maybe a property like available_metrics or so would be nice? This way a user could check what possible arguments are allowed.

One thing you can do is

for name in metrics.METRICS_DICT: try: all_metrics.append(graph.get_metric(name)) except ValueError: continue

which would not give any errors and should return all the metrics it is possible to calculate.

I think given the computational cost to create some of these metrics scales so poorly we should keep the GeoGraph defaults scalable (i.e. not compute all the component edges, df etc.). The most important immidiate thing will be to add good warnings (and docstrings with examples) that let the user know how to add the relevant info to compute a more sophisticated metric.

In the next step, we could let the metrics themselves trigger the computation of these data structures. So e.g. if you want to compute avg_component_area it calls a GeoGraph method get_components(datafram=True), that automatically computes the components with df if the current component attribute doesn't have this yet. This would make it easier for the user, but maintain default scalability.

EDIT: this method would be in addition to the regular components attribute.

EDIT: this method could be an extension of the existing get_graph_components method, so that we always call that method to get components, sometimes with them being newly computed and sometimes with the existing components attribute.

We can certainly make it so that the metrics functions trigger the computation of the needed data structures, but that would then make it impractical to use the code above iterating over all metrics, because it would trigger some incredibly expensive computation unless your graph is small. I personally slightly prefer allowing users to get all metrics calculatable with the defaults easily. Maybe I should add a get_all_metrics function which gets all the metrics that are valid for a geograph - then additional ones can be called optionally.

Hm yeah I think you're right, it would be nice for the user to have an easy way to get the simple metrics. Maybe we could have get_standard_metrics (for computationally inexpensive) and get_extended_metrics (for all metrics including expensive) functions?

Generally, I think triggering these computations is what a user would expect, if you try to get a complicated metric. But you're absolutely right, it shouldn't be the default anywhere (as by calling that iteration above).

EDIT: thought about it more .. I am somewhat indifferent between throwing an error when the graph is not fully computed yet, or a warning that it may take long, when trying to start a complicated metric. In case of the error it should be clear from the error how the user can fix it (with the example). Otherwise, up to you.

Alright, in b196db1 I added a STANDARD_METRICS list and an ALL_METRICS list, so you can just iterate over the standard metrics list to get all of the cheap metrics. Then I made the metrics functions compute the needed data structures automatically, with a warning.

rdnfn · 2021-03-10T18:24:48Z

src/models/geograph.py

+    def save_habitat(self, save_path: pathlib.Path) -> None:
+        """


For my understanding: how is this different from the main save method of the GeoGraph class?

I guess the only difference is the data = { ... } part?

This also saves all the metadata for the habitat, such as the list of valid classes & barrier classes, the max_travel_distance, etc. And the loading function loads all that stuff as well.

rdnfn · 2021-03-10T18:34:04Z

src/models/metrics.py

+if TYPE_CHECKING:
+    from src.models import geograph


Nice, didn't know there is a variable like this. Should start to use this as well!

rdnfn · 2021-03-10T18:36:35Z

src/models/geograph.py

+            self.graph = nx.empty_graph(len(self.components_list))
+        self.metrics: Dict[str, Metric] = {}


For my understanding: this would then just be a number of nodes without any additional info?

Yeah, I thought it best to create the graph so that users can at least add edges and node attributes if they want.

Yep, sounds good.

rdnfn · 2021-03-10T18:39:38Z

src/models/metrics.py

+    comp_geograph = geo_graph.components
+    if not comp_geograph.has_df:


Question for usage: does a GeoGraph instance by default have this attribute, with a ComponentGeoGraph with df? Or how can I make sure that this metric can be computed for a GeoGraph?

By default in the GeoGraph init we call self.components = self.get_graph_components(calc_polygons=False), which means the resulting ComponentsGeoGraph does not have a df so this metric will not work there. You can calculate the df and store it in components with geograph.components = geograph.get_graph_components(calc_polygons=True). However, by default the HabitatGeoGraph does get the components along with the component df in its init function, so this metric will work fine for those.

rdnfn · 2021-03-10T18:41:58Z

src/models/metrics.py

+def _avg_component_isolation(geo_graph: geograph.GeoGraph) -> Metric:
+    """Calculate the average distance to the next-nearest component."""
+    comp_geograph = geo_graph.components


Great that we can compute this, at least for smaller habitats.

Again usage-question: how to I make sure that a GeoGraph instance has all the relevant attributes to be able to compute this?

Also: how hard would it be to extend this to just regular patch isolation as well?

You can do geograph.components = geograph.get_graph_components(calc_polygons=True, add_distance_edges=True), to calculate the distance edges and make this metric work, but as warned this will be very expensive for graphs with more than about 100 components.

For patch isolation I think the approach I've been taking to calculate distances between everything is too inefficient, so I think you would have to change the idea of the metric from being distance to next-nearest patch to being average distance between neighbouring patches.

Also also: should we set this to zero if there is only a single component? I think this would make sense, as we would want this metric to go down as more components become connected, and in the limit (i.e. all components are connected) it should be the lowest = 0.

Good idea, done in f5b80c0

Thanks for the info, I think the best quick addition for this PR will be to add this fix

geograph.components = geograph.get_graph_components(calc_polygons=True, add_distance_edges=True)

to the warning and docstring of the two component metric functions. Then as a next step (probably after this PR) we could extend the get components function as I suggested above.

(Again beyond the scope of this PR) About patch and component isolation: didn't give this extensive thought but I think we can use the existing graph info to make this more efficient:

only computing distances where there are edges (because if there is an edge, the minimal distance patch must has such an edge)

slowly searching for nearby patches (like creating a larger max_travel_distance habitat) by adding a double_max travel distance buffer and then adding the edges with distance to the graph, and doing this in incriments of the max_travel_distance, until we found an edge.

Together this should still be more efficient in most cases than the complete distance computation at the moment.

Croydon-Brixton

Looks good to me as well! Mainly added a few questions for my understadning. Great job @herbiebradley 🚀

Croydon-Brixton · 2021-03-10T11:31:41Z

src/models/binary_graph_operations.py

 from numpy import ndarray
 from src.models.polygon_utils import (
    connect_with_interior_bulk,
    connect_with_interior_or_edge_bulk,
    connect_with_interior_or_edge_or_corner_bulk,
 )

+if TYPE_CHECKING:


Nice hack! 😉

Croydon-Brixton · 2021-03-11T09:46:42Z

src/models/geograph.py

 from tqdm import tqdm

+pd.options.mode.chained_assignment = None  # default='warn'


What's that? 😉

This is because in the add_habitat function, we create a new habitat by slicing the main df to get the rows which correspond to the habitat. Further operations on the habitat df then give a Pandas warning about chained assignment, which is explained here in the first answer: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas, but for our use case we don't care about it so I disabled the warning.

Croydon-Brixton · 2021-03-11T09:47:43Z

src/models/geograph.py

+    def __eq__(self, o: object) -> bool:
+        if not isinstance(o, GeoGraph):
+            return False
+        return nx.is_isomorphic(self.graph, o.graph)


Cool! Just out of curiosity, do you know how complex such an operation is?

Yeah I just realised it's exponential in the number of vertices - for some reason that function is still very fast when comparing graphs which have quite a few differences though, but impossibly slow for graphs which are the same. I've changed it to nx.fast_could_be_isomorphic which takes about half a second to compare even for graphs which are equal. It's only an approximation of course but I think it will be quite accurate for the differences usually found between GeoGraphs.

Croydon-Brixton · 2021-03-11T09:49:38Z

src/models/geograph.py

        if final_index is None:
-            final_index = self.df.last_valid_index()
+            final_index = self.df.last_valid_index() + 1


Oops, that would have been bad - good catch!

Croydon-Brixton · 2021-03-11T09:53:22Z

src/models/geograph.py

+        """
+        if not set(class_list).issubset(self.df["class_label"].unique()):
+            raise ValueError("`class_list` must only contain valid class names.")
+        self.df.loc[self.df["class_label"].isin(class_list), "class_label"] = new_name


How do we deal with potential touching polygons once the classes are merged? Should we also merge them into the respective superpolygons ?

I should be able to find these cases and call the very convenient merge_nodes function to deal with it 😄 - I'll add it to the PR tomorrow probably.

This is taking more time than I thought it would, so I'll create an issue and add it in a future PR.

Croydon-Brixton · 2021-03-11T10:21:30Z

src/models/geograph.py

+    def _load_from_dataframe(
+        self,
+        df: gpd.GeoDataFrame,
+        tolerance: float = 0.0,


What's the best-practice in this case actually? Is it to keep redundant arguments running (e.g. the way you did or with kwargs) or is it to remove them? I don't know but would be curious to learn

I think it's to create the redundant arguments so that if people accidentally copy a function call from the superclass and apply it to the child class, it all still works. See python/mypy#1237 (comment)
https://en.wikipedia.org/wiki/Liskov_substitution_principle

Croydon-Brixton · 2021-03-11T10:22:33Z

src/models/geograph.py

+        # Using this list and iterating through it is slightly faster than
+        # iterating through df due to the dataframe overhead


Did you benchmark with df.geometry.values (which gives you a geometry array object, that has references to the underlying C objects? How does this compare?
(Feel free to address this after the report and just open an issue - I think before the report submission performance is not priority)

Yeah I was thinking about this, I'll test it out before I merge the PR.

@Croydon-Brixton I did some tests and it turns out making the list is slower than just getting the .values, but accessing a random element in the list is ~25x faster than accessing a random element in the geometry array (which is built on top of numpy I think). So we can stick to lists, this post might explain why they are better for accessing elements: https://stackoverflow.com/questions/29281680/numpy-individual-element-access-slower-than-for-lists

Croydon-Brixton · 2021-03-11T10:24:02Z

src/models/metrics.py

+class Metric:
+    """Class to represent a metric for a GeoGraph, with associated metadata."""
+
+    value: Any  # No good Numpy type hints


Sadly 😢 I think ppl are working on it though

Croydon-Brixton · 2021-03-11T10:26:04Z

src/models/geograph.py

+                desc="Calculating edge weights",
+                total=len(self.graph.edges),
+            ):
+                attrs["distance"] = geom[u].distance(geom[v])


That loop and distance calculation is probably the very slow computation you mentioned right?

Yeah that's that one

Croydon-Brixton · 2021-03-11T10:29:39Z

src/models/geograph.py

+            self.components = comp_geograph
+        return comp_geograph
+
+    def get_metric(self, name: str) -> metrics.Metric:


Maybe a property like available_metrics or so would be nice? This way a user could check what possible arguments are allowed.

herbiebradley · 2021-03-12T13:58:16Z

All points dealt with, time to merge! 🥳

herbiebradley added 6 commits March 3, 2021 19:35

feature: basic metrics code structure

5c0bdc5

fix: type hint

17770ca

fix: faster _get_num_components

5874a67

feature: add order to metrics

eace337

feature: allow comparisons between any metrics

c47b9ed

feature: Habitat and Component GeoGraphs

f1aae9e

herbiebradley added kind: feature concerns: GeoGraph labels Mar 9, 2021

herbiebradley requested review from Croydon-Brixton and rdnfn March 9, 2021 18:58

herbiebradley self-assigned this Mar 9, 2021

herbiebradley added this to In progress in GeoGraph via automation Mar 9, 2021

rdnfn approved these changes Mar 10, 2021

View reviewed changes

Croydon-Brixton approved these changes Mar 11, 2021

View reviewed changes

herbiebradley added 4 commits March 11, 2021 10:50

fix: remove redundant arg in CompGeoGraph init

e21e7d5

fix: Make __eq__ faster for GeoGraphs

051ac83

fix: avg_comp_isolation = 0 if only one component

f5b80c0

feature: Metrics automatically calc requirements

b196db1

herbiebradley mentioned this pull request Mar 12, 2021

Merge polygons in the GeoGraph merge_classes function #44

Closed

herbiebradley merged commit f78684e into feature/graph-analysis Mar 12, 2021

herbiebradley deleted the feature/graph-analysis-metrics branch March 12, 2021 13:58

GeoGraph automation moved this from In progress to Done Mar 12, 2021

		barrier_classes: Optional[List[Union[str, int]]] = None,
		max_travel_distance: float = 0.0,

		self, calc_polygons: bool = True, add_distance_edges: bool = False
		) -> ComponentGeoGraph:

		self.graph = nx.empty_graph(len(self.components_list))
		self.metrics: Dict[str, Metric] = {}

		comp_geograph = geo_graph.components
		if not comp_geograph.has_df:

		from tqdm import tqdm

		pd.options.mode.chained_assignment = None # default='warn'

		# Using this list and iterating through it is slightly faster than
		# iterating through df due to the dataframe overhead

Feature/graph analysis metrics #42

Feature/graph analysis metrics #42

Conversation

herbiebradley commented Mar 9, 2021 • edited

rdnfn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdnfn Mar 11, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdnfn Mar 11, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdnfn Mar 11, 2021 • edited

Choose a reason for hiding this comment

rdnfn Mar 11, 2021 • edited

Choose a reason for hiding this comment

Croydon-Brixton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

herbiebradley Mar 11, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

herbiebradley Mar 11, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

herbiebradley commented Mar 12, 2021

herbiebradley commented Mar 9, 2021 •

edited

rdnfn Mar 11, 2021 •

edited

rdnfn Mar 11, 2021 •

edited

rdnfn Mar 11, 2021 •

edited

rdnfn Mar 11, 2021 •

edited

herbiebradley Mar 11, 2021 •

edited

herbiebradley Mar 11, 2021 •

edited