Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for graph series/algorithms #55

Open
furlat opened this issue Jan 22, 2024 · 8 comments
Open

Support for graph series/algorithms #55

furlat opened this issue Jan 22, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@furlat
Copy link

furlat commented Jan 22, 2024

I have in some situations columns containing lists of edges, or even rows of adjacency matrix like representations for my data, it would be lovely to apply graph based operations like collecting all the rows in a shortpath from a row (node) to another.

Happy to elaborate it this does not sound stupid

@abstractqqq
Copy link
Owner

abstractqqq commented Jan 22, 2024

Hi @furlat ,

It is something I find interesting, although dataframes may not be the most optimal data structure to do this efficiently. That said I think some graph support should be added. Can you please provide a minimal example of a dataframe, and the operation you want to perform on it? Thanks

@furlat
Copy link
Author

furlat commented Jan 22, 2024

Hi thanks a lot for the fast reply!!

Something like this? Totally made up and ignoring the various df context just showing expressions

Simulated Polars DataFrame for Graph Representation

import polars as pl

# Sample input data
data = {
    'node_id': [1, 2, 3, 4, 5],
    'edges': [[2, 3], [1, 4], [1, 5], [2, 5], [3, 4]]
}

GraphDataFrame = pl.DataFrame(data)
print(GraphDataFrame.head(5))

shape: (5, 2)
┌─────────┬─────────┐
│ node_idedges   │
│ ------     │
│ i64list[i] │
╞═════════╪═════════╡
│ 1       ┆ [2, 3]  │
│ 2       ┆ [1, 4]  │
│ 3       ┆ [1, 5]  │
│ 4       ┆ [2, 5]  │
│ 5       ┆ [3, 4]  │
└─────────┴─────────┘
# Degree of each node
degree_series = GraphDataFrame.graph.node_metrics.degree()
print(degree_series.head(5))

# Output
shape: (5, 1)
┌────────┐
│ degree │
│ ---    │
│ i64    │
╞════════╡
│ 2      │
│ 2      │
│ 2      │
│ 2      │
│ 2      │
└────────┘
# Eigenvector centrality of each node
centrality_series = GraphDataFrame.graph.node_metrics.eigenvector_centrality()
print(centrality_series.head(5))

# Output
shape: (5, 1)
┌────────────────────┐
│ eigenvector_centr  │
│ ---                │
│ f64                │
╞════════════════════╡
│ 0.5                │
│ 0.6                │
│ 0.4                │
│ 0.7                │
│ 0.5                │
└────────────────────┘
# Shortest path distances from all nodes to node_id 3, with path represented as list of node IDs
distances = GraphDataFrame.graph.distances.short_path(3, algorithm='dijkstra', create_path_column=True)
print(distances.head(5))

# Output
shape: (5, 3)
┌─────────┬──────────┬────────────┐
│ node_iddistancepath       │
│ ---------        │
│ i64i64list[i64]  │
╞═════════╪══════════╪════════════╡
│ 11        ┆ [1, 3]     │
│ 21        ┆ [2, 3]     │
│ 30        ┆ [3]        │
│ 42        ┆ [4, 2, 3]  │
│ 51        ┆ [5, 3]     │
└─────────┴──────────┴────────────┘
# Shortest path distances from all nodes to node_ids 3 and 4
targets = [3, 4]
distances = GraphDataFrame.graph.distances.short_path(targets, algorithm='dijkstra', create_path_column=True)
print(distances.head(5))

# Output
shape: (5, 5)
┌─────────┬─────────────────┬───────────────┬─────────────────┬───────────────┐
│ node_iddistance_to_3path_to_3distance_to_4path_to_4     │
│ ---------------           │
│ i64i64list[i64]     ┆ i64list[i64]     │
╞═════════╪═════════════════╪═══════════════╪═════════════════╪═══════════════╡
│ 11               ┆ [1, 3]        ┆ 2               ┆ [1, 2, 4]     │
│ 21               ┆ [2, 3]        ┆ 1               ┆ [2, 4]        │
│ 30               ┆ [3]           ┆ 1               ┆ [3, 4]        │
│ 41               ┆ [4, 2, 3]     ┆ 0               ┆ [4]           │
│ 51               ┆ [5, 3]        ┆ 1               ┆ [5, 4]        │
└─────────┴─────────────────┴───────────────┴─────────────────┴───────────────┘
# Grouping based on node clusters and aggregating
clustered_grouping = GraphDataFrame.graph_group_by.aggolmerative_clustering(num_clusters=2, depth=1).agg([
    pl.col("cluster"),
    pl.col("edges").list().alias("connected_nodes")
])
print(clustered_grouping.head(5))

# Output
shape: (2, 2)
┌─────────┬──────────────────────────┐
│ clusterconnected_nodes          │
│ ------                      │
│ i64list[list[i]]            │
╞═════════╪══════════════════════════╡
│ 0       ┆ [[2, 3], [1, 4]]         │
│ 1       ┆ [[1, 5], [2, 5], [3, 4]] │
└─────────┴──────────────────────────┘

@furlat
Copy link
Author

furlat commented Jan 22, 2024

I forgot to mention why I am posting this here, because this sort of graphs a lof of time comes out of nearest neighbor filtering of some vector space and i see it just a naturally next step of processing after obtaining the list of neighbors from your example

df.with_columns(
    pl.col("id").num.knn_ptwise(
        pl.col("val1"), pl.col("val2"), 
        k = 3, dist = "haversine", parallel = True
    ).alias("nearest neighbor ids")
)

shape: (5, 6)
┌─────┬──────────┬──────────┬──────────┬──────────┬──────────────────────┐
│ idval1val2val3val4nearest neighbor ids │
│ ------------------                  │
│ i64f64f64f64f64list[u64]            │
╞═════╪══════════╪══════════╪══════════╪══════════╪══════════════════════╡
│ 00.8042260.9370550.4010050.119566 ┆ [0, 3, … 0]          │
│ 10.5266910.5623690.0614440.520291 ┆ [1, 4, … 4]          │
│ 20.2250550.0803440.4259620.924262 ┆ [2, 1, … 1]          │
│ 30.6972640.1122530.6662380.45823  ┆ [3, 1, … 0]          │
│ 40.2278070.7349950.2256570.668077 ┆ [4, 4, … 0]          │
└─────┴──────────┴──────────┴──────────┴──────────┴──────────────────────┘

I think being able to natively parallelize and combine graph operations with all the rest of polars filtering over the nodes attribute (an incredibly underspecified situation for graph frameworks). Also in all honesty I am being a bit biased but I would love to move all my non-gpu stack to polars to stay away from native python cpu processing. I end up always spending so much time in managing things that are trivial queries in polars for whatever data structure the package of the day wants.

@abstractqqq
Copy link
Owner

@furlat

I believe I have a working eigenvector centrality. See the branch here: https://github.com/abstractqqq/polars_ds_extension/tree/graph and take a look in graph.py.

It is pretty slow rn, taking 2.8ms for 1000 rows, each with 10-150 neighbors. I also added query_radius_ptwise to help me generate these edge list with unequal sizes. You can find it in examples.ipynb.

Will you be interested in testing it? (That would require you to build and compile the package locally). The reason I am asking is that I am mostly learning these topics as I go and it would be great if someone can help me with testing and debugging.

@furlat
Copy link
Author

furlat commented Jan 25, 2024

whaooo! Definetely yes, unfortunately I can't help for the immediate next 3 days, but from next week I can dedicate quite some time to testing and helping on the python side!

Also interesting github https://github.com/Splines/fast-louvain .
fast-louvain is quite a formidable algo and easily extend to temporal or multiplexed graphs, again it does not seems fully completed but it has also a nice manual on the graph theory part https://splines.github.io/fast-louvain/modularity/formula.html

@abstractqqq abstractqqq added the enhancement New feature or request label Jan 30, 2024
@abstractqqq
Copy link
Owner

Some graph stuff is there. But on second thought I really think a separate data structure is ideal for Graph. As long as the graph structure can return arrays/things that can easily be turned into Series, I think they can work with dataframes. Having graph queries work with lazy frames is like a dream.. Building a graph takes a lot of effort, and they cannot really be re-used when they are expressions, and graph algorithms are typically slower than the rest of the expressions...

So after some experimentation with graph expressions, I think the API you proposed above makes more sense, although it has to be altered to some extent.. I will try to make more progress in v.0.3.x

@furlat
Copy link
Author

furlat commented Feb 8, 2024

Hi! Unfortunately I have a big presentation tomorrow and got a bit delayed on my testing I will start playing around with the package again from the week-end and will be able to give you some proper feedback.

I have some ideas regarding your points above but I am not sure they are correct, so I will first take my time to play around with the graph algos (and benchmark vs dumping to numpy) and update you asap!

@abstractqqq
Copy link
Owner

abstractqqq commented Feb 9, 2024

Hi! Unfortunately I have a big presentation tomorrow and got a bit delayed on my testing I will start playing around with the package again from the week-end and will be able to give you some proper feedback.

I have some ideas regarding your points above but I am not sure they are correct, so I will first take my time to play around with the graph algos (and benchmark vs dumping to numpy) and update you asap!

No worries and no rush. I have decided to implement most graph algos first as expressions, just to make it convenient to use first, despite it being slow (possibly cannot be any faster).

FYI, I am planning to release v0.3.1 soon. Graph is refactored and the petgraph crate is used instead of Pathfinding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants