Before filing
Kind of addition
Motivation
Following preliminary discussion in #492 , this proposal aims to bring the popular personalized node scoring/ranking algorithm to BGL. Personalized node scoring can be used to score nodes with respect to their structural proximity to a (weighted) set of seed nodes, so it can be used for overlapping community member scoring and ranking. It can also be used as a building block for more complex community detection schemes that satisfy theoretical properties like finding structural communities of low conductance given that the seed nodes lie closely together. Finally, node scores can be used for link prediction by starting from one node and finding structural close alternatives.
The proposal below presents some basic information for bringing these kinds of methods, and can be used as future basis for also implementing more spectral graph filters (not to be confused with full-blown spectral processing in that spectral filters act on eigenvalue-based properties by actually working only on the vertex domain for near-linear instead of quadratic time complexity).
Proposed addition
In linear algebra terms, the algorithm implements the iterative scheme:
$p = (1-dmp)Ap+dmp p_0$
where A is broadly a normalized graph adjacency matrix, dmp is the equivalent of the dampening factor (equal to 1-restart probability of equivalent random walk with restart schemes), and $p_0$ a starting personalization vector. As an API, this would mainly follow the existing page_rank interface:
template <typename Graph, typename RankMap, typename Done>
void personalized_page_rank(const Graph& g, RankMap rank_map, Done done,
typename property_traits<RankMap>::value_type damping,
typename graph_traits<Graph>::vertices_size_type n,
Normalization normalization,
Optimizer optimizer,
RankMap1 rank_map1,
RankMap2 rank_map2);
Simplifications analogous to page_rank's will be added, for example so that done uses some common default (see below for planned options for convergence tracking; it is not as simple a problem as it appears).
Important differences:
A. rank_map will hold both input personalization $p_0$ and the eventually outputted $p$, in alignment with page_rank's interface. This can change but I think it's elegant. Two more storage containers for node values are inserted, as we need to keep track the previous and next node values, as well s the personalization. All vertex value containers would be ReadWritePropertyMaps.
B. I am proposing that the default for done should not be a fixed number of iterations, as this creates a high computational overhead or may not be enough for robustness in scoring or rank order. Since I have worked on research determining robust stopping orders [2], I am proposing implementation of three different schemes, of which the first one would be the default, the second an alternative when correct ranking order does not matter as much as correct scoring, and the third is a practically unknown but a more theoretically rigorous version of the first:
- Run for
max(10, ceil(1/dmp)) iterations to account for most random walks of the equivalent random walk with restart scheme per [2]
- Use
mean(abs(p-p_previous))<tol as the stopping criterion with tol=10E-9 as the default (the popular Python library networkx does this, but its default tol=10E-6 is not sufficient for most graphs).
- [Optional implementation - may consume a PR by itself] Use the adaptive criteria of [2] as a more advanced version of the previous bullet point, where a confidence score is provided. The exact criterion is a bit complicated to write as markdown (see the paper) but would require an interface akin to the above bullet point.
This means that Done would in general require as inputs const RankMap& rank_map1 and const RankMap& rank_map2 to compare, and contain early stopping parameter as well as a user-provided upper bound to the number of iterations that can be 100 by default. The last one is needed because, depending on what input are provided, the computations may never converge and it would be nice for applications to stop. I see that other components throw boost::throw_exception and we could also do it here.
C. Normalization means that each edge is weighted by either 1/out_degree(u) or, more common for results compatible with spectral graph theory, by 1/sqrt(out_degree(u)*in_degree(v)) (would collapse to 1/sqrt(out_degree(u)*out_degree(v)) for edge (u,v) of an IncidenceGraph). I have a sense that this may be more computationally intensive than needed and that there should be a mechanism to cache edge weights. So, for this, I have a question if there is some appropriate structure to use.
In order to not overcomplicate things, a first implementation would just run the computations eagerly, making normalization above would be an enum enum class Normalization { Markovian, Symmetric, Laplacian }, corresponding to divisin with outdegree, the geometric means of degree, and the last case while using (I-A) instead of A to obtain a high-pass instead of low-pass filter.
Extensibility considerations: Future versions could accept as argument a storage container for edge weights (like WeightMap for betweeness centrality) as well as means of precomputing and sharing those, but that would be too much potentially interface-risky work for my first PR given that I also need to practically familiarize myself with the codebase's abstractions. Such an extension would be backwards compatible with the above interface.
D. The above design has delegated an optimizer parameter, which will account for future schemas for speeding up convergence speed. There are many means of performing such optimizations, some of which have grown a bit dim in my mind and I would need to re-read material to implement them. However, the most important of those to be included as a first version, and aside from not making any "optimization", is to again normalize vectors after each iteration.
This renormalization has the advantage of performing numerically stable computations by keeping track of the L2 norm of the input/personalization, normalizing the latter, normalizing p after each iteration, and then multiplying the results with the tracked norm. This also speeds up convergence a lot (and even makes it possible if node scores would otherwise be numerically divergent), but has the issue of losing some nuance in the resulting rank magnitude and should therefore be an opt-in (or opt-out) scheme.
Since I am not sure what extensions would be made in the future, I am proposing the optimizer to be of enum class Optimizer { None, Renormalize } (if needed, it could be combined with Normalization as a bitfield-based approach but would prefer to keep type-based explicitness).
Relation to existing Boost.Graph components
This is similar to the page_rank algorithm.
Prior art
References:
Introductory paper is [1], but the proposal aims to follow the implementation of [3], which different from the original paper by adding options for more recent esearch results (see above). The stopping criteria would be determined per [2].
- https://api.semanticscholar.org/CorpusID:1508503
- 10.1109/ASONAM49781.2020.9381435
- https://github.com/MKLab-ITI/pygrank/blob/master/pygrank/algorithms/filters/adhoc/pagerank/pagerank.py
Are you willing to contribute?
Edit: typo fixing
Before filing
Kind of addition
Motivation
Following preliminary discussion in #492 , this proposal aims to bring the popular personalized node scoring/ranking algorithm to BGL. Personalized node scoring can be used to score nodes with respect to their structural proximity to a (weighted) set of seed nodes, so it can be used for overlapping community member scoring and ranking. It can also be used as a building block for more complex community detection schemes that satisfy theoretical properties like finding structural communities of low conductance given that the seed nodes lie closely together. Finally, node scores can be used for link prediction by starting from one node and finding structural close alternatives.
The proposal below presents some basic information for bringing these kinds of methods, and can be used as future basis for also implementing more spectral graph filters (not to be confused with full-blown spectral processing in that spectral filters act on eigenvalue-based properties by actually working only on the vertex domain for near-linear instead of quadratic time complexity).
Proposed addition
In linear algebra terms, the algorithm implements the iterative scheme:
where A is broadly a normalized graph adjacency matrix, dmp is the equivalent of the dampening factor (equal to 1-restart probability of equivalent random walk with restart schemes), and$p_0$ a starting personalization vector. As an API, this would mainly follow the existing
page_rankinterface:Simplifications analogous to
page_rank's will be added, for example so thatdoneuses some common default (see below for planned options for convergence tracking; it is not as simple a problem as it appears).Important differences:
A.$p_0$ and the eventually outputted $p$ , in alignment with
rank_mapwill hold both input personalizationpage_rank's interface. This can change but I think it's elegant. Two more storage containers for node values are inserted, as we need to keep track the previous and next node values, as well s the personalization. All vertex value containers would beReadWritePropertyMaps.B. I am proposing that the default for
doneshould not be a fixed number of iterations, as this creates a high computational overhead or may not be enough for robustness in scoring or rank order. Since I have worked on research determining robust stopping orders [2], I am proposing implementation of three different schemes, of which the first one would be the default, the second an alternative when correct ranking order does not matter as much as correct scoring, and the third is a practically unknown but a more theoretically rigorous version of the first:max(10, ceil(1/dmp))iterations to account for most random walks of the equivalent random walk with restart scheme per [2]mean(abs(p-p_previous))<tolas the stopping criterion with tol=10E-9 as the default (the popular Python library networkx does this, but its default tol=10E-6 is not sufficient for most graphs).This means that
Donewould in general require as inputsconst RankMap& rank_map1andconst RankMap& rank_map2to compare, and contain early stopping parameter as well as a user-provided upper bound to the number of iterations that can be 100 by default. The last one is needed because, depending on what input are provided, the computations may never converge and it would be nice for applications to stop. I see that other components throwboost::throw_exceptionand we could also do it here.C. Normalization means that each edge is weighted by either
1/out_degree(u)or, more common for results compatible with spectral graph theory, by1/sqrt(out_degree(u)*in_degree(v))(would collapse to1/sqrt(out_degree(u)*out_degree(v))for edge(u,v)of anIncidenceGraph). I have a sense that this may be more computationally intensive than needed and that there should be a mechanism to cache edge weights. So, for this, I have a question if there is some appropriate structure to use.In order to not overcomplicate things, a first implementation would just run the computations eagerly, making
normalizationabove would be an enumenum class Normalization { Markovian, Symmetric, Laplacian }, corresponding to divisin with outdegree, the geometric means of degree, and the last case while using (I-A) instead of A to obtain a high-pass instead of low-pass filter.Extensibility considerations: Future versions could accept as argument a storage container for edge weights (like
WeightMapfor betweeness centrality) as well as means of precomputing and sharing those, but that would be too much potentially interface-risky work for my first PR given that I also need to practically familiarize myself with the codebase's abstractions. Such an extension would be backwards compatible with the above interface.D. The above design has delegated an
optimizerparameter, which will account for future schemas for speeding up convergence speed. There are many means of performing such optimizations, some of which have grown a bit dim in my mind and I would need to re-read material to implement them. However, the most important of those to be included as a first version, and aside from not making any "optimization", is to again normalize vectors after each iteration.This renormalization has the advantage of performing numerically stable computations by keeping track of the L2 norm of the input/personalization, normalizing the latter, normalizing p after each iteration, and then multiplying the results with the tracked norm. This also speeds up convergence a lot (and even makes it possible if node scores would otherwise be numerically divergent), but has the issue of losing some nuance in the resulting rank magnitude and should therefore be an opt-in (or opt-out) scheme.
Since I am not sure what extensions would be made in the future, I am proposing the optimizer to be of
enum class Optimizer { None, Renormalize }(if needed, it could be combined with Normalization as a bitfield-based approach but would prefer to keep type-based explicitness).Relation to existing Boost.Graph components
This is similar to the
page_rankalgorithm.Prior art
References:
Introductory paper is [1], but the proposal aims to follow the implementation of [3], which different from the original paper by adding options for more recent esearch results (see above). The stopping criteria would be determined per [2].
Are you willing to contribute?
Edit: typo fixing