Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avg. clustering coefficient calculated incorrectly on graph #625

Closed
brycethomas opened this issue May 16, 2012 · 4 comments
Closed

Avg. clustering coefficient calculated incorrectly on graph #625

brycethomas opened this issue May 16, 2012 · 4 comments
Milestone

Comments

@brycethomas
Copy link
Contributor

Hi,

Consider the following graph:

http://i.imgur.com/7hMZq.png

Gephi (0.8.1) calculates the following local clustering coefficients:

http://i.imgur.com/ilT1f.png

Gephi averages these: (1.0 + 0.33333 + 1.0 + 0.0) / 4 = 0.583, and reports this number (0.583) as the Avg. Clustering Coefficient. This result is inconsistent with the algorithm from Main-memory Triangle Computations for Very Large (Sparse (Power-Law)) Graphs that Gephi claims to implement. The author of the aforementioned paper (Latapy) has his own C implementation of the algorithm, which produces 0.7777.

I surmise that the mistake Gephi is making is that it follows Latapy's advice of not calculating a local clustering coefficient for nodes with degree <= 1, but then includes this node in the count of total nodes anyway. In the example I gave above, I am referring to node Four, whose clustering coefficient is calculated as 0.0. Note that if we were to do (1.0 + 0.33333 + 1.0 + 0.0) / 3 instead of (1.0 + 0.33333 + 1.0 + 0.0)/4 we would get 0.7777, which is consistent with the result of Latapy's implementation.

To throw another spanner in the works, there is a second potential problem, but I am waiting for this to be confirmed by Latapy. This second problem is that it appears to me that Latapy himself implements the clustering coefficient algorithm differently to how it was first described in Collective dynamics of small world networks by Watts & Strogatz, which he references. I didn't see any mention in Watts & Strogatz paper to suggest nodes with degree <= 1 should be excluded from the calculations at all.

To summarise, I believe clustering coefficient is implemented incorrectly in Gephi. I suggest as a first step at least ensuring it is consistent with Latapy's implementation, and then later on figure out whether Latapy's implementation is itself inconsistent with the original definition of Avg. clustering coefficient. I haven't worked on the Gephi code base before, but if someone can point me to the right files, I'm willing to have a go at fixing it.

@brycethomas
Copy link
Contributor Author

Just to clarify, I believe the current Gephi implementation to be incorrect regardless of whether you consider Latapy's or Watts & Strogatz to be the "proper" way of implementating Avg. clustering coefficient.

@phreakocious
Copy link
Member

I can't speak to your claim, but you can certainly see the source code here: https://github.com/gephi/gephi/blob/master/StatisticsPlugin/src/org/gephi/statistics/plugin/ClusteringCoefficient.java

@brycethomas
Copy link
Contributor Author

I have heard back from Latapy, here is an excerpt from his email:

Clustering coefficient
for degree 1 nodes actually is poorly defined: what is the
probability for any two neighbors of a degree 1 node to be
linked togather? ...

As a consequence, some authors (including me) ignore them,
some consider their cc to be 0 and others to be 1.

As in practice many nodes may have degree 1, this has a huge
impact on the average cc in the graph. Consider for instance
a binary tree: no triangle at all, but half nodes have degree
1. If one considers them to have cc 1 then the average is 0.5
still with no triangle.

In general, authors do not explicitly state how then handle
degree 1 nodes, as you noticed, and I even suspect that
some authors choose the definition best suited to their
expected outcome.

To make the story short, there is no satisfactory way to
handle degree 1 nodes in cc computations. One should first
give the number of such nodes and then the average cc for
all other nodes (or, better, the degree-cc correlations:
then degree 1 nodes are naturally separated from others and
the obtained information is much richer)

What I garner from this is that depending on how you define local clustering coefficient for degree 1 nodes, either the old implementation or the implementation in my pull request could both be considered correct. Given that Gephi references Latapy's paper though, it is at least in my opinion best that the implementation also be consistent with Latapy's.

sheymann added a commit that referenced this issue May 20, 2012
Fixes calculation of clustering coefficient on graph (Issue #625)
@sheymann
Copy link
Member

Thanks for the report, we should indeed be consistent with Latapy's implementation. I merged your code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants