Consider the following graph:
Gephi (0.8.1) calculates the following local clustering coefficients:
Gephi averages these: (1.0 + 0.33333 + 1.0 + 0.0) / 4 = 0.583, and reports this number (0.583) as the Avg. Clustering Coefficient. This result is inconsistent with the algorithm from Main-memory Triangle Computations for Very Large (Sparse (Power-Law)) Graphs that Gephi claims to implement. The author of the aforementioned paper (Latapy) has his own C implementation of the algorithm, which produces 0.7777.
(1.0 + 0.33333 + 1.0 + 0.0) / 4 = 0.583
I surmise that the mistake Gephi is making is that it follows Latapy's advice of not calculating a local clustering coefficient for nodes with degree <= 1, but then includes this node in the count of total nodes anyway. In the example I gave above, I am referring to node Four, whose clustering coefficient is calculated as 0.0. Note that if we were to do (1.0 + 0.33333 + 1.0 + 0.0) / 3 instead of (1.0 + 0.33333 + 1.0 + 0.0)/4 we would get 0.7777, which is consistent with the result of Latapy's implementation.
(1.0 + 0.33333 + 1.0 + 0.0) / 3
(1.0 + 0.33333 + 1.0 + 0.0)/4
To throw another spanner in the works, there is a second potential problem, but I am waiting for this to be confirmed by Latapy. This second problem is that it appears to me that Latapy himself implements the clustering coefficient algorithm differently to how it was first described in Collective dynamics of small world networks by Watts & Strogatz, which he references. I didn't see any mention in Watts & Strogatz paper to suggest nodes with degree <= 1 should be excluded from the calculations at all.
To summarise, I believe clustering coefficient is implemented incorrectly in Gephi. I suggest as a first step at least ensuring it is consistent with Latapy's implementation, and then later on figure out whether Latapy's implementation is itself inconsistent with the original definition of Avg. clustering coefficient. I haven't worked on the Gephi code base before, but if someone can point me to the right files, I'm willing to have a go at fixing it.
Just to clarify, I believe the current Gephi implementation to be incorrect regardless of whether you consider Latapy's or Watts & Strogatz to be the "proper" way of implementating Avg. clustering coefficient.
I can't speak to your claim, but you can certainly see the source code here: https://github.com/gephi/gephi/blob/master/StatisticsPlugin/src/org/gephi/statistics/plugin/ClusteringCoefficient.java
I have heard back from Latapy, here is an excerpt from his email:
for degree 1 nodes actually is poorly defined: what is the
probability for any two neighbors of a degree 1 node to be
linked togather? ...
As a consequence, some authors (including me) ignore them,
some consider their cc to be 0 and others to be 1.
As in practice many nodes may have degree 1, this has a huge
impact on the average cc in the graph. Consider for instance
a binary tree: no triangle at all, but half nodes have degree
1. If one considers them to have cc 1 then the average is 0.5
still with no triangle.
In general, authors do not explicitly state how then handle
degree 1 nodes, as you noticed, and I even suspect that
some authors choose the definition best suited to their
To make the story short, there is no satisfactory way to
handle degree 1 nodes in cc computations. One should first
give the number of such nodes and then the average cc for
all other nodes (or, better, the degree-cc correlations:
then degree 1 nodes are naturally separated from others and
the obtained information is much richer)
What I garner from this is that depending on how you define local clustering coefficient for degree 1 nodes, either the old implementation or the implementation in my pull request could both be considered correct. Given that Gephi references Latapy's paper though, it is at least in my opinion best that the implementation also be consistent with Latapy's.
Thanks for the report, we should indeed be consistent with Latapy's implementation. I merged your code.