<center>
<hr>
<h1>Complessità nei sistemi sociali</h1>
<h2>Laurea Magistrale in Fisica Dei Sistemi Complessi</h2>
<h2>A.A. 2016/17</h2>
<h3>Dr. Daniela Paolotti, Dr. Michele Tizzoni</h3>
<h3>Fitting power law distributions</h3>
<hr>
</center>

In [None]:
import networkx as nx
import seaborn as sns

In [None]:
import numpy as np

In [None]:
%pylab inline

We use the Python toolbox [powerlaw](https://github.com/jeffalstott/powerlaw) that implements a method proposed by Aaron Clauset and collaborators in [this paper](https://arxiv.org/abs/0706.1062).

The paper explains why fitting a power law distribution using a linear regression of logarthim is not correct. 
A more sound approach is based on a Maximum Likelihood Estimator.

The package can be installed using `pip` as `pip install powerlaw`.

Full documentation is available [here](http://pythonhosted.org/powerlaw/).

Several examples and a detailed description of the library has been published in [PLOS ONE
](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0085777).

In [None]:
import powerlaw as pwl

# Analysis of ca-AstroPh

We analyze the network file 'ca-AstroPh' from the SNAP repository

In [None]:
G=nx.Graph()

fh=open('./ca-AstroPh.txt','r')
for line in fh.readlines():
    s=line.strip().split()
    if s[0]!='#':
        origin=int(s[0])
        dest=int(s[1])
        G.add_edge(origin,dest)
fh.close()

In [None]:
degree = np.array(G.degree().values())

Let's plot the degree distribution

In [None]:
plt.figure(figsize=(10,7))
plt.hist(degree, bins=100, normed=True, log=True, histtype='stepfilled')

plt.xticks(fontsize=20)
plt.yticks(fontsize=22)
plt.xlabel('Degree', fontsize=22)
plt.ylabel('$P(k)$', fontsize=22)

In [None]:
plt.figure(figsize=(10,7))
plt.hist(degree, bins=100, normed=True, log=True, histtype='stepfilled')

pwl.plot_pdf(degree, color='r', linear_bins=True, linewidth=1)

plt.xticks(fontsize=20)
plt.yticks(fontsize=22)
plt.xlabel('Degree', fontsize=22)
plt.ylabel('$P(k)$', fontsize=22)

In [None]:
plt.figure(figsize=(10,7))
plt.hist(degree, bins=100, normed=True, log=True, histtype='stepfilled')

pwl.plot_pdf(degree, color='r', linear_bins=False, linewidth=2)

plt.xticks(fontsize=20)
plt.yticks(fontsize=22)
plt.xlabel('Degree', fontsize=22)
plt.ylabel('$P(k)$', fontsize=22)

## Parameter estimation

The library powerlaw allows to estimate the exponent $\alpha$ and the minimum value for the scaling $x_{min}$

In [None]:
fit_function = pwl.Fit(degree)

In [None]:
fit_function.xmin

In [None]:
fit_function.power_law

In [None]:
fit_function.power_law.alpha

In [None]:
fit_function.power_law.sigma

In [None]:
fit_function.fixed_xmin

We fix the minimum value for the scaling $x_{min}=10$

In [None]:
fit_function_fixmin = pwl.Fit(degree, xmin=10)

In [None]:
fit_function_fixmin.xmin

In [None]:
fit_function_fixmin.power_law.alpha

In [None]:
fit_function_fixmin.power_law.sigma

We look at the values of the Kolgomorov-Sminorv distance of the two fits to compare them.

In [None]:
fit_function.power_law.D

In [None]:
fit_function_fixmin.power_law.D

## Visualize distributions and fit

In [None]:
fig=plt.figure(figsize=(10,7))

fig=pwl.plot_pdf(degree, color='r', linewidth=2)

#fig=pwl.plot_pdf([x for x in degree if x>123], color='r', linewidth=2)

fit_function.power_law.plot_pdf(ax=fig, color='b', linestyle='-', linewidth=1)

plt.xticks(fontsize=20)
plt.yticks(fontsize=22)
plt.xlabel('Degree', fontsize=22)
plt.ylabel('$P(k)$', fontsize=22)

In [None]:
fig=plt.figure(figsize=(10,7))

fig=pwl.plot_pdf([x for x in degree if x>10], color='r', linewidth=2, label='Data')

fit_function_fixmin.power_law.plot_pdf(ax=fig, color='b', linestyle='-', linewidth=1, label='Fit')

fig.legend(fontsize=22)
plt.xticks(fontsize=20)
plt.yticks(fontsize=22)
plt.xlabel('Degree', fontsize=22)
plt.ylabel('$P(k)$', fontsize=22)

## Comparing Candidate Distributions

We can compare the goodness of fit of several distributions.

In [None]:
fit_function.supported_distributions

In [None]:
R,p = fit_function.distribution_compare('power_law', 'exponential', normalized_ratio=True)

In [None]:
R,p

R is the loglikelihood ratio between the two candidate distributions. This number will be positive if the data is more likely in the first distribution, and negative if the data is more likely in the second distribution. The significance value for that direction is p. 

In [None]:
R2,p2 = fit_function.distribution_compare('power_law', 'lognormal_positive', normalized_ratio=True)

In [None]:
R2,p2

In [None]:
R3,p3 = fit_function.distribution_compare('power_law', 'truncated_power_law', normalized_ratio=True)

In [None]:
R3,p3

In [None]:
R4,p4 = fit_function.distribution_compare('power_law', 'stretched_exponential', normalized_ratio=True)

In [None]:
R4,p4

Analyze the power law with $x_{min}=10$.

Here, the truncated power law is the best fit that explains the data. Even an exponential is a better fit to the data.

In [None]:
R,p = fit_function_fixmin.distribution_compare('power_law', 'exponential', normalized_ratio=True)

In [None]:
R,p

In [None]:
R3, p3 = fit_function_fixmin.distribution_compare('power_law', 'truncated_power_law', normalized_ratio=True)
R3, p3

In [None]:
fig=plt.figure(figsize=(10,7))

fig=pwl.plot_pdf([x for x in degree if x>10], color='r', linewidth=2, label='Data')

fit_function_fixmin.exponential.plot_pdf(ax=fig, color='b', linestyle='-', linewidth=1, label='Fit')

fig.legend(fontsize=22)

plt.xticks(fontsize=20)
plt.yticks(fontsize=22)
plt.xlabel('Degree', fontsize=22)
plt.ylabel('$P(k)$', fontsize=22)

In [None]:
fig=plt.figure(figsize=(10,7))

fig=pwl.plot_pdf([x for x in degree if x>10], color='r', linewidth=2, label='Data')

fit_function_fixmin.truncated_power_law.plot_pdf(ax=fig, color='b', linestyle='-', linewidth=1, label='Fit')

fig.legend(fontsize=22)
plt.xticks(fontsize=20)
plt.yticks(fontsize=22)
plt.xlabel('Degree', fontsize=22)
plt.ylabel('$P(k)$', fontsize=22)