### 39. Methodenseminar
## Big Data Module II: Introduction to Social Network Science with Python
# 3.2 Scale-Free Networks (Exercise)
**Author**: <a href='https://www.gesis.org/person/haiko.lietz'>Haiko Lietz</a>, GESIS - Leibniz Institute for the Social Sciences

**Date**: 17 July 2019

**Library versions**: ``networkx`` 2.2 ([documentation](https://networkx.github.io/documentation/))

***

## Exercise 1
In the demo, in section 3.2.4.1, we found preferential attachment to be linear for how citations in 2007-2009 predict citations in 1020-2012.
#### How does preferential attachment evolve over time?
In other words, does the Matthew Effect increase over time? Write a ``for`` loop. Compare these cultural dynamics to the emergence of the Small-World social (co-authorship) network.

In [None]:
import pandas as pd
import numpy as np

In [None]:
citations = pd.read_csv('../data/sns/citations.txt', header='infer', delimiter='\t', encoding='utf-8')
references = pd.read_csv('../data/sns/references.txt', header='infer', delimiter='\t', encoding='utf-8')
cited_references = pd.merge(left=citations, right=references, on='reference_id')
publications = pd.read_csv('../data/sns/publications.txt', header='infer', delimiter='\t', encoding='utf-8')
publications['time'] = (3*np.floor(publications['time']/3)+2).astype('int')
cited_references_time = pd.merge(left=cited_references, right=publications[['publication_id', 'time']], on='publication_id')
cited_references_time = cited_references_time.groupby(['time', 'reference']).size().reset_index(name='citations')
cited_references_time.head()

In [None]:
years = list(cited_references_time['time'].drop_duplicates())

In [None]:
def ols_reg(a):
    # log and reshape data
    x_log10 = np.log10(a[:, 0])
    x_log10_reshape = x_log10.reshape(len(x_log10), 1)
    y_log10 = np.log10(a[:, 1])
    y_log10_reshape = y_log10.reshape(len(y_log10), 1)
    # fit linear model in log space
    import sklearn.linear_model as sk_lm
    reg = sk_lm.LinearRegression()
    reg.fit(x_log10_reshape, y_log10_reshape)
    y_log10_reshape_predict = reg.predict(x_log10_reshape)
    # create output
    x_min = min(a[:, 0])
    x_max = max(a[:, 0])
    d = 10**reg.intercept_[0]
    beta = reg.coef_[0][0]
    from sklearn.metrics import r2_score
    r2 = r2_score(y_log10_reshape, y_log10_reshape_predict)
    a_fit = np.array([[x_min, d*x_min**beta], [x_max, d*x_max**beta]])
    return beta, r2, a_fit

#### Emergence of the co-authorship Small World for time slices
<img src='images/social_emergence.png'>

#### Emergence of the co-authorship Small World for cumulative time increases
<img src='images/social_emergence_cum.png'>

## Exercise 2
Four distributions can be extracted from the BTW13 dataset of political tweets:
- number of followee selections
- number of hashtag selections
- number of mentionee selections
- number of retweetee selections

#### What functions best fit these distributions? How plausible are power law fits?
Be careful with setting lower cutoffs ``xmin`` because they don't make sense if the best fit is not a power law. Plot the distribution first to decide what ``xmin`` to set. When comparing functions, disregard the ``lognormal`` (not the ``lognormal_positive``) as the lognormal can have a negative mean. Think about possible underlying mechanisms and what the patterns mean for the Twitter users.

In [None]:
import powerlaw as pl

In [None]:
def plot_pdf(l):
    fit = pl.Fit(l, discrete=True, xmin=1)
    fit.plot_pdf(marker='o', ls='', linear_bins=True)
    fit.plot_pdf(marker='o', ls='', linear_bins=False)

In [None]:
def compare_functions(f):
    function = ['exponential', 'stretched_exponential', 'lognormal', 'lognormal_positive', 'power_law', 'truncated_power_law']
    from numpy import zeros
    f_compare_R = zeros((6, 6), dtype=float)
    f_compare_p_R = zeros((6, 6), dtype=float)
    for i in range(0, 6):
        for j in range(0, 6):
            R, p_R = f.distribution_compare(function[i], function[j])
            f_compare_R[i, j] = R
            f_compare_p_R[i, j] = p_R
    from pandas import DataFrame
    return DataFrame(f_compare_R, index=function, columns=function), DataFrame(f_compare_p_R, index=function, columns=function)

In [None]:
def p_value(f, sims=2500):
    prob = f.n_tail/len(f.data_original)
    body = [x for x in f.data_original if x < f.xmin]
    l = []
    from random import random, sample
    from powerlaw import Fit, Power_Law
    for i in range(0, sims):
        x = []
        for j in range(0, len(f.data_original)):
            if random() <= prob:
                x.append(int(Power_Law(discrete=True, xmin=f.xmin, parameters=[f.power_law.alpha]).generate_random(1)))
            else:
                x.append(sample(body, 1)[0])
        x_fit = Fit(x, discrete=True).power_law
        l.append(x_fit.KS() > f.power_law.KS())
    p = sum(l)/sims
    return p

#### Following

In [None]:
follow = pd.read_csv('../data/btw13/follow.txt', header='infer', delimiter='\t', encoding='utf-8')
follow.head()

In [None]:
number_of_followers = list(follow.groupby('user_id_followee').size())

In [None]:
plot_pdf(number_of_followers)

#### Tagging

In [None]:
tag = pd.read_csv('../data/btw13/tag.txt', header='infer', delimiter='\t', encoding='utf-8')
tag.head()

In [None]:
number_of_tags = list(tag.groupby('hashtag_id').size())

#### Mentioning

In [None]:
mention = pd.read_csv('../data/btw13/mention.txt', header='infer', delimiter='\t', encoding='utf-8')
mention.head()

In [None]:
number_of_mentioners = list(mention.groupby('user_id_mentionee').size())

In [None]:
plot_pdf(number_of_mentioners)

#### Retweeting

In [None]:
retweet = pd.read_csv('../data/btw13/retweet.txt', header='infer', delimiter='\t', encoding='utf-8')
retweet.head()

In [None]:
number_of_retweeters = list(retweet.groupby('user_id_retweetee').size())

In [None]:
plot_pdf(number_of_retweeters)

## Exercise 3
The goal of Network Science is to find patterns that universally appear in complex systems (not only social systems). Ten complex networks, described [here](http://www.networksciencebook.com/translations/en/resources/data.html), are studied by Barabási in his book *Network Science*.
#### What's the evidence for scale-freeness in the networks selected by Barabási?
Can you reproduce the results [here](http://www.networksciencebook.com/chapter/4#advanced-c) (table 4.1)? Networks are ordered by size, so the collaboration network is fastest to assess.

#### Barabási's Datsets
Of these, the citation, email, metabolic, phonecalls, and www are directed.

In [None]:
collaboration = pd.read_csv('../data/networksciencebook/collaboration.edgelist.zip', header='infer', delimiter='\t', encoding='utf-8')
phonecalls = pd.read_csv('../data/networksciencebook/phonecalls.edgelist.zip', header='infer', delimiter='\t', encoding='utf-8')
email = pd.read_csv('../data/networksciencebook/email.edgelist.zip', header='infer', delimiter='\t', encoding='utf-8')
www = pd.read_csv('../data/networksciencebook/www.edgelist.zip', header='infer', delimiter='\t', encoding='utf-8')
citation = pd.read_csv('../data/networksciencebook/citation.edgelist.zip', header='infer', delimiter='\t', encoding='utf-8')
actor = pd.read_csv('../data/networksciencebook/actor.edgelist.zip', header='infer', delimiter='\t', encoding='utf-8')
#internet = pd.read_csv('../data/networksciencebook/internet.edgelist.zip', header='infer', delimiter='\t', encoding='utf-8')
#metabolic = pd.read_csv('../data/networksciencebook/metabolic.edgelist.zip', header='infer', delimiter='\t', encoding='utf-8')
#powergrid = pd.read_csv('../data/networksciencebook/powergrid.edgelist.zip', header='infer', delimiter='\t', encoding='utf-8')
#protein = pd.read_csv('../data/networksciencebook/protein.edgelist.zip', header='infer', delimiter='\t', encoding='utf-8')

In [None]:
def fit_power_law(l, xmin=None):
    fit = pl.Fit(l, discrete=True, xmin=xmin)
    fit.plot_pdf(marker='o', ls='', linear_bins=True)
    fit.plot_pdf(marker='o', ls='')
    #fit.exponential.plot_pdf(label='Exponential')
    fit.stretched_exponential.plot_pdf(label='Stretched Exponential')
    fit.lognormal_positive.plot_pdf(label='Lognormal')
    fit.power_law.plot_pdf(label='Power Law')
    fit.truncated_power_law.plot_pdf(label='Truncated Power Law')
    plt.legend()
    return fit

In [None]:
import networkx as nx

#### The Collaboration Dataset

In [None]:
Co = nx.Graph(name='collaboration')
Co.add_edges_from(collaboration.values)
Co_degree = [degree for (node, degree) in Co.degree]
print(nx.info(Co))

#### The Phonecalls Dataset

In [None]:
P = nx.DiGraph(name='phonecalls')
P.add_edges_from(phonecalls.values)
P_out_degree = [out_degree for (node, out_degree) in P.out_degree if out_degree > 0]
P_in_degree = [in_degree for (node, in_degree) in P.in_degree if in_degree > 0]
print(nx.info(P))

#### The Email Dataset

In [None]:
E = nx.DiGraph(name='email')
E.add_edges_from(email.values)
E_out_degree = [out_degree for (node, out_degree) in E.out_degree if out_degree > 0]
E_in_degree = [in_degree for (node, in_degree) in E.in_degree if in_degree > 0]
print(nx.info(E))

#### The WWW Dataset

In [None]:
W = nx.DiGraph(name='www')
W.add_edges_from(www.values) # takes a few minutes
W_out_degree = [out_degree for (node, out_degree) in W.out_degree if out_degree > 0]
W_in_degree = [in_degree for (node, in_degree) in W.in_degree if in_degree > 0]
print(nx.info(W))

#### The Citation Dataset

In [None]:
Ci = nx.DiGraph(name='citation')
Ci.add_edges_from(citation.values) # takes a few minutes
Ci_out_degree = [out_degree for (node, out_degree) in Ci.out_degree if out_degree > 0]
Ci_in_degree = [in_degree for (node, in_degree) in Ci.in_degree if in_degree > 0]
print(nx.info(Ci))

#### The Actor Dataset

In [None]:
A = nx.Graph(name='actor')
A.add_edges_from(actor.values) # takes a few minutes
A_degree = [degree for (node, degree) in A.degree]
print(nx.info(A))