# Social Networks

Social networks are likely the instantiation of a network that you are all familiar with. A decade ago, this concept was not nearly as prevalent and was entrancing and intoxicating when you would come across it (ask anyone who watched the first season of The Wire in 2002). With the advent of Facebook, Twitter, and LinkedIn (where the network is explicit, easily accessed, roughly shown to you) social networks appear *obvious*. However, there is much to be done to understand these networks so that we can understand them at a deeper level. 

Importantly, Social Networks differ from other networks, such as transportation networks, in that they exhibit two properties:

(i) nodes are very *close* to each other
(ii) the entire network is very *small* (it is generally quick to navigate from one side to the other)

As a demonstration of this, let's examine `soc-hamsterster.edges` which is a  social network from the hamsterster website (a virtual habitat for hamsters and gerbils! i.e. the owners who impersonate them). 



In [None]:
#Colab USAGE!!
# !apt install libcairo2-dev libgif-dev libjpeg-dev 
# !pip install pycairo
# !echo "deb http://downloads.skewed.de/apt bionic main" >> /etc/apt/sources.list
# !apt-key adv --keyserver keys.openpgp.org --recv-key 612DEFB798507F25
# !apt-get update
# !apt-get install python3-graph-tool

In [None]:
import graph_tool.all as gt

In [None]:
!head ../Data/soc-hamsterster.edges

In [None]:
#Exercise -- Read in the edge list


Now we can see the average number of nodes and edges

In [None]:
print(f'Edges: {len([x for x in G.edges()])}')
print(f'Nodes: {len([x for x in G.vertices()])}')

So we have approximately 8x more edges than nodes. So $E(k)\sim8$. What do you expect for the average clustering of a node given that degree? What do you expect for how long it would take to get from one side of the network to the other given that clustering?

In [None]:
#Exercise  - Average clustering & path length


These two properties are, on the face, at odds with each other. **Why?**

# Small worlds

This week we read D.J. Watts and S.H. Strogatz. (1998). Collective dynamics of 'small-world' networks. Nature 393, 440-442. What was the main insight of this paper? 

We can actually reproduce these findings use the instantiation of the Watts-Strogatz model in graph_tool

In [None]:
#This is how we'll create the random graph
#Credit https://nabble.skewed.de/Functions-producing-small-world-scale-free-ER-networks-td3218897.html
def get_n_levels_neighbours(maxn,n,k):
    l=[x for x in range(int(n-(k/2)),n)]
    l+=[x for x in range(n+1, int(n+(k/2)+1))]
    for i in range(len(l)):
        if (l[i]<0):
            l[i]=((l[i])%(maxn-1))+1
        if(l[i]>(maxn-1)):
            l[i]=l[i]%maxn
    return l

def watts_strogatz_network(n,k,p):
    '''
    n - number of nodes
    k - number of edges
    p - probability of rewiring
    returns:
    g - graph tool graph
    '''
    import random
    g= gt.lattice([n])
    # make a ring
    g.add_edge(g.vertex(0), g.vertex(n-1))
    # add edges to k neighbours of each vertex in the ring
    # k-1 neighbours if k is odd
    if ((k%2!=0) and k>1):
        k-=1
    if k>2:
        for v in range(n):
            l_n=  get_n_levels_neighbours(n, v, k)
            for v_n in l_n:
                if not g.edge(v_n,v):
                    g.add_edge(g.vertex(v), g.vertex(v_n))
    # replace each edge u-v by an edge u-w with probability p
    for u in range(n):
        for v in g.vertex(u).all_neighbours():
            if (random.random()<=p):
                l1=range(n)
                l2=get_n_levels_neighbours(n,u,k)
                l=[i for i in l1 if i not in l2]
                l.remove(u)
                w=random.choice(l)
                while w==u or g.edge(u,w):
                    w=random.choice(l)
                g.remove_edge(g.edge(u, v))
                g.add_edge(g.vertex(u), g.vertex(w))
    return g

g = watts_strogatz_network(1000, 6, 0.0)
gt.graph_draw(g)

In [None]:
#Exercise



What's amazing, and frequently occurs in networks, is that the shortest path length effectively 'falls' off a cliff (reducing 10 fold) just with a minor change to $p$. We can see that it stabilizes quickly.

Clustering has a more gradual decline, since most edges are not actually changed. 

Given that social networks typically have a "high" (in respect to other types of networks) average clustering, we can see that there is a *sweet spot* of randomization from a highly structured graph where they are likely to exist. Surprisingly, it takes very little randomization from a fully structured graph to achieve this point. 

And what we can see from the degree distribution is also instructive here. Let's look at the hamsterer social graph

In [None]:
deg = G.degree_property_map('total')
degcum = np.cumsum(sorted(deg.a))
plt.plot(sorted(deg.a), 1-degcum/sum(deg.a))
plt.semilogx()
plt.xlabel('$k$')
plt.ylabel('$1-cdf$')

What does the randomized graph look at its extremes ($p=0.0$ and $p=1.0$)

In [None]:
# Exercise


Why does this matter so much? Why were the detailing of these properties such a large "hit" in the scientific world?

One big answer: this presents a critical flaw in the assumption of bulk mixing in models. We'll turn our attention to specific application now.

# Contagion

Contagion is a process that is used to describe a rather large variety of phenomena, although its origin is in biology. Contagion is how we describe the spread of a disease from one organism to another (most frequently caring about human to human transmission). The typical model used is a SI model, where S stands for the Susceptible population of agents and I stands for the Infected population of agents. 

$\frac{dS}{dt}= -p s(t) I(t)$

where $S(t)$ is the number of susceptible people, $s(t)$ is the fraction of the population that is susceptible, $I(t)$ is the number of infected individuals at time $t$, and $b$ is the $p$ is the probability of an infected person infecting a susceptible one.

Without any type of recovery from the illness, this model will eventually convert all beginning susceptibles to infecteds. 

Fortunately, graph-tool has the SI model implemented

In [None]:
state = gt.SIState(g, beta=0.10)
outdata = []
for t in range(100):
    ret = state.iterate_sync()
    outdata.append(state.get_state().fa.sum())
    
plt.plot(outdata)
plt.xlabel('timestep')
plt.ylabel('Infected Node Count')

Now let's sweep again through the small world model rewiring probabilities, but capture how many timesteps it takes for the network to become fully infected. We will plot the average time for full infection against the rewiring probability.

* Run 10 networks per rewiring probability
* Seed only one infected node (at random)
* infection probability = 0.10

In [None]:
#Exercise



We effectively see the same behavior, i.e. immediately once the graph transitions to a small world network the time it takes for an infection to spread jumps off a cliff (almost 7 times quicker). 

This is both an explanation for how pandemics occur, as well as greatly intriguing researchers - how can we intervene effectively when our natural desire to connect is what creates so much of the risk?

This behavior is exacerbated when we consider a network with a heavy tailed distribution for the degree (as most on-line social networks have).

**TE** Who are the hubs in meatspace? 

In [None]:
#Exercise Hamsterer


The addition of "recovery" is relatively simple

$\frac{dr}{dt} = ki(t)$

where $k$ is the probability that someone that is infected transitions to recovered. You can then modify/decide if recovered people become susceptible again or not (as you are starting to guess, there are a giant number of variations on the SIR model). 

If the recovered population does not transition to susceptible, then the entire network will transition/stabilize as all susceptible/recovered at some time $t$. If recovered becomes susceptible again, then there will be a fraction of susceptibles that stabilizes at some time $t$.

# Generalizing contagion

Contagion on a network can be further generalized (and more amenable to non-pathogenic contagions, such as ideas or emotions) with the framework that is proposed in Dodds and Watts (2004), doi:10.1103/PhysRevLett.92.218701.

This model allows for the implementation of dosing (multiple exposures required to transition from susceptible to infected) and thresholds (agents can be more or less susceptible). 

# Discussion

N.A. Christakis and J.H. Fowler. (2007) The spread of obesity in a large social network over 32 years. *New England Journal of Medicine* **357**, 370-379.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

# Homophily

<img src='../images/homophily_blogs.png' alt='political blog network'></img>
Adamic and Glance. (2005) The political blogosphere. 

<img src='../images/homophily_school.png', alt='friendships at a school, nodes colored by race'></img>
Moody. (2001) Race, school integration, and friendship segregation in America. *AJS*. 

<img src='../images/homophily_school_table.png', alt='friendships at a school, table'>

What does this all mean? 

# Exogenous versus Endogenous

Contagion implicitly means that the cause of the illness/adoption/etc. is driven from exposure to another that has already contracted the illness/purchased the product/etc. When this isn't an illness, we would term this social influence and it would be an exogenous force. 

An individual choosing to purchase a product, based on its attributes and the fit of those attributes to the person's own sensibilities, would be a decision that is driven from within (endogenous). 

"In between" these two would be decision that are driven externally but not from social influence, i.e. marketing.

**The major difficulty with observed network adoption data is selection bias**. You **do not** know if someone would have adopted a product whether their friends adopted it or not with **only** the known adoption data and network structure.

# What can be done?

## Experimentation

The "simplest" answer is to run an experiment, perturbing the conditions for specific individuals and comparing their response to a control group.  

Aral and Walker. (2011) Creating social contagion through viral product design: randomized trial of peer influence in networks. *Management Science* 57(9):1623-1639.

Aral and Walker. (2012) Identifying influential and susceptible members of social networks. *Science* 3337(6092) 337-341. 

Kramer, Guillory, and Hancock. (2013) Experimental evidence of massive-scale emotional contagion through social networks. *PNAS*  8788–8790, doi: 10.1073/pnas.1320040111.

**TE** How far apart must your target nodes be in a network to control for spillover effects?

## Paired sampling

Using demographic variables and propensity score matching to identify matched pairs of users - then control/estimate the effects given the difference in number of neighbors that had adopted. 

With this approach we generally attempt to estimate the propensity to have been treated (exposed to influence from a neighbor), $p_{it}$ at time $t$ with logistic regression as

$p_{it} = P(T_{it} = 1|X_{it}) = \frac{1}{1 + exp[\alpha_{it} + \beta_{it}X_{it} + \epsilon_{it}]}$

where $X_{it}$ is the vector of demographic and behavioral covariates of node $i$. The **major** difficulty is that $X$ varies over time, so a match in week 1 of the adoption observation is not necessarily a match in week 10.

However, that does not mean that logistic regression would be the only method to generate matched pairs (just the most common for social scientists). There are a number of other clustering techniques that could be used instead (although the evaluation of its goodness of fit would differ dramatically). 

Aral, Muchnik, and Sundararajan. (2009) Distinguishing influence-based contagion from homophily driven diffusion in dynamic networks. 

### Exercise

We can detail an example of this proces with Python.

First, we will need to create two populations.

In [None]:
#Populations will have an Age [20-50], Education [0,1], and Married [0,1] attribute
#for simplicity's sake
import random
import pandas as pd


data = []
#Treatment condition yes or no
for condition in [0, 1]:
    #1000 people per population
    for i in range(1000):
        data.append([condition, random.randint(20, 50), random.choice([0, 1]), random.choice([0, 1])])
        
exampledf = pd.DataFrame(data, columns = ['Treated', 'Age', 'Education', 'Married'])

In [None]:
#We can fit LogisticRegression with scikit learn
from sklearn.linear_model import LogisticRegression

propensity = LogisticRegression()
propensity = propensity.fit(exampledf.iloc[:, 1:], exampledf.Treated)
#Returns the probability of being in the class
pscores = propensity.predict_proba(exampledf.iloc[:, 1:])
#Only want the second column
exampledf['Propensity'] = pscores[:,1]

exampledf.head()

In [None]:
#Or we can fit a Logistic Regression with statsmodels
import statsmodels.formula.api as smf 

fitLogit = smf.logit('Treated ~ Age + Education + Married', data = exampledf).fit()
#A Simple predict will get use the probabilities
pscores = fitLogit.predict(exampledf)
exampledf['Propensity'] = pscores

And from this point, we would go and find the closest match from the control group for a member in the treated group (within some cutoff)

## Natural experiments

Use some exogenous shock to the system to quantitatively assess the influence of connections using a diff-in-diff regression. A typical example of applying this methodology is that you have an on-line social network and an avalanche happens in Denver, while it did not happen in Boston. This serves as the natural experiment, now you can see the difference between the two cities. However, what research questions really concern is the difference in behavior of two groups, say Group A and Group B. What we look at is the difference in behavior pre and post event for the 'treatment' and 'control' groups in Denver and Boston (thus, the difference in the difference). Based on a non-extensive literature review, a common shock used is...rain (not enough avalanches!).