# Introduction to Social Network Analysis

[Alex Hanna](http://alex-hanna.com), University of Toronto/Google

This is an introduction to social network analysis in R. This introduction aims to accomplish several objectives:

1. Motivate the use of social network analysis
2. Introduce the theoretical basis of network analysis
3. Discuss the potential data sources for network analysis
4. Define network terminology and types of networks
5. Understand how to import network data into R
6. Introduce the R network ecology

### What is Social Network Analysis?

Social network analysis (SNA, also called network analysis, network science, or graph theory) is a body of methods used to study the relationship between entities. Entities can be individuals, organizations, Twitter users, countries, or a mix of any of the above. Network analysis distinguishes itself from other types of analysis because the focus is on the relationship between entities rather than focusing solely on the properties of the entity itself.

### Motivation

Typically, we want to use network analysis if we have a sense that the relationships or transactions between actors/entities are the most critical part of the story. If you have a sense that the interlocking connection between entities is more important than their individual attributes, or if networked connections between entities forces an endogeneity problem which cannot be resolved by more conventional quantitative regression techniques, then network analysis may be a good approach for you to follow.

#### Two examples

Although network analysis as is presented today is sometimes synonymous with computational social science or "big data" methods, network analysis has a long history within sociology and political science. Some of the more innovative uses of network analysis focus on uses that work with [archival and historical data](http://www.themacroscope.org/?page_id=308). 

<img src="img/padgett_ansell.png" alt="Padgett and Ansell 1994" width="450px">

In a classic article, [Padgett and Ansell](https://www.jstor.org/stable/2781822) discuss the rise of the Medici family in 14th-century Florence, Italy. While many families in Florence were attempting to accumulate power and influence within this Renaissance state, the Medici family was able to do so in such a way that outpaced all others. Padgett and Ansell illustrate how they were able to do this using network methods -- namely, they were able to  position themselves as a critical broker in marriage and economic (trade, partnership, and real estate) networks between the larger group of elite Florentine families. 

This is one of my favorite examples because it illustrates a highly complicated network based on data which are over six centuries old. The **actor or entity** here is the elite family. Second, there is not a single type of relationship in this network -- this is a **multiplex** network, meaning there are multiple types of relationships which exist between entities. Relationships are both **undirected** and **directed**, which means they are not all mutual. More on that below. Furthermore, what this means is that the **medium** of what travels across the network tie is different for each network. It is a combination of trust, cooperation, and resources. Third, the data are archival and based upon historical research. The network must be constructed and operationalized explicitly, rather than something like a Twitter retweet, which tends to be accepted somewhat uncritically.

<img src="img/adamic_glance.png" alt="Adamic and Glance 2005" width="450"/>

Compare that to this classic network from [Adamic and Glance](http://www.ramb.ethz.ch/CDstore/www2005-ws/workshop/wf10/AdamicGlanceBlogWWW.pdf) on the political blogosphere circa 2005. This is one of the articles which illustrates the growing political polarization in US political life. The actor here is a political blog. The relationship here is the hyperlink between one blog and another. Because links are by their nature not mutual, this is a directed network. 

In this graph, we have something new and that is the **colors** and **size** in the network. Red actors signify conservative blogs, while blue actors signify liberal ones. Furtermore, red lines represent conservative-to-conservative links, and blue liberal-to-liberal. Orange lines, however, represent cross-ideological relationships. Lastly, size indicates how many links are coming into a particular blog. Those colors and sizes indicate something about the actors themselves; they are **attributes** of the actors.

In the former example, we used network analysis to focus on the *centrality of a particular actor* -- the Medici family. However, in the latter example, we used network analysis to focus on the *structure of the network as a whole* -- namely how between-ideology linkage is much less common than cross-ideology linkage. Both of these could not be achieved by looking at each entity on its own.

#### Theoretical grounding: Relational sociology

The theoreticaly groundings of network analysis are scattered around various disciplines of social science, but [Emirbayer](https://www.jstor.org/stable/10.1086/231209?seq=1) argues most forcefully for its necessary in social science research. Emirbayer poses *relational* or *transactional* analysis as more ontologically sound than *variable-based* or *substantialist* analysis in the social sciences. The focus should be on the interaction between entities, rather than their properties. Individuals may be said to contain attributes (e.g. gender, race, sexual orientation), but a transactional view would indicate that those attributes are all relational (e.g. Desmond and Emirbayer's discussion of [racial domination](https://scholar.harvard.edu/files/mdesmond/files/what_is_racial_domination.pdf)).

This is heady stuff, and I'm not a theorist, so I don't want to get lost in the weeds. The main takeaway is that, as network analysts, we start to see networks everywhere, and this is our starting point. We also have to take caution in determining whether a network is the best way to approach the problem or whether there are better methods to do this.

#### Exercise 1

1. Reflect on your own research interests. What is a type of network which would be of interest to you in your work?

In [None]:
## loading igraph for some basic drawings
library(igraph)

## repr for resizing plots
library(repr)

### Terms

Let's get to defining some terms before we go on, so that we're on the same page. So far I've been using the terms *entities* or *actor* to talk about individuals in network analysis. From now on, I will more often use the term **node** to discuss the entity who is part of the relationship. The terms actor or vertex are synonyms for this.

The connection between two nodes is called an **edge** (or arc, link, or relation). A network with two nodes and a single edge is called a **dyad**.

In [None]:
## shrink plot size
options(repr.plot.width = 4, repr.plot.height = 3)

## a dyad
## using lgl layout so that the dyad lays flat
net <- graph_from_literal(A-B)
plot(net, 
     vertex.color = "gray", 
     vertex.label = NA, 
     vertex.size = 60, 
     layout=layout_with_lgl)

A network with three nodes, with any type of configuration of edges between them, is called a **triad**.

In [None]:
## a triad 
net <- graph_from_literal(A-B-C)
plot(net, vertex.color = "gray", vertex.label = NA, vertex.size = 60)

You'll see that I'm using a funny little function to draw these networks above called `graph_from_literal`. This lets me literally draw some basic networks using a rudimentary simple syntax. What I want to draw attention to is the name of the function, namely the first word: **graph**. A **graph** is another name for a network and is a term much more common on computer science. A **subgraph** is any subset of a graph. In the network below, the nodes B, C, and D (highlighted in gold) form a subgraph of the larger graph.

In [None]:
net <- graph_from_literal(A:B:C:D:E - A:B:C:D:E)
V(net)$color <- c('gray', 'gray', 'gold', 'gold', 'gold')
plot(net, vertex.size = 60)

A **component** is a subgraph which is connected together. In the plot below, nodes A through E are a component. Nodes G-H, I-J, K-L-M, and N are components. The largest component is called the **major component** while the others are called **minor components**. N is a special kind of component which is by itself and thus called an **isolate**.

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)
net <- graph_from_literal(A:B:C:D:E - A:B:C:D:E, G-H, I-J, K-L-M, N)
plot(net, vertex.color = 'gray', vertex.size = 25)

#### Types of networks

The edges of a network tend to have a set of common properties. Let's begin with properties of edges. As noted in the Medici example, edges in a network can be either **directed** or **undirected**. The most readily available difference we can draw on is from social media -- on Facebook, your friendships are mutual. This is an undirected network. Aliya is friends with Brianna and vice versa. On Twitter, however, follower networks are asymmetrical. Aliya may follow Brianna, but Brianna doesn't follow Aliya. This is a directed network. Directed networks are denoted with an arrow in a visualization.

In [None]:
## shrink plot size
options(repr.plot.width = 4, repr.plot.height = 3)

net <- graph_from_literal(Aliya --+ Brianna)
plot(net, 
     vertex.color = "gray", 
     vertex.size = 60, 
     vertex.label.color = "black",
     vertex.label.dist = 10,
     edge.arrow.size = 3,
     layout=layout_with_lgl)

Edges can be **weighted** (valued) or **unweighted** (unvalued), which means they can denote some kind of attribute, such as strength or distance. A Facebook friend in which there's a lot of interaction a good deal of interaction may have a higher weight than one without. Visually, this can be represented in a number of ways. Below, it's denoted with edge width. We'll get into that more in the next module.

In [None]:
net <- graph_from_literal(Aliya - Brianna - Chlöe)
E(net)$width <- c(8, 2)
plot(net, 
     vertex.color = "gray", 
     vertex.size = 60, 
     vertex.label.color = "black",
     vertex.label.dist = 10)

Networks can also be **multiplex**, meaning they can denote multiple relationships. Again, this was the case with the Medici example. Furthermore, nodes can link to themselves, what is called a **self-loop**. Think about retweets -- people on Twitter can retweet others as well as retweeting themselves. In the graph below, imagine that retweets are the gray network and @replies are dark red network. Aliya is retweeting and replying to Brianna. Aliya also retweets herself. Brianna retweets Chlöe.

In [None]:
net <- graph_from_literal(Aliya -+ Brianna -+ Chlöe, 
                          Aliya -+ Brianna,  
                          Aliya-+Aliya, simplify = FALSE)
plot(net, 
     edge.color = c("dark red", "grey", "grey", "grey"),
     vertex.color = "gray", 
     vertex.size = 60,
     edge.width = 2,
     edge.arrow.size = 3,
     vertex.label.color = "black",
     vertex.label.dist = 10)

Lastly, networks can change over time. They can be either **static** or **dynamic**. We won't really touch on these much in this workshop, so we'll avoid the plotting.

#### Two-mode networks and the Breiger duality

So far we've only discussed networks in which there's a node of a particular type. However, in political and social networks we often observe networks of more than one type of entity. The simplest advancement of this is the **two-mode network**, that is, a network where there's two types of nodes. 

Two-mode network analysis is most common in the case of individuals and groups, or individuals and events. Take this example from Kieran Healy and [using metadata to identify Paul Revere](https://kieranhealy.org/blog/archives/2013/06/09/using-metadata-to-find-paul-revere). The data here is a matrix which has on its rows individuals and on its columns whether or not the person was a member of a particular organization.

In [None]:
## if you have downloaded the repository
#data.revere <- as.matrix(read.csv("data/PaulRevereAppD.csv",row.names=1))

install.packages("RCurl")
library(RCurl)

data.revere <- getURL("https://raw.githubusercontent.com/alexhanna/zurich-sna/master/data/PaulRevereAppD.csv?token=AAwvDYpalfRS-CrL0aiZUSvqgcUIPzIwks5bRNIKwA%3D%3D", ssl.verifypeer = FALSE)
data.revere <- as.matrix(read.csv(textConnection(data.revere), row.names = 1)) 
## show the first six rows
head(data.revere)

A network which illustrates both types of nodes is called a **bipartite graph**. A graph like this can be helpful in illustrating common membership in organizations. 

In [None]:
bipartite.revere <- graph.incidence(data.revere)
options(repr.plot.width = 10, repr.plot.height = 10)
colors <- c("gray", "dark red")
plot(bipartite.revere, 
     vertex.color = ifelse(V(bipartite.revere)$type, colors[2], colors[1]),
     vertex.size = 5,
     vertex.label = ifelse(V(bipartite.revere)$type, V(bipartite.revere)$name, NA),
     vertex.label.dist = 1,
     vertex.label.color = "black")

However, a major insight of these types of networks highlighted by [Ron Breiger](https://www.jstor.org/stable/2576011) is the two-mode networks have a duality -- they can be transformed into one-mode networks (e.g. networks with only one type of node) such that we can highlight the importance of one mode in terms of the other, or vice versa. In terms of the Paul Revere dataset, we can see the importance of the individuals based on their group memberships, or we can see the importance of the groups based on the individual memberships.

The transformation is more or less this: if we have an matrix like the dataset above, if we want to obtain the one-mode network of the first mode, we multiple the matrix with its transpose:

$$ G_1 = A (A^T) $$

Conversely, to obtain the one-mode network of the second mode, we multiple the transpose of the matrix with the matrix:

$$ G_2 = (A^T) A $$

In [None]:
person.net <- data.revere %*% t(data.revere)
diag(person.net) <- NA

revere.person.g <- graph.adjacency(person.net,mode="undirected", weighted=NULL, diag=FALSE)

In [None]:
plot(revere.person.g,
     vertex.color = "grey",
     vertex.size = 5,
     vertex.label = ifelse(V(revere.person.g)$name == "Revere.Paul", V(revere.person.g)$name, ""),
     vertex.label.dist = 1,
     edge.color = "grey90",
     vertex.label.color = "black")

Paul Revere himself is highlighted to illustrate his importance to the connectivity of this network. While we lose some information in terms of group membership, we gain a good deal in highlighting the importance of relationships in the mode we may be more interested in.

#### Exercise 2

1. If you haven't already, install the `igraph` package with `install.packages('igraph')`. Then load it with `library(igraph)`.
2. The syntax for creating undirected graphs with `graph_from_literal` is `A - B`. Create an undirected graph in which Zurich is connected to Geneva and Bern, and Geneva is connected to Bern.
3. Generate and plot the one-mode network for the groups by filling out the blanks below. What do you notice?

In [None]:
## 3
revere.group.net <- ____ %*% ____
diag(revere.group.net) <- NA
revere.group.g <- graph.adjacency(____, mode="undirected", weighted=NULL, diag=FALSE)

plot(revere.group.g,
     vertex.color = "grey",
     vertex.size = 5,
     vertex.label.dist = 1,
     edge.color = "grey",
     vertex.label.color = "black")

<dd>4. `simplify` is a function we often use to clean up network. It takes one argument, the network itself. Run `simplify` below and fill in the blanks for plotting. What's different?</dd>

In [None]:
## 4
simplify(____)
plot(____,
     vertex.color = "grey",
     vertex.size = 5,
     vertex.label.dist = 1,
     edge.color = "grey",
     vertex.label.color = "black")

### Data sources and loading

- Data sources
    - Ethnographic
    - Interviews
    - Surveys
    - Digital
    - Archival
- Data formats
    - Adjacency
    - Edgelist
    - Incidence matrix
- R packages
    - `igraph`
    - `sna`
    - `statnet`
    - `tidygraph` + `ggraph`
- Exercise 3