![UKDS Logo](./images/UKDS_Logos_Col_Grey_300dpi.png)

# Social Network Analysis: Fundamental Concepts

Welcome to the <a href="https://ukdataservice.ac.uk/" target=_blank>UK Data Service</a> training series on *New Forms of Data for Social Science Research*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. To help you get to grips with these new forms of data, we provide webinars, interactive notebooks containing live programming code, reading lists and more.

* To access training materials for the entire series: <a href="https://github.com/UKDataServiceOpen/new-forms-of-data" target=_blank>[Training Materials]</a>

* To keep up to date with upcoming and past training events: <a href="https://ukdataservice.ac.uk/news-and-events/events" target=_blank>[Events]</a>

* To get in contact with feedback, ideas or to seek assistance: <a href="https://ukdataservice.ac.uk/help.aspx" target=_blank>[Help]</a>

<a href="https://www.research.manchester.ac.uk/portal/julia.kasmire.html" target=_blank>Dr Julia Kasmire</a> and <a href="https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html" target=_blank>Dr Diarmuid McDonnell</a> <br />
UK Data Service  <br />
University of Manchester <br />
September 2020

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><span><a href="#Aims" data-toc-modified-id="Aims-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Aims</a></span></li><li><span><a href="#Lesson-details" data-toc-modified-id="Lesson-details-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Lesson details</a></span></li></ul></li><li><span><a href="#Guide-to-using-this-resource" data-toc-modified-id="Guide-to-using-this-resource-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Guide to using this resource</a></span><ul class="toc-item"><li><span><a href="#Interaction" data-toc-modified-id="Interaction-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Interaction</a></span></li><li><span><a href="#Learn-more" data-toc-modified-id="Learn-more-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Learn more</a></span></li></ul></li><li><span><a href="#Overview" data-toc-modified-id="Overview-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Overview</a></span><ul class="toc-item"><li><span><a href="#What-is-Social-Network-Analysis-(SNA)?" data-toc-modified-id="What-is-Social-Network-Analysis-(SNA)?-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>What is Social Network Analysis (SNA)?</a></span></li><li><span><a href="#What-are-the-principles-underpinning-SNA?" data-toc-modified-id="What-are-the-principles-underpinning-SNA?-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>What are the principles underpinning SNA?</a></span></li><li><span><a href="#Why-should-you-consider-SNA-for-your-research?" data-toc-modified-id="Why-should-you-consider-SNA-for-your-research?-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Why should you consider SNA for your research?</a></span></li><li><span><a href="#When-should-you-use-SNA?" data-toc-modified-id="When-should-you-use-SNA?-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>When should you use SNA?</a></span></li><li><span><a href="#What-does-SNA-involve?" data-toc-modified-id="What-does-SNA-involve?-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>What does SNA involve?</a></span></li><li><span><a href="#How-do-you-implement-SNA-in-your-research?" data-toc-modified-id="How-do-you-implement-SNA-in-your-research?-3.6"><span class="toc-item-num">3.6&nbsp;&nbsp;</span>How do you implement SNA in your research?</a></span></li></ul></li><li><span><a href="#Key-Concepts" data-toc-modified-id="Key-Concepts-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Key Concepts</a></span><ul class="toc-item"><li><span><a href="#Entities" data-toc-modified-id="Entities-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Entities</a></span></li><li><span><a href="#Connections" data-toc-modified-id="Connections-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Connections</a></span></li><li><span><a href="#Networks" data-toc-modified-id="Networks-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Networks</a></span></li></ul></li><li><span><a href="#Representing-Networks" data-toc-modified-id="Representing-Networks-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Representing Networks</a></span><ul class="toc-item"><li><span><a href="#Matrices" data-toc-modified-id="Matrices-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Matrices</a></span></li><li><span><a href="#Edgelists" data-toc-modified-id="Edgelists-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Edgelists</a></span></li><li><span><a href="#Graphs" data-toc-modified-id="Graphs-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Graphs</a></span></li></ul></li><li><span><a href="#A-Simple-Analysis" data-toc-modified-id="A-Simple-Analysis-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>A Simple Analysis</a></span><ul class="toc-item"><li><span><a href="#Defining-the-study" data-toc-modified-id="Defining-the-study-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Defining the study</a></span></li><li><span><a href="#Preliminaries" data-toc-modified-id="Preliminaries-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Preliminaries</a></span></li><li><span><a href="#Getting-relational-data" data-toc-modified-id="Getting-relational-data-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>Getting relational data</a></span></li><li><span><a href="#Network-level-summaries" data-toc-modified-id="Network-level-summaries-6.4"><span class="toc-item-num">6.4&nbsp;&nbsp;</span>Network-level summaries</a></span></li><li><span><a href="#Node-level-measures" data-toc-modified-id="Node-level-measures-6.5"><span class="toc-item-num">6.5&nbsp;&nbsp;</span>Node-level measures</a></span></li><li><span><a href="#Components" data-toc-modified-id="Components-6.6"><span class="toc-item-num">6.6&nbsp;&nbsp;</span>Components</a></span></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Conclusion</a></span></li><li><span><a href="#Bibliography" data-toc-modified-id="Bibliography-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Bibliography</a></span></li><li><span><a href="#Further-reading-and-resources" data-toc-modified-id="Further-reading-and-resources-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Further reading and resources</a></span></li><li><span><a href="#Appendices" data-toc-modified-id="Appendices-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Appendices</a></span><ul class="toc-item"><li><span><a href="#Advanced-topics-and-concepts" data-toc-modified-id="Advanced-topics-and-concepts-10.1"><span class="toc-item-num">10.1&nbsp;&nbsp;</span>Advanced topics and concepts</a></span></li><li><span><a href="#Matrix-conventions" data-toc-modified-id="Matrix-conventions-10.2"><span class="toc-item-num">10.2&nbsp;&nbsp;</span>Matrix conventions</a></span></li></ul></li></ul></div>

## Introduction

Vast swathes of our social interactions and personal behaviours are now conducted online and/or captured digitally. Thus, computational methods for collecting, cleaning and analysing data are an increasingly important component of a social scientist’s toolkit.

In this training series we cover some of the essential knowledge and skills needed to engage in **Social Network Analysis (SNA)**, a methodological approach that provides concepts, tools and techniques for uncovering and understanding social structures, relations and networks of assocation. We focus on the three major stages of SNA:
1. Understanding fundamental concepts and terms [Focus of this notebook].
2. Collecting and cleaning social network data from various sources.
3. Performing basic and intermediate analyses of social network data. 

By the end of these lessons you should be confident in your understanding of key SNA concepts and terms, proficient in the handling and cleaning of social network data, and able to apply a range of analytical techniques to derive substantive insight about social structures and relations. In addition, you will gain fluency in the use of the Python programming language for SNA and other computational social science tasks.

### Aims

This lesson - **Social Network Analysis: Fundamental Concepts** - has two aims:
1. Define and examine fundamental concepts and terms underpinning SNA.
2. Cultivate your computational skills through coding examples. For example, there are a number of opportunities for you to interactively explore key concepts, as well as examine data stored in a network format.

### Lesson details

* **Level**: Introductory, for individuals with no prior knowledge or experience of social network analysis.
* **Duration**: 30-60 minutes.
* **Pre-requisites**: None.
* **Audience**: Researchers and analysts from any disciplinary background interested in employing network analysis for social science research purposes.
* **Programming language**: Python.
* **Learning outcomes**:
	1. Understand fundamental concepts and terms associated with SNA.
	2. Understand how social network data are structured and represented.

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*What is Social Network Analysis?*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about social network analysis!".format(name))

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## Overview

Social networks are commonplace in modern life, mainly due to the influence of social media platforms like Twitter, Facebook, Instagram and many others, though there are also a multitude of networks arising from offline interactions also e.g., business/academic networking, community groups organised around causes/neighbourhood projects etc (Scott, 2017). 

They exist even in fictional worlds...

![Star Wars Social Network](./images/starwars.png)

In the words of the graph's creator:
> Here the nodes represent characters in the movies. The characters are connected by a link if they both speak in the same scene. And the more the characters speak together, the thicker the link between them. The size of each node corresponds to the total number of scenes the character appears in.

[(Gabasova, 2015)](http://evelinag.com/blog/2015/12-15-star-wars-social-network/)

Even such a trivial example &mdash; apologies Star Wars fans &mdash; demonstrates some of the core features of social networks. Let's define and establish these features more formally.

### What is Social Network Analysis (SNA)?

Social network analysis (SNA) is a methodological and conceptual toolbox for the measurement, systematic description, and analysis of patterns in relational structures in the social world (Schneider, 2008). A relation is a distinctive type of connection or tie between two entities (Wasserman & Faust, 1994). For example, a married couple share a spousal relation, a brother and sister share a sibling relation, co-workers share a collegial relation etc. Relations are the building blocks of networks, and thus SNA is concerned with and most appropriate for analyses of data capturing relations between units of analysis (Scott, 2017).

SNA is heavily defined by contributions from *network theory* &mdash; a branch of *graph theory* &mdash; which seeks to generate measurable representations of patterns of relationships between entities in an abstract or actual space (Owen-Smith, 2017). As a result of these origins, SNA is a highly technical and mathematical approach to sociological analysis, replete with an intimidating vocabularly of terms and concepts not otherwise encountered with other forms of social science research (Scott, 2017). In addition, the data underpinning SNA have a distinctive structure and thus require specialised approaches to data manipulation and analysis.

### What are the principles underpinning SNA?

Borgatti et al. (2002) establish five key principles of SNA:
1. *SNA focuses on relations (connections) between actors.* Actors and their relations are seen as interdependent rather than independent units.
2. *The relations between actors are the most meaningful focus of analysis.* Your data may allow you to perform other types of analyses &mdash; e.g., how does income vary across individuals and is it associated with variation in subjective wellbeing? &mdash; but the focus of SNA is on examining and understanding how actors are connected e.g., to what extent are individuals connected and are these patterns associated with income and subjective wellbeing? Put another way, the central concern of analytical concern is the relation, not the individual.
3. *The structural and/or relational features of these actors constitute the analytically relevant characteristics of them.* To quote Freeman (2006, p. #):
>...these patterns [interactions between people] are important features of the lives of the individuals who display them.
4. *Relational ties between these actors are the channels for the flow of both material and non-material resources.* In essence, the connections are important and offer opportunities for sharing of valuable resources.
5. *The complete web of actors, their positions and their linkages - the network structure - provides opportunities for (and constraints upon) action.* This is one of the most important points to remember: networks are not only constructed from the relations between actors, they in turn influence the behaviour, opportunities, contraints and outcomes of said actors. We'll reflect on some of the mechanisms through which networks affect outcomes at the micro level in a later lesson [ [LINK] ]().

It is this focus on relational rather than attributional aspects of the units of analysis that makes SNA distinct as a methodology (Caiani, 2014).

### Why should you consider SNA for your research?

From an analytical perspective, SNA can be employed for a variety of valuable purposes (Caiani, 2014; Owen-Smith, 2017).

1. The social phenomenon of interest takes the form of a network i.e., it is thing you are trying to describe and explain in your research project. For example, a researcher may be interested in analysing the London Underground rail network (like in this [study](https://doi.org/10.1016/j.jtrangeo.2017.11.018)). In such instances SNA provides a powerful and rich set of analytical tools for describing important features of networks.

2. The features and properties of a network can be important explanatory factors ('right-hand side' or independent variables) for understanding other social phenomena. Recall a key feature of social networks elucidated in the above principles: network properties can explain patterns of action and processes of change for individuals/groups. Note also how network/structural properties map onto important concepts in social theories e.g., strong and weak ties (Granovetter, 1973), structural holes (Burt, 1992), social capital (Bourdieu, 1986) etc. 
For example, in a review of the impact of social networks on health outcomes, Smith and Christakis (2008, p. 420) conclude that:
>  illness, disability, health behaviors, health care use, and death in one person are associated with similar outcomes in numerous others to whom that person is tied, and there can be a nonbiological transmission of illness.

### When should you use SNA?

When you are dealing with **relational data** i.e., data capturing relationships and connections between units of analysis. This is in contrast to **attributional data**, which captures the attributes &mdash; characteristics, demographics etc &mdash; of your units of analysis.

Attributional data tends to look like this:

(*Execute the cell below - click on the cell and press `Run` or `Shift + Enter` on your keyboard*)

In [2]:
import pandas as pd
att = pd.read_csv("./data/attributional-data-simple-example.csv", index_col = False)
att

Unnamed: 0,name,sex,age,employed
0,John,male,52,yes
1,Joan,female,45,yes
2,Jenny,female,25,no
3,Juliet,female,67,yes
4,Jack,male,19,no


While relational data tends to look like this:

In [1]:
rel = pd.read_csv("./data/relational-data-simple-example.csv", index_col = 0)
rel

NameError: name 'pd' is not defined

### What does SNA involve?

Employing SNA in your research typically involves the following activities (Scott, 2017):
* Identifying and visualising patterns of relations between units of analysis.
* Examining structural properties/characteristics of these relations.
* Analysing implications of these relations on outcomes experienced by units of analysis.

As a result of its focus on the relational characteristics of the units of analysis, SNA requires distinctive data structures, methods of analysis and data visualisation techniques (Caiani, 2014). 

### How do you implement SNA in your research?

There are a number of key steps in conducting social network analysis (Hanneman & Riddle, 2005; Owen-Smith, 2017):
1. Pose a carefully articulated research question that requires understanding and/or analysis of a network.
2. Decide which units of analysis and types of relations to analyse i.e., who is connected and which relationships matter? 
3. Collect or select a data set that provides relational data on your units of analysis. This data set can include attributional data also: for example, Twitter data can provide information on which accounts follow each other (relational), as well as details about the accounts themselves (attributional).
4. Summarise the network and its key features using appropriate measures e.g., network size, density, cohesion, components etc.

## Key Concepts

A network is constructed from two key components (Owen-Smith, 2017):
1. The **entities** that are (or can be) connected.
2. The **connections** that exist (or could exist) between entities.

For example, a family tree is a network containing individuals (entities) that are related through some type of familial tie (connection). Therefore a network is an aggregation or collection of these entities and their connections. For example, here is the familial network of the members of the UK Royal Family ([BBC, 2020](https://www.bbc.com/news/uk-23272491
)):

![UK Royal Family](./images/royal-family.png)

### Entities

The entities in a network are known as **nodes**, a term derived from network theory. Nodes can be individuals, organisations, countries, animals, events, computers, train stations etc. We could also refer to entities as *actors* or *agents* - which are terms specific to SNA -, or *vertices* - which is a term used in geometry -, or *points*, which is used in graph theory. For consistency, we will use the term **node**.

Nodes can be differentiated: if there is a particular node of interest in a network, it is often referred to as a **focal node** or **ego**. Nodes that are or could be connected to an ego are known as **alters**. For example, the largest, red node is the ego and the smaller, blue nodes are the alters in the example below.

![Ego Network Example](./images/ego-network-visualisation.png)

Whether a node is designated as an **ego** or an **alter** is a researcher decision and will be informed by the particular analytical and substantive focus of your study.

The patterns of connections between nodes also give rise to further distinctions: two nodes that are connected are called a **dyad**, while three nodes that are connected are called a **triad**.

**Dyad**

![Example Dyad](./images/dyad-2020-08-26.png)

**Triad**

![Example Triad](./images/triad-2020-08-26.png)

Finally, it is worth posing the following questions about nodes in advance of your analysis:
* Who are they and how many of them are there in a network?
* What connections/ties exists between them?
* What positions do they occupy in the network? For example, do they broker connections between other nodes?

### Connections

Connections or relations between entities are known as **ties**. Other terms include *edges* (network theory), *lines* (graph theory), or *links*. There are a multitude of different types of ties present in the social world e.g., family relations, friendships, event attendance, club memberships, communal living, collegial etc. And it is possible for two entities to be connected by many different types of ties: for example, a pair of colleagues may also be good friends, part of the same sports club, and have attended the same university. As stated previously, it is therefore crucial to clearly define which ties you are interested in measuring, and acknowledging that your data are most likely a sample of **all** possible ties that exist between your nodes (Hanneman & Riddle, 2005).

#### Measuring ties

There are many measurement scales we can use to measure ties (Hanneman & Riddle, 2005):
* *Nominal or Binary*: presence or absence of a relation.
* *Multinomial*: type of relation e.g., friend, colleague etc - usually transformed into binary.
* *Ordinal (grouped)*: likert scale of relation strength or frequency e.g., on a scale of 1-5 (1 = weak, 5 = strong), classify the relationship you have with each family member.
* *Ordinal (full-rank)*: rank nodes in order of relation e.g., best friend, second best friend.
* *Interval*: numerical scores assigned to relations e.g., relations between charities and funders, valued by how much funding is provided. It is possible to switch from interval to binary by specifying a threshold or cut point.

However many methods and measures of analysis focus on ties measured on a *binary* or *interval* scale, and thus we focus on these throughout the rest of this and future lessons.

#### Tie dimensions

We can define a tie as possessing two dimensions (Scott, 2017):
1. **Numeration / Strength**:
    * *Binary*: a tie exists between two entities. These are known as **edges**.
    * *Valued*: a tie can be assigned a value representing greater / lesser importance, strength, prominence etc. These are known as **arcs**.
2. **Directionality**:
    * *Directed*: the tie flows from one entity to another (and is potentially reciprocated); put another way, the relation has a *source node* and *target node*. For example, an individual (source) donates money to a charity (target).
    * *Undirected*: the tie does not originate from or terminate at a particular node e.g., if John is married to Jane, Jane must be married to John. Therefore an undirected tie is *symmetric* or *reciprical* by default. It is also possible to treat directed ties as undirected: for example, I am sharing this lesson with you, therefore a connection exists between us.

These dimensions can be combined in order to define different types of network ties. Let's say we have a network of four friends and we're interested in whether they spoke to each other in the previous week. In this instance we want to examine the *undirected*, *binary* ties that are present in the network - see figure below.

![Undirected Binary Tie](./images/ub-tie-2020-08-26.png)

Now let's say we want to visualise how often these indviduals spoke in the past week. We can do this by examining the *undirected*, *valued* ties - in this example we see that the lines are weighted by the number of times each pair of individuals spoke: for example, Jim spoke to Jane 12 times, and Josie 20 times.

![Undirected Valued Tie](./images/uv-tie-2020-08-26.png)

We can extend this example by incorporating who most often initiates the contact between each pair of individuals; we are also only interested in whether two individuals spoke, not how often. In such a scenario we are interested in the *directed*, *binary* ties in the network: for example, Josie usually contacts Jane first, while John is the typically the one who initiates contact with Josie.

![Directed Binary Ties](./images/db-tie-2020-08-26.png)

Finally we can once again incorporate how often these individuals contact each other to examine *directed*, *valued* ties: for example, Josie and Jim are the pair who speak most often, and Josie is the one who usually initiates contact.

![Directed Valued Ties](./images/dv-tie-2020-08-26.png)

#### Direct vs indirect ties

Thus far we have used the word **tie** to indicate whether two nodes are *directly* connected or not, like in this example of a friendship tie:

![Direct Tie](./images/direct-tie-2020-08-26.png)

However it is also possible for two nodes to be *indirectly* connected, like John and Josie in this example - in common parlance we would say Josie is a friend of a friend of John, or that they share a mutual acquaintance:

![Indirect Tie](./images/indirect-tie-2020-08-26.png)

This distinction between direct and indirect ties is important as some analytical approaches focus on the former, some on the latter. As a rule of thumb however, if you see it stated that two nodes are connected, then assume it is through a direct tie. 

### Networks

In simple terms, a network is an aggregation or collection of entities and the connections that exist between them. For example, the London Underground is network of tube stations that are connected by rail lines. 

Networks tend to be multi-modal i.e., nested within other networks (Hanneman & Riddle, 2005). For example, school pupils are nested within schools, which are nested within local authorities, which are nested within countries etc. Therefore using our understanding of nodes (entities) and ties (connections), we can identify different network modalities:
* One-mode network contains one type of node (e.g., students) - this is also known as a *unipartite* network.
* Two-mode network contains two types of node (e.g., students and schools) - this is also known as a *bipartite* or *affiliate* network.

In general we can say a multi-mode/multipartite network contains *k* types of node.

Finally, it is important to distinguish between network types, at least from an analytical perspective:
* *Whole network*: interested in the totality of connections between a set of nodes.
* *Ego-centric network/egonet*: interested in a focal node (ego) and what other nodes form part of its network.

## Representing Networks

Networks can be represented using three complementary methods:
1. Matrices
2. Edgelists
3. Graphs

### Matrices

A matrix is an arrangement of elements into rows (i) and columns (j). Matrices will be very familiar to anyone who has worked with data stored in a spreadsheet: each row represents an observation, each column a variable, and each cell a value for a given variable and observation. For example, here is a data set containing basic details for charities registered in Manchester, U.K.:

In [2]:
import pandas as pd
import numpy as np

data = pd.read_csv("./data/manchester-charities-2020-07-03.csv", index_col = False)
data.index += 1
print(data.shape) # get number of rows and columns
data.head(5) # view first five observations

(1049, 6)


Unnamed: 0,regno,Name,Latest income,Latest expenditure,Address,Postcode
1,207071,ABRAHAM ALGY BLOOM FOUNDATION,5041,8525,"ONWARD BUILDINGS, 207 DEANSGATE, MANCHESTER",M3 3NW
2,214684,THE DOWAGER COUNTESS ELEANOR PEEL TRUST,716970,557777,"HILL DICKINSON, 50 FOUNTAIN STREET, MANCHESTER",M2 2AS
3,215728,CHARITY KNOWN AS THE SALE ALMSHOUSES,30840,21237,"Mayes Gardens Staff Office, Harrison Street, M...",M4 7FN
4,217014,THE MANCHESTER CHARTERED ACCOUNTANTS STUDENTS'...,37773,42185,"RSM, 3 Hardman Street, Spinningfields, Manchester",M3 3HF
5,221438,PHARMACIST SUPPORT,0,0,"Pharmacist Support, 196 Deansgate, MANCHESTER",M3 3WF


The data set is stored as a matrix containing 1049 rows and 6 columns: each row is a charity, each column a variable capturing an organisational characteristic, and each cell a value. For instance, we can see that the charity number for THE DOWAGER COUNTESS ELEANOR PEEL TRUST is contained in row 2, column 1 - we can express this using matrix notation as follows: 

\begin{equation} \text{X(2, 1)} = \text{217014} \end{equation}

Where:

$X$ is the name of the matrix (this can be of your chosing);

$2$ is the row identifier; and

$1$ is the column identifier.

Keep this notation in mind as we progress, as this is how ties between nodes in a network are represented i.e., as pairs of nodes.

#### Social networks as matrices

The relations between nodes in a social network are often stored in matrix format, the simplest of which is a "square" matrix i.e., the number of rows equals the number of columns. Let's take some simple, fictional social network data and examine how they can be represented as matrices:

##### Undirected networks

In [4]:
estreet = pd.read_csv("./data/estreet-band-members-2020-08-18.csv", index_col = 0)
pd.options.display.float_format = '{:,.0f}'.format # change display of decimal numbers

print(estreet.shape)
estreet # view data set

(8, 8)


Unnamed: 0,Bruce Springsteen,Garry Tallent,Roy Bittan,Max Weinberg,Steven Van Zandt,Nils Lofgren,Patti Scialfa,Jake Clemons
Bruce Springsteen,,0.0,1.0,1.0,1.0,0.0,1.0,0.0
Garry Tallent,0.0,,0.0,0.0,0.0,0.0,0.0,0.0
Roy Bittan,1.0,0.0,,1.0,1.0,1.0,0.0,1.0
Max Weinberg,1.0,0.0,1.0,,1.0,1.0,1.0,1.0
Steven Van Zandt,1.0,0.0,1.0,1.0,,0.0,1.0,1.0
Nils Lofgren,0.0,0.0,1.0,1.0,0.0,,0.0,0.0
Patti Scialfa,1.0,0.0,0.0,1.0,1.0,0.0,,1.0
Jake Clemons,0.0,0.0,1.0,1.0,1.0,0.0,1.0,


Above we have a square matrix containing one type of node: current members of the [E Street Band](https://en.wikipedia.org/wiki/E_Street_Band). The matrix captures friendship ties between each member of the band: the ties are *binary* - are a pair of band members friends or not - and *undirected* - there is no source of the friendship. Because the ties are undirected, we can say that the matrix is *symmetric*. For example, saying Bruce Springsteen is friends with Roy Bittan is the same as saying Roy Bittan is friends with Bruce Springsteen - in matrix notation: 

$X(Bruce Springsteen, Roy Brittan)$ = $X(Roy Brittan, Bruce Springsteen)$. 

Diagonal values represent a node's tie to itself: this does not have any meaning in our example, hence why the value is missing (*nan*: Not a Number).

Therefore we can describe this social network as an *adjacency matrix*: it maps who is next to whom in a social space. Saying two nodes are adjacent is another way of describing the presence of a tie between them. 

In summary, the friendhsip ties between E Street Band members are stored as a binary, undirected, square matrix.

##### Directed networks

My (Diarmuid) wife is part of a book sharing network with some family members. She can send books to others in the network either unprompted or in response to receiving a book herself. Let's store these book sharing ties in matrix format: 

In [5]:
books = pd.read_csv("./data/book-sharing-members-directed-2020-08-25.csv", index_col = 0)
books

Unnamed: 0,Wife,Aunt,Cousin,Gran,Sister-in-law
Wife,,0.0,1.0,0.0,1.0
Aunt,1.0,,0.0,1.0,0.0
Cousin,1.0,0.0,,1.0,0.0
Gran,1.0,0.0,1.0,,0.0
Sister-in-law,1.0,0.0,0.0,0.0,


You can see that my wife has sent her cousin at least one book (*row 1, column 3*), AND the cousin has sent her at least one (*row 3, column 1*). While her aunt has sent my wife some books (*row 2, column 1*), the reverse isn't true (*row 1, column 2*).

**QUESTION**: is there a reciprocal book-sharing relation between my wife and her sister-in-law?

Once again these ties are *binary*, meaning they capture whether an individual sent a book or not to another member of the network. We could also represent these book-sharing ties by **how many** books were sent between individuals in this network:

In [None]:
books_count = pd.read_csv("./data/book-sharing-members-directed-valued-2020-08-25.csv", index_col = 0)
books_count

The same connections are present in this network as in the one above, but this time the ties are *valued* rather than *binary*: that is, they capture the strenght of the connections between members, not just whether a connection exists or not. For instance, my wife has sent two books to her cousin (*row 2, column 3*) and received four in return (*row 3, column 1*). 

Because we are dealing with directed ties, we can calculate how many books people sent by summing the values contained in each row:

In [None]:
books_count.sum(axis=1)

And calculate how many books people received by summing the values contained in each column:

In [None]:
books_count.sum(axis=0)

**QUESTION**: who has received the most books? Who has sent the fewest? And how many people sent more books than they received?

Finally, in a directed network the rows represent the *source* of the tie and the columns the *target* or *receiver* of the tie; while this perspective does not have any meaning for undirected ties, this terminology may still be used and is worth keeping in mind.

### Edgelists

An *edgelist* is simply a list of the ties in a network, with the ties represented as pairs of nodes. For example, the E Street Band network can be represented as an edgelist like so:

In [None]:
estreet_el = pd.read_csv("./data/estreet-band-edgelist-2020-08-25.csv", index_col = False)
estreet_el

An edgelist typically doesn't contain pairs of nodes that are not connected, nor duplicate ties - notice how there is no row where Roy Bittan is the source node and Bruce Springsteen is the target node; in this context the terms *source* and *target* do not have an inherent meaning as the ties are undirected.

It is possible to include an additional column capturing the strength of the tie, like in our book sharing network:

In [6]:
books_count_el = pd.read_csv("./data/book-sharing-edgelist-2020-08-25.csv", index_col = False)
books_count_el

Unnamed: 0,source,target,weight
0,Wife,Cousin,2
1,Wife,Sister-in-law,1
2,Aunt,Wife,3
3,Aunt,Gran,2
4,Cousin,Wife,4
5,Cousin,Gran,3
6,Gran,Wife,1
7,Gran,Cousin,3
8,Sister-in-law,Wife,2


#### TASK: using the edgelist above, calculate how many books my wife sent in total, and how many were received by her grandmother. Do your figures match our previous calculations (run code below)?

In [None]:
# books_count.sum(axis=1)

### Graphs

Many people are familiar with the visual representation of networks, known as **graphs** (network theory) or **sociograms** (SNA). A graph is a set of lines connecting points, and graph theory is a "*body of mathematical axioms and formulae that describe the properties of the patterns formed by the lines."* (Scott, 2017: #). 

Let's take our E Street Band social network and visualise the friendship ties it contains:

![E Street Band Sociogram](./images/estreet-band-sociogram-2020-08-25.png)

In a network graph (Hanneman & Riddle, 2005):
* Nodes are represented as circles
* Ties are represented as lines (with arrow heads if the tie is directed)
* Colours, shapes and sizes can be used to differentiate nodes by their attributes or network characteristics.
* Colours, shapes and sizes can also be used to differentiate relations by their type or amount.

By default, **there is no inherent meaning or information conveyed by these features**. The fact that Steven Van Zandt is positioned closer to Max Weinberg than Bruce Springsteen on the graph does not say anything about their nature or strength of their ties. Likewise Garry Tallent could have been placed anywhere on the graph, it still would not change his status in the network as an isolate (i.e., he has no ties).

It is possible to imbue nodes and ties on a graph with meaning: for example, thicker lines can indicate the strength of the tie, different colours can distinguish nodes by an attribute (e.g., males and females).

Visualising networks is an appealing activity, and there are some excellent examples (see [here](http://www-personal.umich.edu/~mejn/networks/)). However it is better to focus on the structure and analysis of social network data, as they are more revealing about the patterns of connections in a network (Hanneman & Riddle, 2005).

The binary, directed ties in our book sharing network can be visualised as follows:

![Book Sharing Sociogram](./images/book-sharing-sociogram-2020-08-25.png)

## A Simple Analysis

The networks you have encountered so far have been incredibly simple and/or fictional. It's time to solidify your understanding of the core concepts in SNA using a real relational data.

### Defining the study

1. **Research question**: What degree of board interlock occurs in the UK charity sector? Board interlock is a measure of the degree to which organisations are connected through shared board members.
2. **Nodes and connections**: Registered charities and whether they have trustees in common. That is, two charities are connected if they have at least one individual who acts as a trustee of each organisation.
3. **Data set**: Current trustees of charities headquartered in Manchester.
4. **Analysis**: Interested in analysing the size of the network, how cohesive it is, and which charities are the most connected.

### Preliminaries

We need to import the Python modules needed for working with network data.

In [19]:
import pandas as pd # data manipulation
import networkx as nx # network analysis
import matplotlib.pyplot as plt # data visualisation

### Getting relational data

We begin with data on the current trustees (board members) of our group of manchester charities:

In [20]:
data = pd.read_csv("./data/manchester-trustees-2020-08-27.csv", index_col = False)
print(data.shape)
data.head(12)

(2735, 4)


Unnamed: 0,trustee_id,trustee_name,trustee_total_tships,regno
0,5873,Rabbi Abraham Hassan,3,1071809
1,5873,Rabbi Abraham Hassan,3,1095687
2,5873,Rabbi Abraham Hassan,3,1013846
3,14229,David Neuwirth,9,1123674
4,14229,David Neuwirth,9,1166641
5,14229,David Neuwirth,9,1083461
6,14229,David Neuwirth,9,1109132
7,14229,David Neuwirth,9,1084316
8,14229,David Neuwirth,9,1136917
9,14229,David Neuwirth,9,1183303


As you can see, the first individual is a trustee of three charities, the second a trustee of nine etc. Because we have the unique id (`regno`) of the charity a trustee is connected to, this data set contains *relational information* on how charities are connected to each other by the presence of a common individual. For instance, three of the charities - 1071809, 1095687, 1013846 - are all linked through the trustee Rabbi Abraham Hassan. 

Our first task is to extract the relational information contained in this data set: the end result of this process will be an adjacency (node-by-node) matrix containing the *binary*, *undirected* ties linking charities together. That is, a data set where every row and column represents a charity, and the cells indicate whether a pair of charities are linked through at least one trustee.

First, let's see how many charities are in the data:

In [21]:
len(data.drop_duplicates("regno")) # how many unique charity ids are there in the data?

1123

OK, that means we'll have a matrix with 1123 rows and 1123 columns (one for each charity). If a matrix has the same number of rows and columns it is known as a *square matrix*.

#### Adjacency matrix

Let's use some clever Python code to create an adjacency matrix of charities from our trustees data set.

In [22]:
data_merge = data.merge(data, on="trustee_id")
charity_mat = pd.crosstab(data_merge.regno_x, data_merge.regno_y)
np.fill_diagonal(charity_mat.values, 0)
charity_mat[charity_mat >= 1] = 1
charity_mat

regno_y,208879,209174,210037,210563,212479,212755,213258,214684,215728,216533,...,1186986,1187340,1187446,1187493,1188245,1188334,1188662,1188791,1188851,1188892
regno_x,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
208879,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
209174,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
210037,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
210563,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
212479,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1188334,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1188662,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1188791,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1188851,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


As is probably obvious at the outset, there are lots of zeroes; that is, most charities are not connected to most other charities. But they are all connected to at least one other: we can see this by calculating the row/column totals:

In [23]:
charity_mat.sum(axis=1).describe() # row total

count   1,123
mean        3
std         3
min         1
25%         1
50%         2
75%         3
max        23
dtype: float64

In [24]:
charity_mat.sum(axis=0).describe() # column total

count   1,123
mean        3
std         3
min         1
25%         1
50%         2
75%         3
max        23
dtype: float64

**QUESTION:** Why are the summaries of the row and column totals the same?

#### Convert matrix to a `networkx` graph object

Finally we need to import the matrix into a graph object in Python using the `networkx` module and its `from_pandas_adjacency.()` method - this will then allow us to access the rich analytical measures provided by this Python module.

In [29]:
chargraph = nx.from_pandas_adjacency(charity_mat)
chargraph # return what type of object 'chargraph' is

<networkx.classes.graph.Graph at 0x25e1df14898>

### Network-level summaries

We know from our data that we are dealing with a network of charities, who are connected through *undirected*, *binary* ties. We can use `networkx` to learn much more about our network and its properties and structure.

#### Size

First, let's get a sense of how large the network is in terms of nodes and ties:

In [30]:
print(nx.info(chargraph))

Name: 
Type: Graph
Number of nodes: 1123
Number of edges: 1542
Average degree:   2.7462


There are 1,542 ties between the 1,123 Manchester charities. *Average degree* is a measure of the mean number of ties a node has with other nodes. In our example, a Manchester charity is typically connected to three other charitable organisations through its trustees.

#### Density

How cohesive or dense is this network? That is, how many of the possible connections between charities have been realised? We can use the `nx.density()` function to calculate a measure ranging from 1 (all connections realised) to 0 (no connections between nodes). The results below show that our network of charities is not very dense at all, though this is to be expected: our c.1,100 charities have c.2,700 trustees, which are be drawn from a city with a population of c.500,000 residents.

In [31]:
density = nx.density(chargraph)
print("Network density:", density)

Network density: 0.0024476073923457506


#### Clustering

To what extent are nodes in the network clustered together? That is, do groups of nodes tend to realise all possible connections between them. *Transitivity* is one such measure of clustering: it is defined as the ratio of all triads realised to all possible triads. A possible triad exists when one node is connected to two others: in such a scenario we can assume that the other two nodes have a good opportunity to connect to each other. Put another way, *transitivity* calculates the probability that two individuals who share a common acquaintance, will end up connecting with each other directly i.e., a friend of a friend becomes a friend. See the simple example below:

![Add in triad and possible triad examples]()

In [None]:
triadic_closure = nx.transitivity(chargraph)
print("Triadic closure:", triadic_closure)

Transitivity measure is high, though this is likely due to the fact that there are fewer possible triads to begin with - remember that our network is not very dense, therefore there are few instances where two charities are connected share a common connection with another organisation. Where such possible triads exist however, it is likelier than not that a triad will be formed i.e., those two charities will establish a direct connection with other.

### Node-level measures

Now let's focus on summarising some of the relational properties of nodes in the network e.g., which nodes possess the most ties?

#### Centrality

Now we are concerned with the charities that are most important in the network. That is, which nodes are best connected. The first measure is *degree centrality*, which helps us identify hubs. We start by calculating the number of ties for each node and adding this a node attribute:

In [None]:
degree_dict = dict(chargraph.degree(chargraph.nodes()))
nx.set_node_attributes(chargraph, degree_dict, "degree")

In [None]:
chargraph.nodes[225116] # number of ties for charity 225116

### Components

In [None]:
# If your Graph has more than one component, this will return False:
print(nx.is_connected(chargraph))

In [None]:
# Next, use nx.connected_components to get the list of components,
# then use the max() command to find the largest one:
components = list(nx.connected_components(chargraph))
largest_component = max(components, key=len)

print(len(components)) # lots of subgraphs in our network
print(largest_component) # returns the set of charities that form the largest component in the network

In [None]:
diameter = nx.diameter(chargraph)

The longest shortest path in the largest component:

In [None]:
subgraph = chargraph.subgraph(largest_component)
diameter = nx.diameter(subgraph)
print("Network diameter of largest component:", diameter)

`networkx` is primarily a network analysis package, though it does possess some functions for visualising networks:

In [None]:
import matplotlib.pyplot as plt

nx.draw(chargraph)
plt.show()

## Conclusion

*SNA and its value and limitations and opportunities*.

Good luck on your data-driven travels!

## Bibliography

Barba, Lorena A. et al. (2019). *Teaching and Learning with Jupyter*. <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>https://jupyter4edu.github.io/jupyter-edu-book/</a>.

Brooker, P. (2020). *Programming with Python for Social Scientists*. London: SAGE Publications Ltd.

Lau, S., Gonzalez, J., & Nolan, D. (n.d.). *Principles and Techniques of Data Science*. https://www.textbook.ds100.org

Tagliaferri, L. (n.d.). *How to Code in Python 3*. https://assets.digitalocean.com/books/python/how-to-code-in-python.pdf

## Further reading and resources

We maintain a list of useful books, papers, websites and other resources on our SNA Github repository: <a href="https://github.com/UKDataServiceOpen/social-network-analysis/tree/master/reading-list/" target=_blank>[Reading list]</a>

The help documentation for the `networkx` and `pandas` modules is refreshingly readable and useful:
* <a href="LINK" target=_blank>`networkx`</a>
* <a href="LINK" target=_blank>`pandas`</a> 

You may also be interested in the following articles and lessons specifically relating to social network analysis:
* <a href="https://programminghistorian.org/en/lessons/exploring-and-analyzing-network-data-with-python" target=_blank>Exploring and Analyzing Network Data with Python </a>
* <a href="https://programminghistorian.org/en/lessons/creating-network-diagrams-from-historical-sources" target=_blank>From Hermeneutics to Data to Networks: Data Extraction and Network Visualization of Historical Sources</a>

## Appendices

### Advanced topics and concepts

[*Talk about multiplex networks*]

### Matrix conventions

Let's quickly cover some conventions around the use and manipulation of matrices.