![UKDS Logo](./images/UKDS_Logos_Col_Grey_300dpi.png)

# Social Network Analysis: Getting and Marshalling Data

Welcome to the <a href="https://ukdataservice.ac.uk/" target=_blank>UK Data Service</a> training series on *New Forms of Data for Social Science Research*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. To help you get to grips with these new forms of data, we provide webinars, interactive notebooks containing live programming code, reading lists and more.

* To access training materials for the entire series: <a href="https://github.com/UKDataServiceOpen/new-forms-of-data" target=_blank>[Training Materials]</a>

* To keep up to date with upcoming and past training events: <a href="https://ukdataservice.ac.uk/news-and-events/events" target=_blank>[Events]</a>

* To get in contact with feedback, ideas or to seek assistance: <a href="https://ukdataservice.ac.uk/help.aspx" target=_blank>[Help]</a>

<a href="https://www.research.manchester.ac.uk/portal/julia.kasmire.html" target=_blank>Dr Julia Kasmire</a> and <a href="https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html" target=_blank>Dr Diarmuid McDonnell</a> <br />
UK Data Service  <br />
University of Manchester <br />
September 2020

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><span><a href="#Aims" data-toc-modified-id="Aims-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Aims</a></span></li><li><span><a href="#Lesson-details" data-toc-modified-id="Lesson-details-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Lesson details</a></span></li></ul></li><li><span><a href="#Guide-to-using-this-resource" data-toc-modified-id="Guide-to-using-this-resource-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Guide to using this resource</a></span><ul class="toc-item"><li><span><a href="#Interaction" data-toc-modified-id="Interaction-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Interaction</a></span></li><li><span><a href="#Learn-more" data-toc-modified-id="Learn-more-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Learn more</a></span></li></ul></li><li><span><a href="#Social-Network-Analysis:-The-Basics" data-toc-modified-id="Social-Network-Analysis:-The-Basics-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Social Network Analysis: The Basics</a></span><ul class="toc-item"><li><span><a href="#What-is-Social-Network-Analysis?" data-toc-modified-id="What-is-Social-Network-Analysis?-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>What is Social Network Analysis?</a></span></li><li><span><a href="#Key-concepts" data-toc-modified-id="Key-concepts-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Key concepts</a></span></li></ul></li><li><span><a href="#Collecting-social-network-data" data-toc-modified-id="Collecting-social-network-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Collecting social network data</a></span><ul class="toc-item"><li><span><a href="#Importing-credentials" data-toc-modified-id="Importing-credentials-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Importing credentials</a></span></li><li><span><a href="#Authenticating-access" data-toc-modified-id="Authenticating-access-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Authenticating access</a></span></li><li><span><a href="#Requesting-data" data-toc-modified-id="Requesting-data-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Requesting data</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Summary</a></span></li></ul></li><li><span><a href="#Converting-to-social-network-data" data-toc-modified-id="Converting-to-social-network-data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Converting to social network data</a></span><ul class="toc-item"><li><span><a href="#Creating-an-adjacency-matrix" data-toc-modified-id="Creating-an-adjacency-matrix-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Creating an adjacency matrix</a></span></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Conclusion</a></span></li><li><span><a href="#Bibliography" data-toc-modified-id="Bibliography-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Bibliography</a></span></li><li><span><a href="#Further-reading-and-resources" data-toc-modified-id="Further-reading-and-resources-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Further reading and resources</a></span></li></ul></div>

## Introduction

Vast swathes of our social interactions and personal behaviours are now conducted online and/or captured digitally. Thus, computational methods for collecting, cleaning and analysing data are an increasingly important component of a social scientist’s toolkit.

In this training series we cover some of the essential knowledge and skills needed to engage in **Social Network Analysis (SNA)**, a methodological approach that provides concepts, tools and techniques for uncovering and understanding social structures, relations and networks of assocation. We focus on the three major stages of SNA:
1. Understanding fundamental concepts and terms. [ [LINK] ](https://github.com/UKDataServiceOpen/social-network-analysis/blob/master/code/ukds-sna-fundamentals-2020-09-01.ipynb)
2. Collecting and cleaning social network data from various sources [Focus of this notebook].
3. Performing basic and intermediate analyses of social network data. 

By the end of these lessons you should be confident in your understanding of key SNA concepts and terms, proficient in the handling and cleaning of social network data, and able to apply a range of analytical techniques to derive substantive insight about social structures and relations. In addition, you will gain fluency in the use of the Python programming language for SNA and other computational social science tasks.

### Aims

This lesson - **Social Network Analysis: Getting and Marshalling Data** - has two aims:
1. Delineate the key steps in collecting, cleaning and repurposing data for social network analysis.
2. Cultivate your computational skills through coding examples. For example, there are a number of opportunities for you to execute the data collection code for your own purposes.

### Lesson details

* **Level**: Introductory, for individuals with no prior knowledge or experience of social network analysis.
* **Duration**: 45-60 minutes.
* **Pre-requisites**: You are encouraged to complete the following previous lessons:
    * Social Network Analysis: Basic Concepts [LINK]
    * [APIs as a Source of Data](https://github.com/UKDataServiceOpen/web-scraping/tree/master/webinars)
* **Audience**: Researchers and analysts from any disciplinary background interested in employing network analysis for social science research purposes.
* **Programming language**: Python.
* **Learning outcomes**:
	1. Understand the main steps in collecting, cleaning and reshaping data for social network analysis.
	2. Be able to use Python for working with social network data.

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*What is Social Network Analysis?*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about Python and SNA!".format(name))

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## Social Network Analysis: The Basics

We will quickly cover the essential concepts and elements of Social Network Analysis (SNA), though we strongly advise you to work through our [previous webinar](https://www.youtube.com/watch?v=PJOM0m_WeTA) on this topic.

### What is Social Network Analysis?

Social network analysis (SNA) is a methodological and conceptual toolbox for the measurement, systematic description, and analysis of patterns in relational structures in the social world (Caiani, 2014). 

A relation is a distinctive type of connection or tie between two entities (Wasserman & Faust, 1994). For example, a married couple share a spousal relation, a brother and sister share a sibling relation, co-workers share a collegial relation etc. 

Relations are the building blocks of networks, and thus SNA is concerned with and most appropriate for analyses of data capturing relations between units of analysis (Scott, 2017).

### Key concepts

A network is constructed from two key components (Owen-Smith, 2017):
1. The **entities** that are (or can be) connected.
2. The **connections** that exist (or could exist) between entities.

For example, a family tree is a network containing individuals (entities) that are related through some type of familial tie (connection). Therefore a network is an aggregation or collection of these entities and their connections. For example, here is the familial network of the members of the UK Royal Family ([BBC, 2020](https://www.bbc.com/news/uk-23272491
)):

![UK Royal Family](./images/royal-family.png)

## Collecting social network data

In this section we demonstrate how to collect data from the Twitter API.

In [None]:
import tweepy # Twitter API
import json # JSON manipulation
import pandas as pd # data manipulation
from datetime import datetime as dt # datetime parsing and manipulation

### Importing credentials

The first thing we need to do is load in our Twitter API credentials: these are the set of consumer and/or access information that we need to provide the API for authentication purposes (i.e., prove who we are).

These details are confidential and sensitive, therefore it is best to stored them in a separate file that no-one else should have access to. To show you what these details look like, we'll import some fake credentials from a file.

In [None]:
with open("./twitter-api-credentials-fake.json", "r") as f:
    tokens = json.load(f)

consumer_key = tokens["consumer_key"] # user credentials
consumer_secret = tokens["consumer_secret"] # user credentials
access_token = tokens["access_token"] # access/use credentials
access_token_secret = tokens["access_token_secret"] # access/use credentials

In [None]:
print("User credentials: ", consumer_key, " | ", consumer_secret)
print("\r")
print("Access credentials: ", access_token, " | ", access_token_secret)

OK, now let's load in the real credentials but neglect to display to screen. As the rest of this lesson focuses on requesting **publicly available** information on Twitter, we do not need the access tokens.

In [None]:
with open("./twitter-api-credentials.json", "r") as f:
    tokens = json.load(f)

consumer_key = tokens["api_key"] # user credentials
consumer_secret = tokens["api_key_secret"] # user credentials

**If you are executing the code in this notebook, you must supply your own credentials as mine cannot be shared with you.**

### Authenticating access

We now supply these credentials to the Twitter API using the `tweepy` module:

In [None]:
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth)

As a quick demonstration, let's confirm we have authenticated access to the API by performing a search of public tweets for my (Diarmuid) name:

In [None]:
for tweet in tweepy.Cursor(api.search, q="diarmuid").items(10):
    print(tweet.text)

### Requesting data

Let's run through some examples of requesting data from the Twitter API using the `tweepy` module. First we will request information about the authenticated user (me):

In [None]:
api.get_user("DiarmuidMc")

The `get_user()` method returns lots of information about the user account *DiarmuidMc*. If we were so inclined, we could request information about a different user:

In [None]:
api.get_user("BorisJohnson")

We won't concern ourselves with the content of the information that is returned by the request just yet, but you can view the help documentation for [`tweepy`](http://docs.tweepy.org/en/latest/api.html) and the [Twitter API](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/user-object) to learn more. 

Now let's move onto a more substantive concern, the one I stated was the purpose of my use of the Twitter API. Let's see if we can search through a charity's Twitter account for tweets referencing solicitations for donations or just generally related to fundraising.

We'll focus on the [Royal National Lifeboat Institution (RNLI)](https://rnli.org/) Twitter account.

In [None]:
rnli_recent_timeline = api.user_timeline(id="RNLI")
rnli_recent_timeline

We can access some of the content and metadata associated with each status by looping through the list of results and pulling out fields of interest:

In [None]:
for status in rnli_recent_timeline:
    print("Posted on: ", status.created_at)
    print("Content: ", status.text)
    print("\r")

The `user_timeline()` method returns the 20 most recent statuses (tweets and retweets) for a specified account. We can extend the number of statuses that are returned using the `Cursor` method from `tweepy`.

In [None]:
rnli_timeline = [] 
for status in tweepy.Cursor(api.user_timeline, id="RNLI").items():
    rnli_timeline.append(status._json)

In [None]:
len(rnli_timeline)

In [None]:
rnli_timeline[0]

As you can see, there is an abundance of content and metadata associated with each status - a full breakdown of what fields are available is [here](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/tweet-object).

Now we have all of the recent statuses (as far back as we are allowed using the API, anyway) of the RNLI Twitter account. We'll make our lives simpler by extracting only the fields we are interested in: date the status was posted, the id of the status, and the text.

In [None]:
rnli_data = []
for status in rnli_timeline:
    status_info = {}
    status_info["date"] = status["created_at"].strftime("%Y-%m-%d %H:%M:%S")
    status_info["id"] = status["id"]
    status_info["content"] = status.text
    rnli_data.append(status_info)

In [None]:
rnli_data[0:5]

Great, now we can save the extracted data to a file:

In [None]:
with open("./data/rnli-timeline-2020-09-15.json", "w") as f:
    json.dump(rnli_data, f)

We can also convert it to a more familiar data structure (e.g., a dataframe):

In [None]:
df = pd.DataFrame(rnli_data)
df.sample(5)

In [None]:
df.to_csv("./data/rnli-timeline-2020-09-15.csv", index = False)

### Summary

That concludes our brief demonstration of connecting to and requesting data from the Twitter API. Let's unpack the major steps you can employ to conduct your own activities using the Twitter API:

1. Register for a Twitter API developer account and create a project/application that requires use of said API.

2. Use the `tweepy` Python package to interact with the Twitter API.

3. Connect to the Twitter API using your credentials.

4. Make a request for data.

5. Clean and export data for later analysis.

## Converting to social network data

In this section we demonstrate how to extract and structure relational information contained in an attributional data set. We will draw on a data set of the current trustees (board members) of all registered charities in Manchester, UK.

In [None]:
import pandas as pd # data manipulation
import numpy as np # mathematical operations
import networkx as nx # network analysis

In [None]:
data = pd.read_csv("./data/manchester-trustees-2020-08-27.csv", index_col = False)
print(data.shape)
data.head(15)

As you can see, the first individual is a trustee of three charities, the second a trustee of nine etc. Because we have the unique id (`regno`) of the charity a trustee is connected to, this data set contains *relational information* on how charities are connected to each other by the presence of a common individual. For instance, three of the charities &mdash; 1071809, 1095687, 1013846 &mdash; are all linked through the trustee Rabbi Abraham Hassan. 

Our first task is to extract the relational information contained in this data set &mdash; that is, the connections that exist between charities as a result of having trustees in common.

The end result of this process will be an adjacency (node-by-node) matrix containing the *binary*, *undirected* ties linking charities together. That is, a data set where every row and column represents a charity, and the cells indicate whether a pair of charities are linked through at least one trustee.

### Creating an adjacency matrix

Creating the adjacency matrix involves the simple but clever trick of merging the data set with itself. Remember that the goal is to see which charities are connected &mdash; another way of conceptualising this is to think of a **cross-tabulation**. That is, examine the frequency with which every pair of charities occurs in the data.

Let's use the `merge()` method on a `pandas` dataframe to merge the charity data with itself.

In [None]:
data_merge = data.merge(data, on="trustee_id")

In [None]:
data[["trustee_name", "regno"]].head(3)

In [None]:
data_merge[["trustee_name_x", "regno_x", "regno_y"]].head(9)

Note the results of the merge: the process has produced a new data set (`data_merge`) containing all possible combinations of charity numbers for each trustee. For example, the trustee Rabbi Abraham Hassan appears three times in the original data (left cell), once for each charity they are a trustee of. The same person appears nine times in the merged data (right cell), once for each combination of charities they are a trustee of.

For instance, Rabbi Abraham Hassan is a trustee of charity *1071809*, therefore this organisation is connected to all of the other charities (and itself) this person is a trustee of.

In [None]:
adj_matrix = pd.crosstab(data_merge.regno_x, data_merge.regno_y)

In [None]:
adj_matrix.loc[[1071809, 1095687, 1013846], [1071809, 1095687, 1013846]]

We see that charity *1071809* has two connections to charity *1013846*. We can account for one of these connections through the trustee Rabbi Abraham Hassan. We can see where the other connection exists as follows:

In [None]:
subset = data_merge.loc[((data_merge["regno_x"] == 1071809) & (data_merge["regno_y"] == 1013846))
                       | ((data_merge["regno_x"] == 1013846) & (data_merge["regno_y"] == 1071809))]
subset[["trustee_name_x", "regno_x", "regno_y"]]

We have two final tasks before we are satisfied our adjacency matrix captures the relations we are interested in:
1. Remove self-loops
2. Convert to binary relations

In [None]:
np.fill_diagonal(adj_matrix.values, 0)

In [None]:
adj_matrix[adj_matrix >= 1] = 1

Let's see the effect of these two operations on the final data set:

In [None]:
adj_matrix.loc[[1071809, 1095687, 1013846], [1071809, 1095687, 1013846]]

Now we are ready to conver the matrix into a `networkx` graph object:

In [None]:
chargraph = nx.from_pandas_adjacency(adj_matrix)
print(nx.info(chargraph))

## Conclusion

Social Network Analysis (SNA) is a broad, rich and increasingly relevant methodology for investigating patterns in social structures and relations. Data on these structures &mdash; *networks* &mdash; are increasingly and abundantly available, either through dedicated social media and networking platforms (e.g., Twitter) or lurking in more traditional social survey and administrative data sets. Developing proficiency in collecting, cleaning and repurposing data for social network analysis is therefore a worthwhile activity for a social scientist to pursue.

Good luck on your data-driven travels!

## Bibliography

Barba, Lorena A. et al. (2019). *Teaching and Learning with Jupyter*. <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>https://jupyter4edu.github.io/jupyter-edu-book/</a>.

Bourdieu, P. (1986). The Forms of Capital. In J. Richardson (Ed.), *Handbook of Theory and Research for the Sociology of Education* (pp. 241-258). Westport, CT: Greenwood.

Burt, R. S. (1992). *Structural Holes: The Social Structure of Competition*. Cambridge, MA: Harvard University Press.

Caiani, M. (2014). Social Network Analysis. In D. Della Porta (Ed.), *Methodological Practices in Social Movement Research* (pp. 368-396). Oxford: Oxford University Press.

Grannovetter, M. (1973). The Strength of Weak Ties. *American Journal of Sociology, 78*(6), pp. 1360-1380. 

Hanneman, R. A., & Riddle, M. (2005). *Introduction to social network methods*. <a href="http://faculty.ucr.edu/~hanneman/nettext/" target=_blank>http://faculty.ucr.edu/~hanneman/nettext/</a>.

Owen-Smith, J. (2017). Networks: The Basics. In I. Foster et al. (Eds.), *Big Data and Social Science: A Practical Guide to Methods and Tools* (pp. 215-240). Boca Raton, FL: CRC Press.

Scott, J. (2017). *Social Network Analysis* (4th edition). London: SAGE Publications Inc.

Smith, K. P. & Christakis, N. A. (2008). Social Networks and Health. *Annual Review of Sociology, 34*, pp. 405-429.

Wasserman, S. & Faust, K. (1994). *Social Network Analysis*. Cambridge: Cambridge University Press.

## Further reading and resources

We publish a list of useful books, papers, websites and other resources on our web-scraping Github repository: <a href="https://github.com/UKDataServiceOpen/social-network-analysis/tree/master/reading-list/" target=_blank>[Reading list]</a>

We maintain a list of useful books, papers, websites and other resources on our SNA Github repository: <a href="https://github.com/UKDataServiceOpen/social-network-analysis/tree/master/reading-list/" target=_blank>[Reading list]</a>

The help documentation for the `tweepy` module is refreshingly readable and useful: <a href="http://docs.tweepy.org/en/latest/" target=_blank>http://docs.tweepy.org/en/latest/</a>

In addition we strongly recommend getting familiar with the Twitter API by reading its clear and comprehensive help documentation: <a href="https://developer.twitter.com/en/docs" target=_blank>https://developer.twitter.com/en/docs</a>