![UKDS Logo](./images/UKDS_Logos_Col_Grey_300dpi.png)

# Being a Computational Social Scientist

Welcome to the <a href="https://ukdataservice.ac.uk/" target=_blank>UK Data Service</a> training series on *New Forms of Data for Social Science Research*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. To help you get to grips with these new forms of data, we provide webinars, interactive notebooks containing live programming code, reading lists and more.

* To access training materials for the entire series: <a href="https://github.com/UKDataServiceOpen/new-forms-of-data" target=_blank>[Training Materials]</a>

* To keep up to date with upcoming and past training events: <a href="https://ukdataservice.ac.uk/news-and-events/events" target=_blank>[Events]</a>

* To get in contact with feedback, ideas or to seek assistance: <a href="https://ukdataservice.ac.uk/help.aspx" target=_blank>[Help]</a>

<a href="https://www.research.manchester.ac.uk/portal/julia.kasmire.html" target=_blank>Dr Julia Kasmire</a> and <a href="https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html" target=_blank>Dr Diarmuid McDonnell</a> <br />
UK Data Service  <br />
University of Manchester <br />
May 2020

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Guide-to-using-this-resource" data-toc-modified-id="Guide-to-using-this-resource-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Guide to using this resource</a></span><ul class="toc-item"><li><span><a href="#Interaction" data-toc-modified-id="Interaction-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Interaction</a></span></li><li><span><a href="#Learn-more" data-toc-modified-id="Learn-more-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Learn more</a></span></li></ul></li><li><span><a href="#Knowing-your-computational-environment" data-toc-modified-id="Knowing-your-computational-environment-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Knowing your computational environment</a></span><ul class="toc-item"><li><span><a href="#File-system-and-working-directory" data-toc-modified-id="File-system-and-working-directory-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>File system and working directory</a></span></li><li><span><a href="#Hardware-and-software" data-toc-modified-id="Hardware-and-software-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Hardware and software</a></span></li></ul></li></ul></div>

## Introduction

Computational Social Science: it can be a scary, alluring, mystifying term. You may even be thinking, what's the big deal? Surely almost all social science involves the use of computers: we code our interviews using software such as NVivo; build our statistical models in SPSS, Stata etc; and generally conduct our research and teaching activities within a computational environment (e.g., personal desktop/laptop, Dropbox or iCloud for file storage). However, computational social science (CSS) refers to activities and technologies that go beyond what we're typically familiar with as social scientists:
* The use of datasets that are too large to store on your personal machine; 
* Writing programming scripts to access information held in online databases; 
* Employing analytical techniques, derived from computer science, that reveal structures and patterns in large or unfamiliar datasets (e.g. network analysis, text mining). 

More formally, computational social science is an interdisciplinary branch of research, defined more by its methods and data than its substantive topics (Heiberger & Riebling, 2016). It is not limited to certain analytical approaches (e.g., Machine Learning) or data types (e.g., just text data). So what makes computational social science different from "traditional" social science? What types of knowledge and skills do you need to engage in computational methods? And why would you want to be a computational social scientist?

In this lesson we will describe and demonstrate five key domains of computational social science:
1. Thinking computationally
2. Writing code
3. Computational environments
4. Manipulating structured and unstructured data
5. Reproducibility of the scientific workflow

We cover key theories and ideas behind each domain and provide example code, written in the popular Python programming language, that demonstrates some of the key skills computational social scientists need to develop.

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*Collecting data from online databases using an API*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about Python and computational social science!".format(name)) 

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## Knowing your computational environment

All computational social science activities are dependent on knowing how to setup, manage and share a computational environment. This can be as simple as understanding how and where files are located on your machine, to defining and documenting which software packages, versions and configurations are necessary to execute your data analysis. Whether you are thinking about scraping a web page or implementing an advanced machine learning algorithm, it all begins with establishing your computational environment. First, let's understand how files are stored and accessed on your machine.

[*Add materials from first code demo*]

### File system and working directory

It is critical that you think *logicially* and in an *organised* way about how you manage and store files for your project. This goes beyond just keeping your files and folders tidy using a graphic user interface, and requires that you know how to move around in and interpret command line interfaces. Although this may look a bit unfamiliar or even scary, the black window with stark contrast text and a blinky cursor will become your friend!

First thing to know is that files and folders stored on your machine's hard drive can and be accessed in two ways:
* Absolute path 
* Relative path

Both the absolute and relative path are like directions to the location of a file or folder, but they differ in that relative path assumes that whoever is giving directions is in the same place as whoever is getting the directions while absolute path does not. 

For example, if someone were to ask me "Where is your office?" I would answer differently in different contexts. If I was talking on the phone to someone who wanted to send a book to my office, I would respond with an absolute path type answer and give the full postal address of my office. But if someone where standing in the lobby of my building and asked me where to drop something off after lunch, I would give a relative path answer and say which floor and which hallway to take out of the stairwell, plus my office number. 

It is not always easy to know whether you (the one giving the directions) are in the same file system location as the computer (the one giving directions), so you need to know how to ask for the current working directory (i.e., where *this* notebook, that you are working in right now is located).

One way to do that is to ask by double clicking in the code cell below, just after the end of %cd%. Then either 
- hit the "Run" button at the top of the page or 
- use the keyboard shortcut Shift + Enter

In [1]:
!echo %cd%

C:\Users\t95171dm\projects\comp-soc-sci\code


What this is doing is asking the computer to repeat out loud (or 'echo') its current working directory (or 'cd'). This happens to be structured very much like an English language command, with the verb at the front and the object at the end, not unlike a "Pass the salt" or "Close the door". 

When you run a command like this in a code cell, the computer will execute the code and echo back to you its current directory. 

Another way to do that is to import a library called os (which stands for operating system) and ask the computer to use an os command called getcwd (short for get current working directory). Like the echo command above, this tells the computer to report where in the file structure it is currely at. Try double clicking in the code cell below and hitting "Run" or Shift + Enter.

In [2]:
import os

os.getcwd()

'C:\\Users\\t95171dm\\projects\\comp-soc-sci\\code'

Unlike the echo command, this one is not structured so much like an English language command. Instead, it translates (more or less) to "Using os, run the getcwd command (here)". 

Although it is less English-like, os is very useful. For example, you can use it to get a list the contents of a directory. Put another way, that means "Tell me everything that is in this folder". Go ahead and double click/Run in the next cell. 

In [3]:
os.listdir()

['.ipynb_checkpoints',
 'bcss-code-2020-05-06.ipynb',
 'bcss-notebook-four-2020-02-12.ipynb',
 'bcss-notebook-one-2020-02-12.ipynb',
 'bcss-notebook-three-2020-02-12.ipynb',
 'bcss-notebook-two-2020-02-12.ipynb',
 'convert-data-structures-2010-03-16.ipynb',
 'data',
 'images',
 'README.md']

Roughly translated, this command says "Using os, list the contents of the directory (here)". If you did not run the commands in the previous cell block (the command to import os), you would get an error here. If so, make sure you go back and run the commands to import os and then try this command block again. 

As well as getting os to list the contents of *here*, you can ask it to list the contents of directories that are *there* without you having to move to that and use os.listdir(). 

Double click/Run in the next code cell to see how that works. 

In [4]:
os.listdir("./data/")

['oxfam-csv-2020-03-16.csv',
 'oxfam-csv-2020-03-16.json',
 'oxfam-csv-2020-03-16.xml']

If you look up to the results of asking os to list the contents of *here*, you will see that one of the items in the list was 'data'. When we asked os to list the contents of "./data/" we are asking it to list the contents of a directory or folder called 'data" that is located here. 

To translate that a bit more, the "./" and the beginning of "./data/" means "this directory here" or "this directory where we are now". The "data/" part at the end of "./data/" means a directory called "data". If you put them together, it means "a directory called 'data' that is in the directory here".

You can tell that both directories are directories, because of the "/" after "data" and after ".". That "/" means that whatever precedes the "/" will be a directory, rather than a file or something else. 

### Hardware and software
Your computational environment consists of hardware (e.g., the physical machine and its Central Processing Unit) and software (e.g., operating system, programming langauges and their versions, files). For instance, here is a snapshot of the environment of one of our work computersas of 2020-03-30. First, the operating system:

And the version of Python running on the computer, plus some of the additional packages (libraries) that were installed:

[*Change this to !pip freeze and sys.modules()*]

Computational environments tend to be unique: for example, you may have different software applications installed on your machine compared to your classmate; or some machines in your computer lab run Windows 10, others Windows 7. This customisability presents considerable challenges for conducting, sharing and reproducing scientific work. In the words of the Turing Institute:<sup>[5]</sup>
> The analysis should be *mobile*. Mobility of compute is defined as the ability to define, create, and maintain a workflow locally while remaining confident that the workflow can be executed elsewhere.

Trying and failing to reproduce a piece of work after switching to a new machine is, frankly, soul destroying. Thankfully, there are numerous, simple technological solutions for capturing and sharing your computational environment.

[*Edit this entire section*]

#### Capturing a computational environment

If you run multiple projects, you may need more than one environment, achieved by using more than one machine or by _partitioning_ your machine into separate units. Each of these environments can then be customised for the kind of work you do on the different projects. 

For example, on one of my machines I have two environments: one for collecting charity data for Australia; and another for interacting with the [Companies House API](https://developer.companieshouse.gov.uk/api/docs/). Each environment has Python installed but they have different Python packages. I do not perform any web-scraping for the Companies House project, therefore I did not install the `requests` or `BeautifulSoup` packages in that environment. 

I find this beneficial because the the work I do on the different projects requires different packages, each of which can be picky about which versions of *other* packages I have installed. By keeping them separate, I can install or update only the packages I need, when I need them, without worrying that it will break the chain of requirements for one project by improving it for another. If I were to use one environment for all the different kinds of work I do, some of my scripts may break whenever I try to upgrade or install something that I need for only one project. Running separate environments for different projects allows me to manage these package dependencies carefully and correctly.

Interacting with and undertanding your computer at a more fundamental level is also excellent training for running your own server for research (or other) purposes. What is a server? Think of it as a more powerful form of personal computer, running in the cloud, and your primary means of communicating with it is through the Command Line Interface (CLI). It is always on (barring any planned or unplanned downtime) and thus is particularly useful for running automated, scheduled tasks e.g. conducting a weekly scrape of a particular web page.

[5]: https://the-turing-way.netlify.com/reproducible_environments/reproducible_environments.html

<div style="text-align: right"><a href="./bcss-notebook-three-2020-02-12.ipynb" target=_blank><i>Previous section: Writing code</i></a> &nbsp;&nbsp;&nbsp;&nbsp; | &nbsp;&nbsp;&nbsp;&nbsp;<a href="./bcss-notebook-five-2020-02-12.ipynb" target=_blank><i>Next section: Manipulating data</i></a></div>