![UKDS Logo](./images/UKDS_Logos_Col_Grey_300dpi.png)

# Setting Up Your Computational Environment

Welcome to the <a href="https://ukdataservice.ac.uk/" target=_blank>UK Data Service</a> training series on *New Forms of Data for Social Science Research*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. To help you get to grips with these new forms of data, we provide webinars, interactive notebooks containing live programming code, reading lists and more.

* To access training materials for the entire series: <a href="https://github.com/UKDataServiceOpen/new-forms-of-data" target=_blank>[Training Materials]</a>

* To keep up to date with upcoming and past training events: <a href="https://ukdataservice.ac.uk/news-and-events/events" target=_blank>[Events]</a>

* To get in contact with feedback, ideas or to seek assistance: <a href="https://ukdataservice.ac.uk/help.aspx" target=_blank>[Help]</a>

<a href="https://www.research.manchester.ac.uk/portal/julia.kasmire.html" target=_blank>Dr Julia Kasmire</a> and <a href="https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html" target=_blank>Dr Diarmuid McDonnell</a> <br />
UK Data Service  <br />
University of Manchester <br />
May 2020

## Introduction

Computational methods for collecting, cleaning and analysing data are an increasingly important component of a social scientist’s toolkit. Central to engaging in these methods is the ability to write readable and effective code using a programming language.

In this training series we demonstrate core programming concepts and methods through the use of social science examples. In particular we focus on four areas of programming/computational social science:
1. Introduction to Python.
2. Collecting data I: web-scraping. 
3. Collecting data II: APIs.
4. Setting up your computational environment. [Focus of this notebook]

### Aims

This lesson - **Setting up your computational environment** - has two aims:
1. Demonstrate how to create, manage, capture, share and reproduce a Python computational environment.
2. Cultivate your computational thinking skills through coding examples. In particular, how to define and solve a data analysis problem using computational methods.

### Lesson details

* **Level**: Introductory
* **Time**: 30-60 minutes
* **Pre-requisites**: None, though you may find it useful to work through our <a href="https://github.com/UKDataServiceOpen/code-demos/blob/master/code/ukds-intro-to-python-2020-05-06.ipynb" target=_blank>*Introduction to Python for social scientists*</a> lessons first.
* **Audience**: Researchers and analysts from any disciplinary background. The materials are slightly tailored for social scientists through the use of social data.
* **Learning outcomes**:
    1. Understand what a computational environment is.
    2. Be able to install and import Python modules on your machine.
    3. Be able to capture your computational environment and share it with others.
    4. Be able to reproduce a computational environment in order to execute a programming script.

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*What is a computational environment?*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [5]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about Python and web-scraping!".format(name))

Enter your name and press enter:
Diarmuid

Hello Diarmuid, enjoy learning more about Python and web-scraping!


### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## What is a computational environment?

> Every computer has its own unique computational environment consisting of its operating system, what software it has installed, what versions of software packages are installed, and other features [e.g., hardware]...(The Turing Way Community, 2019) 



### Why do you need to understand your computational environment?

In the digital era, computation is becoming central to research and analytical work. Researchers and analysts are increasingly dependent on specialised hardware, software, data and other technological tools to produce their results. In conjunction, there are valid concerns around the reproducibility of scientific work (Christensen et al., 2019; Wynants et al., 2020). Therefore it is critical that attempts are made to ensure a piece of research or analytical work can be reproduced by others. One solution to this problem is to capture and make available the computational environment (or the core aspects of it) in which the work was undertaken.

> The analysis should be mobile. Mobility of compute is defined as the ability to define, create, and maintain a workflow locally while remaining confident that the workflow can be executed elsewhere. (The Turing Way Community, 2019)

### How do you capture a computational environment?

Capturing a computational environment involves recording the computational features of your work:
* What type of machine did you use? How much memory does it have?
* What operating system do you have e.g., Windows, Mac, Linux?
* Which programming language or statistical software did you use to conduct your analysis e.g., Python, R, Stata, NVivo, Qualtrix?
* Which version of the programming language did you use? Python 3.5, 3.6, 3.7 or 3.8?
* What additional packages or modules did you install and make use of in your analysis e.g., Python `nltk` module for natural language processing?
* What other files are necessary for producing the results e.g., data sets?

A basic but valid way of caputuring your computational environment is simply documenting all of these elements above. Thankfully there are technological solutions that make this process simpler, quicker and more robust.

There are two key methods, which can be combined, for capturing your computational environment:
1. Using package management systems (e.g., Conda, pip) to record the software used in your research or analytical work.
2. Using virutal machines or containers to capture the entire computational environment i.e., the operating system **and** the software, files, data sets etc.

## A social science example

Let's set ourselves a data collection task: downloading the Republic of Ireland Register of Charities using Python.

<a id="comp-env"></a>
###  Creating a computational environment

#### Preliminaries - Check Python is installed

In [6]:
!python --version

Python 3.7.3


In [7]:
import os, sys

os.path.dirname(sys.executable)

'C:\\ANACONDA3'

#### Step 1 - Create project folder

Open your command prompt/terminal/command line interface (CLI) and type the following one at a time:

*Windows/Linux/Mac*

```
mkdir charity-data-download
cd charity-data-download
```

This creates a new folder called `charity-data-download` on your machine and navigates to it.

#### Step 2 - Create computational environment

Next, we need to install a copy of Python in this folder (also known as a virtual environment). With the command prompt/terminal/command line interface (CLI) still open, type the following:

*Windows/Linux/Mac*

```
python -m venv env
```

In essence the above command creates an isolated copy of Python (stored in a folder called `env`) for our `charity-data-download` project. (Technically, it creates a file called `pyvenv.cfg` which points to the original Python installation).

What this means is we can tweak and configure this version of Python without affecting the main installation or any other Python environment that exists on your machine. 

#### Step 3 - Activate computational environment

Now that we have a distinct Python computational environment setup, we need to tell our machine to use this environment for our subsequent coding session. With the command prompt/terminal/command line interface (CLI) still open, type the following:

*Windows*

```
env\Scripts\activate.bat
```

*Linux/Mac*

```
source env/bin/activate
```

(Now you start to notice differences in the operating systems and their handling of computational environments)

This means that all of our subsequent work will use the copy of Python contained in the `env` folder, **not** the main installation of Python.

### Managing a computational environment

If you followed the steps above then you should have a distinct computational environment that is just for any work relating to our `charity-data-download` project. Now let's focus on our coding activity: downloading the Republic of Ireland Register of Charities. This is a .xlsx file containing a list of all charitable organisations registered with the Republic of Ireland charity regulator. The file is located at the following web page: <a href="https://www.charitiesregulator.ie/en/information-for-the-public/search-the-register-of-charities" target=_blank>https://www.charitiesregulator.ie/en/information-for-the-public/search-the-register-of-charities</a>

Here is the code we need to download this file (not executable in this notebook):

We're going to move out of Jupyter notebook at this stage and execute the above code directly in the command prompt/terminal/command line interface (CLI). First, save the following file to the `charity-data-download` folder on your machine:

Script: [ire-register-of-charities-download.py](files/ire-register-of-charities-download.py)

Next, with the `charity-data-download` computational environment active, type the following into your command prompt/terminal/command line interface (CLI):

*Windows/Linux/Mac*

```
python ire-register-of-charities-download.py
```

This executes/runs the code contained in the file.

Hmm, it seems there's an error in our code. Helpfully Python tells us what kind of error it is: `ModuleNotFoundError: No module named 'requests'`. This means that Python cannot find the `requests` module on your machine.

#### Step 1 - Installing Python modules

Python comes with a lot of functionality that is available straight away when you launch it. For example, you can perform calculations like so:

In [8]:
(100 * 50) / 10

500.0

However many of the more interesting and complicated programming tasks require additional functionality that you need to import into your Python session. For example, if we want to randomly select a number between 1 and 100 inclusive:

(Try running the code a couple of times to see what happens - is it truly "random"?)

In [9]:
import random

random.randint(1,100)

39

Note how we needed to import (load) a module called `random` in order to use the `randint` method. **Modules** are Python programming scripts that contain additional techniques or functions for you to use. Some modules are automatically included with the standard installation of Python: in this scenario, all you need to do is **import** the module into your Python session (e.g., `import random`).

(You can view the standard library of Python modules here: <a href="" target=_blank>https://docs.python.org/3/library/</a>)

However, many of the modules you will need for your computational social science / data science / data analysis work have to be installed on your machine before they can be imported. For example, we need to install the `requests` module in order to download the Register of Charities (and web-scraping more generally). 

(Technically you are installing a *package* that contains the module; think of this like a zip file containing a programming script: the zip file is the package, the script is the module).

If you've followed the steps above then the `charity-data-download` environment should be active. If so, open your command prompt/terminal/command line interface (CLI) and type the following:

*Windows/Linux/Mac*

```
pip install requests
```

This command installs the `requests` module on your machine.

Now let's try to execute our data download code once more. Type the following into your command prompt/terminal/command line interface (CLI):

*Windows/Linux/Mac*

```
python ire-register-of-charities-download.py
```

How do we know the file was downloaded? Simply navigate to the `charity-data-download` folder and see if a file called *ire-register-of-charities.xlsx* exists.

#### Step 2 - Updating modules

If we wanted to update to the latest version of the `requests` module, we can type the following into the command prompt/terminal/command line interface (CLI):

*Windows/Linux/Mac*

```
pip install requests --upgrade
```

If you want to check which packages are possibly needing updating:

*Windows/Linux/Mac*

```
pip list --outdated
```

### Capturing a computational environment

We've done the hard work of setting up the environment and installing additional modules (packages). We've also successfully executed our programming script. How do we ensure that the programming script can be executed on a different machine/by a different team member etc?

#### Step 1 - Saving a list of installed modules

We can view the additional (i.e., those we've installed ourselves after the initial installation of Python) modules and their versions by typing the following into the command prompt/terminal/command line interface (CLI):

*Windows/Linux/Mac*

```
pip freeze > requirements.txt
```

`pip freeze` displays the additional installed modules, and `> requirements.txt` saves this list to a file.

Note that although we only installed one extra module (`requests`), there is a much longer list of modules displayed using `pip freeze`. This is because `requests` automatically installs other modules that it needs in order to function properly.

We can use this knowledge to refine our *requirements.txt* file so that it only contains one line:

```
requests==2.23.0
```

#### Step 2 - Sharing a list of installed modules

This is the easy part: because our programming activity is quite simple and only relies on a) a standard installation of Python, b) one additional module installation (`requests`), and c) the file containing the code, this is all another individual needs in order to reproduce the work. That is, the *requirements.txt* file captures the computational environment necessary to execute this programming task.

Therefore you just need to make the *requirements.txt* and *ire-register-of-charities-download.py* files available to whomever needs it. You could just zip these files and email them; better practice would be to make them as open as possible by placing them on an online public repository (e.g., Github).

### Reproducing a computational environment

Our final task is to reproduce someone else's work. We'll keep our example simple: let's try and execute a programming script from a previous coding demo (manipulating census data). First, we need to create and navigate to a project folder on our machine:

*Windows/Linux/Mac*

```
mkdir census-data-cleaning
cd census-sampling-cleaning
```

Second, we create and activate a computational environment:

*Windows*

```
python -m venv env
env\Scripts\activate.bat
```

*Linux/Mac*

```
python -m venv env
source env/bin/activate
```


Third, we download and move these files to the `census-sampling-frame` folder:

Script: [census-1961-data.py](files/ire-register-of-charities-download.py)

Data: [census_1961.csv](files/./responses/census_1961.csv)

Requirements: [census-requirements.txt](files/census-requirements.txt)

Fourth, we try and execute the programming script prior to customising the computational environment:

*Windows/Linux/Mac*

```
python census-1961-data.py
```

You should get the following error: `ModuleNotFoundError: No module named 'pandas'`.

Fifth, we customise the computational environment by installing the modules contained in the `census-requirements.txt` file:

*Windows/Linux/Mac*

```
pip install -r census-requirements.txt
```

Finally, we execute the `census-1961-data.py` script once more:

*Windows/Linux/Mac*

```
python census-1961-data.py
```

And Voila, we have successfully reproduced a computational environment!

## What have we learned?

Let's recap what key skills and techniques we've learned:
* **How to import modules**. You will usually need to import modules into Python to support your work. Python does come with some methods and functions that are ready to use straight away, but for computational social science tasks you'll almost certainly need to import some additional modules.
* **How to make requests (calls) for data to an API**. You can use Python to request data from an API.
* **How to handle and save the data that is returned by the API**. APIs tend to return data in JSON format, which requires different data manipulation techniques than you may be used to. You can process this data and save it to a file for future use.
* **How to do all of the above in an efficient, clear and effective manner**.

## Conclusion

Interacting with an API is a simple yet powerful computational method for collecting data of value for social science research. It provides a relatively gentle introduction to using programming languages, also. However, "with great power comes great responsibility" (sorry). APIs take you into the realm of data protection, Terms of Service/Use, and many murky ethical issues. Wielded sensibly and sensitively, collecting data from APIs is a valuable and exciting social science research method.

Good luck on your data-driven travels!

## Bibliography

Barba, Lorena A. et al. (2019). *Teaching and Learning with Jupyter*. <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>https://jupyter4edu.github.io/jupyter-edu-book/</a>.

Christensen, G., Freese, J., & Miguel, E. (2019). *Transparent and reproducible social science research: How to do open science*. Oakland, California: University of California Press.

The Turing Way Community. (2019). *The Turing Way: A Handbook for Reproducible Data Science (Version v0.0.4)*. Zenodo. http://doi.org/10.5281/zenodo.3233986

Wynants, L., Van Calster, B., Bonten, M. M. J., Collins, G. S., Debray, T. P. A., De Vos, M., . . . van Smeden, M. (2020). Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal. *BMJ, 369*, m1328. doi: 10.1136/bmj.m1328

## Further reading and resources

We hope this brief lession has whetted your appetite for learning more about Python computational environments. There are some fantastic learning materials available to you, many of them free. We highly recommend the materials referenced in the Bibliography.

In addition, you may find the following resources useful:
* <a href="https://github.com/UKDataServiceOpen/web-scraping" target=_blank>**Web-scraping for Social Science Research**</a> - a free UK Data Service training series on web-scraping and APIs, with three webinars and lots of detailed coding examples.
* <a href="https://automatetheboringstuff.com/" target=_blank>**Automate the Boring Stuff with Python**</a> - a free ebook covering lots of interesting, practical uses of Python. Chapter 16 covers APIs and JSON files.
* <a href="https://github.com/public-apis/public-apis" target=_blank>**Public APIs**</a> - a list of free APIs covering a range of interesting domains e.g., social data, food & drink, weather etc.

## Appendices

### Appendix A - Stop-and-search data for every police force

It would be inefficient to request data for every police force one at a time. We can make use of lists and loops to speed up the downloading of data.

First, let's get our list of police forces:

In [10]:
# Import modules

import os
import requests
import json
from datetime import datetime
print("Succesfully imported necessary modules")

# Define web address and search terms

baseurl = "https://data.police.uk/api/"
forces = "forces"
webadd = baseurl + forces

# Make call to API

response = requests.get(webadd)
response.status_code

# Store data in variable

forces_data = response.json()

Succesfully imported necessary modules


Next, we extract a list of force ids:

In [11]:
force_ids = [el["id"] for el in forces_data]

Then, for each of these ids we request stop-and-search data and store the results in a list:

In [12]:
baseurl = "https://data.police.uk/api/"
sas = "stops-force"

forces_sas_data = [] # create a blank list for storing results of request

for force in force_ids: 
    webadd = baseurl + sas + "?force=" + force
    
    response = requests.get(webadd)
    
    if response.status_code==200:
        sas_data = response.json()
    
        for el in sas_data:
            el["force"] = force
            el["code"] = response.status_code
            el["note"] = "Downloaded data"
    else:
        sas_data = {"force": force, "note": "Could not download", "code": response.status_code}
        
    forces_sas_data.append(sas_data)

You'll see we added a conditional statement (`if, else`) to check whether we made a successful request for data: if yes, then store the data in the `sas_data` variable; if no, then define the `sas_data` variable as a dictionary containing some notes about the unsuccessful attempt.

Let's check the results. Our `data` list should contain the results for 44 police forces:

In [13]:
len(forces_sas_data)

44

We'll leave it to you to examine the contents of the `forces_sas_data` variable.

--END OF FILE--