![UKDS Logo](./images/UKDS_Logos_Col_Grey_300dpi.png)

# Setting Up Your Computational Environment

Welcome to the <a href="https://ukdataservice.ac.uk/" target=_blank>UK Data Service</a> training series on *New Forms of Data for Social Science Research*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. To help you get to grips with these new forms of data, we provide webinars, interactive notebooks containing live programming code, reading lists and more.

* To access training materials for the entire series: <a href="https://github.com/UKDataServiceOpen/new-forms-of-data" target=_blank>[Training Materials]</a>

* To keep up to date with upcoming and past training events: <a href="https://ukdataservice.ac.uk/news-and-events/events" target=_blank>[Events]</a>

* To get in contact with feedback, ideas or to seek assistance: <a href="https://ukdataservice.ac.uk/help.aspx" target=_blank>[Help]</a>

<a href="https://www.research.manchester.ac.uk/portal/julia.kasmire.html" target=_blank>Dr Julia Kasmire</a> and <a href="https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html" target=_blank>Dr Diarmuid McDonnell</a> <br />
UK Data Service  <br />
University of Manchester <br />
May 2020

## Introduction

Computational methods for collecting, cleaning and analysing data are an increasingly important component of a social scientist’s toolkit. Central to engaging in these methods is the ability to write readable and effective code using a programming language.

In this training series we demonstrate core programming concepts and methods through the use of social science examples. In particular we focus on four areas of programming/computational social science:
1. Introduction to Python.
2. Collecting data I: web-scraping. 
3. Collecting data II: APIs.
4. Setting up your computational environment. [Focus of this notebook]

### Aims

This lesson - **Setting up your computational environment** - has two aims:
1. Demonstrate how to create, manage, share and reproduce a Python computational environment.
2. Cultivate your computational thinking skills through coding examples. In particular, how to define and solve a data analysis problem using computational methods.

### Lesson details

* **Level**: Introductory
* **Time**: 30-60 minutes
* **Pre-requisites**: None, though you may find it useful to work through our <a href="https://github.com/UKDataServiceOpen/code-demos/blob/master/code/ukds-intro-to-python-2020-05-06.ipynb" target=_blank>*Introduction to Python for social scientists*</a> and <a href="https://github.com/UKDataServiceOpen/code-demos/blob/master/code/ukds-web-scraping-2020-05-13.ipynb" target=_blank>*Collecting data I: web-scraping*</a>  lessons first.
* **Audience**: Researchers and analysts from any disciplinary background. The materials are slightly tailored for social scientists through the use of social data.
* **Learning outcomes**:
    1. Understand what a computational environment is.
    2. Be able to install and import Python modules on your machine.
    3. Be able to capture your computational environment and share it online.
    4. Be able to reproduce a computational environment in order to execute a programming script.

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*What is a computational environment?*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [1]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about Python and web-scraping!".format(name))

Enter your name and press enter:
Diarmuid

Hello Diarmuid, enjoy learning more about Python and web-scraping!


### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## What is a computational environment?

An Application Programming Interface (API) is
> a set of functions and procedures allowing the creation of applications that access the features or data of an operating system, application, or other service" (Oxford English Dictionary). 

In essence: an API acts as an intermediary between software applications. Think of an API's role as similar to that of a translator faciliating a conversation between two individuals who do not speak the same language. Neither individual needs to know the other's language, just how to formulate their response in a way the translator can understand. Similarly, an API **simplifies** how applications communicate with each other.

It performs this role by providing a set of protocols/standards for making *requests* and formulating *responses* between applications. For example, a smart phone application might need real-time traffic data from an online database. An API can validate the application's request for data, and handle the online database's response (i.e., the transfer of data to the application). In the absence of an API, the smart phone application would need to know a lot more technical information about the online database in order to communicate with it (e.g., what commands does the database understand?). But thanks to the API, the smart phone application only needs to know how to formulate a request that the API understands, which then communicates the request to the database and handles the response.

Run the code below for a graphical representation of how an API works.

In [2]:
from IPython.display import IFrame
IFrame("./images/ukds-apis-slides.mp4", width=900, height=600)

### Why do you need to understand your computational environment?



### How do you create, manage, share and reproduce a computational environment?

1. Create:
    * Install Python
    * Create a folder/directory for your project
    * Create a computational environment i.e., setup a distinct version of Python, separate from the main installation
    * Activate the computational environment i.e., use this version of Python for subsequent coding work
2. Manage:
    * Customise your computational environment by installing and updating Python modules
3. Share:
    * Capture your computational environment using a `requirements.txt` file
    * Upload your environment to a public repostory
    * mybinder.org
4. Reproduce:
    * Install someone else's computational environment on your machine in order to reproduce some Python code

## A social science example

Let's set ourselves a data collection task: downloading the Republic of Ireland Register of Charities using Python.

<a id="comp-env"></a>
###  Creating a computational environment

#### Preliminaries - Check Python is installed

dffdsf

In [3]:
!python --version

Python 3.7.3


#### Step 1 - Create project folder

Open your command prompt/terminal/command line interface (CLI) and type the following one at a time:

*Windows/Linux/Mac*

```
mkdir charity-data-download
cd charity-data-download
```

This creates a new folder called `charity-data-download` on your machine and navigates to it.

#### Step 2 - Create computational environment

Next, we need to install a copy of Python in this folder (known as a virtual environment). With the command prompt/terminal/command line interface (CLI) still open, type the following:

*Windows/Linux/Mac*

```
python -m venv env
```

In essence the above command creates an isolated version of Python (stored in a folder called `env`) for our `charity-data-download` project. (Technically, it creates a file called `pyvenv.cfg` which points to the original Python installation).

What this means is we can tweak and configure this version of Python without affecting the main installation or any other Python environment that exists on your machine. 

#### Step 3 - Activate computational environment

Now that we have a distinct Python computational environment setup, we need to tell our machine to use this environment for our subsequent coding session. With the command prompt/terminal/command line interface (CLI) still open, type the following:

*Windows*

```
env\Scripts\activate.bat
```

*Linux/Mac*

```
source env/bin/activate
```

(Now you start to notice differences in the operating systems and their handling of computational environments)

This means that all of our subsequent work will use the version of Python contained in the `env` folder, **not** the main installation of Python.

### Managing a computational environment

If you followed the steps above then you should have an active computational environment contained in the `charity-data-download` folder. [REWORD]Now let's focus on our coding activity: downloading the Republic of Ireland Register of Charities. This is a .xlsx file containing a list of all charitable organisations registered with the Republic of Ireland charity regulator. The file is located at the following web page: <a href="https://www.charitiesregulator.ie/en/information-for-the-public/search-the-register-of-charities" target=_blank>https://www.charitiesregulator.ie/en/information-for-the-public/search-the-register-of-charities</a> [UPDATE]

OK, let's execute our Python code and see if it works:

In [8]:
# Import modules

import os # module for working with the operating system


# Request file

file_webadd = "https://www.charitiesregulator.ie/media/1931/public-register-22052020.xlsx"
response_file = requests.get(file_webadd)


# Save files (data)

outfile = "ire-register-of-charities.xlsx"

if os.path.isfile(outfile): # do not overwrite existing file
    print("File already exists, no need to overwrite")
else: # file does not currently exist, therefore create
    with open(outfile, "wb") as f:
        f.write(response_file.content)

NameError: name 'requests' is not defined

Hmm, it seems there's an error in our code. Helpfully Python tells us what kind of error it is: `NameError: name 'requests' is not defined`. This means that Python does not recognise the word `requests`. However I know from experience that this is indeed the correct method - `requests.get()` - for requesting web pages/downloading files.

#### Step 1 - Installing Python modules

Python comes with a lot of functionality that is available straight away when you launch it. For example, you can perform calculations like so:

In [5]:
(100 * 50) / 10

500.0

However many of the more interesting and complicated programming tasks require additional functionality that you need to import into your Python session. For example, if we want to randomly select a number between 1 and 100 inclusive:

(Try running the code a couple of times to see what happens - is it truly "random"?)

In [11]:
import random

random.randint(1,100)

92

Note how we needed to import (load) a module called `random` in order to use the `randint` method. **Modules** are Python programming scripts that contain additional techniques or functions for you to use. Some modules are automatically included with the standard installation of Python: in this scenario, all you need to do is **import** the module into your Python session (e.g., `import random`).

(You can view the standard library of Python modules here: <a href="" target=_blank>https://docs.python.org/3/library/</a>)

However, many of the modules you will need for your computational social science / data science / data analysis work have to be installed on your machine before they can be imported. For example, we need to install the `requests` module in order to download the Register of Charities (and web-scraping more generally). 

Open your command prompt/terminal/command line interface (CLI) and type the following:

*Windows/Linux/Mac*

```
pip install requests
```

(Remember to activate your computational environment using the [instructions above](#comp-env) before executing the above command, else the `requests` module gets installed somewhere else).

The `pip install requests` command installs this Python module on your machine.

Now let's try to execute our data download code once more, this time importing the `requests` module:

In [12]:
# Import modules

import os # module for working with the operating system
import requests # module for requesting urls


# Request file

file_webadd = "https://www.charitiesregulator.ie/media/1931/public-register-22052020.xlsx"
response_file = requests.get(file_webadd)


# Save files (data)

outfile = "ire-register-of-charities.xlsx"

if os.path.isfile(outfile): # do not overwrite existing file
    print("File already exists, no need to overwrite")
else: # file does not currently exist, therefore create
    with open(outfile, "wb") as f:
        f.write(response_file.content)

How do we know the file was downloaded? The simplest way is to check whether a) the file was created, and b) the data were written to it.

In [13]:
# Check presence of file in current folder

os.listdir()

['.ipynb_checkpoints',
 'downloads',
 'images',
 'ire-register-of-charities.xlsx',
 'README.md',
 'responses',
 'sampling-frame',
 'ukds-apis-2020-05-20.ipynb',
 'ukds-computational-environments-2020-05-27.ipynb',
 'ukds-intro-to-python-2020-05-06.ipynb',
 'ukds-intro-to-python-notes-2020-05-06.ipynb',
 'ukds-web-scraping-2020-05-13.ipynb']

In order to open

## What have we learned?

Let's recap what key skills and techniques we've learned:
* **How to import modules**. You will usually need to import modules into Python to support your work. Python does come with some methods and functions that are ready to use straight away, but for computational social science tasks you'll almost certainly need to import some additional modules.
* **How to make requests (calls) for data to an API**. You can use Python to request data from an API.
* **How to handle and save the data that is returned by the API**. APIs tend to return data in JSON format, which requires different data manipulation techniques than you may be used to. You can process this data and save it to a file for future use.
* **How to do all of the above in an efficient, clear and effective manner**.

## Conclusion

Interacting with an API is a simple yet powerful computational method for collecting data of value for social science research. It provides a relatively gentle introduction to using programming languages, also. However, "with great power comes great responsibility" (sorry). APIs take you into the realm of data protection, Terms of Service/Use, and many murky ethical issues. Wielded sensibly and sensitively, collecting data from APIs is a valuable and exciting social science research method.

Good luck on your data-driven travels!

## Bibliography

Barba, Lorena A. et al. (2019). *Teaching and Learning with Jupyter*. <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>https://jupyter4edu.github.io/jupyter-edu-book/</a>.

Brooker, P. (2020). *Programming with Python for Social Scientists*. London: SAGE Publications Ltd.

Lau, S., Gonzalez, J., & Nolan, D. (n.d.). *Principles and Techniques of Data Science*. https://www.textbook.ds100.org

Tagliaferri, L. (n.d.). *How to Code in Python 3*. https://assets.digitalocean.com/books/python/how-to-code-in-python.pdf

## Further reading and resources

We hope this brief lession has whetted your appetite for learning more about web-scraping and Python programming in general. There are some fantastic learning materials available to you, many of them free. We highly recommend the materials referenced in the Bibliography.

In addition, you may find the following resources useful:
* <a href="https://github.com/UKDataServiceOpen/web-scraping" target=_blank>**Web-scraping for Social Science Research**</a> - a free UK Data Service training series on web-scraping and APIs, with three webinars and lots of detailed coding examples.
* <a href="https://automatetheboringstuff.com/" target=_blank>**Automate the Boring Stuff with Python**</a> - a free ebook covering lots of interesting, practical uses of Python. Chapter 16 covers APIs and JSON files.
* <a href="https://github.com/public-apis/public-apis" target=_blank>**Public APIs**</a> - a list of free APIs covering a range of interesting domains e.g., social data, food & drink, weather etc.

## Appendices

### Appendix A - Stop-and-search data for every police force

It would be inefficient to request data for every police force one at a time. We can make use of lists and loops to speed up the downloading of data.

First, let's get our list of police forces:

In [25]:
# Import modules

import os
import requests
import json
from datetime import datetime
print("Succesfully imported necessary modules")

# Define web address and search terms

baseurl = "https://data.police.uk/api/"
forces = "forces"
webadd = baseurl + forces

# Make call to API

response = requests.get(webadd)
response.status_code

# Store data in variable

forces_data = response.json()

Succesfully imported necessary modules


Next, we extract a list of force ids:

In [None]:
force_ids = [el["id"] for el in forces_data]

Then, for each of these ids we request stop-and-search data and store the results in a list:

In [None]:
baseurl = "https://data.police.uk/api/"
sas = "stops-force"

forces_sas_data = [] # create a blank list for storing results of request

for force in force_ids: 
    webadd = baseurl + sas + "?force=" + force
    
    response = requests.get(webadd)
    
    if response.status_code==200:
        sas_data = response.json()
    
        for el in sas_data:
            el["force"] = force
            el["code"] = response.status_code
            el["note"] = "Downloaded data"
    else:
        sas_data = {"force": force, "note": "Could not download", "code": response.status_code}
        
    forces_sas_data.append(sas_data)

You'll see we added a conditional statement (`if, else`) to check whether we made a successful request for data: if yes, then store the data in the `sas_data` variable; if no, then define the `sas_data` variable as a dictionary containing some notes about the unsuccessful attempt.

Let's check the results. Our `data` list should contain the results for 44 police forces:

In [None]:
len(forces_sas_data)

We'll leave it to you to examine the contents of the `forces_sas_data` variable.

--END OF FILE--