![UKDS Logo](./images/UKDS_Logos_Col_Grey_300dpi.png)

# Introduction to Python for Social Scientists

Welcome to the <a href="https://ukdataservice.ac.uk/" target=_blank>UK Data Service</a> training series on *Computational Social Science*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. To help you get to grips with these new forms of data, we provide webinars, interactive notebooks containing live programming code, reading lists and more.

* To access training materials for the entire series: <a href="https://github.com/UKDataServiceOpen/computational-social-science" target=_blank>[Training Materials]</a>

* To keep up to date with upcoming and past training events: <a href="https://ukdataservice.ac.uk/news-and-events/events" target=_blank>[Events]</a>

* To get in contact with feedback, ideas or to seek assistance: <a href="https://ukdataservice.ac.uk/help.aspx" target=_blank>[Help]</a>

<a href="https://www.research.manchester.ac.uk/portal/julia.kasmire.html" target=_blank>Dr Julia Kasmire</a> and <a href="https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html" target=_blank>Dr Diarmuid McDonnell</a> <br />
UK Data Service  <br />
University of Manchester <br />
May 2020

## Introduction

Computational methods for collecting, cleaning and analysing data are an increasingly important component of a social scientist’s toolkit. Central to engaging in these methods is the ability to write readable and effective code using a programming language.

This training notebook demonstrates the core programming concepts and methods introducing Python to social scientists. Throughout this notebook, we use social science examples to show how these concepts and methods apply outside of computer science domains. 

### Aims

This lesson - **Introduction to Python for social scientists** - has two aims:
1. Demonstrate how to use Python for a typical social science research activity.
2. Cultivate your computational thinking skills through a coding example.

### Lesson details

* **Level**: Introductory
* **Time**: 30-60 minutes
* **Pre-requisites**: None, though you may find it useful to work through our <a href="https://github.com/UKDataServiceOpen/code-demos/blob/master/code/ukds-intro-to-python-2020-05-06.ipynb" target=_blank>*Introduction to Python for social scientists*</a> lesson first.
* **Audience**: Researchers and analysts from any disciplinary background. The materials are slightly tailored for social scientists through the use of social data.
* **Learning outcomes**:
    1. Understand what programming is as a practice, and common terms associated with it (e.g., coding, script, debugging). 
    2. Understand how files are stored on a machine, and how to locate them using absolute and relative paths.
    3. Be able to use Python for finding, importing, appending and saving data sets stored on a machine.

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*What is a programming language?*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about Python!".format(name))

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## What is a programming language?

In essence, a *programming language* is a set of instructions through which humans can interact with a computer. Similar to a spoken language, there are grammatical (e.g., specifying commands correctly) and syntactical (e.g., arranging commands in the correct order) rules that need to followed.

### Vocabulary

Like learning any new language (spoken or programming), much of the difficulty arises from getting to grips with an unfamiliar vocabulary. The following are some general programming terms, adapted from Brooker (2020), that are worth keeping in mind as you progress on your computational oddysey:

* **Programming language** - a means of interacting with and issuing instructions to a computer.
* **Programming** - the practice of using a programming language (also known as *coding*).
* **Code** - the written down instructions that result from programming.
* **Script** - a collection of code.
* **Shell** - a tool that allows you to write and execute code e.g., using R withoutR Studio, using the Command Line Interface (CLI) on your computer. 
* **Debugging** - fixing errors or issues with your code.
* **Testing** - the process of discovering errors or issues with your code.

### What is Python?

Python is an open-source, general purpose, high level and extensible programming language. These terms may be unfamiliar, especially in this context, so let's take them one-by-one:
* **Open source** - the source code that underpins Python is freely available for others to use, modify and share, as long as these activities comply with the <a href="https://docs.python.org/3/license.html" target=_blank>license</a>. Watch this neat <a href="https://www.youtube.com/watch?v=Tyd0FO0tko8" target=_blank>overview</a> of open-source software to learn more.
* **General purpose** - Python can be used for a multitude of programming activities, such as web development, scientific computing, software development, system administration, and more.
* **High level** - the way Python is written is highly abstracted from the language your machine uses to send, receive and store information. Computers receive instructions in a language called *binary*, which is a series of 1s and 0s that represent a range of characters. For example, here is my name ("Diarmuid") represented as binary: <br><br>01000100 01101001 01100001 01110010 01101101 01110101 01101001 01100100<br><br>The first sequence of 8 bits (known as a "byte") represents the letter "D", the second sequence the letter "i" and so on. It would be difficult and tedious in the extreme to write programming code in sequences of binary, hence the creation of high-level languages that are easier to read, write and understand by humans.
* **Extensible** - Python's functionality, capabilities and range of uses can be expanded. This is often achieved through the creation and sharing of additional, open-source add-ons (known as *packages*). For example, the `BeautifulSoup` module was created to enable Python users to extract information from web pages. 

## Python for social science research

Python provides some really nice methods and techniques for social scientists. It can be used to download and view files:

In [None]:
import requests

accounts = "https://register-of-charities.charitycommission.gov.uk/charity-search?p_p_id=uk_gov_ccew_onereg_charitydetails_web_portlet_CharityDetailsPortlet&p_p_lifecycle=2&p_p_state=maximized&p_p_mode=view&p_p_resource_id=%2Faccounts-resource&p_p_cacheability=cacheLevelPage&_uk_gov_ccew_onereg_charitydetails_web_portlet_CharityDetailsPortlet_fileName=0000202918_AC_20190331_E_C.pdf&_uk_gov_ccew_onereg_charitydetails_web_portlet_CharityDetailsPortlet_objectiveId=A9972313&_uk_gov_ccew_onereg_charitydetails_web_portlet_CharityDetailsPortlet_priv_r_p_mvcRenderCommandName=%2Faccounts-and-annual-returns&_uk_gov_ccew_onereg_charitydetails_web_portlet_CharityDetailsPortlet_priv_r_p_organisationNumber=202918"
outfile = "annual-accounts-oxfam-2019.pdf"

response = requests.get(accounts, allow_redirects=True)

with open(outfile, "wb") as f: # with the file open in "write binary" mode, and giving it a shorter name (f)
    f.write(response.content)

In [None]:
from IPython.display import IFrame
IFrame(outfile, width=800, height=500)

It can produce some interesting, interactive visualisations:

In [None]:
IFrame("images/cast-sankey-diagram-2020-03-27.html", width=800, height=600)

It can do lots more, including natural language processing, machine learning, text mining etc. However, in this lesson we're going to focus on learning to walk before we start to run. Therefore, we'll focus on some core data manipulation skills that many social scientists could benefit from possessing.

### Manipulating data

Let's introduce some of the fundamentals of Python through a typical social science research activity: **creating a sampling frame**. 
> A sampling frame is a list or other device used to define a researcher's population of interest. (Lewis-Beck et al., 2004)

Though not as exciting as other social science activities (e.g., data visualisation), constructing a sampling frame requires a reasonable level of data and computational literacy, as we will soon demonstrate. It is also perfect for demonstrating how Python can be used to interact with your computer and data.

### Defining the problem

Programming is fundamentally about solving problems, so let's reframe our social science research activity in those terms.

> Jane is a sociologist who is completing a mixed methods PhD. The research design involves three waves of surveys sent to individuals, followed by semi-structured interviews with a subset of the survey respondents. The surveys have been completed and now her main task is to construct a sampling frame of individuals who could participate in the interview phase.

While this is an important first step, the problem can seem quite daunting when defined in such broad terms. Therefore it helps to *decompose* the problem into smaller steps:
1. Locate the files containing the data we need.
2. Open each file and extract its contents.
3. Append the contents of each file together to create a master file of all responses (i.e., our sampling frame).
4. Save the sampling frame as a separate file.
5. Produce a random sample of responses using the sampling frame and save this as a separate file; the random sample represents individuals who would then be contacted about participating in the semi-structured interviews.

What we've just written is called *pseudo-code*, as it captures the main tasks and the order in which they need to be run, but isn't code that a programming language can understand (e.g., you can't just tell Python "*Hey, locate the files I need for my project*").

We're going to use some real, open data in this lesson: individual responses to the 1961, 1971 and 1981 UK censuses [<a href="https://www.ukdataservice.ac.uk/get-data/open-data.aspx" target=_blank>Available here</a>]. These data sets contain a 1% sample of all individuals who responded to each census, and contain a subset of variables relating to each respondent (e.g., sex, age). We won't concern ourselves too much with the contents of these data sets, just how we can manipulate them to solve our research problem (creating a sampling frame).

### Setting up Python

Python already has lots of functionality available when you first launch it. For example, we can perform calculations like so:

In [None]:
43 * 105

Or we can print statements to the screen:

In [None]:
print("Python cannot be that difficult to learn, right?")

Often though, we need to *import* additional functionality specific to the activity at hand. 

In [None]:
# Import modules

import os # module for navigating your machine (e.g., file directories)
import csv # module for working with delimited text files
import pandas as pd # module for handling data
from datetime import datetime # module for working with dates and time

print("Succesfully imported necessary modules")

In [None]:
!pip freeze

Modules are additional techniques or functions that are not present when you launch Python. Some do not even come with Python when you download it and must be installed on your machine separately - think of using `ssc install <package>` in Stata, or `install.packages(<package>)` in R. For now just understand that many useful modules need to be imported every time you start a new Python session.

### Working with directories

(A directory is another name for a folder.)

A key task in any research project is setting up the directory structure in a logical and organised manner (Ferretti et al., 2019). While we're sure you won't find this task very interesting, not giving this enough thought leads to some very frustrating scenarios, such as: 
* Raw data and clean data being stored in the same folder.
* File names which only serve to confuse - which is the latest version: *Thesis Chapter 5 final.docx* or *Thesis Chapter 5 final UPDATED.docx*?
* Accidently deleting files because they are stored in the wrong folder.

In addition, a messy, non-sensical directory structure severely hinders your ability to collaborate with others (or yourself in the future). Think of being near the end of a project and having to update a chapter or article: can you find the clean data and recreate the steps that produced the table? Can you find the quotes critical to evidencing a particular theme?

Thankfully Python provides techniques for creating and navigating your directory structure.

#### Locating current folder

The first thing we can ask Python to do is tell us where we currently are on our computer. That is, where is the file we are currently using located?

In [None]:
os.getcwd()

Let's unpack the `os.getcwd()` command: There is a module called `os` which provides various methods for working with files and folders on your computer. One of these methods is called `getcwd()`, which returns a value indicating which folder is acting as the current working directory.

And what files are located in this folder?

In [None]:
os.listdir()

#### Examining folders

Let's take a look at the folder containing the data files we need to process:

In [None]:
os.listdir("responses")

The `os.listdir()` command lists the contents of a given directory.

Remember how we said Python understands where we are currently on our machine? This is a very useful feature, as it saves us having to worry about the *absolute* path to a file. For example, here is the full path to the data files:

In [None]:
os.path.abspath("responses")

Using *relative* rather than *absolute* paths to locate files and folders really comes in handy when you need to move your files to another computer, or you're working as part of team and each member has their own machine.

#### Creating folders

Finally, we will create a folder to store the sampling frame file:

In [None]:
#os.mkdir("sampling-frame")
os.listdir() # list contents of current working directory

### Working with files

We saw in the previous section that the files we need are located in a folder called "responses". Our first task is to open one of these files, and we begin by telling Python where to find it:

In [None]:
census_1961_file = "responses/census_1961.csv"
chickenchicken = "responses/census_1961.csv"
print(census_1961_file, chickenchicken)

Before opening the file, let's address the new element in above previous command. We defined a variable called `census_1961_file` to store the location and name of the 1961 census file. Note how the value of this variable is enclosed in double quotes (""): this denotes that it is a *string* variable i.e., it stores values that should be treated as text.

Defining a variable means we no longer have to type `"responses/census_1961.csv"` when we want to refer to or interact with this file. Instead, Python knows the `census_1961_file` variable stores this information for us.

Let's open and read in the content of the file using the `pandas` module we imported earlier:

In [None]:
census_1961_data = pd.read_csv(census_1961_file, encoding = "ISO-8859-1", index_col=False)

Let's unpack what the command is doing, starting on the right-hand side of the "=" sign. We use the `pandas` module, referring to it by its abbreviation `pd`. From `pandas` we employ the `read_csv()` method and supply it with three arguments: a CSV file to read (`census_1961_file`), a means of interpreting the contents of the file (`encoding = "ISO-8859-1"`), and an instruction not to create what's called an index column (`index_col=False`).

Don't stress about knowing which arguments are necessary and which are optional: just refer to the help documentation for a given module (e.g., <a href="https://pandas.pydata.org/docs/" target=_blank>pandas</a>).

`pandas` provides lots of useful functionality for manipulating and exploring data sets, such as viewing a sample of observations:

In [None]:
census_1961_data.sample(10)

Now that we've solved the issue of finding and reading files, let's apply this solution to the other census data sets:

In [None]:
census_1971_file = "responses/census_1971.csv"
census_1981_file = "responses/census_1981.csv"

census_1971_data = pd.read_csv(census_1971_file, encoding = "ISO-8859-1", index_col=False)
census_1981_data = pd.read_csv(census_1981_file, encoding = "ISO-8859-1", index_col=False)

### Creating a sampling frame

If you're somebody with experience of data analysis using SPSS or Stata for example, you may have noticed a significant advantage of Python: namely the ability to hold multiple data sets in memory at the same time (R can do this also).

Back to our next problem: combining the three census data sets to produce a master data set of all respondents (our sampling frame). The first thing we'll do is create a new variable by copying one of the existing data sets:

In [None]:
census_all_data = census_1961_data
census_all_data.sample(5)

Now we append observations from the other data sets to the bottom of `census_all_data`:

In [None]:
census_all_data = census_all_data.append([census_1971_data, census_1981_data])

Let's quickly highlight an efficient bit of coding. Note that we supplied the `append()` method with two data sets at once (`[census_1971_data, census_1981_data]`). The use of square brackets indicates to Python it is working with a *list* variable (i.e., it contains more than one value). The alternative would be two separate append commands:

(Note how you can tell Python not to run certain commands by prefixing them with `#` symbol)

In [None]:
# census_all_data = census_all_data.append(census_1971_data)
# census_all_data = census_all_data.append(census_1981_data)

We need to check whether we have the correct number of observations in the new data set. Using simple arithmetic, we know the sample frame should equal the sum of the number of observations across the three census data sets. First, we need to know how to capture the number of observations in each data set:

In [None]:
len(census_all_data)

In [None]:
len(census_1961_data)

The `len()` function - which is available for use when you launch Python, no importing necessary - returns the number of observations in a data set.

Now we can ask Python to evaluate whether the lenght of the master data set equals the sum of the lenghts of each census data set: 

In [None]:
len(census_all_data) == len(census_1961_data) + len(census_1971_data) + len(census_1981_data)

The above command is a Boolean expression, as evidenced by the fact it returns only one of the following two values: `True` or `False`. Boolean expressions are tremendously useful for evaluating whether a condition is met and thus controlling the *flow* of your code: if a condition is met, do one thing; if not, do something else. You can learn more about Boolean logic in chapter 21 of <a href="https://assets.digitalocean.com/books/python/how-to-code-in-python.pdf" target=_blank>How to Code in Python</a>. 

### Producing a random sample

By combining three census data sets, we have now have a sampling frame containing `1,562,660` respondents; obviously this is far too many individuals to contact about participating in follow-up interviews, so let's take a simple random sample:

In [None]:
census_random_sample = census_all_data.sample(frac=.01)
len(census_random_sample)

The above command takes a 1% (`frac=.01`) random sample of observations from the sampling frame saves the result as a new data set (`census_random_sample`).

Because we used `pandas` to create our data sets, we are able to use this module's methods directly on the data set variable. This probably sounds confusing, so let's see how else could have generated the random sample:

In [None]:
census_random_sample_alt = pd.DataFrame.sample(census_all_data, frac=.01)
len(census_random_sample_alt)

### Saving our work

The final problem to be solved: saving our work. Once again we can lean on the `pandas` module to simplify this task for us:

In [None]:
census_all_data.to_csv("sampling-frame/census-sampling-frame.csv", index=False)
census_random_sample.to_csv("sampling-frame/census-random-sample.csv", index=False)

How can we tell it worked? We could ask Python to list the contents of the "sampling-frame" folder:

In [None]:
os.listdir("sampling-frame")

Python has certainly created the files, though it's worth checking if the contents are correct:

In [None]:
random_sample_data = pd.read_csv("sampling-frame/census-random-sample.csv", encoding = "ISO-8859-1", index_col=False)
random_sample_data 
# In this blah 

And Voila, we have successfully solved our sampling frame research problem!

## What have we learned?

Let's recap what key skills and techniques we've learned:
* **How to import modules**. You will usually need to import modules into Python to support your work. Python does come with some methods and functions that are ready to use straight away, but for computational social science tasks you'll almost certainly need to import some additional modules.
* **How to navigate, create and delete folders**. You can use Python to navigate your directory structure using *relative* or *absolute* paths (with the former much preferred for reasons of collaboration and project portability).
* **How to read and manipulate data in files** Plenty of software packages (e.g., Stata, SPSS) and programming languages (e.g., R) provide functionality for working with data. Python is no different, and we consider it to have considerable advantages over other tools.
* **How to do all of the above in an efficient, clear and effective manner**.

## Conclusion

Python is a very powerful programming language, brimful of methods - data manipulation, web-scraping, natural langauge processing, interactive data visualisations - that are of great use to social scientists.

Jane's research task, creating a sampling frame, could have been solved using a manual approach: creating folders using `right-click > New folder`, opening each file individually and copy-and-pasting rows into a new file etc. While we do not advocate using Python for every task or project, we encourage you to think clearly about the advantages of adopting a computational approach:
* **Scalability** - What happens if there are 10, 100, or 1000 files instead of 3?
* **Accuracy** - What if Jane makes a mistake when copy-and-pasting records from the individual files? How will she know she's made a mistake?
* **Reproducibility** - What if Jane loses the sampling frame file? What if Jane wants to collaborate with another researcher? Does that person need to manually create the same folders, recreate or update the sampling frame?
* **Automatability** - What if new data becomes available on a monthly basis? Does Jane need to set a reminder in her calendar to download the latest data?

Like learning a spoke language, your initial attempts at writing code are rudimentary and frustrating. You'll find yourself only to able to write the same simple commands, wondering when Python's conventions will become second nature. However, with practice and immersion (through a relevant project) you will find your ability increases rapidly. In our assessment, the learning curve for social scientists developing their programming skills is steep at the beginning ("What the hell is a loop!?"), gentle in the middle ("Huh, so that's how you scrape a website"), and steep once again at the end ("Why did God invent neural networks...").

We promise though, with a modest investment of time and energy you will be surprised what opportunities emerge from knowing a little bit of Python.

Good luck on your programming adventures!

## Bibliography

Barba, Lorena A. et al. (2019). *Teaching and Learning with Jupyter*. <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>https://jupyter4edu.github.io/jupyter-edu-book/</a>.

Brooker, P. (2020). *Programming with Python for Social Scientists*. London: SAGE Publications Ltd.

Ferretti, M., Reades, J., & Millington, J. (2019). *Code Camp: 2019 (v1.0)*. <a href="http://doi.org/10.5281/zenodo.3474043" target=_blank>http://doi.org/10.5281/zenodo.3474043</a>

Tagliaferri, L. (n.d.). *How to Code in Python 3*. https://assets.digitalocean.com/books/python/how-to-code-in-python.pdf

## Further reading and resources

We hope this brief lession has whetted your appetite for learning more about Python and programming in general. There are some fantastic learning materials available to you, many of them free. We highly recommend the materials referenced in the Bibliography. In addition, you may find the list of useful books, papers, websites and other resources on our web-scraping Github repository worth referencing: <a href="https://github.com/UKDataServiceOpen/web-scraping/tree/master/reading-list/" target=_blank>[Reading list]</a>

--END OF FILE--