![UKDS Logo](./images/UKDS_Logos_Col_Grey_300dpi.png)

# Being a Computational Social Scientist

Welcome to the <a href="https://ukdataservice.ac.uk/" target=_blank>UK Data Service</a> training series on *New Forms of Data for Social Science Research*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. To help you get to grips with these new forms of data, we provide webinars, interactive notebooks containing live programming code, reading lists and more.

* To access training materials for the entire series: <a href="https://github.com/UKDataServiceOpen/new-forms-of-data" target=_blank>[Training Materials]</a>

* To keep up to date with upcoming and past training events: <a href="https://ukdataservice.ac.uk/news-and-events/events" target=_blank>[Events]</a>

* To get in contact with feedback, ideas or to seek assistance: <a href="https://ukdataservice.ac.uk/help.aspx" target=_blank>[Help]</a>

<a href="https://www.research.manchester.ac.uk/portal/julia.kasmire.html" target=_blank>Dr Julia Kasmire</a> and <a href="https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html" target=_blank>Dr Diarmuid McDonnell</a> <br />
UK Data Service  <br />
University of Manchester <br />
May 2020

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Guide-to-using-this-resource" data-toc-modified-id="Guide-to-using-this-resource-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Guide to using this resource</a></span><ul class="toc-item"><li><span><a href="#Interaction" data-toc-modified-id="Interaction-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Interaction</a></span></li><li><span><a href="#Learn-more" data-toc-modified-id="Learn-more-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Learn more</a></span></li></ul></li><li><span><a href="#Acquiring,-understanding-and-manipulating-unstructured/unfamiliar-data" data-toc-modified-id="Acquiring,-understanding-and-manipulating-unstructured/unfamiliar-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Acquiring, understanding and manipulating unstructured/unfamiliar data</a></span><ul class="toc-item"><li><span><a href="#Acquiring-data" data-toc-modified-id="Acquiring-data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Acquiring data</a></span></li><li><span><a href="#Understanding-and-manipulating-data" data-toc-modified-id="Understanding-and-manipulating-data-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Understanding and manipulating data</a></span></li></ul></li></ul></div>

-------------------------------------

<div style="text-align: center"><i><b>This is notebook 5 of 6 in this lesson</i></b></div>

-------------------------------------

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*Collecting data from online databases using an API*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about Python and computational social science!".format(name)) 

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## Acquiring, understanding and manipulating unstructured/unfamiliar data

### Acquiring data

There are LOADS of ways to get data, some that are more 'computational' than others. You are all surely familiar with surveys and interviews, as well as Official data sources and data requests. You may also be familiar with (at least the concepts of):
* scraped data that comes from web-pages or APIs
* “found” data that is captured through alongside orinigally intended data targets
* meta-data, which is data about data
* repurposed data, or data collected for some other purspose that is used in new and creative ways or 
* other... cause this list is definitely not exhaustive. 
 
To some extent, using these data sources requires that you keep your ear to the ground so that you know when relevant new sources come available. But once you know *about* them, you still need to know *what* they are and *how* to access and use them. 
 
So, we will set data acquisition aside for the moment and instead focus on data literacy, which is knowldege of the types of data that you might find.

### Understanding and manipulating data

Being data literate involves understanding two key properties of datasets:
1. How the contents of the dataset are stored (e.g., as numbers, text, etc.).
2. How the contents of the dataset are structured (e.g., as rows of observations, or networks of relations).

#### Data types

Data types provide a means of classifying the contents (values) of your dataset. For example, in [Understanding Society](https://www.understandingsociety.ac.uk/) there are questions where the answers are recorded as numbers e.g., [`prfitb`](https://www.understandingsociety.ac.uk/documentation/mainstage/dataset-documentation/variable/prfitb) which captures total personal income.

Data types are important as they determine which values can be assigned to them, and what operations can be performed using them e.g., can you calculate the mean value of a piece of text (Tagliaferri, 2019). Let's cover some of the main data types in Python.

##### Numbers

These can be integers or floats (decimals), both of which behave a little differently.

In [1]:
# Integers

myint = 5
yourint = 10

You double clicked, you hit run, but nothing happened, right? That is because naming and defining variables does not come with any commands that produce output. Basically, we ran a command that has no visible reaction. But maybe we want to check that it worked? To do that, we can call a print command. 

The cell below has a print command that includes the some text (within the quotation marks) and the result of a numerical operation over the variables we defined. Go ahead, double click in the cell and hit Run/Shift+Enter.

In [2]:
print("Summing integers: ", myint + yourint)

Summing integers:  15


Great! The print command worked and we see that it correctly summed the numerical value of the two variables that we defined. 

Let's try it again with Floats. Click in the code block below and hit Run/Shift+Enter. 

In [3]:
# Floats

myflo = 5.5
yourflo = 10.7
print("Summing floats: ", myflo + yourflo)

Summing floats:  16.2


It might not be surprising, but it worked again. This time, the resulting sum had a decimal point and a following digit, which is how we know it was a float rather than an integer. 

What happens when we sum an integer and a float? Find out with the next code block!

In [4]:
# Combining integers and floats

newnum = myint + myflo

print("Value of summing an integer and a float: ", newnum)
print("Data type when we sum an integer and a float: ", type(newnum))

Value of summing an integer and a float:  10.5
Data type when we sum an integer and a float:  <class 'float'>


In this case, create a new variable, called *newnum* and assign it the value of the sum of one of our previous integers and one of our previous floats.  

Then, we have two print statements. One returns the value of *newnum* while the other returns the *type* of *newnum*. 

You can always ask for the type. Go ahead and double click in the cell above again. This time, instead of just running the code, copy and past the final print statement. Before you run the code again with your new line, but change that line by rewriting the text inside the quotation marks to anything you like and change *newnum* to *myfloat* or *myint* or any of the other variables we defined. 

You can even define a whole new variable and then ask for the type of your new variable. 

##### Strings

This data type stores text information. This should be a bit familiar, as we used text information in the previous code blocks within quotation marks. 

Strings are immutable in Python i.e., you cannot permanently change its value after creating it. But you can see what type of variable a string is (just like with the numerical variables above). 

In [5]:
# Strings

mystring = "Thsi is my feurst string."
print(mystring)

print("What type is mystring: ", type(mystring))

mystring = "This is my correct first string."
print(mystring)

yourstring = mystring.replace("my", "your") # replace the word "my" with "your"
print(yourstring)

splitstring = yourstring.split("your") # split into separate strings
print(splitstring)

Thsi is my feurst string.
What type is mystring:  <class 'str'>
This is my correct first string.
This is your correct first string.
['This is ', ' correct first string.']


Manipulating strings will be a common and crucial task during your computational social science work. We'll cover intermediate and advanced string manipulation throughout these training materials but for now we highly suggest you consult the resources listed below.

*Further Resources*:
* [Principles and Techniques of Data Science](https://www.textbook.ds100.org) - Chapter 8.
* [Python 101](https://python101.pythonlibrary.org) - Chapter 2.

##### Boolean

This data type captures values that are true or false, 1 or 0, yes or no, etc. These will be like dummy or indicator variables, if you have used those in Stata, SPSS or other stats programmes. 

Boolean data allow us to evaluate expressions or calculations (e.g., is one variable equal to another? Is this word found in a paragraph?).

In [7]:
# Boolean

result = (10+5) == (14+1) # check if two sums are equal
print(result) # print the value of the "result" object
print(type(result)) # print the data type of the "result" object

True
<class 'bool'>


It is important to note that we did not define *result* as the value of 10+5 or the value of 14+1. We defined *result* as the value of whether 10+5 was exactly equal to 14+1. 

In this case, 10+5 is exacly equal to 14+1, so *result* was defined as True, which we can see in the output of the *print(result)* command. 

Booleans are very useful for controlling the flow of your code: in the below example, we assign somebody a grade and then use boolean logic to test whether the grade is above a threshold, which determines whether or not that grade receives a pass or fail notification.

Double click in the code block below and hit Run/Shift+Enter. 

Then redefine grade as a different number by changing the number after the '=' and then hitting Run=Shift+Enter again. 

In [8]:
grade = 71

if grade >= 40:
    print("Congratulations, you have passed!")
else:
    print("Uh oh, exam resits for you.")

Congratulations, you have passed!


You can write a boolean statement more consicely, as demonstrated in the next code block. This time, you don't get the nicely worded pass/fail messages, but those will not always be important. 

Double click in the code block below and hit Run/Shift+Enter. Try it again, but change the number. This changes the threshold against which the command will return a true. 

Remember that you can redefine *grade* at any point, either by changing the definition in the code block above and re-running that code block or by copy/pasting/editing the grade = 71 line from above into this code block and re-running it here. 

In [9]:
print(grade >= 40) # evaluate this expression

True


*Further Resources*:
* [How To Code in Python](https://assets.digitalocean.com/books/python/how-to-code-in-python.pdf) - Chapter 21.

##### Lists

The list data type stores a variable that is defined as an ordered, mutable (i.e., you can change its values) sequence of elements. Lists are defined by naming a variable and setting it equal to elements inside of square brackets. 

Double click in the code block below and hit Run/Shift+Enter. 


In [10]:
# Creating a list

numbers = [1,2,3,4,5]
print("numbers is: ", numbers, type(numbers))

strings = ["Hello", "world"]
print("strings is: ", strings, type(strings))

mixed = [1,2,3,4,5,"Hello", "World"]
print("mixed is: ", mixed, type(mixed))

mixed_2 = [numbers, strings]
print("mixed_2 is: ", mixed_2, type(mixed_2)) # this is a list of lists

numbers is:  [1, 2, 3, 4, 5] <class 'list'>
strings is:  ['Hello', 'world'] <class 'list'>
mixed is:  [1, 2, 3, 4, 5, 'Hello', 'World'] <class 'list'>
mixed_2 is:  [[1, 2, 3, 4, 5], ['Hello', 'world']] <class 'list'>


Notice that most of these print commands print the value of the variable and also the type of variable.

Also notice that you can define a list variable by listing all of the elements that you want to be in that list inside of square brackets (like the code that defines 'mixed') or you can define a list by including *other* lists inside of the square brackets for a new list (like thecode that defines 'mixed_2'). 

As you can see, mixed has only one set of square brackets, but mixed_2 has square brackets nested inside of other square brackets to create a list of lists. 

Feel free to re-define these variables or add/define new variables too (but leave 'numbers' alone as we need it for the next several steps). 

When you are done testing out how to define work with lists, go on to run the next code block.

In [11]:
# List length

length_numbers = len(numbers)
print("The numbers list has {} items".format(length_numbers)) 
# the curly braces act as a placeholder for what we reference in .format()

The numbers list has 5 items


This one creates a new variable, called 'length_numbers' that is defined as the "len" of the "numbers" variable we defined above. 

The print statement underneath then goes on to tell us the value of the 'length_numbers' variable, but embeds that value inside of a sentence. We use the curly brackets as a placeholder for where the value should get embedded and use the '.format(length_numbers) to order the embedding and to define what is to be embedded. 

Try re-running the print command with other values embedded (by changing the variable that is to be emdedded), or embedding the variable in different places (be repositioning the curly brackets). 

In [None]:
# Accessing items (elements) within a list

print("{} is the second item in the list".format(numbers[1]))
# note that the position of items in a list (known as its 'index position')
# begins at zero i.e., [0] represents the first item in a list

# We can also loop through the items in a list:

print("\r") # add a new line to the output to aid readability
for item in numbers:
    print(item)
# note that the word 'item' in the for loop is not special and
# can instead be defined by the user - see below   

print("\r")
for chicken in numbers:
    print(chicken)
# of course, such a silly name does nothing to aid interpretability of the code    

In [None]:
# Adding or removing items in a list

numbers.append(6) # add the number six to the end of the list
print(numbers)

numbers.remove(3) # remove the number three from the list
print(numbers)

##### Dictionaries

The dictionary data type maps keys (i.e., variables) to values; thus, data in a dictionary are stored in key-value pairs (known as items). Dictionaries are useful for storing data that are related e.g., variables and their values for an observation in a dataset.

In [12]:
# Creating a dictionary

dict = {"name": "Diarmuid", "age": 32, "occupation": "Researcher"}
print(dict)

{'name': 'Diarmuid', 'age': 32, 'occupation': 'Researcher'}


In [13]:
# Accessing items in a dictionary

print(dict["name"]) # print the value of the "name" key

Diarmuid


In [14]:
print(dict.keys()) # print the dictionary keys

dict_keys(['name', 'age', 'occupation'])


In [15]:
print(dict.items()) # print the key-value pairs

dict_items([('name', 'Diarmuid'), ('age', 32), ('occupation', 'Researcher')])


In [16]:
# Combining with lists

obs = [] # create a blank list

ind_1 = dict # create dictionaries for three individuals
ind_2 = {"name": "Jeremy", "age": 50, "occupation": "Nurse"}
ind_3 = {"name": "Sandra", "age": 41, "occupation": "Chef"}

for ind in ind_1, ind_2, ind_3: # for each dictionary, add to the blank list
    obs.append(ind)

print(obs)# print the list
print("\r")
print(type(obs)) # now we have a list of dictionaries

[{'name': 'Diarmuid', 'age': 32, 'occupation': 'Researcher'}, {'name': 'Jeremy', 'age': 50, 'occupation': 'Nurse'}, {'name': 'Sandra', 'age': 41, 'occupation': 'Chef'}]

<class 'list'>


#### Data structures

Indulge me: close your eyes and visualise a dataset. What do you picture? Heiberger and Riebling (2016, p. 4) are confident they can predict what you visualise:

> Ask any social scientist to visualize data; chances are they will picture a rectangular table consisting of observations along the rows and variables as columns.

##### Data frame

A data frame is a rectangular data structure and is often stored in a Comma-Separated Value (CSV) file format. A CSV stores observations in rows, and separates (or "delimits") each value in an observation using a comma (','). Let's examine a CSV dataset in Python:

In [17]:
import pandas as pd # module for handling data frames

df = pd.read_csv("./data/oxfam-csv-2020-03-16.csv") # open the file and store its contents in the "df" object
df # view the data frame

Unnamed: 0,name,regno,fys,fye,inc,exp
0,Oxfam,202918,01/05/2008 00:00,30/04/2009 00:00,308300000,318600000
1,Oxfam,202918,01/05/2009 00:00,31/03/2010 00:00,318000000,294800000
2,Oxfam,202918,01/04/2010 00:00,31/03/2011 00:00,367500000,361100000
3,Oxfam,202918,01/04/2011 00:00,31/03/2012 00:00,385500000,378700000
4,Oxfam,202918,01/04/2012 00:00,31/03/2013 00:00,367900000,384600000
5,Oxfam,202918,01/04/2013 00:00,31/03/2014 00:00,389100000,365100000
6,Oxfam,202918,01/04/2014 00:00,31/03/2015 00:00,401400000,387800000
7,Oxfam,202918,01/04/2015 00:00,31/03/2016 00:00,414700000,420700000
8,Oxfam,202918,01/04/2016 00:00,31/03/2017 00:00,408600000,402600000
9,Oxfam,202918,01/04/2017 00:00,31/03/2018 00:00,427200000,438700000


##### Dictionaries

A dictionary is a hierarchical data structure based on key-value pairs. Dictionaries are often stored as Javascript Object Notation (JSON) files. Let's examine a JSON dataset in Python:

In [18]:
import json # import Python module for handling JSON files

with open('./data/oxfam-csv-2020-03-16.json', 'r') as f: # open file in 'read mode' and store in a Python JSON object called 'data'
    data = json.load(f)
          
data # view the contents of the JSON file

{'name': {'0': 'Oxfam',
  '1': 'Oxfam',
  '2': 'Oxfam',
  '3': 'Oxfam',
  '4': 'Oxfam',
  '5': 'Oxfam',
  '6': 'Oxfam',
  '7': 'Oxfam',
  '8': 'Oxfam',
  '9': 'Oxfam'},
 'regno': {'0': 202918,
  '1': 202918,
  '2': 202918,
  '3': 202918,
  '4': 202918,
  '5': 202918,
  '6': 202918,
  '7': 202918,
  '8': 202918,
  '9': 202918},
 'fys': {'0': '01/05/2008 00:00',
  '1': '01/05/2009 00:00',
  '2': '01/04/2010 00:00',
  '3': '01/04/2011 00:00',
  '4': '01/04/2012 00:00',
  '5': '01/04/2013 00:00',
  '6': '01/04/2014 00:00',
  '7': '01/04/2015 00:00',
  '8': '01/04/2016 00:00',
  '9': '01/04/2017 00:00'},
 'fye': {'0': '30/04/2009 00:00',
  '1': '31/03/2010 00:00',
  '2': '31/03/2011 00:00',
  '3': '31/03/2012 00:00',
  '4': '31/03/2013 00:00',
  '5': '31/03/2014 00:00',
  '6': '31/03/2015 00:00',
  '7': '31/03/2016 00:00',
  '8': '31/03/2017 00:00',
  '9': '31/03/2018 00:00'},
 'inc': {'0': 308300000,
  '1': 318000000,
  '2': 367500000,
  '3': 385500000,
  '4': 367900000,
  '5': 389100000,


<div style="text-align: right"><a href="./bcss-notebook-four-2020-02-12.ipynb" target=_blank><i>Previous section: Computational environments</i></a> &nbsp;&nbsp;&nbsp;&nbsp; | &nbsp;&nbsp;&nbsp;&nbsp;<a href="./bcss-notebook-six-2020-02-12.ipynb" target=_blank><i>Next section: Reproducibility</i></a></div>