# Getting and processing data

This week, we will cover the topic of getting and processing data. Given a research problem, where can you find the relevant data? How do you obtain the data? And how do you actually process the data? This notebook aims to guide you through the process.

**Important**: this notebook requires you to download and install several items. Please install them before class.

## Where to find data

**Curated**

* Corpora (Brown ([NLTK version](http://www.nltk.org/book/ch02.html)), [OANC](http://www.anc.org/data/oanc/download/), [UMBC WebBase](http://ebiquity.umbc.edu/resource/html/id/351))
* Psycholinguistic data (sometimes known as 'norms' in the Psychology literature)
* DBpedia
* Open data (e.g. [Dutch](https://data.overheid.nl/), [American](https://www.data.gov/))

**The web**

* [USENET](http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html)
* [Internet Archive](https://archive.org/)
* [Project Gutenberg](https://www.gutenberg.org/)
* Wikipedia ([dumps](https://dumps.wikimedia.org/), [export]())
* [Web data commons](http://webdatacommons.org/)

**Do it yourself**

* [BootCat](http://bootcat.sslmit.unibo.it/)
* Experiments
* Annotating
* Crowdsourcing
* ...

## How to get the data

### Downloading directly

Here are three ways to download data from the web, each with their own use cases.

* Browser (loads of data available online)
* Command line: `wget` ([manual](https://www.gnu.org/software/wget/manual/wget.html))
* Python: `requests`, `urllib`

If you see some dataset online, or you just want to download a webpage, there is no better way than to use your browser and either save the page (from the File menu), or to right-click and press "save as..". But for more complex cases, you'll want to automate the process. 

The command line `wget` tool is like a swiss pocket knife for downloading stuff in bulk. For example, if you have a list of URLs in a text file called `list_of_urls.txt`, you can just use `wget -i list_of_urls.txt` to download all the files. You can also use the `wget` module in Python. For more complicated procedures, it's easier to just use the `requests` or `urllib` library.

Here is how we downloaded the Linguist List data for this course:

```python
import os
import urllib.request
import time

base_url = 'http://listserv.linguistlist.org/pipermail/linglite/'
years = [str(year) for year in range(1997,2016)]
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
          'August', 'September', 'October', 'November', 'December']

for year in years:
    # OS-independent way of creating the path to the folder.
    path = os.path.join('..', 'linguistlist', year)
    # Make the necessary folder.
    os.makedirs(path)
    
    for month in months:
        # Update variables.
        filename = '{}-{}.txt.gz'.format(year, month)
        path_with_file = os.path.join(path, filename)
        url = base_url + filename
        
        # Write the data to disk.
        with urllib.request.urlopen(url) as response:
            # Use the 'wb' flag because the response contents are bytes.
            with open(path_with_file, 'wb') as outfile:
                data = response.read()
                outfile.write(data)
        
        # Be nice to the server.
        time.sleep(2)
```

How did we do this?

* First, we went to the [Linguist List archive website](http://listserv.linguistlist.org/pipermail/linglite/). The archive looks nice, but it's a lot of work to download all of those files by hand!
* Then, we inspected the **source** of the webpage. In Firefox, you can do this by going to `Tools/Developer/Page Source`. In Chrome: `View/Developer/View Source`. Most other browsers offer this functionality as well.
* We saw that the URLs for the monthly archives are very regular. This is good, it means that we can exploit this regularity.
* Then, we decided on a local structure: we want to have one folder for every year, in which all the archives for that year are stored. This structure determined the structure of our program.
* If you don't download files often, search online for a good way to do this. Many programmers would be lost without Google/StackOverflow! The first thing we found was the `urllib` library. But a solution using the `requests` library would also be OK! That would look like this:

```python
import requests

# Get the data:
r = requests.get('http://listserv.linguistlist.org/pipermail/linglite/2016-September.txt.gz')

# Use the 'wb' flag because the response contents are bytes.
with open('September.txt.gz','wb') as f:
	# Write the data:
	f.write(r.content)
```

* It turns out that you can use a context manager (`with`-statement) to treat online sources as files. Cool! That means we can use two context managers (1) to get the file from the internet, and (2) to write the file to disk.
* It's good practice to make your computer wait a little between requests. So we used the `sleep` function from the `time` module to wait 2 seconds after each download.


#### Class discussion
This was a simple example that doesn't require us to do any parsing of the webpage itself. But how would you write a function that takes a URL like [this one](http://listserv.linguistlist.org/pipermail/linguist/2016-September/date.html) and returns all job descriptions? What would be your approach (on a high level)? 

We will revisit this problem below.

### Using an API

An API (*application programming interface*) provides a way for programs to interact with applications running independently. Those applications could either be running on your own computer, or they could be running somewhere else. We will be working with online APIs, specifically APIs providing the interface to some database. 

General guidelines for using APIs:

1. Try to minimize the number of requests you make. Can you be selective before putting in your requests? 
2. Try to spread your requests so that you don't overload the server.
3. Try to cache your results so that you don't request the same thing twice. (Think about multiple sessions and testing your code.)

In short: developers providing APIs are doing us a favor. Acting nice to them is the least we can do.

#### Bare APIs and wrappers

APIs work like this: you send them a request (possibly with some additional information), and they send you the relevant data back. Sometimes you have to send these requests explicitly in your code, but other times there will be a *wrapper* where people have written code to provide a nice interface for you to use.

**Geopy** is a nice example of a wrapper around several geolocation APIs. Read the documentation [here](https://geopy.readthedocs.io/en/1.10.0/). You can install Geopy using `pip install geopy`. 


When there is no wrapper, you just treat the API as if you are downloading something from the URL. Let's go through some examples. Both of these provide output in JSON format.

**Recipepuppy** is a website where you can search for recipes you can make with a particular set of ingredients. The description of their API is [here](http://www.recipepuppy.com/about/api/). So how do we make this work?

In [None]:
# This library comes pre-installed with Anaconda. We use it to send requests to the web.
import requests

# Get the ingredients
ingredients = input('Please enter the ingredients as a comma-separated list.\n')

# Remove spaces if there are any. (This makes the script more robust.)
ingredients.replace(' ','')

# Prepare the API request URL
base_url = "http://www.recipepuppy.com/api/?i="
api_request = base_url + ingredients

# Get the response
response = requests.get(api_request)

# And print it
print(response.content)

We know from last week that JSON objects are just like Python dictionaries, and you can load them using the JSON module. Let's try that!

In [None]:
import json

recipe_data = json.loads(response.content)

.

.

.

Woops! It turns out that data from the internet is in bytes-format. The JSON library really needs it to be a string.
For this, we need to use the `decode` method to turn the bytes into unicode. If this sounds like magic to you, don't worry: this is something all programmers have struggled with at some point. 

For the next class, please watch the video [Pragmatic Unicode, or: How do I stop the pain?](http://nedbatchelder.com/text/unipain.html). And, if you want to learn more about Unicode, read [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html).

Now, let's just convert the bytes and continue working with the recipe data.

In [None]:
# Decode bytes into utf-8 (unicode).
decoded_data = response.content.decode('utf-8')

# Load the data.
recipe_data = json.loads(decoded_data)

# Print the keys.
print(recipe_data.keys())

It worked! A nice way to inspect JSON response dictionaries is to use the built-in pretty printer from the `pprint` library

In [None]:
# Import the pretty printer:
from pprint import pprint

# Print the recipe data:
pprint(recipe_data)

So now we understand the basics of how this API works: ingredients are passed to the website as a comma-separated string, and we get a JSON response back that we can load as a dictionary. The dictionary contains a key called 'results', which maps to a list of results (dictionaries as well). 

But there is more to this API. Apparently you can't just get one page of results, but you can actually get multiple pages of results. [Here](http://www.recipepuppy.com/api/?i=onions,garlic&q=omelet&p=3) is their example. Some questions:

* How can you get more results?
* How do you know whether you have *all* results for a given query?

.

.

.

.

.

Play with the URL and see what happens! Try stuff like p=500000 (or some other high number).
We can assume that the website will give a similar page when there are no more results.
That's when the algorithm to get all the results needs to stop.

We will work with [this URL for omelettes containing potatoes](http://www.recipepuppy.com/api/?i=potato&q=omelette&p=1), for the simple reason that there aren't that many recipes matching this query. It's nice to have examples like these, because you can easily test your code.

**Hackernews** is a website where people can post URLs to interesting stories, submit polls, show the community something, or ask the community a question. The description of their API is [here](https://github.com/HackerNews/API). **Question**: what kind of things could you do with this data?

We will use the Hackernews API in the exercises.

Many APIs require you to authenticate yourself to the server, before they actually return any results. This is a means to prevent abuse (e.g. overloading the server). This usually means you have to register for the service in order to get an *API key*. We won't cover these in class (we don't want to force you to register for anything), but know there are many public APIs out there!

## How to process your data

### Processing the data: HTML

Let's take a look at a simple webpage. [Here](http://listserv.linguistlist.org/pipermail/linguist/2016-September/date.html) is one with all postings from the Linguist List in September 2016. Our goal will be to get a list with all the Job postings, including the URL. How do we go about this?

Step 1. **Look at the source code first**. We can't do anything without knowing how the page is structured. You can open the page with your browser and inspect the source, or right-click the link and choose "Save as.." to save the file and inspect it with a text editor. What would be a good approach?

.

.

.

.

.

.


**Possible approaches**

1. Use string-methods, look for all the lines with the word 'Jobs' in it, and extract the URL and title from them.
2. Use regular expressions, write a pattern to match all links with 'Jobs' in the text.
3. Use a module to parse the HTML first, then look for all links with the word 'Jobs'.

Let me first emphasize: *There is no wrong way to do this.* If it works, it works. But as the problems you are trying to solve are getting more and more complex, it's increasingly easier to use a high-level approach. To illustrate: how would you get the full text of [this article](http://www.bbc.com/news/disability-35881779) from the webpage?

Step 2. **Create a working solution for the problem at hand.** Let's try all three approaches. 

In [None]:
# Python 3 only imports libraries that it hasn't already imported.
import requests

# Get the data, and convert to string.
response = requests.get('http://listserv.linguistlist.org/pipermail/linguist/2016-September/date.html')
contents = response.contents.decode('utf-8') # We'll use this variable as the starting point for this exercise.

First, try to find all URLs and titles of job-announcements.

Now try to find all URLs and titles of job-announcements using regular expressions.

Finally, let's use the `lxml` module to find all URLs and titles of job announcements.

Step 3. **How generalizable is your solution?** How many steps does it take to change our solutions to, for example:

* Use a different URL (maybe you want to do this in October as well).
* Search for a different set of announcements, e.g. *Books*, or *Conferences*.

You don't need to implement these changes, though you can if you want to! (Use the code boxes below.) But just read through your solutions to this problem and think about what changes should be made.

### Processing data: NLP tools

The common idea for all NLP tools is that they try to structure or transform text in some meaningful way. The question of which tool you should use is only secondary to the question what you want to achieve. To give you a sense of the things you can achieve with standard NLP techniques, we will now look at two tools that you can use to analyze text: **SpaCy** and **pyspotlight**. 

#### SpaCy: quickly parsing documents

SpaCy provides a small NLP pipeline: it takes a raw document, tokenizes it, tags all the tokens, and parses each sentence. On top of that, it also recognizes different types of entities: numbers, locations, and persons. The advantage of SpaCy is that it is really fast, and it has a good accuracy. The downside is that, at the moment, it only works for English and German. There are other tools available for different languages, but those are a bit more difficult to set up. (We can help you with this; ask us after class.)

**Installing** 

To install SpaCy, enter the following commands on the command line.

* `conda config --add channels spacy` on the command line
* `conda install spacy`. 
* `python -m spacy.en.download` (if this doesn't work, see [here](http://spacy.io/docs/#getting-started) for updated instructions).

**Using SpaCy**

First let's load SpaCy.

In [None]:
# Load the English parser.
from spacy.en import English

# The English parser is a class. 
# If you call it without any arguments, you will get a parser object.
# You can use this object to parse documents.
parser = English()

In [None]:
# Here's how to parse a document.
parsed_document = parser("I have a cat. It's sitting on the mat.")

In [None]:
# Now you can loop over the document and print each sentence.
for sentence in parsed_document.sents:
    print(sentence)

#### pyspotlight: 'interpret' sentences using DBpedia

Pyspotlight provides an easy way to use DBpedia Spotlight, which is a service you can use to find DBpedia entities in a text. DBpedia is --roughly-- a machine-readable version of Wikipedia. In short, this tool enables us to figure out which entities a text is about.

**Installing**

To install pyspotlight, enter the following command on the command line.

* `pip install pyspotlight`

**Using pyspotlight**

EXAMPLES AND EXERCISES.

#### Other tools (not covered in class)

Unfortunately we cannot cover all NLP tools in this course. Below is a short list of tools that might be useful to you in the future. You can either use these tools as standalone programs (and then process their output using Python), or you can choose to use a *wrapper* that allows you to call these tools from inside Python.

* Treetagger is a tool for tokenization and part-of-speech tagging in many languages. [Here](https://github.com/miotto/treetagger-python) is a Python interface for it. 
* Stanford CoreNLP is a suite of NLP tools (constituting a full pipeline). [Here](https://github.com/dasmith/stanford-corenlp-python) is a library to interact with those tools.

## Exercises