# 10. Data acquisition

Data science projects typically start with the acquisition of data. In many cases, such data sets consist of secondary data made available on the web by commercial or non-commercial organisations. This part of the tutorial explains how you can obtain such online data sets.

In this tutorial, we distinguish three methods of data acquisition: downloading data files, accessing data through APIs and webscraping. You usually choose one of these methods to acquire your data, based on what the data provider offers.

## Direct downloads

If the resources that you are interested in are available directly via the web, you can download these files by making use of the [`requests`](https://requests.readthedocs.io/) library. As is the case for all libraries, the `requests` library needs to be imported before you can use it. 

In [None]:
import requests

The `requests` library can be used to make requests according to the Hypertext Transfer Protocol (HTTP), which was developed to enable the exchange of information across computers. The computer that can provide information is typically referred to as a server, and the computer that requests information from this server is referred to as a client. In the HTTP protocol, the GET method is used to request data from a specified server. 

In Python, such a GET request can be sent to a server using the `get()` method in `requests`, as demonstrated below. Evidently, it is important that you are online when you run this code.

In [None]:
response = requests.get( 'https://www.universiteitleiden.nl')

This method returns a so-called `Response` object. It is an object which represents information about the downloaded web resource. In the example above, the result of the method is assigned to a variable named `response`.

Once this `Response` object has been created successfully, you can use various pieces of information about the resource that was downloaded.
The property `status_code`, for instance, indicates the HTTP status code that was returned by the server.
The status code 200 indicates that the request was successful and the infamous status code 404 indicates that the file was not found.

If the status code is indeed 200, the contents of the resource is accessible in the response's `body` property. However, this property holds the contents as bytes. Typically, when we downloaded a webpage, we want to work with the data as text. In these cases, the `text` property of the `Response` object contains the full contents of the downloaded website, dataset or other kind of file as a string.

Note that `requests` may not always understand a file's [character encoding](https://www.w3.org/International/questions/qa-what-is-encoding) automatically. You can set the correct character encoding explicitly using the `encoding` property.

When you run the code that is given below, the contents of the webpage that is specified at the beginning (or more specifically, the HTML code that was created to build the webpage) becomes available as a string, assigned to the variable named `contents`.

In [None]:
import requests

contents = ""
response = requests.get('https://www.universiteitleiden.nl')
print( response.status_code )

if response.status_code == 200:
    response.encoding = 'utf-8'
    contents = response.text
    print (contents)


Using the `requests` library, you can basically download any type of file from the web, as long as it is retrievable via HTTP(s). The code below, for instance, downloads a specific text file from the Project Gutenberg website.

In [None]:
url = "https://www.gutenberg.org/files/98/98-0.txt"

response = requests.get(url)

if response:
    response.encoding = 'utf-8' 
    print (response.text) 


Note that the `if` keyword in the code above does not explicitly test whether the response code is 200. The Response object, which is created when you use the `get()` method from requests, automatically returns `True` when the status code is 200.

The `requests` library can also be used to retrieve data from an API.

## Acquiring data via APIs

Organisations which aim to make their data available for reuse often do this through an *Application Programming Interface* (API). An API, simply put, is the interface through which (online) services and applications provide access to their information and functionalities. 

It enables organisations to share some of the data that they have in a strucured format, so that other external parties can make use of these data in new applications.

The communication between the sender and the recipient of such requests needs to take place according to a specific protocol. The requests need to be formulated according to certain rules. 

For many APIs, you need to create an access key before you can send requests. This is the case, for instance, for the Twitter API. 

### Example: MusicBrainz

There are also many APIs which are fully open, however. One example is the [MusicBrainz](https://musicbrainz.org/doc/MusicBrainz_API) API. *MusicBrainz* is a large online encyclopedia containing information about musicians and their work. You can send requests to this API without having to provide an access key. 

The root URL of this API is [https://musicbrainz.org/ws/2/](https://musicbrainz.org/ws/2/)

On *MusicBrainz*, you can request information a number of different entities, including artists, genres, instruments, labels and releases. The enity type you are interested in firstly needs to be appended the root URL. If you want to want to search for information about an artist, for example, you need to work with the following URL structure: https://musicbrainz.org/ws/2/artist

You can then work with the following parameters:

```
query = [search term]
fmt = [json or xml]
limit = [integer]
```

Following the `query` parameter, you can supply the name of the artist you want to search for. Using the `fmt` parameter, you can specify whether you want to receive the result in [XML](https://www.w3.org/XML/) or in [JSON](https://www.json.org/) format. The API returns XML data by default. If the API results many results, you can reduce the number of results by working with the `limit` parameter.  

The following API call returns information about *The Beatles* in the JSON format. 

https://musicbrainz.org/ws/2/artist?query=The%20Beatles&fmt=json

In Python, you can also send out such API calls using the requests library. 

In [None]:
import requests
from requests.utils import requote_uri

root_url = 'https://musicbrainz.org/ws/2/'

## The parameters for the API call are defined as variables
entity = 'artist'
query = 'David Bowie'
limit = 5
fmt = 'json'

query = requote_uri(query)

api_call = f'{root_url}{entity}?query={query}&fmt={fmt}&limit={limit}'
print(api_call)

response = requests.get( api_call )


In the code above, the data that are returned by the *MusicBrainz* API are saved as an object named `response`. These data are structured according the format we specified, namely, JSON. To process these data, we can work with the `json()` method from the `request` library. This method parses the JSON data into regular Python data structures. JSON objects are converted into dictionaries, and JSON lists become regular Python lists. 

The JSON object that is returned by *MusicBrainz* returns an object in which the 'artists' key contains all the artists whose names or descriptions contains the search term you provided. For each individual artist, we can retrieve the name, using the `name` attribute, and the `type`, among many other properties. 


In [None]:
musicbrainz_results = response.json()

for artist in musicbrainz_results['artists']:
    name = artist.get('name','[unknown]')
    artist_type = artist.get('type','[unknown]')
    print(f'{name} ({artist_type})')


## Webscraping

When a website does not offer access to its structured data via a well-defined API, it may be an option to acquire the data that can be viewed on a site by making use of web scraping. It is a process in which a computer program tries to process the contents of given webpage, and to extract the data values that are needed. The aim of such an application is generally to copy information on a web page and to paste it into a local database.

To get the most out of webscraping, you need to have a basic understanding of HTML. Many [basic introductions](https://bookandbyte.universiteitleiden.nl/DMT/PDF/HTML.pdf) can be found on the web. Web scraping should be used with caution, because it may be not be allowed to download large quantities of data from a specific website. In this tutorial we will only look at extracting information from single pages.

To scrape webpages, you firstly need to download them. This can be done using the `requests` library that was explained above. The code below scrapes data from a page on the [Internet Movie Database](https://www.gutenberg.org) website, listing the top rated movies.

In [None]:
import requests

url = 'https://www.imdb.com/chart/top?ref_=ft_250'
response = requests.get( url )

if response:
    response.encoding = 'utf-8'
    html_page = response.text 

Once you have obtained the contents of a webpage, in the form of an HTML document, you can begin to extract the data values that you are interested in. This tutorial explains how you can extract the title of these movies and the URLs of the pages on IMDB using web scraping. 

If you inspect the output of the previous cell (the HTML code), you can see that the information about the movies is encoded as follows:


```
<td class="titleColumn">

<a href="/title/tt0068646/">
The Godfather
</a>

</td>

```

The data can found in a &lt;td&gt; element whose 'class' attribute has value 'titleColumn'. The actual title in given in a hyperlink, encoded using &lt;a&gt;. The URL to the page for the movie is given in an 'href' attribute. 'Scraping' the page really means that we need to extract the values we need from these HTML elements.  


One of the libraries that you can use in Python for scraping online resources is `Beautiful Soup`. The code below firstly transforms the HTML code that was downloaded into a BeautifulSoup object. If the `bs4` library has been imported, you can use its `BeautifulSoup()` method. This method demands the full contents of an HTML document as a first parameter. As a second parameter, you need to provide the name one of the parsers that are available. Generally, a parser is an application which can process and analyse data. In this context, it refers to a program which can analyse the HTML file. One of the parsers that we can use is `lxml`. Using this parser, the `BeautifulSoup()` method converts the downloaded HTML page into a BeautifulSoup object. 

The `prettify()` method of this object creates a more readable version of the HTML file by adding indents and end of line characters.

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup( html_page,"lxml")

print( soup.prettify() )



The BeautifulSoup object that was created above has a `find_all()` method, which you can use to find all occurrences of a specific HTML tag. The name of the tag (or element) needs to be mentioned as the first parameter.  

In our example, we need to focus on specific types of &lt;td&gt; elements: those which have a 'class' attribute with value 'titleColumn'. Such criteria for the attributes can be given as the second parameter in `find_all()`.

As we saw in the HTML snippet above, the &lt;td&gt; elements do not contain the title and the url directly. These values are given in the &lt;a&gt; child element. Such child elements, or subelements, can be found using `findChildren()`. As a parameter, you need to give the name of the tag you want to find underneath the current element. In the code below, the variable `children` represents all the &lt;a&gt; elements found underneath &lt;td&gt;.    

To retrieve only the text of the tag (i.e. the text which is encoded using the tag), we can use the `text` property. To retrieve the value of an attribute of this element, we can use the `get()` method. As an argument, this method demands the name of the attibute we are interested in, `href` in this case.   

In [None]:
movies = soup.find_all('td', {'class': 'titleColumn'} )

for m in movies:
    # Find links (a elements) within the cell\n",
    children = m.findChildren("a" , recursive=False)
    for c in children:
        movie_title = c.text
        url = c.get('href')
        ## This is an internal link, so we need to prepand the base url
        url = 'https://imdb.com' + url
        print( f'{movie_title}: {url}' )  

Once you have created a list of URLs using the method outlined above, you can also download all the texts that were found, using the `get()` method from `requests` library.

As you can see, web scraping can easily become rather difficult. You need to inspect the structure of the HTML source quite carefully, and you often need to work with fairly complicated code to extract only the values that you need. 


### Advanced scraping: Scrapy

This tutorial has only touched the surface of web scraping. To get specific data from webpages or APIs, you will need to dig into the data that you get and probably learn more about the data formats. A more advanced framework (or toolkit) for webscraping with Python is [Scrapy](https://scrapy.org). This framework simlified the process of building a scraper/crawler considerably by providing basic functionalities out of the box. Although Scrapy does not understand what parts of webpages are of interest to you, it does many things for you, such as making sure you don't send too many requests at the same time or retrying requests that fail. Feel free to look at the [Scrapy tutorial](https://docs.scrapy.org/en/latest/intro/tutorial.html) if you want to experiment with this library. 

# Exercises

## Exercise 10.1.

The list below contains a number of URLs. They are the web addresses of texts created for the [Project Gutenberg](https://www.gutenberg.org) website.

```
urls = [ 'https://www.gutenberg.org/files/580/580-0.txt' ,
'https://www.gutenberg.org/files/1400/1400-0.txt' ,
'https://www.gutenberg.org/files/786/786-0.txt' ,
'https://www.gutenberg.org/files/766/766-0.txt' 
]
```

Write a program in Python that downloads all the files in this list and stores them in the current directory.
As filenames, use the same names that are used by Project Gutenberg (e.g. '580-0.txt' or '1400-0.txt').
The basename in a URL can be extracted using the [`os.path.basename()`](https://docs.python.org/3/library/os.path.html#os.path.basename) function.


In [None]:
import requests
import os 

# Recreate the given list using copy and paste
urls = [  
]

# We use a for-loop to take the same steps for each item in the list:
for url in urls:
    # 1. Download the file contents
    
    # 1a. Force the textual contents to be interpreted as UTF-8 encoded, because the website does not send the text encoding
    
    # 2. Use basename to get a suitable filename
    
    # 3. Open the file in write mode and write the downloaded file contents to the file
    
    # 4. Close the file
    
    

## Exercise 10.2.

Write Python code which can download the titles and the URLs of Wikipedia articles whose titles contain the word 'Dutch'. Your code needs to display the first 30 results only.

*Hint: the tutorial covers the Wikipedia API.*

In [None]:
import requests

baseURL = 'https://en.wikipedia.org/w/api.php?action=opensearch'

# Get the search results and display them



## Exercise 10.3.

Write an application in Python that extracts all the publications that have been added to a specific ORCID account, using the ORCID API.

Information about individual ORCID accounts can be obtained by appending their ID to the base URL <https://pub.orcid.org/v2.0/>. The ORCID API returns data in XML by default. In the XML, the list of publications can be found using the XPath `r:record/a:activities-summary/a:works/a:group` (using the namespace declarations given below).

*Note: we use the [ElementTree](https://docs.python.org/3/library/xml.etree.elementtree.html) library to process the XML data. It is very powerful, but has a quite steep learning curve.*

In [None]:
# Choose an ORCID to look up, e.g. 0000-0002-8469-6804
orcid = ''


import re
import requests
import xml.etree.ElementTree as ET


ns = {'o': 'http://www.orcid.org/ns/orcid' ,
's' : 'http://www.orcid.org/ns/search' ,
'h': 'http://www.orcid.org/ns/history' ,
'p': 'http://www.orcid.org/ns/person' ,
'pd': 'http://www.orcid.org/ns/personal-details' ,
'a': 'http://www.orcid.org/ns/activities' ,
'e': 'http://www.orcid.org/ns/employment' ,
'c': 'http://www.orcid.org/ns/common' , 
'w': 'http://www.orcid.org/ns/work'}

# We expect that there may be an error and therefore use `try` and `except`
try:
    # Construct the API call
    orcidUrl = "https://pub.orcid.org/v2.0/" + orcid
    print( orcidUrl )
    
    # Find and print the record creation date
    
    
    # Find and print the titles of the publications

            
except:
    print("Data could not be downloaded")

## Exercise 10.4.

The API developed by [OpenStreetMap](https://www.openstreetmap.org/) can be used, among other things, to find the precise geographic coordinates of a specific location. The base URL of this API is https://nominatim.openstreetmap.org/search. 

Following the `q` parameter, you need to supply a string describing the locations whose latitude and longitude you want to find. As values for the `format` parameter, you can use `xml` for XML-formatted data or `json` for JSON-formatted data. Use this API to find the longitude and the latitude of the addresses in the following list:

```
addresses = ['Grote Looiersstraat 17 Maastricht' , 'Witte Singel 27 Leiden' ,
'Singel 425 Amsterdam' , 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']
```

The JSON data received via the OpenStreetMap API can be converted to regular Python lists and dictionaries using the `json()` method: 

```json_data = response.json()```

If the result is saved as variable named `json_data`, you should be able to access the latitude and the longitude as follows:

```
latitude = json_data[0]['lat']
longitude = json_data[0]['lon']
```

In [None]:

import requests
import urllib.parse
import re
import string
from os.path import isfile, join , isdir
import os

addresses = ['Grote Looiersstraat 17 Maastricht' , 
             'Witte Singel 27 Leiden','Singel 425 Amsterdam' , 
             'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']

for a in addresses:
    # create the API call, with the address in the 'q' parameter
    
    # Get the JSON data and process the data using json()
    
    # Find the latitude and the longitude
    #latitude = json_data[0]['lat']
    #longitude = json_data[0]['lon']
    
    



## Exercise 10.5

The webpage below offers access to the complete work of the author H.P. Lovecraft. 

http://www.hplovecraft.com/writings/texts/

    
Write code in Python to find the URLs of all the texts that are listed. The links are all encoded in an element named &lt;a&gt;. The attribute `href` mentions the links, and the body of the &lt;a&gt; element mentions the title. List only the web pages that end in '.aspx'. 


In [None]:
from bs4 import BeautifulSoup
import requests
import re

base_url = "http://www.hplovecraft.com/writings/texts/"


## Exercise 10.6

Using `requests` and `BeautifulSoup`, create a list of all the countries mentioned on https://www.scrapethissite.com/pages/simple/.

Also collect data about the capital, the population and the area of all of these countries. 

## Exercise 10.7

Download all the images shown on the following page: https://www.bbc.com/news/in-pictures-61014501 

You can follow these steps:

* Download the HTML file
* 'Scrape' the HTML file you downloaded. As images in HTML are encoded using the `<img>` element, try to create a list containing all occurrences of this element. 
* Find the URLS of all the images. Witnin these `<img>` element, there should be a `src` attribute containing the URL of the image. 
* The bbc.com website uses images as part of the user interface. These images all have the word 'line' in their filenames. Try to exclude these images whose file names contain the word 'line'. 
* Download all the images that you found in this way, using the `requests` library. In the `Response` object that is created following a succesful download, you need to work with the `content` property to obtain the actual file.  Save all these images on your computer, using `open()` and `write()`. In the `open()` function, use the code ‘wb’ as a second parameter (instead of only ‘w’) to make sure that the contents are saved as bytes.


## Exercise 10.8

As was discussed in this notebook, you can use the *MusicBrainz* API to request information about musicians. Via the code that is provided, you can request the names and the type (i.e. are we dealing with a person or with a group?). This specific API can make much more information available, however. Try to add some code with can add the following data about each artist: 

* The date of birth (in the case of a person) or formation (in the case of a group)
* The date of death or breakup
* The place of birth or formation
* The place of death or breakup
* Aliases
* Tags associated with the artist

Tip: 'Uncomment' the print statement in the second cell to be able explore the structure of the JSON data. 


In [None]:
import requests
from requests.utils import requote_uri


root_url = 'https://musicbrainz.org/ws/2/'

## The parameters for the API call are defined as variables
entity = 'artist'
query = 'David Bowie'
limit = 5
fmt = 'json'

query = requote_uri(query)

api_call = f'{root_url}{entity}?query={query}&fmt={fmt}&limit={limit}'
response = requests.get( api_call )

In [None]:
import json

musicbrainz_results = response.json()

for artist in musicbrainz_results['artists']:
    #print(json.dumps(artist, indent=4))
    name = artist.get('name','[unknown]')
    artist_type = artist.get('type','[unknown]')
    print(f'{name} ({artist_type})')
    ## Add your code below
    
    

## Exercise 10.9

*[PLOS One](https://journals.plos.org/plosone/)* is a peer reviewed open access journal. The *PLOS One* API can be used to request metadata about all the articles that have been published in the journal. In this API, you can refer to specific articles using their [DOI](https://www.doi.org/).

Such requests can be sent using API calls with the following structure:

https://api.plos.org/search?q=id:{doi}

To acquire data about the article with DOI [10.1371/journal.pone.0270739](https://doi.org/10.1371/journal.pone.0270739), for example, you can use the following API call:

https://api.plos.org/search?q=id:10.1371/journal.pone.0270739

Try to write code which can get hold of metadata about the articles with the following DOIs:

* 10.1371/journal.pone.0169045
* 10.1371/journal.pone.0271074
* 10.1371/journal.pone.0268993

For each article, print the title, the publication date, the article type, a list of all the authors and the abstract. 


In [1]:
import requests

dois = [ '10.1371/journal.pone.0169045',
        '10.1371/journal.pone.0268993',
        '10.1371/journal.pone.0271074' ]


    