# Data acquisition - Solutions


## Exercise 10.1.

The list below contains a number of URLs. They are the web addresses of texts created for the [Project Gutenberg](https://www.gutenberg.org) website.

```
urls = [ 'https://www.gutenberg.org/files/580/580-0.txt' ,
'https://www.gutenberg.org/files/1400/1400-0.txt' ,
'https://www.gutenberg.org/files/786/786-0.txt' ,
'https://www.gutenberg.org/files/766/766-0.txt' 
]
```

Write a program in Python that downloads all the files in this list and stores them in the current directory.
As filenames, use the same names that are used by Project Gutenberg (e.g. '580-0.txt' or '1400-0.txt').
The basename in a URL can be extracted using the [`os.path.basename()`](https://docs.python.org/3/library/os.path.html#os.path.basename) function.


In [None]:
import requests
import os 

# Recreate the given list using copy and paste
urls = [ 'https://www.gutenberg.org/files/580/580-0.txt' ,
'https://www.gutenberg.org/files/1400/1400-0.txt' ,
'https://www.gutenberg.org/files/786/786-0.txt' ,
'https://www.gutenberg.org/files/766/766-0.txt' 
]

# We use a for-loop to take the same steps for each item in the list:
for url in urls:
    # 1. Download the file contents
    response = requests.get(url)
    # 1a. Force the textual contents to be interpreted as UTF-8 encoded, because the website does not send the text encoding
    response.encoding = 'utf-8'
    # 2. Use basename to get a suitable filename
    filename = os.path.basename(url)
    # 3. Open the file in write mode and write the downloaded file contents to the file
    out = open( filename , mode = 'w', encoding= 'utf-8' )
    out.write( response.text )
    # 4. Close the file
    out.close()
    

## Exercise 10.2.

Write Python code which can download the titles and the URLs of Wikipedia articles whose titles contain the word 'Dutch'. Your code needs to display the first 30 results only.

*Hint: the tutorial covers the Wikipedia API.*

In [None]:
import requests
import json

# Let's construct the full API call (which is a URL) piece by piece
baseURL = 'https://en.wikipedia.org/w/api.php?action=opensearch'

searchTerm = "Dutch"
limit = 30
data_format = 'json'

apiCall = '{}&search={}&limit={}&format={}'.format( baseURL, searchTerm , limit , data_format )

# Get the data using the Requests library
responseData = requests.get( apiCall )

# Because we asked for and got JSON-formatted data, Requests lets us access
# the data as a Python data structure using the .json() method
wikiResults = responseData.json()

# Now we print the search results 
for i in range( 0 , len(wikiResults[1]) ):
    print( 'Title: ' + wikiResults[1][i] )
    print( 'Tagline: ' + wikiResults[2][i] )
    print( 'Url: ' + wikiResults[3][i] + '\n')
    


## Exercise 10.3.

Write an application in Python that extracts all the publications that have been added to a specific ORCID account, using the ORCID API.

Information about individual ORCID accounts can be obtained by appending their ID to the base URL <https://pub.orcid.org/v2.0/>. The ORCID API returns data in XML by default. In the XML, the list of publications can be found using the XPath `r:record/a:activities-summary/a:works/a:group` (using the namespace declarations given below).

*Note: we use the [ElementTree](https://docs.python.org/3/library/xml.etree.elementtree.html) library to process the XML data. It is very powerful, but has a quite steep learning curve.*

In [None]:
orcid = '0000-0002-8469-6804'


import re
import requests
import xml.etree.ElementTree as ET

# Declare namespace abbreviations
ns = {'o': 'http://www.orcid.org/ns/orcid' ,
's' : 'http://www.orcid.org/ns/search' ,
'h': 'http://www.orcid.org/ns/history' ,
'p': 'http://www.orcid.org/ns/person' ,
'pd': 'http://www.orcid.org/ns/personal-details' ,
'a': 'http://www.orcid.org/ns/activities' ,
'e': 'http://www.orcid.org/ns/employment' ,
'c': 'http://www.orcid.org/ns/common' , 
'w': 'http://www.orcid.org/ns/work',
'r': 'http://www.orcid.org/ns/record'}


try:
    # Construct the API call and retrieve the data
    orcidUrl = "https://pub.orcid.org/v2.0/" + orcid
    print( orcidUrl )
    
    response = requests.get( orcidUrl )
    
    # Parse XML string into its Python ElementTree object representation
    root = ET.fromstring(response.text)
    
    # Find and print the ORCID creation date
    creationDate = root.find('h:history/h:submission-date' , ns ).text
    
    print('\nORCID created on:')
    print(creationDate)
    
    # Print the title and DOI of each work (DOI only when available)
    print('\nWorks:')
    
    works = root.findall('a:activities-summary/a:works/a:group' , ns )
    for w in works:
        title = w.find('w:work-summary/w:title/c:title' , ns ).text
        print(title)
        doiEl = w.find('c:external-ids/c:external-id/c:external-id-url' , ns )
        if doiEl is not None:
            doi = doiEl.text
            print(doi)
            
except:
    print("Data could not be downloaded")

## Exercise 10.4.

The API developed by [OpenStreetMap](https://www.openstreetmap.org/) can be used, among other things, to find the precise geographic coordinates of a specific location. The base URL of this API is https://nominatim.openstreetmap.org/search. Following the `q` parameter, you need to supply a string describing the locations whose latitude and longitude you want to find. As values for the `format` parameter, you can use `xml` for XML-formatted data or `json` for JSON-formatted data. Use this API to find the longitude and the latitude of the addresses in the following list:

```
addresses = ['Grote Looiersstraat 17 Maastricht' , 'Witte Singel 27 Leiden' ,
'Singel 425 Amsterdam' , 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']
```

In [None]:
import requests
import xml.etree.ElementTree as ET
import re
import string
from os.path import isfile, join, isdir
import os

addresses = ['Grote Looiersstraat 17 Maastricht' , 'Witte Singel 27 Leiden' , 'Singel 425 Amsterdam' , 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']


for a in addresses:
    # Construct the API query and get the data
    # Note that spaces are not allowed in URLs, so they need to be 'escaped', i.e. replaced by %20
    url = 'https://nominatim.openstreetmap.org/search?q='+ a + '&format=xml'
    url = re.sub( '\s+' , '%20' , url )
    # See the note below about URL escaping

    print("Trying {}...".format(url))
    response = requests.get( url )
    
    # Parse the XML using the ElementTree library and find places
    root = ET.fromstring( response.text )
    places = root.findall('place')
    
    # Print the first result, if there are results
    if places is not None:
        place = places[0]
        lat = place.attrib['lat']
        lon = place.attrib['lon']
        print( '{}: {},{}\n'.format( a, lat , lon ) )


### URL escaping

Replacing spaces with the escape sequence for spaces is not the same as is just one part of 'URL escaping'. There are other, preferred, ways of escaping illegal characters in URLs, like [urllib.parse.urlencode](https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlencode) specifically meant for query parameters.