# 10. Data acquisition


## Exercise 10.1.

The list below contains a number of URLs. They are the web addresses of texts created for the [Project Gutenberg](https://www.gutenberg.org) website.

```
urls = [ 'https://www.gutenberg.org/files/580/580-0.txt' ,
'https://www.gutenberg.org/files/1400/1400-0.txt' ,
'https://www.gutenberg.org/files/786/786-0.txt' ,
'https://www.gutenberg.org/files/766/766-0.txt' 
]
```

Write a program in Python that downloads all the files in this list and stores them in the current directory.
As filenames, use the same names that are used by Project Gutenberg (e.g. '580-0.txt' or '1400-0.txt').
The basename in a URL can be extracted using the [`os.path.basename()`](https://docs.python.org/3/library/os.path.html#os.path.basename) function.


In [None]:
import requests
import os 

# Recreate the given list using copy and paste
urls = [ 'https://www.gutenberg.org/files/580/580-0.txt' ,
'https://www.gutenberg.org/files/1400/1400-0.txt' ,
'https://www.gutenberg.org/files/786/786-0.txt' ,
'https://www.gutenberg.org/files/766/766-0.txt' 
]

# We use a for-loop to take the same steps for each item in the list:
for url in urls:
    # 1. Download the file contents
    response = requests.get(url)
    # 1a. Force the textual contents to be interpreted as UTF-8 encoded, because the website does not send the text encoding
    response.encoding = 'utf-8'
    # 2. Use basename to get a suitable filename
    filename = os.path.basename(url)
    # 3. Open the file in write mode and write the downloaded file contents to the file
    out = open( filename , mode = 'w', encoding= 'utf-8' )
    out.write( response.text )
    # 4. Close the file
    out.close()
    

## Exercise 10.2.

Write Python code which can download the titles and the URLs of Wikipedia articles whose titles contain the word 'Dutch'. Your code needs to display the first 30 results only.

*Hint: the tutorial covers the Wikipedia API.*

In [None]:
import requests
import json

# Let's construct the full API call (which is a URL) piece by piece
baseURL = 'https://en.wikipedia.org/w/api.php?action=opensearch'

searchTerm = "Dutch"
limit = 30
data_format = 'json'

apiCall = '{}&search={}&limit={}&format={}'.format( baseURL, searchTerm , limit , data_format )

# Get the data using the Requests library
responseData = requests.get( apiCall )

# Because we asked for and got JSON-formatted data, Requests lets us access
# the data as a Python data structure using the .json() method
wikiResults = responseData.json()

# Now we print the search results 
for i in range( 0 , len(wikiResults[1]) ):
    print( 'Title: ' + wikiResults[1][i] )
    print( 'Tagline: ' + wikiResults[2][i] )
    print( 'Url: ' + wikiResults[3][i] + '\n')
    


## Exercise 10.3.

Write an application in Python that extracts all the publications that have been added to a specific ORCID account, using the ORCID API.

Information about individual ORCID accounts can be obtained by appending their ID to the base URL <https://pub.orcid.org/v2.0/>. The ORCID API returns data in XML by default. In the XML, the list of publications can be found using the XPath `r:record/a:activities-summary/a:works/a:group` (using the namespace declarations given below).

*Note: we use the [ElementTree](https://docs.python.org/3/library/xml.etree.elementtree.html) library to process the XML data. It is very powerful, but has a quite steep learning curve.*

In [None]:
orcid = '0000-0002-8469-6804'


import re
import requests
import xml.etree.ElementTree as ET

# Declare namespace abbreviations
ns = {'o': 'http://www.orcid.org/ns/orcid' ,
's' : 'http://www.orcid.org/ns/search' ,
'h': 'http://www.orcid.org/ns/history' ,
'p': 'http://www.orcid.org/ns/person' ,
'pd': 'http://www.orcid.org/ns/personal-details' ,
'a': 'http://www.orcid.org/ns/activities' ,
'e': 'http://www.orcid.org/ns/employment' ,
'c': 'http://www.orcid.org/ns/common' , 
'w': 'http://www.orcid.org/ns/work',
'r': 'http://www.orcid.org/ns/record'}


try:
    # Construct the API call and retrieve the data
    orcidUrl = "https://pub.orcid.org/v2.0/" + orcid
    print( orcidUrl )
    
    response = requests.get( orcidUrl )
    
    # Parse XML string into its Python ElementTree object representation
    root = ET.fromstring(response.text)
    
    # Find and print the ORCID creation date
    creationDate = root.find('h:history/h:submission-date' , ns ).text
    
    print('\nORCID created on:')
    print(creationDate)
    
    # Print the title and DOI of each work (DOI only when available)
    print('\nWorks:')
    
    works = root.findall('a:activities-summary/a:works/a:group' , ns )
    for w in works:
        title = w.find('w:work-summary/w:title/c:title' , ns ).text
        print(title)
        doiEl = w.find('c:external-ids/c:external-id/c:external-id-url' , ns )
        if doiEl is not None:
            doi = doiEl.text
            print(doi)
            
except:
    print("Data could not be downloaded")

## Exercise 10.4.

The API developed by [OpenStreetMap](https://www.openstreetmap.org/) can be used, among other things, to find the precise geographic coordinates of a specific location. The base URL of this API is https://nominatim.openstreetmap.org/search. Following the `q` parameter, you need to supply a string describing the locations whose latitude and longitude you want to find. As values for the `format` parameter, you can use `xml` for XML-formatted data or `json` for JSON-formatted data. Use this API to find the longitude and the latitude of the addresses in the following list:

```
addresses = ['Grote Looiersstraat 17 Maastricht' , 'Witte Singel 27 Leiden' ,
'Singel 425 Amsterdam' , 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']
```

In [None]:

import requests
import urllib.parse
import re
import string
from os.path import isfile, join , isdir
import os

addresses = ['Grote Looiersstraat 17 Maastricht' , 'Witte Singel 27 Leiden' ,
'Singel 425 Amsterdam' , 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']


for a in addresses:
    url = 'https://nominatim.openstreetmap.org/search?q='+ a_encoded + '&format=json'

    response = requests.get( url )
    json_data = response.json() 
    latitude = json_data[0]['lat']
    longitude = json_data[0]['lon']
    print( f'{latitude},{longitude}')


The code below also creates an interactive map, with markers on the locations described in the list. 

In [None]:

import requests
import urllib.parse
import re
import string
from os.path import isfile, join , isdir
import os

addresses = ['Grote Looiersstraat 17 Maastricht' , 'Witte Singel 27 Leiden' ,
'Singel 425 Amsterdam' , 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']

locations_coord = dict()


for a in addresses:
    a_encoded = urllib.parse.quote(a)
    url = 'https://nominatim.openstreetmap.org/search?q='+ a_encoded + '&format=json'
    print(url)

    response = requests.get( url )
    json_data = response.json() 
    latitude = json_data[0]['lat']
    longitude = json_data[0]['lon']
    locations_coord[a] = (latitude,longitude)
    print( f'{latitude},{longitude}')


out = open( 'map.html' , 'w' , encoding = 'utf-8')
import re

out.write('''
<!DOCTYPE html>
<html>
<head>

                <title>Locations</title>

                <meta charset="utf-8" />
                <meta name="viewport" content="width=device-width, initial-scale=1.0">

                <link rel="shortcut icon" type="image/x-icon" href="docs/images/favicon.ico" />

    <link rel="stylesheet" href="https://unpkg.com/leaflet@1.7.1/dist/leaflet.css" integrity="sha512-xodZBNTC5n17Xt2atTPuE1HxjVMSvLVW9ocqUKLsCC5CXdbqCmblAshOMAS6/keqq/sMZMZ19scR4PsZChSR7A==" crossorigin=""/>
    <script src="https://unpkg.com/leaflet@1.7.1/dist/leaflet.js" integrity="sha512-XQoYMqMTK8LvdxXYG3nZ448hOEQiglfqkJs1NOQV44cWnUrBc8PkAOcXy20w0vlaXaVUearIOBhiXZ5V3ynxwA==" crossorigin=""></script>



</head>
<body>



<div id="mapid" style="width: 800px; height: 600px;"></div>
<script>

                var mymap = L.map('mapid').setView([52.1568157,4.4850392], 6);

                L.tileLayer('https://api.mapbox.com/styles/v1/{id}/tiles/{z}/{x}/{y}?access_token=pk.eyJ1IjoibWFwYm94IiwiYSI6ImNpejY4NXVycTA2emYycXBndHRqcmZ3N3gifQ.rJcFIG214AriISLbB6B5aw', {
                                maxZoom: 18,
                                attribution: 'Map data &copy; <a href="https://www.openstreetmap.org/">OpenStreetMap</a> contributors, ' +
                                                '<a href="https://creativecommons.org/licenses/by-sa/2.0/">CC-BY-SA</a>, ' +
                                                'Imagery  <a href="https://www.mapbox.com/">Mapbox</a>',
                                id: 'mapbox/streets-v11',
                                tileSize: 512,
                                zoomOffset: -1
                }).addTo(mymap); 
''')

for l in locations_coord:
    display_name = re.sub( '\'' , '' , l )
    out.write( f' L.marker([ { locations_coord[l][0] }, { locations_coord[l][1] }  ]).addTo(mymap) ')
    out.write( f" .bindPopup('{display_name}.') ")  
    out.write( ';' )
    
out.write(
'''
</script>



</body>
</html>

''')

out.close()


## Exercise 10.5

The webpage below offers access to the complete work of the author H.P. Lovecraft. 

http://www.hplovecraft.com/writings/texts/

    
Write code in Python to find the URLs of all the texts that are listed. The links are all encoded in an element named &lt;a&gt;. The attribute `href` mentions the links, and the body of the &lt;a&gt; element mentions the title. List only the web pages that end in '.aspx'. 


In [None]:
from bs4 import BeautifulSoup
import requests
import re

base_url = "http://www.hplovecraft.com/writings/texts/"

response = requests.get(base_url)
if response: 
    #print(response.text)
    soup = BeautifulSoup( response.text ,"lxml")
    links = soup.find_all("a")
    for link in links:
        if link.get('href') is not None:
            title = link.string
            url = base_url + link.get('href')
            if re.search( r'aspx$' , url): 
                print( f'{title}\n{url}')



## Exercise 10.6

Using `requests` and `BeautifulSoup`, create a list of all the countries mentioned on https://www.scrapethissite.com/pages/simple/.

Also collect data about the capital, the population and the area of all of these countries. 

In [None]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.scrapethissite.com/pages/simple/'

response = requests.get(url)

if response.status_code == 200:
    response.encoding = 'utf-8'
    html_page = response.text
    
    
soup = BeautifulSoup( html_page,"lxml")
    
countries = soup.find_all('div', {'class': 'col-md-4 country'} )


for c in countries:
    
    name = c.find('h3' , { 'class':'country-name'})
    print(name.text.strip())
    
    # find all <span> elements underneath <h3 class='country_name'>
    span = c.find_all("span" )
    
    capital = ''
    population = 0
    area = 0
    
    for s in span:

        if s['class'][0] == 'country-capital':
            capital = s.text
        
        if s['class'][0] == 'country-population':
            population = s.text
            
        if s['class'][0] == 'country-area':
            area = s.text
    
    print(f'  Capital: {capital}')
    print(f'  Population: {population}')
    print(f'  Area: {area}')
    print()


## Exercise 10.7

Download all the images shown on the following page: https://www.bbc.com/news/in-pictures-61014501 

You can follow these steps:

* Download the HTML file
* 'Scrape' the HTML file you downloaded. As images in HTML are encoded using the `img` element, try to create a list containing all occurrences of this element. 
* Find the URLS of all the images. Witnin these `img` element, there should be a `src` attribute containing the URL of the image. 
* The bbc.com website uses images as part of the user interface. These images all have the word 'line' in their filenames. Try to exclude these images whose file names contain the word 'line'. 
* Download all the images that you found in this way, using the `requests` library. In the `Response` object that is created following a succesful download, you need to work with the `content` property to obtain the actual file.  Save all these images on your computer, using `open()` and `write()`. In the `open()` function, use the code ‘wb’ as a second parameter (instead of only ‘w’) to make sure that the contents are saved as bytes.


In [None]:
import os

url = 'https://www.bbc.com/news/in-pictures-61014501'

response = requests.get(url)

if response:
    html_page = response.text
    soup = BeautifulSoup( html_page,"lxml")
    images = soup.find_all('img')
    for i in images:
        img_url = i.get('src')
        if 'line' not in img_url:
            response = requests.get(img_url)
            if response:
                file_name = os.path.basename(img_url)
                print(file_name)
                out = open( file_name , 'wb' )
                out.write(response.content)
                out.close()
    