# Data Collection

## Downloading Data
The built-in Python *urllib.request* module has functions which help in downloading content from HTTP URLs using minimal code.

In [7]:
import urllib.request
url = "http://mlg.ucd.ie/modules/COMP41680/ucd.txt"
response = urllib.request.urlopen(url)
text = response.read().decode()
print(text)

History of UCD

Originally known as the Catholic University of Ireland and subsequently as the Royal University, the university became UCD in 1908 and a constituent college of the National University of Ireland (NUI). 

In 1997, UCD became an autonomous university within the loose federal structure of the NUI and UCD students are awarded degrees of the National University of Ireland.

UCD has been a major contributor to the making of modern Ireland. Many UCD students and staff participated in the struggle for Irish independence and the university has produced numerous Irish Presidents and Taoisigh (Prime Ministers) in addition to generations of Irish business, professional, cultural and sporting leaders. 

Among UCD's well-known graduates are authors (Maeve Binchy, Roddy Doyle, Flann O'Brien), actors (Gabriel Byrne, Brendan Gleeson), directors (Neil Jordan, Jim Sheridan) and sports stars such as Irish rugby captain Brian O'Driscoll and former Manchester United and Ireland captain Kevin

In practice, we may often want to wrap code to fetch URLs in a try block, to handle the case where we cannot access the URL.

In [5]:
url = "http://somemissinglink.ucd.ie/ucd.txt"
try:
    response = urllib.request.urlopen(url)
    text = response.read().decode()
except:
    print("Failed to retrieve %s" % url)

Failed to retrieve http://somemissinglink.ucd.ie/ucd.txt


## Working with CSV Data

The CSV ("Comma Separated Values") file format is often used to exchange tabular data between different applications, like Excel. Essentially a CSV file is a plain text file where values are split by a comma separator. Alternatively can be tab or space separated. 

As an example, we will look at a CSV file containing details of Premier League goal scorers. We could download this CSV file using *urllib.request* and manually parse it...

In [11]:
# Download the CSV and store as a string
url = "http://mlg.ucd.ie/modules/COMP41680/goal_scorers.csv"
response = urllib.request.urlopen(url)
raw_csv = response.read().decode()
# Parse each line
lines = raw_csv.split("\n")
for l in lines:
    l = l.strip()
    if len(l) > 0:
        # split based on a comma separator
        parts = l.split(",")
        print(parts)

['Player', 'Club', 'Total Goals', 'Home Goals', 'Away Goals']
['Jamie Vardy', 'Leicester City', '17', '8', '9']
['Sergio Aguero', 'Manchester City', '16', '8', '8']
['P. Aubameyang', 'Arsenal', '15', '6', '9']
['Danny Ings', 'Southampton', '15', '8', '7']
['Marcus Rashford', 'Manchester Utd', '14', '10', '4']
['Mohamed Salah', 'Liverpool', '14', '12', '2']
['Tammy Abraham', 'Chelsea', '13', '5', '8']
['Sadio Mané', 'Liverpool', '12', '7', '5']
['Raúl Jiménez', 'Wolverhampton', '11', '5', '6']
['Harry Kane', 'Tottenham', '11', '6', '5']
['D. Calvert-Lewin', 'Everton', '11', '6', '5']
['Raheem Sterling', 'Manchester City', '11', '2', '9']
['Teemu Pukki', 'Norwich City', '11', '7', '4']
['Chris Wood', 'Burnley', '10', '6', '4']
['Son Heungmin', 'Tottenham', '9', '6', '3']
['Anthony Martial', 'Manchester Utd', '9', '4', '5']
['Richarlison', 'Everton', '9', '5', '4']
['Kevin De Bruyne', 'Manchester City', '9', '6', '3']
['Gabriel Jesus', 'Manchester City', '9', '3', '6']
['Roberto Firmino',

Python also includes a built-in module called *csv* which simplies the process of reading and writing CSV data.

See https://docs.python.org/3/library/csv.html

In [13]:
import csv

In [15]:
# first download the file and save it
url = "http://mlg.ucd.ie/modules/COMP30760/goal_scorers.csv"
response = urllib.request.urlopen(url)
data = response.read().decode()
fout = open("goal_data.csv", "w")
fout.write(data)
fout.close()

In [21]:
# next, use the csv module to read each line (row) into a dictionary
fin = open("goal_data.csv", "r")
reader = csv.DictReader(fin)
rows = []
for row in reader:
    print(row)
    rows.append(row)
fin.close()
print("Read %d rows of data" % len(rows))

{'Player': 'Jamie Vardy', 'Club': 'Leicester City', 'Total Goals': '17', 'Home Goals': '8', 'Away Goals': '9'}
{'Player': 'Sergio Aguero', 'Club': 'Manchester City', 'Total Goals': '16', 'Home Goals': '8', 'Away Goals': '8'}
{'Player': 'P. Aubameyang', 'Club': 'Arsenal', 'Total Goals': '15', 'Home Goals': '6', 'Away Goals': '9'}
{'Player': 'Danny Ings', 'Club': 'Southampton', 'Total Goals': '15', 'Home Goals': '8', 'Away Goals': '7'}
{'Player': 'Marcus Rashford', 'Club': 'Manchester Utd', 'Total Goals': '14', 'Home Goals': '10', 'Away Goals': '4'}
{'Player': 'Mohamed Salah', 'Club': 'Liverpool', 'Total Goals': '14', 'Home Goals': '12', 'Away Goals': '2'}
{'Player': 'Tammy Abraham', 'Club': 'Chelsea', 'Total Goals': '13', 'Home Goals': '5', 'Away Goals': '8'}
{'Player': 'Sadio Mané', 'Club': 'Liverpool', 'Total Goals': '12', 'Home Goals': '7', 'Away Goals': '5'}
{'Player': 'Raúl Jiménez', 'Club': 'Wolverhampton', 'Total Goals': '11', 'Home Goals': '5', 'Away Goals': '6'}
{'Player': 'Har

We can also use the *csv* module to write data out to a CSV file. In the example below, we will write out the data which we collected above, row by row:

In [1]:
# open the output file for writing
fout = open("output.csv", "w")
# specify the ordered list of fields in our file
fields = ["Player", "Club", "Total Goals", "Home Goals", "Away Goals"]
writer = csv.DictWriter(fout, fieldnames=fields)
# write the header row
writer.writeheader()
# write each row of data
for row in rows:
    writer.writerow(row)
fout.close()

NameError: name 'csv' is not defined

## Working with JSON

[JSON](http://json.org/) is a lightweight format which is becoming increasingly popular for online data exchanged. Based originally on the JavaScript language and (relatively) easy for humans to read and write

The built-in module *json* provides an easy way to encode and decode data in JSON in Python.

In [25]:
import json

Let's try downloading and parsing a simple JSON file which contains information about a number of books, originally from librarything.com:

In [27]:
url = "http://mlg.ucd.ie/modules/COMP30760/books.json"
response = urllib.request.urlopen(url)
raw_json = response.read().decode("utf-8")

In [29]:
print(raw_json)

[{
	"book_id": "13585350",
	"title": "The World Treasury of Science Fiction",
	"ISBN": "0316349410",
	"year": 1989,
	"rating": 3,
	"language": "eng"
}, {
	"book_id": "124205572",
	"title": "The War of the Worlds",
	"ISBN": "1936594056",
	"year": 2013,
	"rating": 4,
	"language": "eng"
}, {
	"book_id": "127360065",
	"title": "Under the Dome: A Novel",
	"ISBN": "1439149038",
	"year": 2013,
	"rating": 2,
	"language": "eng"
}, {
	"book_id": "13908800",
	"title": "The Ultimate Hitchhiker's Guide to the Galaxy",
	"ISBN": "0345453743",
	"year": 2002,
	"rating": 5,
	"language": "eng"
}, {
	"book_id": "123734934",
	"title": "The Time Traveler's Wife",
	"ISBN": "1476764832",
	"year": 2014,
	"rating": 5,
	"language": "eng"
}, {
	"book_id": "13603020",
	"title": "Salem's Lot",
	"ISBN": "0451098277",
	"year": 1976,
	"rating": 3,
	"language": "eng"
}, {
	"book_id": "124173974",
	"title": "Republic",
	"ISBN": "039395501X",
	"year": 1985,
	"rating": 3,
	"language": "eng"
}, {
	"book_id": "123102859",
	

We can now parse the JSON, converting it from a string into a useful Python data structure:

In [31]:
data = json.loads(raw_json)
for book in data:
    print(book)

{'book_id': '13585350', 'title': 'The World Treasury of Science Fiction', 'ISBN': '0316349410', 'year': 1989, 'rating': 3, 'language': 'eng'}
{'book_id': '124205572', 'title': 'The War of the Worlds', 'ISBN': '1936594056', 'year': 2013, 'rating': 4, 'language': 'eng'}
{'book_id': '127360065', 'title': 'Under the Dome: A Novel', 'ISBN': '1439149038', 'year': 2013, 'rating': 2, 'language': 'eng'}
{'book_id': '13908800', 'title': "The Ultimate Hitchhiker's Guide to the Galaxy", 'ISBN': '0345453743', 'year': 2002, 'rating': 5, 'language': 'eng'}
{'book_id': '123734934', 'title': "The Time Traveler's Wife", 'ISBN': '1476764832', 'year': 2014, 'rating': 5, 'language': 'eng'}
{'book_id': '13603020', 'title': "Salem's Lot", 'ISBN': '0451098277', 'year': 1976, 'rating': 3, 'language': 'eng'}
{'book_id': '124173974', 'title': 'Republic', 'ISBN': '039395501X', 'year': 1985, 'rating': 3, 'language': 'eng'}
{'book_id': '123102859', 'title': 'The Road', 'ISBN': '0307387895', 'year': 2006, 'rating': 

We can now iterate through the books in the list and extract the relevant information that we require.

In [33]:
for book in data:
    print( "%s = %d" % ( book["title"], book["year"] ) )

The World Treasury of Science Fiction = 1989
The War of the Worlds = 2013
Under the Dome: A Novel = 2013
The Ultimate Hitchhiker's Guide to the Galaxy = 2002
The Time Traveler's Wife = 2014
Salem's Lot = 1976
Republic = 1985
The Road = 2006


## Working with XML

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format which is both human-readable and machine-readable. XML is a widely-adopted format. Python includes several built-in modules for parsing XML data.

The *xml.etree.ElementTree* module can be used to extract data from a simple XML file based on its tree structure. 

In [9]:
# download the content
url = "http://mlg.ucd.ie/modules/COMP30760/books.xml"
response = urllib.request.urlopen(url)
raw_xml = response.read().decode()
print(raw_xml)

<?xml version="1.0" encoding="UTF-8"?>
<booklist>
   <book id="13585350">
      <title>The World Treasury of Science Fiction</title>
      <ISBN>0316349410</ISBN>
      <year>1989</year>
      <rating>3</rating>
      <language>eng</language>
   </book>
   <book id="124205572">
      <title>The War of the Worlds</title>
      <ISBN>1936594056</ISBN>
      <year>2013</year>
      <rating>4</rating>
      <language>eng</language>
   </book>
   <book id="127360065">
      <title>Under the Dome: A Novel</title>
      <ISBN>1439149038</ISBN>
      <year>2013</year>
      <rating>2</rating>
      <language>eng</language>
   </book>
   <book id="13908800">
      <title>The Ultimate Hitchhiker's Guide to the Galaxy</title>
      <ISBN>0345453743</ISBN>
      <year>2002</year>
      <rating>5</rating>
      <language>eng</language>
   </book>
   <book id="123734934">
      <title>The Time Traveler's Wife</title>
      <ISBN>1476764832</ISBN>
      <year>2014</year>
      <rating>5</rating>
    

We can use the *xml.etree.ElementTree.fromstring()* function to parse content from a string containing XML data.

In [11]:
import xml.etree.ElementTree
tree = xml.etree.ElementTree.fromstring(raw_xml)
print(tree)

<Element 'booklist' at 0x000002D464C944A0>


An XML tree has a root node (i.e. the top level of the document), with child nodes at lower levels. We can iterate over these:

In [41]:
for child in tree:
    # get the name of the tag, along with any XML attributes which the tag has
    print( child.tag, child.attrib )

book {'id': '13585350'}
book {'id': '124205572'}
book {'id': '127360065'}
book {'id': '13908800'}
book {'id': '123734934'}
book {'id': '13603020'}
book {'id': '124173974'}
book {'id': '123102859'}


We can also query to find tags with specific names, such as 'book' and then in turn find child nodes of that tag with a specific name.

In [15]:
for book in tree.findall("book"):
    # get the text inside a <title> tag, contained within a <book> tag
    title = book.find("title").text
    print(type(title))

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>


## Working with APIs

Instead of manually scraping HTML data, some web sites provide a convenient "official" way of retrieving their data via a Web API.

### Example - Wikipedia

As a simple example of using an online API, we will use the Wikipedia web API to perform a search for Wikipedia pages with titles which match a particular query keyword.

The endpoint for this API is given below. The complete documentation for using this endpoint is at [online here](https://en.wikipedia.org/w/api.php):

In [45]:
end_point = "https://en.wikipedia.org/w/api.php"

We build a URL that includes the endpoint and the query parameters which specify what we are looking for:

In [47]:
# the keyword in page titles that we are searching for
keyword = "Dublin"
url = "%s?action=query&list=search&format=json&srsearch=%s" % (end_point, keyword)
print(url)

https://en.wikipedia.org/w/api.php?action=query&list=search&format=json&srsearch=Dublin


We send our request to the API using a standard HTTP request:

In [49]:
response = urllib.request.urlopen(url)
raw_json = response.read().decode("utf-8")

Once we have downloaded the JSON data into a string, we parse it using the *loads()* function, which will convert it into an actual Python dictionary that we can work with.

In [51]:
response_data = json.loads(raw_json)
response_data

{'batchcomplete': '',
 'continue': {'sroffset': 10, 'continue': '-||'},
 'query': {'searchinfo': {'totalhits': 81729},
  'search': [{'ns': 0,
    'title': 'Dublin',
    'pageid': 8504,
    'size': 177642,
    'wordcount': 16252,
    'snippet': '<span class="searchmatch">Dublin</span> (/ˈdʌblɪn/ ; Irish: Baile Átha Cliath, pronounced [ˈbˠalʲə aːhə ˈclʲiə] or [ˌbʲlʲaː ˈclʲiə]) is the capital of Ireland. On a bay at the mouth of',
    'timestamp': '2024-09-18T04:35:54Z'},
   {'ns': 0,
    'title': 'List of Dublin postal districts',
    'pageid': 961703,
    'size': 22595,
    'wordcount': 1946,
    'snippet': '<span class="searchmatch">Dublin</span> postal districts have been used by Ireland\'s postal service, known as An Post, to sort mail in <span class="searchmatch">Dublin</span>. The system is similar to that used in cities',
    'timestamp': '2024-09-25T07:30:50Z'},
   {'ns': 0,
    'title': 'County Dublin',
    'pageid': 6514,
    'size': 184379,
    'wordcount': 17260,
    'snippet

We can now process the results sent back by the API. For instance, we could print the top results return for our keyword query: 

In [53]:
for result in response_data["query"]["search"]:
    print(result["title"])

Dublin
List of Dublin postal districts
County Dublin
Dublin City
Kingdom of Dublin
In Dublin
Dublin Township
Dublin Regulation
Dublin (disambiguation)
Dublin Airport


### Example - Currency Exchange Rates

In the next example, we will use the *frankfurter.app* (formerly *Fixer.io*) API to get currency exchange rate information: https://frankfurter.app

To retrieve all rates in EUROs, we retrieve data from the API end point: https://api.frankfurter.app/latest - we do not need to specify any parameters in the URL in this case. Here we will use the *urlib.request* module in a slightly different way, using the *build_opener()* function

In [55]:
end_point = "https://api.frankfurter.app/latest"
# note that we need to add a special header to appear to be a proper web browser
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
# retrieve the data from the API endpoint
response = opener.open(end_point)
raw_json = response.read().decode("utf-8")
print(raw_json)

{"amount":1.0,"base":"EUR","date":"2024-10-04","rates":{"AUD":1.6121,"BGN":1.9558,"BRL":6.057,"CAD":1.4952,"CHF":0.9394,"CNY":7.7407,"CZK":25.347,"DKK":7.4579,"GBP":0.83735,"HKD":8.5629,"HUF":401.33,"IDR":17165,"ILS":4.2022,"INR":92.61,"ISK":149.1,"JPY":161.69,"KRW":1478.24,"MXN":21.269,"MYR":4.6542,"NOK":11.6845,"NZD":1.7779,"PHP":62.126,"PLN":4.3145,"RON":4.9769,"SEK":11.3375,"SGD":1.4314,"THB":36.484,"TRY":37.776,"USD":1.1029,"ZAR":19.2809}}


Parse the JSON data

In [57]:
data = json.loads(raw_json)
# list all the rates
data

{'amount': 1.0,
 'base': 'EUR',
 'date': '2024-10-04',
 'rates': {'AUD': 1.6121,
  'BGN': 1.9558,
  'BRL': 6.057,
  'CAD': 1.4952,
  'CHF': 0.9394,
  'CNY': 7.7407,
  'CZK': 25.347,
  'DKK': 7.4579,
  'GBP': 0.83735,
  'HKD': 8.5629,
  'HUF': 401.33,
  'IDR': 17165,
  'ILS': 4.2022,
  'INR': 92.61,
  'ISK': 149.1,
  'JPY': 161.69,
  'KRW': 1478.24,
  'MXN': 21.269,
  'MYR': 4.6542,
  'NOK': 11.6845,
  'NZD': 1.7779,
  'PHP': 62.126,
  'PLN': 4.3145,
  'RON': 4.9769,
  'SEK': 11.3375,
  'SGD': 1.4314,
  'THB': 36.484,
  'TRY': 37.776,
  'USD': 1.1029,
  'ZAR': 19.2809}}

In [59]:
# get a specific rate
data["rates"]["CHF"]

0.9394

We can change the URL to get rates for a different currency, such as US Dollars (USD):

In [65]:
# create a URL based on the end point, with extra parameters
url = "%s?base=USD" % end_point
# note that we need to add a special header to appear to be a proper web browser
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
# retrieve the data from the API endpoint
response = opener.open(end_point)
# parse the JSON
data = json.loads(raw_json)
print(data)
# display the rates data for US dollars
data["rates"]

{'amount': 1.0, 'base': 'EUR', 'date': '2024-10-04', 'rates': {'AUD': 1.6121, 'BGN': 1.9558, 'BRL': 6.057, 'CAD': 1.4952, 'CHF': 0.9394, 'CNY': 7.7407, 'CZK': 25.347, 'DKK': 7.4579, 'GBP': 0.83735, 'HKD': 8.5629, 'HUF': 401.33, 'IDR': 17165, 'ILS': 4.2022, 'INR': 92.61, 'ISK': 149.1, 'JPY': 161.69, 'KRW': 1478.24, 'MXN': 21.269, 'MYR': 4.6542, 'NOK': 11.6845, 'NZD': 1.7779, 'PHP': 62.126, 'PLN': 4.3145, 'RON': 4.9769, 'SEK': 11.3375, 'SGD': 1.4314, 'THB': 36.484, 'TRY': 37.776, 'USD': 1.1029, 'ZAR': 19.2809}}


{'AUD': 1.6121,
 'BGN': 1.9558,
 'BRL': 6.057,
 'CAD': 1.4952,
 'CHF': 0.9394,
 'CNY': 7.7407,
 'CZK': 25.347,
 'DKK': 7.4579,
 'GBP': 0.83735,
 'HKD': 8.5629,
 'HUF': 401.33,
 'IDR': 17165,
 'ILS': 4.2022,
 'INR': 92.61,
 'ISK': 149.1,
 'JPY': 161.69,
 'KRW': 1478.24,
 'MXN': 21.269,
 'MYR': 4.6542,
 'NOK': 11.6845,
 'NZD': 1.7779,
 'PHP': 62.126,
 'PLN': 4.3145,
 'RON': 4.9769,
 'SEK': 11.3375,
 'SGD': 1.4314,
 'THB': 36.484,
 'TRY': 37.776,
 'USD': 1.1029,
 'ZAR': 19.2809}