# Data Formats

## Downloading Data

The built-in Python *urllib.request* module has functions which help in downloading content from HTTP URLs using minimal code.

In [5]:
import urllib.request
import urllib.error

url = "http://mlg.ucd.ie/modules/python/ucd.txt"
# open the url and read the response
response = urllib.request.urlopen(url)
# decode the response content from bytes to a utf-8 string
text = response.read().decode("utf-8")
print(text)

History of UCD

Originally known as the Catholic University of Ireland and subsequently as the Royal University, the university became UCD in 1908 and a constituent college of the National University of Ireland (NUI). 

In 1997, UCD became an autonomous university within the loose federal structure of the NUI and UCD students are awarded degrees of the National University of Ireland.

UCD has been a major contributor to the making of modern Ireland. Many UCD students and staff participated in the struggle for Irish independence and the university has produced numerous Irish Presidents and Taoisigh (Prime Ministers) in addition to generations of Irish business, professional, cultural and sporting leaders. 

Among UCD's well-known graduates are authors (Maeve Binchy, Roddy Doyle, Flann O'Brien), actors (Gabriel Byrne, Brendan Gleeson), directors (Neil Jordan, Jim Sheridan) and sports stars such as Irish rugby captain Brian O'Driscoll and former Manchester United and Ireland captain Kevin

In practice, we should always wrap code to fetch URLs in proper error handling blocks to handle the various cases where we cannot access the URL. This includes network issues, server errors, and malformed responses.

In [2]:
# specify a URL that does not exist
url = "http://somemissinglink.ucd.ie/ucd.txt"

try:
    # attempt to retrieve the URL
    response = urllib.request.urlopen(url)
    text = response.read().decode("utf-8")
    print(text)
except urllib.error.HTTPError as e:
    print(f"HTTP Error {e.code}: Failed to retrieve {url}")
except urllib.error.URLError as e:
    print(f"Network Error: Failed to retrieve {url} - {e.reason}")
except Exception as e:
    print(f"Unexpected error retrieving {url}: {e}")

Network Error: Failed to retrieve http://somemissinglink.ucd.ie/ucd.txt - [Errno 11001] getaddrinfo failed


## Working with CSV Data

The CSV ("Comma Separated Values") file format is often used to exchange tabular data between different applications, like Excel. Essentially a CSV file is a plain text file where values are split by a comma separator. Alternatively can be tab or space separated. 

As an example, we will look at a CSV file containing details of Premier League goal scorers. We could download this CSV file using the `urllib.request.urlopen()` function and manually parse it...

In [3]:
# download the CSV with proper error handling
url = "http://mlg.ucd.ie/modules/python/goal_scorers.csv"

try:
    response = urllib.request.urlopen(url)
    raw_csv = response.read().decode("utf-8")
    
    # parse each line
    lines = raw_csv.split("\n")
    for line in lines:
        line = line.strip()
        if len(line) > 0:
            # split based on a comma separator
            parts = line.split(",")
            print(parts)
except urllib.error.HTTPError as e:
    print(f"HTTP Error {e.code}: Could not download CSV file")
except urllib.error.URLError as e:
    print(f"Network Error: {e.reason}")
except Exception as e:
    print(f"Unexpected error: {e}")

['Player', 'Club', 'Total Goals', 'Home Goals', 'Away Goals']
['Jamie Vardy', 'Leicester City', '17', '8', '9']
['Sergio Aguero', 'Manchester City', '16', '8', '8']
['P. Aubameyang', 'Arsenal', '15', '6', '9']
['Danny Ings', 'Southampton', '15', '8', '7']
['Marcus Rashford', 'Manchester Utd', '14', '10', '4']
['Mohamed Salah', 'Liverpool', '14', '12', '2']
['Tammy Abraham', 'Chelsea', '13', '5', '8']
['Sadio Mané', 'Liverpool', '12', '7', '5']
['Raúl Jiménez', 'Wolverhampton', '11', '5', '6']
['Harry Kane', 'Tottenham', '11', '6', '5']
['D. Calvert-Lewin', 'Everton', '11', '6', '5']
['Raheem Sterling', 'Manchester City', '11', '2', '9']
['Teemu Pukki', 'Norwich City', '11', '7', '4']
['Chris Wood', 'Burnley', '10', '6', '4']
['Son Heungmin', 'Tottenham', '9', '6', '3']
['Anthony Martial', 'Manchester Utd', '9', '4', '5']
['Richarlison', 'Everton', '9', '5', '4']
['Kevin De Bruyne', 'Manchester City', '9', '6', '3']
['Gabriel Jesus', 'Manchester City', '9', '3', '6']
['Roberto Firmino',

Python also includes a built-in module called `csv` which simplies the process of reading and writing CSV data.

See https://docs.python.org/3/library/csv.html

In [5]:
# Download the file and save it as a CSV file using modern file handling
url = "http://mlg.ucd.ie/modules/python/goal_scorers.csv"

try:
    response = urllib.request.urlopen(url)
    data = response.read().decode("utf-8")
    
    with open("goal_data.csv", "w", encoding="utf-8") as fout:
        fout.write(data)
    print("Successfully downloaded and saved goal_data.csv")
    
except urllib.error.HTTPError as e:
    print(f"HTTP Error {e.code}: Could not download file")
except urllib.error.URLError as e:
    print(f"Network Error: {e.reason}")
except IOError as e:
    print(f"File Error: Could not write to goal_data.csv - {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Successfully downloaded and saved goal_data.csv


Once we've saved the data, we can use the `csv` module to read each line (row) into a dictionary with proper error handling:

In [6]:
# Import the module
import csv

try:
    with open("goal_data.csv", "r", encoding="utf-8") as fin:
        reader = csv.DictReader(fin)
        rows = []
        for row_num, row in enumerate(reader, 1):
            try:
                print(row)
                rows.append(row)
            except Exception as e:
                print(f"Error processing row {row_num}: {e}")
    print(f"Successfully read {len(rows)} rows of data")
    
except IOError as e:
    print(f"File Error: Could not read goal_data.csv - {e}")
except csv.Error as e:
    print(f"CSV Error: Invalid CSV format - {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

{'Player': 'Jamie Vardy', 'Club': 'Leicester City', 'Total Goals': '17', 'Home Goals': '8', 'Away Goals': '9'}
{'Player': 'Sergio Aguero', 'Club': 'Manchester City', 'Total Goals': '16', 'Home Goals': '8', 'Away Goals': '8'}
{'Player': 'P. Aubameyang', 'Club': 'Arsenal', 'Total Goals': '15', 'Home Goals': '6', 'Away Goals': '9'}
{'Player': 'Danny Ings', 'Club': 'Southampton', 'Total Goals': '15', 'Home Goals': '8', 'Away Goals': '7'}
{'Player': 'Marcus Rashford', 'Club': 'Manchester Utd', 'Total Goals': '14', 'Home Goals': '10', 'Away Goals': '4'}
{'Player': 'Mohamed Salah', 'Club': 'Liverpool', 'Total Goals': '14', 'Home Goals': '12', 'Away Goals': '2'}
{'Player': 'Tammy Abraham', 'Club': 'Chelsea', 'Total Goals': '13', 'Home Goals': '5', 'Away Goals': '8'}
{'Player': 'Sadio Mané', 'Club': 'Liverpool', 'Total Goals': '12', 'Home Goals': '7', 'Away Goals': '5'}
{'Player': 'Raúl Jiménez', 'Club': 'Wolverhampton', 'Total Goals': '11', 'Home Goals': '5', 'Away Goals': '6'}
{'Player': 'Har

We can also use the `csv` module to write data out to a CSV file. In the example below, we will write out the data which we collected above, row by row:

In [7]:
# Write data to CSV file using modern file handling
try:
    with open("output.csv", "w", encoding='utf-8', newline='') as fout:
        # Specify the ordered list of fields in our file
        fields = ["Player", "Club", "Total Goals", "Home Goals", "Away Goals"]
        writer = csv.DictWriter(fout, fieldnames=fields)
        # Write the header row
        writer.writeheader()
        # Write each row of data
        for row in rows:
            writer.writerow(row)
            
    print(f"Successfully wrote {len(rows)} rows to output.csv")    
except IOError as e:
    print(f"File Error: Could not write to output.csv - {e}")
except csv.Error as e:
    print(f"CSV Error: Could not write CSV data - {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Successfully wrote 30 rows to output.csv


## Working with JSON

[JSON](http://json.org/) is a lightweight format which is becoming increasingly popular for online data exchanged. Based originally on the JavaScript language and (relatively) easy for humans to read and write

The built-in module *json* provides an easy way to encode and decode data in JSON in Python.

In [11]:
import json

Let's try downloading and parsing a simple JSON file which contains information about a number of books, originally from librarything.com:

In [12]:
url = "http://mlg.ucd.ie/modules/python/books.json"

try:
    response = urllib.request.urlopen(url)
    raw_json = response.read().decode("utf-8")
    print("Successfully downloaded JSON data")
except urllib.error.HTTPError as e:
    print(f"HTTP Error {e.code}: Could not download JSON file")
except urllib.error.URLError as e:
    print(f"Network Error: {e.reason}")
except Exception as e:
    print(f"Unexpected error: {e}")
    raw_json = None

Successfully downloaded JSON data


In [14]:
print(raw_json)

[{
	"book_id": "13585350",
	"title": "The World Treasury of Science Fiction",
	"ISBN": "0316349410",
	"year": 1989,
	"rating": 3,
	"language": "eng"
}, {
	"book_id": "124205572",
	"title": "The War of the Worlds",
	"ISBN": "1936594056",
	"year": 2013,
	"rating": 4,
	"language": "eng"
}, {
	"book_id": "127360065",
	"title": "Under the Dome: A Novel",
	"ISBN": "1439149038",
	"year": 2013,
	"rating": 2,
	"language": "eng"
}, {
	"book_id": "13908800",
	"title": "The Ultimate Hitchhiker's Guide to the Galaxy",
	"ISBN": "0345453743",
	"year": 2002,
	"rating": 5,
	"language": "eng"
}, {
	"book_id": "123734934",
	"title": "The Time Traveler's Wife",
	"ISBN": "1476764832",
	"year": 2014,
	"rating": 5,
	"language": "eng"
}, {
	"book_id": "13603020",
	"title": "Salem's Lot",
	"ISBN": "0451098277",
	"year": 1976,
	"rating": 3,
	"language": "eng"
}, {
	"book_id": "124173974",
	"title": "Republic",
	"ISBN": "039395501X",
	"year": 1985,
	"rating": 3,
	"language": "eng"
}, {
	"book_id": "123102859",
	

We can now parse the JSON, converting it from a string into a useful Python data structure, by using the `json.loads()`function

In [15]:
try:
    data = json.loads(raw_json)
    print("Successfully parsed JSON data")
    for book in data:
        print(book)
except json.JSONDecodeError as e:
    print(f"JSON Error: Invalid JSON format - {e}")
except Exception as e:
    print(f"Unexpected error parsing JSON: {e}")

Successfully parsed JSON data
{'book_id': '13585350', 'title': 'The World Treasury of Science Fiction', 'ISBN': '0316349410', 'year': 1989, 'rating': 3, 'language': 'eng'}
{'book_id': '124205572', 'title': 'The War of the Worlds', 'ISBN': '1936594056', 'year': 2013, 'rating': 4, 'language': 'eng'}
{'book_id': '127360065', 'title': 'Under the Dome: A Novel', 'ISBN': '1439149038', 'year': 2013, 'rating': 2, 'language': 'eng'}
{'book_id': '13908800', 'title': "The Ultimate Hitchhiker's Guide to the Galaxy", 'ISBN': '0345453743', 'year': 2002, 'rating': 5, 'language': 'eng'}
{'book_id': '123734934', 'title': "The Time Traveler's Wife", 'ISBN': '1476764832', 'year': 2014, 'rating': 5, 'language': 'eng'}
{'book_id': '13603020', 'title': "Salem's Lot", 'ISBN': '0451098277', 'year': 1976, 'rating': 3, 'language': 'eng'}
{'book_id': '124173974', 'title': 'Republic', 'ISBN': '039395501X', 'year': 1985, 'rating': 3, 'language': 'eng'}
{'book_id': '123102859', 'title': 'The Road', 'ISBN': '0307387

We can now iterate through the books in the list and extract the relevant information that we require.

In [16]:
for book in data:
    print(f"{book['title']} = {book['year']}")

The World Treasury of Science Fiction = 1989
The War of the Worlds = 2013
Under the Dome: A Novel = 2013
The Ultimate Hitchhiker's Guide to the Galaxy = 2002
The Time Traveler's Wife = 2014
Salem's Lot = 1976
Republic = 1985
The Road = 2006


## Working with XML

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format which is both human-readable and machine-readable. XML is a widely-adopted format. Python includes several built-in modules for parsing XML data.

The `xml.etree.ElementTree` module can be used to extract data from a simple XML file based on its tree structure. 

In [6]:
# Download XML content with proper error handling
url = "http://mlg.ucd.ie/modules/python/books.xml"

try:
    response = urllib.request.urlopen(url)
    raw_xml = response.read().decode("utf-8")
    print("Successfully downloaded XML data:")
    print(raw_xml)
except urllib.error.HTTPError as e:
    print(f"HTTP Error {e.code}: Could not download XML file")
    raw_xml = None
except urllib.error.URLError as e:
    print(f"Network Error: {e.reason}")
    raw_xml = None
except Exception as e:
    print(f"Unexpected error: {e}")
    raw_xml = None

Successfully downloaded XML data:
<?xml version="1.0" encoding="UTF-8"?>
<booklist>
   <book id="13585350">
      <title>The World Treasury of Science Fiction</title>
      <ISBN>0316349410</ISBN>
      <year>1989</year>
      <rating>3</rating>
      <language>eng</language>
   </book>
   <book id="124205572">
      <title>The War of the Worlds</title>
      <ISBN>1936594056</ISBN>
      <year>2013</year>
      <rating>4</rating>
      <language>eng</language>
   </book>
   <book id="127360065">
      <title>Under the Dome: A Novel</title>
      <ISBN>1439149038</ISBN>
      <year>2013</year>
      <rating>2</rating>
      <language>eng</language>
   </book>
   <book id="13908800">
      <title>The Ultimate Hitchhiker's Guide to the Galaxy</title>
      <ISBN>0345453743</ISBN>
      <year>2002</year>
      <rating>5</rating>
      <language>eng</language>
   </book>
   <book id="123734934">
      <title>The Time Traveler's Wife</title>
      <ISBN>1476764832</ISBN>
      <year>2014</y

We can use the `xml.etree.ElementTree.fromstring()` function to parse content from a string containing XML data.

In [7]:
import xml.etree.ElementTree

try:
    tree = xml.etree.ElementTree.fromstring(raw_xml)
    print("Successfully parsed XML data")
except xml.etree.ElementTree.ParseError as e:
    print(f"XML Parse Error: Invalid XML format - {e}")
    tree = None
except Exception as e:
    print(f"Unexpected error parsing XML: {e}")
    tree = None

Successfully parsed XML data


An XML tree has a root node (i.e. the top level of the document), with child nodes at lower levels. We can iterate over these:

In [8]:
# loop through the immediate children of the root element
for child in tree:
    # Get the name of the tag, along with any XML attributes which the tag has
    print(child.tag, child.attrib)

book {'id': '13585350'}
book {'id': '124205572'}
book {'id': '127360065'}
book {'id': '13908800'}
book {'id': '123734934'}
book {'id': '13603020'}
book {'id': '124173974'}
book {'id': '123102859'}


We can also query to find tags with specific names, such as 'book' and then in turn find child nodes of that tag with a specific name.

In [9]:
titles = []
# loop through all <book> elements in the XML tree
for book in tree.findall("book"):
    # get the text inside a <title> tag, contained within a <book> tag
    title_element = book.find("title")
    # check that the <title> tag exists and contains text
    if title_element is not None and title_element.text:
        title = title_element.text.strip()
        titles.append(title)
    else:
        print("Warning: Book found without a title")

# sort and print the collected titles
for title in sorted(titles):
    print(f"- {title}")

- Republic
- Salem's Lot
- The Road
- The Time Traveler's Wife
- The Ultimate Hitchhiker's Guide to the Galaxy
- The War of the Worlds
- The World Treasury of Science Fiction
- Under the Dome: A Novel
