# Data Formats

## Downloading Data

The built-in Python *urllib.request* module has functions which help in downloading content from HTTP URLs using minimal code.

In [None]:
import urllib.request
import urllib.error

url = "http://mlg.ucd.ie/modules/python/ucd.txt"
# open the url and read the response
response = urllib.request.urlopen(url)
# decode the response content from bytes to a utf-8 string
text = response.read().decode("utf-8")
print(text)

In practice, we should always wrap code to fetch URLs in proper error handling blocks to handle the various cases where we cannot access the URL. This includes network issues, server errors, and malformed responses.

In [None]:
# specify a URL that does not exist
url = "http://somemissinglink.ucd.ie/ucd.txt"

try:
    # attempt to retrieve the URL
    response = urllib.request.urlopen(url)
    text = response.read().decode("utf-8")
    print(text)
except urllib.error.HTTPError as e:
    print(f"HTTP Error {e.code}: Failed to retrieve {url}")
except urllib.error.URLError as e:
    print(f"Network Error: Failed to retrieve {url} - {e.reason}")
except Exception as e:
    print(f"Unexpected error retrieving {url}: {e}")

## Working with CSV Data

The CSV ("Comma Separated Values") file format is often used to exchange tabular data between different applications, like Excel. Essentially a CSV file is a plain text file where values are split by a comma separator. Alternatively can be tab or space separated. 

As an example, we will look at a CSV file containing details of Premier League goal scorers. We could download this CSV file using the `urllib.request.urlopen()` function and manually parse it...

In [None]:
# download the CSV with proper error handling
url = "http://mlg.ucd.ie/modules/python/goal_scorers.csv"

try:
    response = urllib.request.urlopen(url)
    raw_csv = response.read().decode("utf-8")
    
    # parse each line
    lines = raw_csv.split("\n")
    for line in lines:
        line = line.strip()
        if len(line) > 0:
            # split based on a comma separator
            parts = line.split(",")
            print(parts)
except urllib.error.HTTPError as e:
    print(f"HTTP Error {e.code}: Could not download CSV file")
except urllib.error.URLError as e:
    print(f"Network Error: {e.reason}")
except Exception as e:
    print(f"Unexpected error: {e}")

Python also includes a built-in module called `csv` which simplies the process of reading and writing CSV data.

See https://docs.python.org/3/library/csv.html

In [None]:
# Download the file and save it as a CSV file using modern file handling
url = "http://mlg.ucd.ie/modules/python/goal_scorers.csv"

try:
    response = urllib.request.urlopen(url)
    data = response.read().decode("utf-8")
    
    with open("goal_data.csv", "w", encoding="utf-8") as fout:
        fout.write(data)
    print("Successfully downloaded and saved goal_data.csv")
    
except urllib.error.HTTPError as e:
    print(f"HTTP Error {e.code}: Could not download file")
except urllib.error.URLError as e:
    print(f"Network Error: {e.reason}")
except IOError as e:
    print(f"File Error: Could not write to goal_data.csv - {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Once we've saved the data, we can use the `csv` module to read each line (row) into a dictionary with proper error handling:

In [None]:
# Import the module
import csv

try:
    with open("goal_data.csv", "r", encoding="utf-8") as fin:
        reader = csv.DictReader(fin)
        rows = []
        for row_num, row in enumerate(reader, 1):
            try:
                print(row)
                rows.append(row)
            except Exception as e:
                print(f"Error processing row {row_num}: {e}")
    print(f"Successfully read {len(rows)} rows of data")
    
except IOError as e:
    print(f"File Error: Could not read goal_data.csv - {e}")
except csv.Error as e:
    print(f"CSV Error: Invalid CSV format - {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

We can also use the `csv` module to write data out to a CSV file. In the example below, we will write out the data which we collected above, row by row:

In [None]:
# Write data to CSV file using modern file handling
try:
    with open("output.csv", "w", encoding='utf-8', newline='') as fout:
        # Specify the ordered list of fields in our file
        fields = ["Player", "Club", "Total Goals", "Home Goals", "Away Goals"]
        writer = csv.DictWriter(fout, fieldnames=fields)
        # Write the header row
        writer.writeheader()
        # Write each row of data
        for row in rows:
            writer.writerow(row)
            
    print(f"Successfully wrote {len(rows)} rows to output.csv")    
except IOError as e:
    print(f"File Error: Could not write to output.csv - {e}")
except csv.Error as e:
    print(f"CSV Error: Could not write CSV data - {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

## Working with JSON

[JSON](http://json.org/) is a lightweight format which is becoming increasingly popular for online data exchanged. Based originally on the JavaScript language and (relatively) easy for humans to read and write

The built-in module *json* provides an easy way to encode and decode data in JSON in Python.

In [None]:
import json

Let's try downloading and parsing a simple JSON file which contains information about a number of books, originally from librarything.com:

In [None]:
url = "http://mlg.ucd.ie/modules/python/books.json"

try:
    response = urllib.request.urlopen(url)
    raw_json = response.read().decode("utf-8")
    print("Successfully downloaded JSON data")
except urllib.error.HTTPError as e:
    print(f"HTTP Error {e.code}: Could not download JSON file")
except urllib.error.URLError as e:
    print(f"Network Error: {e.reason}")
except Exception as e:
    print(f"Unexpected error: {e}")
    raw_json = None

In [None]:
print(raw_json)

We can now parse the JSON, converting it from a string into a useful Python data structure, by using the `json.loads()`function

In [None]:
try:
    data = json.loads(raw_json)
    print("Successfully parsed JSON data")
    for book in data:
        print(book)
except json.JSONDecodeError as e:
    print(f"JSON Error: Invalid JSON format - {e}")
except Exception as e:
    print(f"Unexpected error parsing JSON: {e}")

We can now iterate through the books in the list and extract the relevant information that we require.

In [None]:
for book in data:
    print(f"{book['title']} = {book['year']}")

## Working with XML

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format which is both human-readable and machine-readable. XML is a widely-adopted format. Python includes several built-in modules for parsing XML data.

The `xml.etree.ElementTree` module can be used to extract data from a simple XML file based on its tree structure. 

In [None]:
# Download XML content with proper error handling
url = "http://mlg.ucd.ie/modules/python/books.xml"

try:
    response = urllib.request.urlopen(url)
    raw_xml = response.read().decode("utf-8")
    print("Successfully downloaded XML data:")
    print(raw_xml)
except urllib.error.HTTPError as e:
    print(f"HTTP Error {e.code}: Could not download XML file")
    raw_xml = None
except urllib.error.URLError as e:
    print(f"Network Error: {e.reason}")
    raw_xml = None
except Exception as e:
    print(f"Unexpected error: {e}")
    raw_xml = None

We can use the `xml.etree.ElementTree.fromstring()` function to parse content from a string containing XML data.

In [None]:
import xml.etree.ElementTree

try:
    tree = xml.etree.ElementTree.fromstring(raw_xml)
    print("Successfully parsed XML data")
except xml.etree.ElementTree.ParseError as e:
    print(f"XML Parse Error: Invalid XML format - {e}")
    tree = None
except Exception as e:
    print(f"Unexpected error parsing XML: {e}")
    tree = None

An XML tree has a root node (i.e. the top level of the document), with child nodes at lower levels. We can iterate over these:

In [None]:
# loop through the immediate children of the root element
for child in tree:
    # Get the name of the tag, along with any XML attributes which the tag has
    print(child.tag, child.attrib)

We can also query to find tags with specific names, such as 'book' and then in turn find child nodes of that tag with a specific name.

In [None]:
titles = []
# loop through all <book> elements in the XML tree
for book in tree.findall("book"):
    # get the text inside a <title> tag, contained within a <book> tag
    title_element = book.find("title")
    # check that the <title> tag exists and contains text
    if title_element is not None and title_element.text:
        title = title_element.text.strip()
        titles.append(title)
    else:
        print("Warning: Book found without a title")

# sort and print the collected titles
for title in sorted(titles):
    print(f"- {title}")