## COMP20008 2021 Semester 1 Workshop 3
### Why XML and when do we see it?
- Extensible Markup Language (XML) is widely used markup language used to define rules for encoding documents or data structures (closer to HTML than to Python).
- Commonly used for documents, but also for XML SOAP requests (messaging protocol for requests) when working with asynchronous API's (so yes, you will eventually come across these in industry). 
- Just note that the XML SOAP protocol has been superseded by REST API's (Application Programming Interfaces), but these are still abundant and around!

### XML and Python
- To parse XML data structures in Python, we will use the `lxml` package (one of the fastest and most efficient libraries).
- Combining both `lxml` and `requests` (library for sending requests) creates a powerful method of dealing with API's online.
- Notable functions from `lxml` include `etree` (ElementTree), which allows parsing of XML data into a tree-like structure.
- Documentation: https://lxml.de/api/index.html

In [1]:
from lxml import etree

For this section we will work with the royal.xml file, which contains the names of some members of the British royal family.  The code below simply displays the contents of that file, you can also open the file in a web browser or text editor.  Look through the file and ensure you understand its content.

In [None]:
f = open("royal.xml", "r")
text = f.read()
print(text)
f.close()

In order to load an XML file and to represent it as a tree in computer memory, you need to parse the XML file. The etree.parse() function parses the XML file that is passed in as a parameter.  

In [None]:
xmltree = etree.parse("royal.xml")

The *parse()* function returns an XML *ElementTree* object, which represents the whole XML tree. Each node in the tree is translated into an *Element* object .

Use *getroot()* function of an *ElementTree* object to get the root element of the XML tree. You can print out the XML tag of an element using *tag* property.

In [None]:
root = xmltree.getroot()
print (root.tag)

### Traversing the XML Tree

The following sections describe various methods for traversing the XML tree

To obtain a list all of the children of an element, you can iterate over the XML *Element* itself:

In [None]:
for e in root:
   print (e.tag)

You can use indexing to access the children of an element:


In [None]:
oldest_prince = root[0]
#print(type(oldest_prince))
print (oldest_prince.get("title"))

The *find()* method returns only the first matching child.



In [None]:
the_first_child_with_prince_tag = root.find("prince")
print (the_first_child_with_prince_tag.get('title'))

The *iterchildren()* function allows you to iterate over children with a particular tag:



In [None]:
for child in root:#.iterchildren(tag="prince"):
    print (child.get('title'))

There is also a *iterdescendants()* function to iterate all descendants of a particular node.

### Exercise 1

Using the *royal.xml*:

i) Write Python code to get the title property of queen's grandsons.

ii) Write Python code to get the full title of the only princess in the family tree.

In [None]:
#insert answer to 1 here



### Accessing XML attributes


You can access the XML attributes of an element using the *get()* method
or *attrib* properties of an element.



In [None]:
print (root.attrib)
print (root.get("title"))


### Accessing XML text


This XML looks different to the *royal2.xml* in that it has some
text content within each element. To access the text content of an
element (text between start and end tag), use *text* properties of that
element

In [None]:
from lxml import etree
xmltree = etree.parse('book.xml')
root = xmltree.getroot()
for child in root:
    print (child.tag + ": " + child.text)

### Building XML data



Let's go back to the *book.xml* example above. As usual, use *lxml* library to parse the XML and get the root of the tree:



In [None]:
from lxml import etree
xmltree = etree.parse('book.xml')
root = xmltree.getroot()

To create a new XML element, use *etree.Element()* function:



In [None]:
new_element = etree.Element('genre')
new_element.text = 'Novel'
root.append(new_element)
print(etree.tostring(root[-1],pretty_print=True,encoding='unicode'))   # the last element, the newly appended element


Tips: You can create a totally a new XML tree by constructing the root element:

In [None]:
root = etree.Element('book')

You can also create new element using *SubElement()* function:


In [None]:
new_element = etree.SubElement(root, "price")
new_element.text = '23.95'
for e in root: # check whether the new element is added
    print(e.tag)

Use *insert()* to insert a new element at a specific location:

In [None]:
root.insert(1,etree.Element("country"))
root[1].text = "United States"
print(etree.tostring(root[1],pretty_print=True,encoding='unicode'))

### Serialising XML data (printing as web content or writing into a file)


You can get the whole XML string by calling *etree.tostring()* with the root of the tree as the first parameter:



In [None]:
output = etree.tostring(root, pretty_print=True, encoding="UTF-8")
for e in root:
   print(e.tag)

In [None]:
open('output.xml','wb').write(output)

### Exercise 2

Write Python code to load in the file "book.xml", change the ISBN to "Unknown" and then write out the file to "book-new.xml"

In [None]:
#insert answer to 2 here


## JSON

Python has a built in json module that allows you to process JSON files.  You can find out more about it by reading [its page at python.org](https://docs.python.org/3/library/json.html).  W3schools also provide a good [introductory tutorial](https://www.w3schools.com/python/python_json.asp),  while Real Python has a [more comprehensive one](https://realpython.com/python-json/).

Below you can see a sample JSON file consisting of some information about a book.

In [None]:
str_json = '''
{
"id": "book001",
"author": "Salinger, J. D.",
"title": "The Catcher in the Rye",
"price": "44.95",
"language": "English",
"publish_date": "1951-07-16",
"publisher": "Little, Brown and Company",
"isbn": "0-316-76953-3",
"description": "A story about a few important days in the life of Holden Caulfield"
}
'''

Using the *json* library we are able to manipulate the JSON file as follows.

In [None]:
import json
Data = json.loads(str_json)
print(type(Data))
print(Data["price"])

# modify any attribute
Data["isbn"] = "Unknown"

# save Json file
with open('book_test.json', 'w') as f:
    json.dump(Data, f,indent = 2)

# load Json file
with open('book_test.json') as f:
    Data = json.load(f)


### Exercise 3
Add Spanish and German to the JSON file above as two extra languages represented as an array. Save this file as book2.json. Validate it on JSONLint.

In [None]:
#insert answer to 3 here and save as book2.json



In [None]:
# load and check the answer

with open('book2.json') as f:
    Data = json.load(f)
    
Data

### Exercise 4 (If you have time)
Now modify the publish date parameter. Make this an array of two objects that have
properties of edition (first, second) and date (1951-07-16,1979-01-01) respectively. Save
this file as book3.json.

In [None]:
#insert answer to 4 here and save as book3.json



### Additional Task: Git Resources 

Please go throuh the git PDF manual uploaded on Canvas. The manual will help you to get fimilarize with the commands used when working with git repository.
You can further access a git toturial video using link : https://canvas.lms.unimelb.edu.au/courses/107611/files/6845808?module_item_id=2714691 which explains the gerneral envionment of git and some of 
