# XML
XML (extensive markup language) is another markup language, being very similar to HTML. Information in XML can look very messy, very nested. I will use Beautifulsoup's XML parser to make extracting information from XML more pain free.

It follows the same basic structure as HTML.

```xml
<tagname attribute="value">text</tagname>
```

### Imports
bs4 as an XML parser
requests to get the data from the url

In [4]:
import bs4
import requests

### Set the url
You can also parse xml from file system by providing the file path as opposed to a url.

In [3]:
xml_url = 'https://data.gov.au/data/dataset/4b7b5b50-774f-4416-90ce-5b7df85ff8ce/resource/aa0499fa-e19d-417d-bb1b-9589c0a19dbf/download/immigration.xml'

### Create a request object and send a GET request to the site
It returns status code 200, meaning it is ok to proceed

In [5]:
r = requests.get(xml_url)

In [6]:
r.status_code

200

### Convert the content from the request into a soup object with BeautifulSoup, specifying the parser (lxml)

In [7]:
soup = bs4.BeautifulSoup(r.content, 'lxml')

### Inspect soup object
I'm going to preview the first 1500 chars of the XML to get an idea of the structure, you can also do this in browser. Immediately, you should be able to notice we have a tag called <records> and within that tag are multiple tags called record. So we can assume that the record tag contains row level data.

In [19]:
print(soup.prettify()[:1500])

<?xml version="1.0"?>
<html>
 <body>
  <records>
   <record id="1479486">
    <ni_url>
     &lt;subfield_u&gt;https://stors.tas.gov.au/SWD4-1-1&lt;/subfield_u&gt;&lt;subfield_z&gt;SWD4/1/1&lt;/subfield_z&gt;
    </ni_url>
    <tasmanian>
     Published in Tasmania
    </tasmanian>
    <tasmanian>
     About Tasmania
    </tasmanian>
    <tasmanian>
     By a Tasmanian
    </tasmanian>
    <ni_name_facet>
     Cullen, Ira S.
    </ni_name_facet>
    <ni_index>
     Immigration
    </ni_index>
    <pubdate_range>
     1921
    </pubdate_range>
    <ni_name_full_display>
     Cullen, Ira S.
    </ni_name_full_display>
    <cover_image_url>
     https://stors.tas.gov.au/SWD4-1-1$givethumb
    </cover_image_url>
    <ni_doc_date>
     12 Jul 1921
    </ni_doc_date>
    <linc_tas_avail>
     Online
    </linc_tas_avail>
    <ni_name>
     Cullen, Ira S.
    </ni_name>
    <relevance_sort>
     0
    </relevance_sort>
    <pubdate>
     1921
    </pubdate>
    <format>
     VIEW
    </format>

### Storing all record tags
Calling the findAll method on the soup object returns a list of **3530** record tags

In [10]:
records = soup.findAll('record')

In [11]:
len(records)

3530

### Creating a helper function
This will be used to return the inner text from a tag. I am using a try except because if a tag does not exist it will throw an error back at us.

In [42]:
def getTagText(tag, tag_name):
    try:
        return tag.find(tag_name).text
    except:
        return None

### Process the tags
- First I create an empty `pandas.DataFrame()` to store extracted data in.
- Next Iterate through each record using `enumerate`
- Following that I target particular tags within the **<record>** tag I observed earlier using the `getTagText` helper function
- Finally I store the results into the `pandas.DataFrame()` object using `.loc` method

Note: the record_id value does not come from the inner text of a tag, it is actually an attributes value, therefore we have to use the `.get()` method

In [46]:
data = pandas.DataFrame()

for index, r in enumerate(records):
    record_id = r.get('id')
    pubdate = getTagText(r, 'pubdate')
    ni_doc_date = getTagText(r, 'ni_doc_date')
    title = getTagText(r, 'title')
    ni_index = getTagText(r, 'ni_index')
    ni_remarks = getTagText(r, 'ni_remarks')
    cover_image_url = getTagText(r, 'cover_image_url')
    
    data.loc[index, 'record_id'] = record_id
    data.loc[index, 'pubdate'] = pubdate
    data.loc[index, 'ni_doc_date'] = ni_doc_date
    data.loc[index, 'ni_index'] = ni_index
    data.loc[index, 'title'] = title
    data.loc[index, 'cover_image_url'] = cover_image_url

### Inspect final data
Here I just check the shape, to see if it matches the number of **record** tags we found earlier. I also run `.info()` on the dataframe to check how many values it missed. It appears it only missed a few for title, cover_image_url , pubdate and ni_doc_date. Finally I call `.head()` to preview the first 5 rows.

In [48]:
data.shape

(3530, 6)

In [49]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3530 entries, 0 to 3529
Data columns (total 6 columns):
record_id          3530 non-null object
pubdate            3528 non-null object
ni_doc_date        3528 non-null object
ni_index           3530 non-null object
title              3529 non-null object
cover_image_url    3528 non-null object
dtypes: object(6)
memory usage: 353.0+ KB


In [47]:
data.head()

Unnamed: 0,record_id,pubdate,ni_doc_date,ni_index,title,cover_image_url
0,1479486,1921,12 Jul 1921,Immigration,"Cullen, Ira S.",https://stors.tas.gov.au/SWD4-1-1$givethumb
1,1479487,1920,4 Nov 1920,Immigration,"Newton, A.A.",https://stors.tas.gov.au/SWD4-1-2$givethumb
2,1479488,1920,1 Jan 1920,Immigration,"Hurst, Frederick",https://stors.tas.gov.au/SWD4-1-3$givethumb
3,1479489,1920,1 Jan 1920,Immigration,"Rhodes, H.A.",https://stors.tas.gov.au/SWD4-1-4$givethumb
4,1479490,1920,1 Jan 1920,Immigration,"Whitton, B.J.",https://stors.tas.gov.au/SWD4-1-5$givethumb
