# Get Hansard URLs
Script that automatically downloads the XML data that points to all of the Hansards

Senate Hansard page is available [here](http://parlinfo.aph.gov.au/parlInfo/search/summary/summary.w3p;adv%3Dyes;orderBy%3D_fragment_number,doc_date-rev;query%3DDataset%3Ahansards,hansards80;resCount%3DDefault)

House of Reps Hansard Page is available [here](http://parlinfo.aph.gov.au/parlInfo/search/summary/summary.w3p;adv%3Dyes;orderBy%3D_fragment_number,doc_date-rev;query%3DDataset%3Ahansardr,hansardr80;resCount%3DDefault)

## NOTE
Currently only for the SENATE

In [5]:
import requests

In [310]:
class xmlSnippet(): # A class that represents a snippet of XML data
    def __init__(self, rawStr, selfAnalyse, **kwargs):
        self.rawStr = rawStr.strip().replace("\\n", "")
        self.properties = kwargs.get("properties", [])
        self.errorList = []
        if selfAnalyse:
            self.errorList = self.analyse()
            
    
    def analyse(self):
        tagList = []
        errorEntries = []
        for index, item in enumerate(self.rawStr.split("<")):
            if item[0] != "/": # If the line doesn't contain a closing tag
                try:
                    key = item.split(">")[0]
                    value = item.split(">")[1]
                    attributes = {}
                    
                    # If the XML has attributes associated with it
                    if " " in key: 
                        keyList = key.split(" ")
                        key = keyList[0]
                        for attribute in keyList:
                            if attribute != key:
                                attrPair = attribute.split("=")
                                attributes[attrPair[0]] = attrPair[1]
                    
                    self.properties.append(
                        {
                            "key": key,
                            "value": value,
                            "attr": attributes
                        }
                    )
                
                except IndexError:
                    errorEntries.append(item)
        return errorEntries

    def display(self):
        print(f"{'Key':<15} | {'Attr Count':<15} | Value")
        print(f"{'-'*16}|{'-'*17}|{'-'*15}")
        for item in self.properties:
            print(f"{item['key']:<15} | {str(len(item['attr'])):<15} | {item['value']}")


In [423]:
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"} # Bot detection who?
startPageNumber = 0 # The first page of results that will be downloaded (Index starts at 0)
endPageNumber = 16 # The final page of results to be downloaded
rssData = requests.get(otherURL, headers=headers)

In [424]:
content = str(rssData.content)

In [461]:
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"}
for pageNumber in range(startPageNumber, endPageNumber):
    # The URL for a page with a given index
    rssURL = f"https://parlinfo.aph.gov.au/parlInfo/feeds/rss.w3p;adv=yes;orderBy=_fragment_number,doc_date-rev;page={pageNumber+1};query=Dataset%3Ahansardr,hansardr80;resCount=Default"
    if pageNumber == 0:
        # The URL for the first page (it isn't the same as the rest)
        rssURL = "https://parlinfo.aph.gov.au/parlInfo/feeds/rss.w3p;adv=yes;orderBy=_fragment_number,doc_date-rev;query=Dataset%3Ahansardr,hansardr80;resCount=Default"
    
    rssData = requests.get(rssURL, headers=headers)
    content = str(rssData.content)
    with open(f"../data/senate/urlLists/list{pageNumber}.txt", "w", encoding="utf-8") as listFile:
        listFile.write(str(rssData.content))
    print(f"Just wrote data for page {pageNumber}")
    print(xmlSnippet(content.split("<item>")[2], True).display())
    print(xmlSnippet(content.split("<item>")[-2], True).display())

Just wrote data for page 2
Key             | Attr Count      | Value
----------------|-----------------|---------------
title           | 0               | Title Unavailable
link            | 0               | https://parlinfo.aph.gov.au:443/parlInfo/search/display/display.w3p;query=Id%3A%22chamber%2Fhansardr%2F2008-08-26%2F0000%22
guid            | 1               | https://parlinfo.aph.gov.au:443/parlInfo/search/display/display.w3p;query=Id%3A%22chamber%2Fhansardr%2F2008-08-26%2F0000%22
pubDate         | 0               | Tue, 26 Aug 2008 00:00:00 +1100
None
Key             | Attr Count      | Value
----------------|-----------------|---------------
title           | 0               | Title Unavailable
link            | 0               | https://parlinfo.aph.gov.au:443/parlInfo/search/display/display.w3p;query=Id%3A%22chamber%2Fhansardr%2F2000-10-09%2F0000%22
guid            | 1               | https://parlinfo.aph.gov.au:443/parlInfo/search/display/display.w3p;query=Id%3A%22chamber%