# [LEGALST-123] Lab 17: Parsing XML Data

This lab will cover parsing XML and attribute lookup, Beautiful Soup, and web scraping.

*Estimated Time: 45 Minutes *

### Topics Covered:
- XML syntax
- locating content with Beautiful Soup
- Web scraping

### Table of Contents

[The Data](#section-data)<br>
1 - [Web Scraping](#section-1)<br>
2 - [XML Syntax](#section-2)<br>
3 - [Using Beautiful Soup to parse XML](#section-3)<br>
4 - [Putting it all in a dataframe](#section-4)<br>

**Dependencies:**

In [1]:
import pandas as pd
import xml.etree.cElementTree as ET #XML Parser
from lxml import etree #ElementTree and lxml allow us to parse the XML file.
import requests #make request to server
import time #pause loop
from bs4 import BeautifulSoup

----
## The Data<a id='section-data'></a>

In this notebook, you'll be working with XML files from the Old Bailey API (https://www.oldbaileyonline.org/obapi/). These files contain the proceedings of all trials from 1674 to 1913. For this lab, we'll go through the trials from 1754-1756. XML (eXtensible Markup Language) provides a hierarchical representation of data contained within different tags and nodes. We'll go over XML syntax later. We will learn how to parse through these XML files from Old Bailey and grab information from sections of an XML file.

---

----
## Section 1: Web Scraping<a id='section-1'></a>

First we will go through how to parse one XML file. The Old Bailey API has a total of **197751** cases. Fortunately, we are only going to use the ones from 1754-1756, but that still only narrows the number of cases to somewhere above 1300! 

Don't worry though, you're not going to manually download each case yourself. This is where web scraping comes into play. With web scraping, we can automate data collection to get all the cases. 

Before we start scraping, we need to know how `requests` works. The `requests` library gets (`.get`!) you a response object from a web server and will automatically decode the content from the server, from which you can use `.json()` to see the document! Requests through the Old Bailey API will return a dictionary, embedded in which is the XML representation of the trial account, which we can then write as a file and save.

Let's take a look at all of the terms we can use to choose the specific cases we want. We use `.json()` here since the parameters are stored as a JSON object.

<span style="color:red">**Note**: The Old Bailey Online website has changed quite a lot and so we will be using the archived site to get this JSON object. The archived site will disappear in September 2024.</span>

In [2]:
rjson = requests.get('https://www.dhi.ac.uk/oldbaileyonline/obapi/terms').json()
rjson 

[{'name': 'trialtext', 'type': 'text'},
 {'name': 'defgen',
  'type': 'select',
  'terms': ['female', 'indeterminate', 'male']},
 {'name': 'offcat',
  'type': 'select',
  'terms': ['breakingPeace',
   'damage',
   'deception',
   'kill',
   'miscellaneous',
   'royalOffences',
   'sexual',
   'theft',
   'violentTheft']},
 {'name': 'offsubcat',
  'type': 'select',
  'terms': ['',
   'animalTheft',
   'arson',
   'assault',
   'assaultWithIntent',
   'assaultWithSodomiticalIntent',
   'bankrupcy',
   'barratry',
   'bigamy',
   'burglary',
   'coiningOffences',
   'concealingABirth',
   'conspiracy',
   'embezzlement',
   'extortion',
   'forgery',
   'fraud',
   'gameLawOffence',
   'grandLarceny',
   'habitualCriminal',
   'highwayRobbery',
   'housebreaking',
   'illegalAbortion',
   'indecentAssault',
   'infanticide',
   'keepingABrothel',
   'kidnapping',
   'libel',
   'mail',
   'manslaughter',
   'murder',
   'other',
   'perjury',
   'pervertingJustice',
   'pettyLarceny',
   

As I noted before, the [Old Bailey Online](https://www.oldbaileyonline.org/) website has [changed](https://www.oldbaileyonline.org/about/whats-new) quite a bit. The changes are not just to the front end but throughout the site, including the [API](https://www.oldbaileyonline.org/about/api). For example, the API now returns just ten trial account files at a time, and they are no longer just the XML object. The query parameters, like `fromdate_` and `todate_` have also changed, so this lab has had to be revised.

Now that you've had a chance to look through some of the terms, let's see how to grab the specific trial account files.

Clicking the URL below returns a web search result of the total number of cases, and a listing of all the cases, containing the term "sheffield" and the offence categrory "deception" from June 14th, 1847 onward. Also, each trial ID that satisfies the terms is returned; the count parameter in this case returns 74 trial IDs. The API returns how many trial IDs it finds, but it only hands out the trial account files ten at a time. The query parameter for start year is `year_gte=` and for end year is `year_lte=`

https://www.oldbaileyonline.org/search/advanced?month_gte=6&offence=deception&text=Sheffield&year_gte=1847#results


**Question 1.1:** Use requests.get(...) to get all the trials between the years 1754 and 1756 and return them as a JSON object, and then find how many total trial accounts there are.

In [3]:
trials = ...

At this point it might pay to look at the JSON object that `requests.get` returned to see what the list of trial accounts actually looks like. You can see that the first `hits` is a dictionary index which records a `total` for the query, and that the second `hits` is the key for the list of trial accounts.

You can see below that trials is a big dictionary, inside of which is a list of trials, each of which has a trial ID. The API documentation talks about what [data endpoints](https://www.oldbaileyonline.org/about/api#data_endpoints) and [query parameters](https://www.oldbaileyonline.org/about/api#supported_parameters) you can use. Unfortunately, the data now come in a format that is more complicated that the earlier format, so we will have to do some work to extract all the trials we want over the two year period (1754-1756) that we want to look at. The `total:` key below tells us that there are a total of 1312 trial records in the period 1754-1756. The numbering scheme for the `idkey:` also tells us something about the organization of the records; the prefix `f` is for front matter, while `t` is for trial account. After that is the date with digits for year, month, and day, followed by a hyphen with how many accounts are in the series for that day. Note that many trials occurred on a day when the Old Bailey was in session.

In [4]:
trials

{'took': 4,
 'timed_out': False,
 '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
 'hits': {'total': 1312,
  'max_score': None,
  'hits': [{'_index': 'dhids_oldbailey_record',
    '_type': 'doc',
    '_id': 'AYxEMNu7Pld_y58Rz_mD',
    '_score': None,
    '_source': {'idkey': 'f17540116-1',
     'images': ['https://www.dhi.ac.uk/san/ob/1750s/175401160001.gif',
      'https://www.dhi.ac.uk/san/ob/1750s/175401160002.gif'],
     'text': "THE PROCEEDINGS ON THE King's Commissions of the Peace, Oyer and Terminer, and Gaol Delivery FOR THE CITY of LONDON; And also the Gaol Delivery for the County of MIDDLESEX, HELD AT JUSTICE-HALL in the OLD-BAILEY, On Wednesday the 16th, Thursday the 17th, Friday the 18th, Saturday the 19th, and Monday the 21st, of JANUARY. In the 27th Year of His MAJESTY's Reign. NUMBER II. for the Year 1754. BEING THE Second SESSIONS in the MAYORALTY of the Right Hon. Thomas Rawlinson , Esq; LORD-MAYOR of the CI",
     'title': 'Front matter. 16th Janu

In [5]:
total_hits = ...
total_hits

1312

It is clear that the JSON object is pretty complicated. There is a top-level dictionary with the key 'hits' and then an embedded dictionary with the key 'total' (that is where we got the total number of trial accounts above) and then a second 'hits' and then a list of dictionaries and more embedded dictionaries. This is considerably more complicated than the old representation. Notice also that the very first XML document is not a trial record but is instead the front matter from that printed version of the Old Bailey Proceedings.

So with the new API it looks like we may have to walk through each group of ten trial account documents until we have gotten the entire list, and then eventually parse the XML tree for each document using Beautiful Soup. For the problem set we will use the XML [dataset archived](https://www.oldbaileyonline.org/about/data#toc3) at the University of Sheffield so we will not have to scrape them from the site. Even the zipped file of Sessions Papers is very large (335MB) so you will have to plan for where you can do the work.

In the JSON object that we got back, [object.hits.hits](https://www.oldbaileyonline.org/about/api#response_format) is the list of the XML docs.

In [6]:
len(trials['hits']['hits'])

10

In [7]:
# iteratively call requests.get for the years we want until we have a list of all the XMLs of trials
# start_yr = 1754
# end_yr = 1756

# the nested loops are cumbersome but I am not sure of any other way of producing a list of trial contents in one step
i = 0
xml_list = []
while i < total_hits:
    trials = requests.get(f'https://www.dhi.ac.uk/api/data/oldbailey_record?from={i}&year_gte=1754&year_lte=1756').json()
    ten_xmls = ...
    j=0
    while j < len(ten_xmls):
        one_xml = ...
        xml_list...
        j = j + 1
    i = i + 10
    time.sleep(0.1)    # avoid overloading server
len(xml_list)

1312

In [8]:
# inspect the first trial record in the list
xml_list[0]

{'_index': 'dhids_oldbailey_record',
 '_type': 'doc',
 '_id': 'AYxEMNu7Pld_y58Rz_mD',
 '_score': None,
 '_source': {'idkey': 'f17540116-1',
  'images': ['https://www.dhi.ac.uk/san/ob/1750s/175401160001.gif',
   'https://www.dhi.ac.uk/san/ob/1750s/175401160002.gif'],
  'text': "THE PROCEEDINGS ON THE King's Commissions of the Peace, Oyer and Terminer, and Gaol Delivery FOR THE CITY of LONDON; And also the Gaol Delivery for the County of MIDDLESEX, HELD AT JUSTICE-HALL in the OLD-BAILEY, On Wednesday the 16th, Thursday the 17th, Friday the 18th, Saturday the 19th, and Monday the 21st, of JANUARY. In the 27th Year of His MAJESTY's Reign. NUMBER II. for the Year 1754. BEING THE Second SESSIONS in the MAYORALTY of the Right Hon. Thomas Rawlinson , Esq; LORD-MAYOR of the CI",
  'title': 'Front matter. 16th January 1754.'},
 'sort': [-6814972800000, 0]}

In [9]:
# and another
xml_list[5]

{'_index': 'dhids_oldbailey_record',
 '_type': 'doc',
 '_id': 'AYxEMNu7Pld_y58Rz_mI',
 '_score': None,
 '_source': {'idkey': 't17540116-5',
  'images': ['https://www.dhi.ac.uk/san/ob/1750s/175401160003.gif'],
  'text': '84. (M.) John Allen was indicted for stealing one linen shirt, value 1 s. 6 d. the property of Thomas Fazakerley , Dec. 15 . ++ Acquitted .',
  'title': 'John Allen. Theft; grand larceny (to 1827). 16th January 1754.'},
 'sort': [-6814972800000, 5]}

Now we can see clearly what the requests call got us. It is a list of dictionaries that has various keys, including 'idkey' for the document ID, 'images' for the page images of the paper Old Bailey Proceeds, and 'text', which shows **just the plain text** for the document. We did not get the XML representation of the trial record, which has tags and labels for the parties etc., as we saw when we got the list of terms above. 

What we need to do next is
* get the list of ID keys for all 1312 documents from 1754 to 1756 (some of these will not be trial session records but will be front matter or advertising--we won't worry about that now)
* use the list of ID keys to make requests for the xml files for each ID key
* write the xml files to the local data directory
* use those xml files to build a dataframe

First we need to pick out the document ID from each item in the list.

In [10]:
xml_list[5]...

't17540116-5'

Now, let's get a list of document IDs we can work with. Then we can use the first ten to demonstrate the next step of calling the API to get the xml file for each document ID.

In [11]:
doc_ids = [...]
first_10 = doc_ids[:10]
first_10

['f17540116-1',
 't17540116-1',
 't17540116-2',
 't17540116-3',
 't17540116-4',
 't17540116-5',
 't17540116-6',
 't17540116-7',
 't17540116-8',
 't17540116-9']

Using the trial IDs from the previous cell, we are going to format the URL in a way so that we can get the XML file for each trial. In order to get the XML file using the Old Bailey API, we must follow this URL format:

<p style="text-align: center;">`https://www.dhi.ac.uk/api/data/oldbailey_record_single?idkey=`  </p>

For example, https://www.dhi.ac.uk/api/data/oldbailey_record_single?idkey=t16740429-1 gives you the link to the XML file of the first proceeding in the database.


**Get the XML file of the fifth trial in first_10.** A successful `.get` request returns `<Response [200]>`.

In [12]:
fifth_trial_id = first_10[4]
url = 'https://www.dhi.ac.uk/api/data/oldbailey_record_single?idkey={}'.format(fifth_trial_id)
response = requests.get(url)
response

<Response [200]>

Run the next cell to see the XML format of the text! 

In [13]:
response.json()['hits']['hits'][0]['_source']['xml']

'<div1 type="trialAccount" id="t17540116-4"> <interp inst="t17540116-4" type="prevdiv" value="t17540116-3" divtype="trialAccount"/> <interp inst="t17540116-4" type="nextdiv" value="t17540116-5" divtype="trialAccount"/> <interp inst="t17540116-4" type="div0" value="17540116" divtype="sessionsPaper"/> <interp inst="t17540116-4" type="assocrec" value="ar_4437_48007" title="Document. 00 1754. London Metropolitan Archives MJ/SP/1754/01/041."/> <interp inst="t17540116-4" type="collection" value="BAILEY"/> <interp inst="t17540116-4" type="year" value="1754"/> <interp inst="t17540116-4" type="uri" value="sessionsPapers/17540116"/> <interp inst="t17540116-4" type="date" value="17540116"/> <xptr imgpath="ob/1750s/175401160003.gif" imgtitle="Proceedings of the Old Bailey, 16th January ." imgrights="This image is reproduced by permission of Harvard University Library from the microfilm, &quot;The Old Bailey Proceedings&quot;, (Harvester Microform, a former imprint of the Gale Group, 1983). Commerc

We can save just the XML portion in a local file:

In [14]:
with open(f'data/old-bailey/old-bailey-{fifth_trial_id}.xml', 'w') as file:
    file.write(...)

Now we'll get trial `t17031013-13` specifically, for the examples below:

In [15]:
davis_trial_id = 't17031013-13'
response = ...
with open(f'data/old-bailey/old-bailey-{davis_trial_id}.xml', 'w') as file:
    file.write(...)

### Challenge: Scraping all trials from 1754 - 1756

Now we have extracted trial IDs and trial XML trees using `requests.get(some_url)`, so we can iterate through each ID in a of trials (use `doc_ids` from above for the list of IDs). You can choose how many trials you want to save--maybe 30 to start?

In [16]:
for doc_id in doc_ids[:30]:
    #format URL
    url =  'https://www.dhi.ac.uk/api/data/oldbailey_record_single?idkey={}'.format(doc_id)
    print(url)
    #get the JSON from URL
    tree = requests.get(url).json()
    #save just the xml of the trial in the file
    with open('data/old-bailey/old-bailey-' + doc_id + '.xml', 'w') as file:
        file.write(...)
    #one second pause so servers aren't overloaded
    time.sleep(1)

https://www.dhi.ac.uk/api/data/oldbailey_record_single?idkey=f17540116-1
https://www.dhi.ac.uk/api/data/oldbailey_record_single?idkey=t17540116-1
https://www.dhi.ac.uk/api/data/oldbailey_record_single?idkey=t17540116-2
https://www.dhi.ac.uk/api/data/oldbailey_record_single?idkey=t17540116-3
https://www.dhi.ac.uk/api/data/oldbailey_record_single?idkey=t17540116-4
https://www.dhi.ac.uk/api/data/oldbailey_record_single?idkey=t17540116-5
https://www.dhi.ac.uk/api/data/oldbailey_record_single?idkey=t17540116-6
https://www.dhi.ac.uk/api/data/oldbailey_record_single?idkey=t17540116-7
https://www.dhi.ac.uk/api/data/oldbailey_record_single?idkey=t17540116-8
https://www.dhi.ac.uk/api/data/oldbailey_record_single?idkey=t17540116-9
https://www.dhi.ac.uk/api/data/oldbailey_record_single?idkey=t17540116-10
https://www.dhi.ac.uk/api/data/oldbailey_record_single?idkey=t17540116-11
https://www.dhi.ac.uk/api/data/oldbailey_record_single?idkey=t17540116-12
https://www.dhi.ac.uk/api/data/oldbailey_record_

You can check if you saved the XML files by executing the cell below!

In [17]:
!ls data/old-bailey/

old-bailey--t17540116-1.xml  old-bailey--t17540227-13.xml
old-bailey--t17540116-10.xml old-bailey--t17540227-14.xml
old-bailey--t17540116-11.xml old-bailey--t17540227-15.xml
old-bailey--t17540116-12.xml old-bailey--t17540227-16.xml
old-bailey--t17540116-13.xml old-bailey--t17540227-17.xml
old-bailey--t17540116-14.xml old-bailey--t17540227-18.xml
old-bailey--t17540116-15.xml old-bailey--t17540227-19.xml
old-bailey--t17540116-16.xml old-bailey--t17540227-2.xml
old-bailey--t17540116-17.xml old-bailey--t17540227-20.xml
old-bailey--t17540116-18.xml old-bailey--t17540227-21.xml
old-bailey--t17540116-19.xml old-bailey--t17540227-22.xml
old-bailey--t17540116-2.xml  old-bailey--t17540227-23.xml
old-bailey--t17540116-20.xml old-bailey--t17540227-24.xml
old-bailey--t17540116-21.xml old-bailey--t17540227-25.xml
old-bailey--t17540116-22.xml old-bailey--t17540227-26.xml
old-bailey--t17540116-23.xml old-bailey--t17540227-27.xml
old-bailey--t17540116-24.xml old-bailey--t17540227-28.xml

This cell will show you the XML file.

In [18]:
!cat data/old-bailey/old-bailey-t17540116-1.xml

<div1 type="trialAccount" id="t17540116-1"> <interp inst="t17540116-1" type="prevdiv" value="f17540116-1" divtype="frontMatter"/> <interp inst="t17540116-1" type="nextdiv" value="t17540116-2" divtype="trialAccount"/> <interp inst="t17540116-1" type="div0" value="17540116" divtype="sessionsPaper"/> <interp inst="t17540116-1" type="collection" value="BAILEY"/> <interp inst="t17540116-1" type="year" value="1754"/> <interp inst="t17540116-1" type="uri" value="sessionsPapers/17540116"/> <interp inst="t17540116-1" type="date" value="17540116"/> <xptr imgpath="ob/1750s/175401160002.gif" imgtitle="Proceedings of the Old Bailey, 16th January ." imgrights="This image is reproduced by permission of Harvard University Library from the microfilm, &quot;The Old Bailey Proceedings&quot;, (Harvester Microform, a former imprint of the Gale Group, 1983). Commercial use is prohibited without permission of the owner of the original." type="pageFacsimile" value="preceding" doc="175401160002"/> <join result

## Section 2: XML Syntax<a id='section-2'></a>

First, we'll go over the syntax of a XML file. The basic unit of XML code is called an "element" or "node" and has a start and ending tag. The tags for each element look something like this:

 `<exampletag>some text</exampletag>`  

Run the next cell to look at the XML file of one of the cases from the OldBailey API!

In [19]:
# use requests to get an example of xml
example = requests.get(f'https://www.dhi.ac.uk/api/data/oldbailey_record_single?idkey={davis_trial_id}').json()
example

{'took': 3,
 'timed_out': False,
 '_shards': {'total': 20, 'successful': 20, 'skipped': 0, 'failed': 0},
 'hits': {'total': 1,
  'max_score': 10.208518,
  'hits': [{'_index': 'dhids_oldbailey_record',
    '_type': 'doc',
    '_id': 'AYxEMK2rPld_y58Rz-bk',
    '_score': 10.208518,
    '_source': {'metadata': '<table xmlns:dhids="https://www.dhi.ac.uk/data/" class="table small"><tbody><tr><th scope="row">Text type</th><td>Trial account</td></tr><tr><th scope="row">Defendants</th><td>Samuel Davis</td></tr><tr><th scope="row">Offences</th><td><a href="../about/crimes#theft">Theft</a> &gt; <a href="../about/crimes#grandlarceny">Grand larceny</a></td></tr><tr><th scope="row">Session Date</th><td><a href="17031013#t17031013-13">13th October 1703</a></td></tr><tr><th scope="row">Reference Number</th><td>t17031013-13</td></tr><tr><th scope="row">Verdicts</th><td><a href="../about/verdicts#guilty">Guilty</a></td></tr><tr><th scope="row">Punishments</th><td><a href="../about/punishment#miscellane

A ha! Old Bailey Online revised the way it represents data. It used to be that each file contained just the XML, but now each one is a dictionary, among the keys of which is the XML. It is the XML we want to parse, so let's separate that out, like we did up above when we wrote out the Old Bailey Online files locally. You can see that the key 'xml' is shortly after the 'idkey' that we used when we made the list of trial records above. Let's inspect the XML from Samuel Davis's trial.

In [20]:
# get just the xml from the Samuel Davis trial account
example['hits']['hits'][0]['_source']['xml']

'<div1 type="trialAccount" id="t17031013-13"> <interp inst="t17031013-13" type="prevdiv" value="t17031013-12" divtype="trialAccount"/> <interp inst="t17031013-13" type="nextdiv" value="t17031013-14" divtype="trialAccount"/> <interp inst="t17031013-13" type="div0" value="17031013" divtype="sessionsPaper"/> <interp inst="t17031013-13" type="collection" value="BAILEY"/> <interp inst="t17031013-13" type="year" value="1703"/> <interp inst="t17031013-13" type="uri" value="sessionsPapers/17031013"/> <interp inst="t17031013-13" type="date" value="17031013"/> <xptr imgpath="ob/1700s/17031013002.gif" imgtitle="Proceedings of the Old Bailey, 13th October ." imgrights="This image is reproduced courtesy of the British Library. Commercial use is prohibited without permission of the owner of the original." type="pageFacsimile" value="preceding" doc="17031013002"/> <join result="criminalCharge" id="t17031013-13-off60-c52" targOrder="Y" targets="t17031013-13-defend52 t17031013-13-off60 t17031013-13-ver

The `interp` tags at the beginning of the file are elements that don't have any plain text content. Note that elements may possibly be empty and not contain any text (i.e. `interp` elements mentioned earlier). If the element is empty, the tag may follow a format that looks similar to `<exampletag/>`, which is equivalent to `<exampletag></exampletag>`.

Elements may also contain other elements, which we call "children". Most children are indented, but the indents aren't necessary in XML and are used for clarity to show nesting. For example, if we go down to `<persName id="t17540116-4-defend46" type="defendantName">` , we see that the `rs` tag is a child of `persName`. We will explore about children in XML more in the next section. 

Lastly, elements may have attributes, which are in the format `<exampletag name_of_attribute="somevalue">`. Attributes are designed to store data related to a specific elements. Attributes **must** follow the quotes format (`name = "value"`). As you can tell, in this XML file, attributes are everywhere!

-----
**Question 2.1:** What was the verdict of this case? Was there a punsihment and if so, what was it? List both and state whether you found it as plain text content or as an attribute.

_Write your answer here :_



----
## Section 3: Using Beautiful Soup to parse XML<a id='section-3'></a>

Now that we know what the syntax and structure of an XML file, let's figure out how to parse through one! We are going to load the same file from the second section and use [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to navigate through elements in this file. 

First, we need to import the file into a Beautiful Soup instance. 

In [21]:
xml_file = f'data/old-bailey/old-bailey-{davis_trial_id}.xml' #local file
xml_file = open(xml_file).read()
davis_trial_soup = BeautifulSoup(xml_file)
print(davis_trial_soup.prettify())

We can examine `davis_trial_soup` using `.contents`, which puts all children of a tag in a list.

In [22]:
davis_trial_soup.contents

[<html><body><div1 id="t17031013-13" type="trialAccount"> <interp divtype="trialAccount" inst="t17031013-13" type="prevdiv" value="t17031013-12"></interp> <interp divtype="trialAccount" inst="t17031013-13" type="nextdiv" value="t17031013-14"></interp> <interp divtype="sessionsPaper" inst="t17031013-13" type="div0" value="17031013"></interp> <interp inst="t17031013-13" type="collection" value="BAILEY"></interp> <interp inst="t17031013-13" type="year" value="1703"></interp> <interp inst="t17031013-13" type="uri" value="sessionsPapers/17031013"></interp> <interp inst="t17031013-13" type="date" value="17031013"></interp> <xptr doc="17031013002" imgpath="ob/1700s/17031013002.gif" imgrights="This image is reproduced courtesy of the British Library. Commercial use is prohibited without permission of the owner of the original." imgtitle="Proceedings of the Old Bailey, 13th October ." type="pageFacsimile" value="preceding"></xptr> <join id="t17031013-13-off60-c52" result="criminalCharge" target

We notice that all information we care about is contained within `<div>` and `</div1>` tags, so we navigate to it. The simplest way to navigate the parse tree is to say the name of the tag you want (`.`). In this case, we want to access div1 under body tag, which is under html tag.

In [23]:
body = davis_trial_soup.html.body.div1

We can now start working down the tree! With the body, we can find each child of the body by printing the tags. This will also help us for future reference, if we every want to go through other children in the XML file.

In [24]:
for child in body.children:
    if child.name:
        print(child.name)

interp
interp
interp
interp
interp
interp
interp
xptr
join
p
p


Now that we have a list of children to work with let's select one using `.`. Using `.` navigates through the hierarchical structure of XML and helps us keep track of the path we are taking through this file.

In [25]:
choose_p = body.p
for child in choose_p.children:
    if child.name:
        print(child.name)

persname
placename
interp
interp
join
interp
rs
interp
interp
interp
interp
interp
interp
persname
rs
join
rs
interp
join
rs


This isn't very helpful, since we're still left with a bunch of tags and on top of that, we have a lot of repeating tags and names. Let's choose `placename` as our next tag and see what happens.

In [26]:
place_name = body.p.placename
for child in place_name.children:
    if child.name:
        print(child.name)

Nothing was printed, so it looks like we hit the end! Let's use `.string` to examine the data in this element, following the `.` path we used to get here.

In [27]:
print(body.p.placename.string)

St. James Westminster


**Question 2.1:** Find the defendant's name by traversing through the correct elements. You can check your answer with printed XML using `soup.contents`

You may find `body.p.persname.string` returns None. If a tag, `body.p.persname.string` in this case, contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None. Which functions could help us locate the name instead?

In [28]:
...

 Samuel Davis 


**Question 2.2:** Since the textual data is pretty messy in the XML files of these proceedings, where do you think the data you need might be held and how might you go about extracting this data? 

*Write your response here*

----
## Section 4: Putting it all in a dataframe<a id='section-4'></a>

Now that we have a bunch of XML files and know how to parse through them to extract data, let's put the data from the XML files into a dataframe. Take a look at the XML for [this trial](https://www.oldbaileyonline.org/record/t17031013-3?text=t17031013-3) (and even better, look at what is or isn't consistent between that one and some others), and think about the structure of the data. How would you identify the people involved in a case? How would you identify their roles (witness/defendant/victim/other), or their genders? What can you learn about the alleged offence?

*Note:* Some cases have multiple defendants, multiple victims or multiple witnesses; however, most cases only have at most one of each. You can represent this in a dataframe by having $N$ columns for each property of a defendant, victim, etc., but this results in many many empty cells, and may not be amenable to analysis for the questions you come up with.

Think about the kinds of questions you may want to ask about this data, and refer to the XML for how you might answer them. For example, you may be interested in

- the words used specifically in describing the crime (notice that the text specifically between `<rs id="..." type="offenceDescription">` and `</rs>` gives you this)

- whether any victim was female

- whether any defendant was female

- the `category` (or `subCategory`) of the offense, etc.

- the entire text of the trial (sans tags)

These are questions that can be answered for most if not all cases, so they make good candidates for names of columns.

**Question 4.1**: Start by completing the following function to get the date of the case and return it.

In [29]:
def case_date(case_soup):
    for element in case_soup.body.div1.contents:
        if element.name == ...:
            if element.attrs['type'] == ...:
                return ...
            
print("Case", davis_trial_id, "happened on", case_date(davis_trial_soup))

Case t17031013-13 happened on 17031013


**Question 4.2**: Complete the following function finding every person in a trial, and returning a list of dictionaries of their attributes (e.g. `[{"surname": "FINCH", "given": "JOHN", "gender": "male", "type": "witnessName"}]`). Test it on `davis_trial_soup` used before. **Note:** If you use `find_all`, specify the tag name in lowercase, as beautifulsoup lowercases all tag names.

In [30]:
def people_in_case(case_soup):
    people = []
    for persName in case_soup.body.div1.find_all('persname'):
        person = {}
        person["type"] = persName.attrs.get("type") # `thing.get(key)` is like `thing[key]` but returns None if key is not in x instead of raising an exception
        for interp in persName.find_all('interp'):
            fieldName = ...
            fieldValue = ...
            person[fieldName] = fieldValue
        people.append(person)
    return people

print("Case", davis_trial_id, "describes these people:\n", people_in_case(davis_trial_soup))


Case t17031013-13 describes these people:
 [{'type': 'defendantName', 'surname': 'Davis', 'given': 'Samuel', 'gender': 'male'}, {'type': 'victimName', 'victimNameLabel': 'Lady', 'surname': 'Herbert', 'given': 'Catherine', 'gender': 'female'}]


**Question 4.3**: Complete the following function to find the `offenseDescription` and `verdictDescription` in a trial. Think about how the XML expresses the offenseDescription and verdictDescription, and see if you can write the code without specifically looking for the labels "offenseDescription" and "verdictDescription" (i.e. so that it will work even if a case came up with something like `<rs type="sentencingDescription">`). Get the category, subCategory and textual description of the offense:

In [31]:
# note: myString.strip() removes the whitespace at the beginning and end of myString

def case_descriptions(case_soup):
    descriptions = {}
    for rs in ...:
        desc = {}
        for interp in ...:
            fieldName = ...
            fieldValue = ...
            desc[fieldName] = ...
        desc["text"] = ... 
        descriptions[rs.attrs['type']] = desc
    return descriptions
print("Case", davis_trial_id, "has these various descriptions in <rs> elements:\n", case_descriptions(davis_trial_soup))


Case t17031013-13 has these various descriptions in <rs> elements:
 {'offenceDescription': {'offenceCategory': 'theft', 'offenceSubcategory': 'grandLarceny', 'text': 'feloniously Stealing 58 Diamonds set in Silver gilt, value 250 l.'}, 'occupation': {'text': 'Coachman'}, 'crimeDate': {'text': '28th of July'}, 'verdictDescription': {'verdictCategory': 'guilty', 'verdictSubcategory': 'guiltyNoDetail', 'plea': 'notGuilty', 'text': 'guilty'}, 'punishmentDescription': {'punishmentCategory': 'miscPunish', 'punishmentSubcategory': 'brandingOnCheek', 'text': '[Branding. See summary.]'}}


**Question 4.4:** Once you get this far, you've learned how to parse most of the data provided in the Old Bailey XML. Now, think about the data you now have access to for each case, and complete the following function creating a dataframe describing all of trials in `trials["hits"][:100]`.

One easy way to do this is to make a list of dictionaries, and pass this to `pd.DataFrame`, as in the following example:

In [32]:
pd.DataFrame([{"x": 1, "y": 10}, {"y": 12, "z": 111}])

Unnamed: 0,x,y,z
0,1.0,10,
1,,12,111.0


Consider the questions you may want to ask about the data, and complete the following function to put it all together in a DataFrame:

*Note!:* This is not easy. Take it one step at a time, initially just making one a DataFrame with one column, and building up from there. I made several little errors while writing it up (take note of capitalization in property names like `offenceSubcategory` and to spell it 'offence' not 'offense' (i.e. the British way)). **The rewarding thing** is that once you've written this up, as you come up with new questions to ask about the case data you'll be able to easily add columns to use in your analysis.

*If you are stuck*, try looking at the *data output* of the solutions (avoid looking at the code until you've worked through it), picking *one* column, and thinking of how you can answer that with the functions you've made or learned already.

In [33]:
from math import nan
def table_of_cases(xml_file_names):
    rows = []
    for xml_file in xml_file_names:
        with open(f'data/old-bailey/old-bailey-{xml_file}.xml', "r") as xml_file:
            case = BeautifulSoup(xml_file)
        people = ...
        date = ...
        descriptions = ...
        row = {
            "date": date,
            "id": case.div1.attrs["id"],
            "text": " ".join(case.text.split()), # split on all whitespace, then join on " ", to remove long sequences of whitespace
            "any_defendant_female": False,
            "any_defendant_male": False,
            "any_victim_female": False,
            "any_victim_male": False,
        }
        if "offenceDescription" in descriptions:
            row["offenceText"] = descriptions["offenceDescription"].get("text", nan) # `dictionary.get(key, default)` is the same as `dictionary[key] if key in dictionary else default`
            row["offenceCategory"] = ...
            row["offenceSubcategory"] = ...
        if "verdictDescription" in descriptions:
            row["verdictText"] = ...
            row["verdictCategory"] = ...
        if "punishmentDescription" in descriptions:
            row["punishmentText"] = ...
            row["punishmentCategory"] = ...
            row["punishmentSubcategory"] = ...
        for person in people:
            if person.get("type") == "defendantName" and person.get("gender") == "female":
                row["any_defendant_female"] = True
            if person.get("type") == ...
            if person.get("victim") == ...
            if person.get("victim") == ...
        rows.append(row)
    return pd.DataFrame(rows)

table_of_cases(doc_ids[:30])

Unnamed: 0,date,id,text,any_defendant_female,any_defendant_male,any_victim_female,any_victim_male,offenceText,offenceCategory,offenceSubcategory,verdictText,verdictCategory,punishmentText,punishmentCategory,punishmentSubcategory
0,17540116,f17540116-1,THE PROCEEDINGS ON THE King's Commissions of t...,False,False,False,False,,,,,,,,
1,17540116,t17540116-1,"80. Hannah Ash , spinster , was indicted for s...",True,False,False,False,"stealing one linen shift, one cotton gown, one...",theft,grandLarceny,pleaded guilty,guilty,[Transportation. See summary.],transport,transportNoDetail
2,17540116,t17540116-2,81. (M.) Peter Foreman and Mary his wife were ...,True,True,False,False,"stealing one pair of linen sheets, value 6 d. ...",theft,theftFromPlace,Guilty,guilty,[Branding. See summary.],miscPunish,branding
3,17540116,t17540116-3,"82. (M.) Sarah Williams , spinster , was indic...",True,False,False,False,"stealing one brass kettle, value 10 s.",theft,grandLarceny,Guilty,guilty,[Transportation. See summary.],transport,transportNoDetail
4,17540116,t17540116-4,"83. (M.) Elizabeth wife of Joseph Kempster , w...",True,False,False,False,"stealing one feather-bed, value 14 s. one bols...",theft,theftFromPlace,Guilty,guilty,[Transportation. See summary.],transport,transportNoDetail
5,17540116,t17540116-5,84. (M.) John Allen was indicted for stealing ...,False,True,False,False,"stealing one linen shirt, value 1 s. 6 d.",theft,grandLarceny,Acquitted,notGuilty,,,
6,17540116,t17540116-6,85. (M.) William Derter was indicted for steal...,False,True,False,False,"stealing 70 lb. weight of rags, value 4 s.",theft,grandLarceny,Acquitted,notGuilty,,,
7,17540116,t17540116-7,86. (M.) William Ford was indicted for stealin...,False,True,False,False,"stealing one mare, of a black colour, value 12 l.",theft,animalTheft,Guilty,guilty,Death,death,deathNoDetail
8,17540116,t17540116-8,"87. (L.) Anne Beezley , spinster , was indicte...",True,False,False,False,stealing a set of green bed curtains and valle...,theft,theftFromPlace,Guilty,guilty,[Transportation. See summary.],transport,transportNoDetail
9,17540116,t17540116-9,"88. Robert Barber was indicted for that he, to...",False,True,False,False,forging a certain acquittance for the sum of 1...,deception,forgery,acquitted,notGuilty,[Transportation. See summary.],transport,transportNoDetail


Note that the document that was front matter in the printed Old Bailey Proceedings has NaN in the `offenceCategory` column--it is not a document with a criminal trial record. You could easily eliminate the rows that are not trial records. 

Phew, that's it! Now you know how to parse through XML files using Beautiful Soup and web scrape using the `requests` library! This was a long lab, so pat yourself on the back for working through it. It will help you on problem set 3 enormously.

You will note that the lab does not have much to fill in, relative to the solution file. Working with the Old Bailey Online data is challenging, so **ask questions when you don't quite get what's happening**. You can test your understanding by adding extending the solutions with more columns.

## Bibliography

 - All files from Old Bailey API - https://www.oldbaileyonline.org/obapi/
 - ElementTree information adapted from Driscoll, Mike. (2013, April). Python 101 – Intro to XML Parsing with ElementTree.
 https://www.blog.pythonlibrary.org/2013/04/30/python-101-intro-to-xml-parsing-with-elementtree/

 - Web Scraping code adapted from MEDST-250 Notebook developed by Tejas Priyadarshan.
 https://github.com/ds-modules/MEDST-250/tree/master/04%20-%20XML_Day_1
 
 - Image source from https://www.researchgate.net/publication/257631377_Efficient_XML_Path_Filtering_Using_GPUs
 
 - Beautiful Soup 4 [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

----
Notebook developed by: Jason Jiang, Iland Leigh, Violet Yao, and Wilson Berkow; adjustment to new Old Bailey Online API by Jon Marshall 2024

Data Science Modules: http://data.berkeley.edu/education/modules