# [LEGALST-123] Lab 17: Parsing XML Data

This lab will cover parsing XML and attribute lookup, Beautiful Soup, and web scraping.

*Estimated Time: 45 Minutes *

### Topics Covered:
- XML syntax
- locating content with Beautiful Soup
- Web scraping

### Table of Contents

[The Data](#section-data)<br>
1 - [Web Scraping](#section-1)<br>
2 - [XML Syntax](#section-2)<br>
3 - [Using Beautiful Soup to parse XML](#section-3)<br>
4 - [Putting it all in a dataframe](#section-4)<br>

**Dependencies:**

In [168]:
import pandas as pd
import xml.etree.cElementTree as ET #XML Parser
from lxml import etree #ElementTree and lxml allow us to parse the XML file.
import requests #make request to server
import time #pause loop
from bs4 import BeautifulSoup

----
## The Data<a id='section-data'></a>

In this notebook, you'll be working with XML files from the Old Bailey API (https://www.oldbaileyonline.org/obapi/). These files contain the proceedings of all trials from 1674 to 1913. For this lab, we'll go through the trials from 1754-1756 and 1824-1826. XML (eXtensible Markup Language) provides a hierarchical representation of data contained within different tags and nodes. We'll go over XML syntax later. We will learn how to parse through these XML files from Old Bailey and grab information from sections of an XML file.

---

----
## Section 1: Web Scraping<a id='section-1'></a>

First we will go through how to parse one XML file. The Old Bailey API has a total of **197751** cases. Fortunately, we are only going to use the ones from 1754-1756 and 1824-1826, but that still only narrows the number of cases to 6506! 

Don't worry though, you're not going to manually download each case yourself. This is where web scraping comes into play. With web scraping, we can automate data collection to get all 6506 cases. 

Before we start scraping, we need to know how `requests` works. The `requests` library gets (`.get`!) you a response object from a web server and will automatically decode the content from the server, from which you can use `.text` to see the document! Requests through the Old Bailey API will return an XML file, which we can then write as a file and save.

Let's take a look at all of the terms we can use to choose the specific cases we want. We use `.json()` here since the parameters are stored as a JSON object.

In [169]:
rjson = requests.get('http://www.oldbaileyonline.org/obapi/terms').json()
rjson 

[{'name': 'trialtext', 'type': 'text'},
 {'name': 'defgen',
  'type': 'select',
  'terms': ['female', 'indeterminate', 'male']},
 {'name': 'offcat',
  'type': 'select',
  'terms': ['breakingPeace',
   'damage',
   'deception',
   'kill',
   'miscellaneous',
   'royalOffences',
   'sexual',
   'theft',
   'violentTheft']},
 {'name': 'offsubcat',
  'type': 'select',
  'terms': ['',
   'animalTheft',
   'arson',
   'assault',
   'assaultWithIntent',
   'assaultWithSodomiticalIntent',
   'bankrupcy',
   'barratry',
   'bigamy',
   'burglary',
   'coiningOffences',
   'concealingABirth',
   'conspiracy',
   'embezzlement',
   'extortion',
   'forgery',
   'fraud',
   'gameLawOffence',
   'grandLarceny',
   'habitualCriminal',
   'highwayRobbery',
   'housebreaking',
   'illegalAbortion',
   'indecentAssault',
   'infanticide',
   'keepingABrothel',
   'kidnapping',
   'libel',
   'mail',
   'manslaughter',
   'murder',
   'other',
   'perjury',
   'pervertingJustice',
   'pettyLarceny',
   

If you wanted to explore the full list in your web browser, click [this link](https://www.oldbaileyonline.org/obapi/terms). 

Now that you've had a chance to look through some of the terms, let's see how to grab the specific XML files.

Clicking the URL below returns a JSON object of the number of IDs and the frequency of each term in which every trial contains the term "sheffield" and the offence categrory "deception" from June 14th, 1847 onward. Also, each trial ID that satisfies the terms is returned; the count parameter in this case returns 10 trial IDs, but if left unspecified, the API will return a maximum count of 1000 IDs. 

https://www.oldbaileyonline.org/obapi/ob?term0=trialtext_sheffield&term1=offcat_deception&term2=fromdate_18470614&breakdown=offsubcat&count=10&start=0

Although the terms for time are listed as numbers, the format for the term is
`fromdate_(starting date)` and `todate_(ending date)` without the parentheses.

**Question 1.1:** Use requests.get(...) to get the all trial IDs between the years 1754 and 1756 and return it as a JSON object.

In [170]:
trials = requests.get('https://www.oldbaileyonline.org/obapi/ob?term0=fromdate_17540116&term1=todate_17561208&&start=0').json()


Now, lets pick some trials from `trial['hits']`, so we have a list of IDs we can work with. 

**Question 1.2:** Select the first 10 trials by splicing through the list that we retrieved from the previous cell.

In [171]:
first_10 = trials['hits'][:10]


Using the trial IDs from the previous cell, we are going to format the URL in a way so that we can get the XML file for each trial. In order to get the XML file using the Old Bailey API, we must follow this URL format:

<p style="text-align: center;">`http://www.oldbaileyonline.org/obapi/text?div=(enter trial ID here without parenthesis)`  </p>

For example, http://www.oldbaileyonline.org/obapi/text?div=t16740429-1 gives you the link to the XML file of the first proceeding in the database.


**Question  1.3:** Get the XML file of the first trial in first_10. A successful `.get` request returns `<Response [200]>`.

In [172]:
first_trial_id = first_10[0]
url = 'http://www.oldbaileyonline.org/obapi/text?div={}'.format(first_10[0])
response = requests.get(url)
response

<Response [200]>

Run the next cell to see the XML format of the text! 

In [173]:
print(response.text)

<?xml version="1.0" encoding="UTF-8"?>
<div1 type="trialAccount" id="t17540116-1">
               <interp inst="t17540116-1" type="collection" value="BAILEY"></interp>
               <interp inst="t17540116-1" type="year" value="1754"></interp>
               <interp inst="t17540116-1" type="uri" value="sessionsPapers/17540116"></interp>
               <interp inst="t17540116-1" type="date" value="17540116"></interp>
               <join result="criminalCharge" id="t17540116-1-off2-c29" targOrder="Y" targets="t17540116-1-defend30 t17540116-1-off2 t17540116-1-verdict4"></join>
         
               <p>80. 
               
                  <persName id="t17540116-1-defend30" type="defendantName">
                     Hannah 
                     Ash 
                  <interp inst="t17540116-1-defend30" type="surname" value="Ash"></interp>
                     <interp inst="t17540116-1-defend30" type="given" value="Hannah"></interp>
                     <interp inst="t17540116-1-defe

We can save the XML file:

In [174]:
with open(f'data/old-bailey/old-bailey-{first_trial_id}.xml', 'w') as file:
    file.write(response.text)

Now we'll get trial `t17031013-13` specifically, for the examples below:

In [175]:
davis_trial_id = 't17031013-13'
response = requests.get(f'http://www.oldbaileyonline.org/obapi/text?div={davis_trial_id}')
with open(f'data/old-bailey/old-bailey-{davis_trial_id}.xml', 'w') as file:
    file.write(response.text)

### Challenge: Scraping all trials from 1754 - 1756

Now that you know how to find the trial IDs for certain parameters as well as get an XML file using `requests.get(some_url)`, iterate through each ID in the list of trials (use `trials['hits']` for the list of IDs) we got from 1754-1756 earlier. You can choose how many trials you want to save.

In [176]:
for trial in trials['hits'][:30]:
    #format URL
    url =  'http://www.oldbaileyonline.org/obapi/text?div={}'.format(trial)
    print(url)
    #get text from URL
    text = requests.get(url).text
    #save the file
    with open('data/old-bailey/old-bailey-' + trial + '.xml', 'w') as file:
        file.write(text)
    #one second pause so servers aren't overloaded
    time.sleep(1)

http://www.oldbaileyonline.org/obapi/text?div=t17540116-1
http://www.oldbaileyonline.org/obapi/text?div=t17540116-2
http://www.oldbaileyonline.org/obapi/text?div=t17540116-3
http://www.oldbaileyonline.org/obapi/text?div=t17540116-4
http://www.oldbaileyonline.org/obapi/text?div=t17540116-5
http://www.oldbaileyonline.org/obapi/text?div=t17540116-6
http://www.oldbaileyonline.org/obapi/text?div=t17540116-7
http://www.oldbaileyonline.org/obapi/text?div=t17540116-8
http://www.oldbaileyonline.org/obapi/text?div=t17540116-9
http://www.oldbaileyonline.org/obapi/text?div=t17540116-10
http://www.oldbaileyonline.org/obapi/text?div=t17540116-11
http://www.oldbaileyonline.org/obapi/text?div=t17540116-12
http://www.oldbaileyonline.org/obapi/text?div=t17540116-13
http://www.oldbaileyonline.org/obapi/text?div=t17540116-14
http://www.oldbaileyonline.org/obapi/text?div=t17540116-15
http://www.oldbaileyonline.org/obapi/text?div=t17540116-16
http://www.oldbaileyonline.org/obapi/text?div=t17540116-17
http:/

You can check if you saved the XML files by executing the cell below!

In [177]:
!ls data/old-bailey/

old-bailey-t17031013-13.xml old-bailey-t17540116-23.xml
old-bailey-t17540116-1.xml  old-bailey-t17540116-24.xml
old-bailey-t17540116-10.xml old-bailey-t17540116-25.xml
old-bailey-t17540116-11.xml old-bailey-t17540116-26.xml
old-bailey-t17540116-12.xml old-bailey-t17540116-27.xml
old-bailey-t17540116-13.xml old-bailey-t17540116-28.xml
old-bailey-t17540116-14.xml old-bailey-t17540116-29.xml
old-bailey-t17540116-15.xml old-bailey-t17540116-3.xml
old-bailey-t17540116-16.xml old-bailey-t17540116-30.xml
old-bailey-t17540116-17.xml old-bailey-t17540116-4.xml
old-bailey-t17540116-18.xml old-bailey-t17540116-5.xml
old-bailey-t17540116-19.xml old-bailey-t17540116-6.xml
old-bailey-t17540116-2.xml  old-bailey-t17540116-7.xml
old-bailey-t17540116-20.xml old-bailey-t17540116-8.xml
old-bailey-t17540116-21.xml old-bailey-t17540116-9.xml
old-bailey-t17540116-22.xml


This cell will show you the XML file.

In [178]:
!cat data/old-bailey/old-bailey-t17540116-1.xml

<?xml version="1.0" encoding="UTF-8"?>
<div1 type="trialAccount" id="t17540116-1">
               <interp inst="t17540116-1" type="collection" value="BAILEY"></interp>
               <interp inst="t17540116-1" type="year" value="1754"></interp>
               <interp inst="t17540116-1" type="uri" value="sessionsPapers/17540116"></interp>
               <interp inst="t17540116-1" type="date" value="17540116"></interp>
               <join result="criminalCharge" id="t17540116-1-off2-c29" targOrder="Y" targets="t17540116-1-defend30 t17540116-1-off2 t17540116-1-verdict4"></join>
         
               <p>80. 
               
                  <persName id="t17540116-1-defend30" type="defendantName">
                     Hannah 
                     Ash 
                  <interp inst="t17540116-1-defend30" type="surname" value="Ash"></interp>
                     <interp inst="t17540116-1-defend30" type="given" value="Hannah"></interp>
                     <interp inst="t

## Section 2: XML Syntax<a id='section-2'></a>

First, we'll go over the syntax of a XML file. The basic unit of XML code is called an "element" or "node" and has a start and ending tag. The tags for each element look something like this:

<p style="text-align: center;"> `<exampletag>some text</exampletag>`  </p>

Run the next cell to look at the XML file of one of the cases from the OldBailey API!

In [179]:
# use requests to get an example of xml
example = requests.get(f'https://www.oldbaileyonline.org/obapi/text?div={davis_trial_id}')
print(example.text)

<?xml version="1.0" encoding="UTF-8"?>
<div1 type="trialAccount" id="t17031013-13">
               <interp inst="t17031013-13" type="collection" value="BAILEY"></interp>
               <interp inst="t17031013-13" type="year" value="1703"></interp>
               <interp inst="t17031013-13" type="uri" value="sessionsPapers/17031013"></interp>
               <interp inst="t17031013-13" type="date" value="17031013"></interp>
               <join result="criminalCharge" id="t17031013-13-off60-c52" targOrder="Y" targets="t17031013-13-defend52 t17031013-13-off60 t17031013-13-verdict64"></join>
         
               <p>
            
                  <persName id="t17031013-13-defend52" type="defendantName">
                  Samuel 
                  Davis
               <interp inst="t17031013-13-defend52" type="surname" value="Davis"></interp>
                     <interp inst="t17031013-13-defend52" type="given" value="Samuel"></interp>
                     <interp inst="t17031013-13-d

The `interp` tags at the beginning of the file are elements that don't have any plain text content. Note that elements may possibly be empty and not contain any text (i.e. `interp` elements mentioned earlier). If the element is empty, the tag may follow a format that looks similar to `<exampletag/>`, which is equivalent to `<exampletag></exampletag>`.

Elements may also contain other elements, which we call "children". Most children are indented, but the indents aren't necessary in XML and are used for clarity to show nesting. For example, if we go down to `<persName id="t17540116-4-defend46" type="defendantName">` , we see that the `rs` tag is a child of `persName`. We will explore about children in XML more in the next section. 

Lastly, elements may have attributes, which are in the format `<exampletag name_of_attribute="somevalue">`. Attributes are designed to store data related to a specific elements. Attributes **must** follow the quotes format (`name = "value"`). As you can tell, in this XML file, attributes are everywhere!

-----
**Question 2.1:** What was the verdict of this case? Was there a punsihment and if so, what was it? List both and state whether you found it as plain text content or as an attribute.

Write your answer here :

Verdict: guilty, plain text content

Punishment: brandingOnCheek, attribute

----
## Section 3: Using Beautiful Soup to parse XML<a id='section-3'></a>

Now that we know what the syntax and structure of an XML file, let's figure out how to parse through one! We are going to load the same file from the second section and use Beautiful Soup to navigate through elements in this file. 

First, we need to import the file into a Beautiful Soup instance. 

In [180]:
xml_file = f'data/old-bailey/old-bailey-{davis_trial_id}.xml'
xml_file = open(xml_file).read()
davis_trial_soup = BeautifulSoup(xml_file)

We can examine `davis_trial_soup` using `.contents`, which puts all children of a tag in a list.

In [181]:
davis_trial_soup.contents

['xml version="1.0" encoding="UTF-8"?',
 <html><body><div1 id="t17031013-13" type="trialAccount">
 <interp inst="t17031013-13" type="collection" value="BAILEY"></interp>
 <interp inst="t17031013-13" type="year" value="1703"></interp>
 <interp inst="t17031013-13" type="uri" value="sessionsPapers/17031013"></interp>
 <interp inst="t17031013-13" type="date" value="17031013"></interp>
 <join id="t17031013-13-off60-c52" result="criminalCharge" targets="t17031013-13-defend52 t17031013-13-off60 t17031013-13-verdict64" targorder="Y"></join>
 <p>
 <persname id="t17031013-13-defend52" type="defendantName">
                   Samuel 
                   Davis
                <interp inst="t17031013-13-defend52" type="surname" value="Davis"></interp>
 <interp inst="t17031013-13-defend52" type="given" value="Samuel"></interp>
 <interp inst="t17031013-13-defend52" type="gender" value="male"></interp>
 </persname>
             , of the Parish of <placename id="t17031013-13-defloc59">St. James Westmins

We notice that all information we care about is contained within `<div>` and `</div1>` tags, so we navigate to it. The simplest way to navigate the parse tree is to say the name of the tag you want (`.`). In this case, we want to access div1 under body tag, which is under html tag.

In [182]:
body = davis_trial_soup.html.body.div1

We can now start working down the tree! With the body, we can find each child of the body by printing the tags. This will also help us for future reference, if we every want to go through other children in the XML file.

In [183]:
for child in body.children:
    if child.name:
        print(child.name)

interp
interp
interp
interp
join
p
p


Now that we have a list of children to work with let's select one using `.`. Using `.` navigates through the hierarchical structure of XML and helps us keep track of the path we are taking through this file.

In [184]:
choose_p = body.p
for child in choose_p.children:
    if child.name:
        print(child.name)

persname
placename
interp
interp
join
rs
persname
rs
join
rs
join
rs


This isn't very helpful, since we're still left with a bunch of tags and on top of that, we have a lot of repeating tags and names. Let's choose `placename` as our next tag and see what happens.

In [185]:
place_name = body.p.placename
for child in place_name.children:
    if child.name:
        print(child.name)

Nothing was printed, so it looks like we hit the end! Let's use `.string` to examine the data in this element, following the `.` path we used to get here.

In [186]:
print(body.p.placename.string)

St. James Westminster


**Question 2.1:** Find the defendant's name by traversing through the correct elements. You can check your answer with printed XML using `soup.contents`

You may find `body.p.persname.string` returns None. It is becasue if a tag, `body.p.persname.string` in this case, contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None. Which functions could help us locate the name instead?

In [187]:
#SOLUTION
print(body.p.persname.contents[0])


                  Samuel 
                  Davis
               


**Question 2.2:** Since the textual data is pretty messy in the XML files of these proceedings, where do you think the data you need might be held and how might you go about extracting this data? 

*Write your response here*

----
## Section 4: Putting it all in a dataframe<a id='section-4'></a>

Now that we have a bunch of XML files and know how to parse through them to extract data, let's put the data from the XML files into a dataframe. Take a look at the XML for [this trial](https://www.oldbaileyonline.org/obapi/text?div=t17031013-13) (and even better, look at what is or isn't consistent between that one and some others), and think about the structure of the data. How would you identify the people involved in a case? How would you identify their roles (witness/defendant/victim/other), or their genders? What can you learn about the alleged offence?

*Note:* Some cases have multiple defendants, multiple victims or multiple witnesses; however, most cases only have at most one of each. You can represent this in a dataframe by having $N$ columns for each property of a defendant, victim, etc., but this results in many many empty cells, and may not be amenable to analysis for the questions you come up with.

Think about the kinds of questions you may want to ask about this data, and refer to the XML for how you might answer them. For example, you may be interested in

- the words used specifically in describing the crime (notice that the text specifically between `<rs id="..." type="offenceDescription">` and `</rs>` gives you this)

- whether any victim was female

- whether any defendant was female

- the `category` (or `subCategory`) of the offense, etc.

- the entire text of the trial (sans tags)

These are questions that can be answered for most if not all cases, so they make good candidates for names of columns.

**Question 4.1**: Start by completing the following function to get the date of the case and return it.

In [188]:
def case_date(case_soup):
    for element in case_soup.body.div1.contents:
        if element.name == 'interp':
            if element.attrs['type'] == 'date':
                return element.attrs['value']
            
print("Case", davis_trial_id, "happened on", case_date(davis_trial_soup))

Case t17031013-13 happened on 17031013


**Question 4.2**: Complete the following function finding every person in a trial, and returning a list of dictionaries of their attributes (e.g. `[{"surname": "FINCH", "given": "JOHN", "gender": "male", "type": "witnessName"}]`). Test it on `davis_trial_soup` used before. **Note:** If you use `find_all`, specify the tag name in lowercase, as beautifulsoup lowercases all tag names.

In [189]:
def people_in_case(case_soup):
    people = []
    for persName in case_soup.body.div1.find_all('persname'):
        person = {}
        person["type"] = persName.attrs.get("type") # `thing.get(key)` is like `thing[key]` but returns None if key is not in x instead of raising an exception
        for interp in persName.find_all('interp'):
            fieldName = interp.attrs["type"]
            fieldValue = interp.attrs["value"]
            person[fieldName] = fieldValue
        people.append(person)
    return people

print("Case", davis_trial_id, "describes these people:\n", people_in_case(davis_trial_soup))


Case t17031013-13 describes these people:
 [{'type': 'defendantName', 'surname': 'Davis', 'given': 'Samuel', 'gender': 'male'}, {'type': 'victimName', 'surname': 'Herbert', 'given': 'Catherine', 'gender': 'female'}]


**Question 4.3**: Complete the following function to find the `offenseDescription` and `verdictDescription` in a trial. Think about how [the XML](https://www.oldbaileyonline.org/obapi/text?div=t17031013-13) expresses the offenseDescription and verdictDescription, and see if you can write the code without specifically looking for the labels "offenseDescription" and "verdictDescription" (i.e. so that it will work even if a case came up with something like `<rs type="sentencingDescription">`). Get the category, subCategory and textual description of the offense:

In [190]:
def case_descriptions(case_soup):
    descriptions = {}
    for rs in case_soup.find_all('rs'):
        desc = {}
        for interp in rs.find_all('interp'):
            fieldName = interp.attrs["type"]
            fieldValue = interp.attrs["value"]
            desc[fieldName] = fieldValue
        desc["text"] = rs.text.strip() # myString.strip() removes the whitespace at the beginning and end of myString
        descriptions[rs.attrs['type']] = desc
    return descriptions
print("Case", davis_trial_id, "has these various descriptions in <rs> elements:\n", case_descriptions(davis_trial_soup))


Case t17031013-13 has these various descriptions in <rs> elements:
 {'offenceDescription': {'offenceCategory': 'theft', 'offenceSubcategory': 'grandLarceny', 'text': 'feloniously Stealing 58 Diamonds set in Silver gilt, value 250 l.'}, 'occupation': {'text': 'Coachman'}, 'crimeDate': {'text': '28th of July'}, 'verdictDescription': {'verdictCategory': 'guilty', 'text': 'guilty'}, 'punishmentDescription': {'punishmentCategory': 'miscPunish', 'punishmentSubcategory': 'brandingOnCheek', 'text': '[Branding. See summary.]'}}


**Question 4.4:** Once you get this far, you've learned how to parse most of the data provided in the Old Bailey XML. Now, think about the data you now have access to for each case, and complete the following function creating a dataframe describing all of trials in `trials["hits"][:100]`.

One easy way to do this is to make a list of dictionaries, and pass this to `pd.DataFrame`, as in the following example:

In [191]:
pd.DataFrame([{"x": 1, "y": 10}, {"y": 12, "z": 111}])

Unnamed: 0,x,y,z
0,1.0,10,
1,,12,111.0


Consider the questions you may want to ask about the data, and complete the following function to put it all together in a DataFrame:

*Note!:* This is not easy. Take it one step at a time, initially just making one a DataFrame with one column, and building up from there. I made several little errors while writing it up (take note of capitalization in property names like `offenceSubcategory` and to spell it 'offence' not 'offense' (i.e. the British way)). **The rewarding thing** is that once you've written this up, as you come up with new questions to ask about the case data you'll be able to easily add columns to use in your analysis.

*If you are stuck*, try looking at the *data output* of the solutions (avoid looking at the code until you've worked through it), picking *one* column, and thinking of how you can answer that with the functions you've made or learned already.

In [194]:
from math import nan
def table_of_cases(xml_file_names):
    rows = []
    for xml_file in xml_file_names:
        with open(f'data/old-bailey/old-bailey-{xml_file}.xml', "r") as xml_file:
            case = BeautifulSoup(xml_file)
        people = people_in_case(case)
        date = case_date(case)
        descriptions = case_descriptions(case)
        row = {
            "date": date,
            "id": case.div1.attrs["id"],
            "text": " ".join(case.text.split()), # split on all whitespace, then join on " ", to remove long sequences of whitespace
            "any_defendant_female": False,
            "any_defendant_male": False,
            "any_victim_female": False,
            "any_victim_male": False,
        }
        if "offenceDescription" in descriptions:
            row["offenceText"] = descriptions["offenceDescription"].get("text", nan) # `dictionary.get(key, default)` is the same as `dictionary[key] if key in dictionary else default`
            row["offenceCategory"] = descriptions["offenceDescription"].get("offenceCategory", nan)
            row["offenceSubcategory"] = descriptions["offenceDescription"].get("offenceSubcategory", nan)
        if "verdictDescription" in descriptions:
            row["verdictText"] = descriptions["verdictDescription"].get("text", nan)
            row["verdictCategory"] = descriptions["verdictDescription"].get("verdictCategory", nan)
        if "punishmentDescription" in descriptions:
            row["punishmentText"] = descriptions["punishmentDescription"].get("text", nan)
            row["punishmentCategory"] = descriptions["punishmentDescription"].get("punishmentCategory", nan)
            row["punishmentSubcategory"] = descriptions["punishmentDescription"].get("punishmentSubcategory", nan)
        for person in people:
            if person.get("type") == "defendantName" and person.get("gender") == "female":
                row["any_defendant_female"] = True
            if person.get("type") == "defendantName" and person.get("gender") == "male":
                row["any_defendant_male"] = True
            if person.get("victim") == "victimName" and person.get("gender") == "female":
                row["any_victim_female"] = True
            if person.get("victim") == "victimName" and person.get("gender") == "male":
                row["any_victim_male"] = True
        rows.append(row)
    return pd.DataFrame(rows)

table_of_cases(trials["hits"][:30])

Unnamed: 0,date,id,text,any_defendant_female,any_defendant_male,any_victim_female,any_victim_male,offenceText,offenceCategory,offenceSubcategory,verdictText,verdictCategory,punishmentText,punishmentCategory,punishmentSubcategory
0,17540116,t17540116-1,"80. Hannah Ash , spinster , was indicted for s...",True,False,False,False,"stealing one linen shift, one cotton gown, one...",theft,grandLarceny,pleaded guilty,guilty,[Transportation. See summary.],transport,
1,17540116,t17540116-2,81. (M.) Peter Foreman and Mary his wife were ...,True,True,False,False,"stealing one pair of linen sheets, value 6 d. ...",theft,theftFromPlace,Guilty,guilty,[Branding. See summary.],miscPunish,branding
2,17540116,t17540116-3,"82. (M.) Sarah Williams , spinster , was indic...",True,False,False,False,"stealing one brass kettle, value 10 s.",theft,grandLarceny,Guilty,guilty,[Transportation. See summary.],transport,
3,17540116,t17540116-4,"83. (M.) Elizabeth wife of Joseph Kempster , w...",True,False,False,False,"stealing one feather-bed, value 14 s. one bols...",theft,theftFromPlace,Guilty,guilty,[Transportation. See summary.],transport,
4,17540116,t17540116-5,84. (M.) John Allen was indicted for stealing ...,False,True,False,False,"stealing one linen shirt, value 1 s. 6 d.",theft,grandLarceny,Acquitted,notGuilty,,,
5,17540116,t17540116-6,85. (M.) William Derter was indicted for steal...,False,True,False,False,"stealing 70 lb. weight of rags, value 4 s.",theft,grandLarceny,Acquitted,notGuilty,,,
6,17540116,t17540116-7,86. (M.) William Ford was indicted for stealin...,False,True,False,False,"stealing one mare, of a black colour, value 12 l.",theft,animalTheft,Guilty,guilty,Death,death,
7,17540116,t17540116-8,"87. (L.) Anne Beezley , spinster , was indicte...",True,False,False,False,stealing a set of green bed curtains and valle...,theft,theftFromPlace,Guilty,guilty,[Transportation. See summary.],transport,
8,17540116,t17540116-9,"88. Robert Barber was indicted for that he, to...",False,True,False,False,forging a certain acquittance for the sum of 1...,deception,forgery,acquitted,notGuilty,[Transportation. See summary.],transport,
9,17540116,t17540116-10,"89, 90. (M.) Elizabeth Eaton and Catherine Dav...",True,False,False,False,"stealing one steel tobacco box, val. 2 s. and ...",theft,pocketpicking,guilty of felony only,guilty,[Transportation. See summary.],transport,


Phew, that's it! Now you know how to parse through XML files using Beautiful Soup and web scrape using the `requests` library! This was a long lab, so pat yourself on the back for working through it. It will help you on problem set 3 enormously.

If you leaned heavily on the given lab solutions, make sure you understand every line, and **ask questions when you don't**. You can test your understanding by adding extending the solutions with more columns.

## Bibliography

 - All files from Old Bailey API - https://www.oldbaileyonline.org/obapi/
 - ElementTree information adapted from Driscoll, Mike. (2013, April). Python 101 – Intro to XML Parsing with ElementTree.
 https://www.blog.pythonlibrary.org/2013/04/30/python-101-intro-to-xml-parsing-with-elementtree/

 - Web Scraping code adapted from MEDST-250 Notebook developed by Tejas Priyadarshan.
 https://github.com/ds-modules/MEDST-250/tree/master/04%20-%20XML_Day_1
 
 - Image source from https://www.researchgate.net/publication/257631377_Efficient_XML_Path_Filtering_Using_GPUs

----
Notebook developed by: Jason Jiang, Iland Leigh, Violet Yao, and Wilson Berkow

Data Science Modules: http://data.berkeley.edu/education/modules