Code to scrape trial data from https://www.oldbaileyonline.org

In [7]:
import xml.etree.cElementTree as ET  # XML parser
import requests  # make requests to web servers
import time  # will help us pause python's for loop
import json

## Step 1: Choose your parameters


In [22]:
# Possible fields, types, and values for queries. JSON
fields = requests.get('http://www.oldbaileyonline.org/obapi/terms').json()


In [23]:
fields

[{'name': 'trialtext', 'type': 'text'},
 {'name': 'defgen',
  'terms': ['*** OTHER ***,', 'female', 'indeterminate', 'male'],
  'type': 'select'},
 {'name': 'offcat',
  'terms': ['breakingPeace',
   'damage',
   'deception',
   'kill',
   'miscellaneous',
   'royalOffences',
   'sexual',
   'theft',
   'violentTheft'],
  'type': 'select'},
 {'name': 'offsubcat',
  'terms': ['animalTheft',
   'arson',
   'assault',
   'assaultWithIntent',
   'assaultWithSodomiticalIntent',
   'bankrupcy',
   'barratry',
   'bigamy',
   'burglary',
   'coiningOffences',
   'concealingABirth',
   'conspiracy',
   'embezzlement',
   'extortion',
   'forgery',
   'fraud',
   'gameLawOffence',
   'grandLarceny',
   'habitualCriminal',
   'highwayRobbery',
   'housebreaking',
   'illegalAbortion',
   'indecentAssault',
   'infanticide',
   'keepingABrothel',
   'kidnapping',
   'libel',
   'mail',
   'manslaughter',
   'murder',
   'other',
   'perjury',
   'pervertingJustice',
   'pettyLarceny',
   'pettyTre

## Step 2: Get a list of trial IDs based on your chosen parameters*
\*max IDs returned = 1000

http://www.oldbaileyonline.org/obapi/ob

This servlet receives search terms and additional instructions as HTML parameters. It returns either a JSON object representing a result set, or a ZIP file of texts from the result set.

Search terms are specified as follows:

    term[term number]=[field]_[term] 

Results are paged as follows:

    start=[result to start from]
    count=[number of results to return] 

It is a good idea always to specify a “count” parameter, as the default is all results.

If you would like the returned JSON object to include a frequency table of terms in the results, you can specify the field you would like to examine:

    breakdown=[field] 

If you would like the full text of the results returned in a zip file, you can specify:

    return=zip 

For example:

http://www.oldbaileyonline.org/obapi/ob?term0=trialtext_sheffield&term1=offcat_deception&breakdown=offsubcat&count=10&start=0

will return a JSON object representing a result set where every trial contains the term “sheffield”, and an offence category of “deception”. The object will also offer a frequency table of offence subcategories within this result set. The hit list returned will start at the first result and continue for ten results.


In [28]:
# a request for the ids of all trials between 1754 and 1756
trials = requests.get('https://www.oldbaileyonline.org/obapi/ob?term0=fromdate_17540116&term1=todate_17561208&&start=0').json()

In [29]:
trials

{'hits': ['t17540116-1',
  't17540116-2',
  't17540116-3',
  't17540116-4',
  't17540116-5',
  't17540116-6',
  't17540116-7',
  't17540116-8',
  't17540116-9',
  't17540116-10',
  't17540116-11',
  't17540116-12',
  't17540116-13',
  't17540116-14',
  't17540116-15',
  't17540116-16',
  't17540116-17',
  't17540116-18',
  't17540116-19',
  't17540116-20',
  't17540116-21',
  't17540116-22',
  't17540116-23',
  't17540116-24',
  't17540116-25',
  't17540116-26',
  't17540116-27',
  't17540116-28',
  't17540116-29',
  't17540116-30',
  't17540116-31',
  't17540116-32',
  't17540116-33',
  't17540116-34',
  't17540116-35',
  't17540116-36',
  't17540116-37',
  't17540116-38',
  't17540116-39',
  't17540116-40',
  't17540116-41',
  't17540116-42',
  't17540116-43',
  't17540116-44',
  't17540116-45',
  't17540116-46',
  't17540116-47',
  't17540116-48',
  't17540116-49',
  't17540116-50',
  't17540116-51',
  't17540116-52',
  't17540116-53',
  't17540116-54',
  't17540116-55',
  't1754011

## Step 3: Iterate through list to get text of each trial

http://www.oldbaileyonline.org/obapi/text

This servlet receives the ID of a single trial, and optionally a list of words to highlight as HTML parameters. It returns the XML of the requested trial.

    id=[trial id]
    highlightTerms=[terms to highlight] 
    
For example:

http://www.oldbaileyonline.org/obapi/text?div=t17690112-9&highlightTerms=sheffield+fraud

will return the XML of the trial t17690112-9, with <span> tags around the words “sheffield” and “fraud”. 

In [31]:
# single trial example
trial = requests.get('http://www.oldbaileyonline.org/obapi/text?div=t17690112-9&highlightTerms=sheffield+fraud')

'<?xml version="1.0" encoding="UTF-8"?>\n<div1 type="trialAccount" id="t17690112-9">\n               <interp inst="t17690112-9" type="collection" value="BAILEY"></interp>\n               <interp inst="t17690112-9" type="year" value="1769"></interp>\n               <interp inst="t17690112-9" type="uri" value="sessionsPapers/17690112"></interp>\n               <interp inst="t17690112-9" type="date" value="17690112"></interp>\n               <join result="criminalCharge" id="t17690112-9-off38-c105" targOrder="Y" targets="t17690112-9-defend123 t17690112-9-off38 t17690112-9-verdict42"></join>\n               <join result="criminalCharge" id="t17690112-9-off43-c105" targOrder="Y" targets="t17690112-9-defend123 t17690112-9-off43 t17690112-9-verdict42"></join>\n         \n               <p>90. (L.) \n               \n                  <persName id="t17690112-9-defend123" type="defendantName">\n                     Matthew \n                     Skinner \n                  <interp inst="t176901

In [38]:
print(trial.text)

<?xml version="1.0" encoding="UTF-8"?>
<div1 type="trialAccount" id="t17690112-9">
               <interp inst="t17690112-9" type="collection" value="BAILEY"></interp>
               <interp inst="t17690112-9" type="year" value="1769"></interp>
               <interp inst="t17690112-9" type="uri" value="sessionsPapers/17690112"></interp>
               <interp inst="t17690112-9" type="date" value="17690112"></interp>
               <join result="criminalCharge" id="t17690112-9-off38-c105" targOrder="Y" targets="t17690112-9-defend123 t17690112-9-off38 t17690112-9-verdict42"></join>
               <join result="criminalCharge" id="t17690112-9-off43-c105" targOrder="Y" targets="t17690112-9-defend123 t17690112-9-off43 t17690112-9-verdict42"></join>
         
               <p>90. (L.) 
               
                  <persName id="t17690112-9-defend123" type="defendantName">
                     Matthew 
                     Skinner 
                  <interp inst="t17690112-9-defend123"

In [40]:


# iterate through manuscripts
for trial in trials['hits'][:100]:

    # build url
    url = 'http://www.oldbaileyonline.org/obapi/text?div={}'.format(trial)
        
    # get the response
    res = requests.get(url).text
        
    #create a file name
    fname = 'data/old-bailey/old-bailey-' + '-' + trial + '.xml'
        
    # save the file
    with open(fname, 'w') as f:
            f.write(res)

    # pause for a second so we don't overload their servers
    time.sleep(1)



In [41]:
!ls data/old-bailey

'ls' is not recognized as an internal or external command,
operable program or batch file.


In [42]:
!cat data/old-bailey/old-bailey-t17540116-1.xml

'cat' is not recognized as an internal or external command,
operable program or batch file.


## Bibliography
- Text explaining Old Bailey API adapted from https://www.oldbaileyonline.org/static/API.jsp
- Web scraping code adapted from code by Chris Hench: https://github.com/ds-modules/MEDST-250/tree/master/04%20-%20XML_Day_1

Notebook by Keeley Takimoto