# Old Bailey Classification: PSET 3

1. [Introduction](#intro)
2. [Getting the Data](#data)
3. [Descriptive Statistics](#stats)
4. [Discussion Questions](#dq)

In [1]:
import requests
from bs4 import BeautifulSoup
import xmltodict
from datascience import *
from urllib.request import urlopen
import re
import numpy as np
from utils import *

## 1. Introduction <a id='intro'></a>
In this assignment, you will learn how to apply machine learning techniques to a corpus of textual data. The source for this assignment is the Old Bailey Corpus. This data set contains court records of the Old Bailey, London's Criminal court from 1674 to 1913. You will: 
    1. Download data from the corpus and process the text into data.
    2. Define a dictionary and a set of rules for using the words for classification.
    3. Apply your algorithm to classify documents and assess accuracy.
    
Make sure to start early and ask lots of questions! The dataset, along with other publicaly available data, is available at: https://www.oldbaileyonline.org/obapi/

## 2. Getting the Data <a id='data'></a>

We have implemented a couple of scripts to complete the messier parts of downloading the data we'll be working with. You can run these scripts with the following command:

```python
!python wgetxmls.py ARG1 ARG2 ARG3 ```

- ARG1 should be the first year of the range you'd like to investigate
- ARG2 should be the last year of the range you'd like to investigate
- ARG3 **must be *AT MOST* the number of trials in that range** (preferably fewer, or it will take forever to run). Find this number [here](http://www.oldbaileyonline.org/forms/formMain.jsp), specifying the years in your range under `time period`, then selecting `calculate totals` at the top of the results page. The number will be the total number of results found.

Here is one example of how to use this script. We have chosen the 1750s years as our range, and we know there are 4435 results from January 1750 through December 1759:

***NOTE:*** This will take a few minutes to complete! Just be patient. When it finishes running, the circle at the top of the page next to the words "Python 3" will turn back from black to white.

In [2]:
# The first time you run this, it should print "(Some number) archives were successfully processed."
# After the first time, it may print several lines that say "cannot create directory '...' ".
# This is normal, just ignore these warnings.

wget_files('1750', '1759', 10)
!./extractFiles.sh '1750'

mkdir: ../1750-trialxmls: File exists


In [3]:
#files = requests.get("http://www.oldbaileyonline.org/obapi/ob?term0=fromdate_17500114&term1=todate_17591216&count=10&start=100&return=zip)")
#files.text

******** should improve data pull system using 'request' library: http://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls ********

We have also implemented a function that will convert the text files we just downloaded into useful tables of information that we may now analyze. All we need to specify is the decade with which we want to work.

In [4]:
trials = make_table('1750')
trials

collection,date,defendant gender,defendant given,defendant surname,file name,offenceCategory,offenceSubcategory,trial summary,uri,verdictCategory,victim gender,victim given,victim surname,year
BAILEY,17500117,male,Thomas,Biggs,t17500117-3,theft,grandLarceny,"112. Thomas Biggs , was indicted for stealing one pair o ...",sessionsPapers/17500117,notGuilty,male,William,Gordon,1750
BAILEY,17500117,male,William,Heyden,t17500117-2,theft,theftFromPlace,"110, 111. Nicholas Bond , and William Heyden , late of F ...",sessionsPapers/17500117,guilty,male,Henry,Page,1750
BAILEY,17500117,male,John,Price,t17500117-6,theft,grandLarceny,"116, 117. Ralph Darvel , and John Price , were indicted ...",sessionsPapers/17500117,guilty,male,John,Clark,1750
BAILEY,17500117,female,"Jane,",the,t17500117-7,theft,grandLarceny,"118. Jane, the wife of John Holmes , was indicted for st ...",sessionsPapers/17500117,guilty,male,John,Mitchel,1750
BAILEY,17500117,female,Margaret,Richards,t17500117-5,theft,grandLarceny,"114, 115. Susannah Lowe , and Margaret Richards , widows ...",sessionsPapers/17500117,notGuilty,male,Ezekiel,Wolse,1750
BAILEY,17500117,female,Elizabeth,Wanless,t17500117-4,theft,grandLarceny,"113. Elizabeth Wanless , otherwise Newbey , spinster, wa ...",sessionsPapers/17500117,guilty,male,Thomas,Broadhurst,1750
BAILEY,17500117,male,Henry,Shuter,t17500117-10,theft,grandLarceny,"121. Henry Shuter , was indicted for stealing seven silk ...",sessionsPapers/17500117,guilty,male,Edmund,Jervice,1750
BAILEY,17500117,male,William,Clark,t17500117-11,theft,grandLarceny,"122. William Clark , was indicted for stealing one chees ...",sessionsPapers/17500117,guilty,male,James,Crawforth,1750
BAILEY,17500117,male,Thomas,Chessam,t17500117-9,theft,grandLarceny,"120. Thomas Chessam , was indicted for stealing three ir ...",sessionsPapers/17500117,guilty,male,Francis,Cooper,1750
BAILEY,17500117,male,George,Fear,t17500117-8,theft,grandLarceny,"119. George Fear , was indicted for stealing one portman ...",sessionsPapers/17500117,guilty,female,Martha,Rigby,1750


## 3. Descriptive Statistics <a id='stats'></a>

## 4. Discussion Questions <a id='dq'></a>