# Replacing the Associate - Legal Machine Learning

This post is part of a series that look at applying machine learning to legal information. We will start off by looking at case law for exclued subject matter in the United Kingdom and Europe.

## Corpus Creation

This post will cover the steps needed to create a corpus of UK and European patent case law. This corpus can be used for machine learning / automated processing.

### Prequisites

Before you start you need to install Python and a bucketful of useful libraries. The best way to do this is to use [Anaconda](https://www.continuum.io/downloads). On my ten-year-old laptop running Puppy Linux (which was in the loft for a year or so covered in woodlouse excrement) this simply involved running the script. No compiling from source. No version errors. No messing with pip. 

The code below does use pywget, which is not installed in the Anaconda environment by default. To install it run the following from a terminal / command line:
```
conda install pywget
```
I find that Jupyter (formerly iPython) notebooks are a great way to iteratively code. You can test out ideas block by block, shift stuff around, output and document all in the same tool. You can also easily export to HTML with one click (hence this post). To start a notebook having installed Anaconda run the following:
```
jypyter notebook
```
This will start the notebook server on your local machine and open your browser. By default the notebooks are served at *localhost:8888*. To access across a local network use the *-ip* flag with your IP address (e.g. -ip 192.168.1.2) and then point your browser at *[your-ip]:8888* (use -p to change the port).

#### Imports

In [1]:
# Import requests for web requests
import requests
# Import Beautiful Soup for HTML parsing
from bs4 import BeautifulSoup
# Import pywget to download PDFs from URLs
import wget
# Import urlparsing tool for relative links
from urllib.parse import urljoin
# Import time to allow for delays (so we don't get booted for being a bot)
import time
# Import os to allow file manipulation
import os
# Import SQLAlchemy to save case law details in old DB
import sqlalchemy
# Import datetime for string to date conversion
from datetime import datetime

In [2]:
# Save current working directory as a variable for use later
current_dir = os.getcwd()

## GB Excluded Subject Matter Hearings

This post will discuss how to generate a corpus of UK Intellectual Property Office decisions regarding excluded subject matter.

***
1) Define URLs for excluded subject matter allowed and refused result tables from UKIPO.

In [3]:
base_url = "https://www.ipo.gov.uk/"

In [4]:
url_excluded_allowed = "https://www.ipo.gov.uk/p-challenge-decision-results/p-challenge-decision-results-gen.htm?hearingtype=All&number=&MonthFrom=&YearFrom=&MonthTo=&YearTo=&hearingofficer=&party=&provisions=&keywords1=Excluded+fields+%28allowed%29&keywords2=&keywords3=&submit=Go+%BB"

In [5]:
url_excluded_refused = "https://www.ipo.gov.uk/p-challenge-decision-results/p-challenge-decision-results-gen.htm?hearingtype=All&number=&MonthFrom=&YearFrom=&MonthTo=&YearTo=&hearingofficer=&party=&provisions=&keywords1=Excluded+fields+%28refused%29&keywords2=&keywords3=&submit=Go+%BB"

***
2) Make a request to the results page for cases that were allowed, i.e. that were deemed not to relate to excluded subject matter. Load the result into Beautiful Soup.

In [6]:
r = requests.get(url_excluded_allowed)
data = r.text
soup = BeautifulSoup(data, "lxml")

***
3) Have a look at the fields we need to store - get these from the table's header cells.

In [7]:
# Get headers for results table
for tableheader in soup.find_all('th'):
    print(tableheader.get_text())

BL Number
Application / Patent Number
Person(s) or Company(s) involved
Hearing Officer
Decision Date


***
4) Generate DB using SQLAlchemy and define class to hold data.

In [8]:
# Define name and path for SQLite3 DB
db_name = "gb_excluded_sm.db"
db_path = os.path.join(current_dir, db_name)

# Create DB
from sqlalchemy import create_engine
engine = create_engine('sqlite:///' + db_path, echo=False)

# Setup base class
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()

# Define Class for Excluded Matter Case Details
from sqlalchemy import Column, Integer, String, Date, Boolean, Text
class Decision(Base):
    __tablename__ = 'decisions'
    
    id = Column(Integer, primary_key=True)
    # Hearing number
    bl_number = Column(String(15))
    # Application number
    appln_no = Column(String(15))
    # Publication number / patent number
    pub_no = Column(String(15))
    # Country code e.g. GB
    country_code = Column(String(2))
    # Applicant / proprietor 
    applicant = Column(String(256))
    
    hearing_officer = Column(String(128))
    decision_date = Column(Date)
    # Link for the decision page
    link = Column(String(128))
    # Summary text
    summary = Column(Text)
    # File name for the PDF
    filelink = Column(String(128))
    
    # Whether case was deemed to relate to excluded subject matter
    excluded = Column(Boolean)
    
    
    def as_dict(self):
        """ Return object as a dictionary. """
        temp_dict = {}
        temp_dict['object_type'] = self.__class__.__name__
        for c in self.__table__.columns:
            cur_attr = getattr(self, c.name)
            # If datetime generate string representation
            if isinstance(cur_attr, datetime):
                cur_attr = cur_attr.strftime('%d %B %Y')
            temp_dict[c.name] = cur_attr
        return temp_dict
    
    def populate(self, data):
        """ Populates matching attributes of class instance. 
        param dict data: dict where for each entry key, value equal attributename, attributevalue."""
        for key, value in data.items():
            if hasattr(self, key):
                # Convert string dates into datetimes
                if isinstance(getattr(self, key), datetime) or str(self.__table__.c[key].type) == 'DATE':
                    value = datetime.strptime(value, "%d %B %Y")
                setattr(self, key, value)

In [9]:
# Clear existing DB
Base.metadata.drop_all(engine)
# Create new DB    
Base.metadata.create_all(engine)

***
5) Investigate getting data entries from results list.

In [10]:
# Have a look at each result
tableitem = soup.find('tbody')
tableitem

<tbody>
<tr>
<td><a href="p-challenge-decision-results-bl?BL_Number=O/312/15">O/312/15</a></td>
<td>GB1109923.1</td>
<td>The Boeing Company</td>
<td>Miss J Pullen</td>
<td class="no-wrap">6 July 2015</td>
</tr>
</tbody>

In [11]:
# Get the link as below
relative_link = tableitem.find('a').get('href')
urljoin(base_url, relative_link)

'https://www.ipo.gov.uk/p-challenge-decision-results-bl?BL_Number=O/312/15'

In [12]:
# Get individual data items from that result
for td in tableitem.find_all('td'):
    print(td.get_text())

O/312/15
GB1109923.1
The Boeing Company
Miss J Pullen
6 July 2015


In [13]:
# Define list of DB fields
fields = ['bl_number', 'appln_no', 'applicant' , 'hearing_officer', 'decision_date']

***
6) Define routine to get details and store as a series of dicts

In [14]:
# Get details for results table as a list of dicts
results = []
# For each result entry
for tableitem in soup.find_all('tbody'):
    
    # Get individual data items from that result
    result = {field:td.get_text() for td, field in zip(tableitem.find_all('td'),fields)}
    result['link'] = urljoin(base_url, tableitem.find('a').get('href'))
    result['excluded'] = False
    # Get link to PDF by following result link
    r = requests.get(result['link'])
    resultsoup = BeautifulSoup(r.text, "lxml")
    result['filelink'] = resultsoup.find('a', rel='pdf').get('href')
    result['summary'] = resultsoup.find('p', class_="summary").get_text()
    # Add delay to avoid server overload / denial
    time.sleep(1)
    results.append(result)

In [15]:
# Check we have all results - at time of writing = 42
print("There are " + str(len(results)) + " decisions in the results")

There are 42 decisions in the results


In [16]:
# Format for a result dict
results[6]

{'applicant': 'Fisher-Rosemount Systems, Inc',
 'appln_no': 'GB0809880.8',
 'bl_number': 'O/390/12',
 'decision_date': '9 October 2012',
 'excluded': True,
 'filelink': 'o39012.pdf',
 'hearing_officer': 'Mr B Buchanan',
 'link': 'https://www.ipo.gov.uk/p-challenge-decision-results-bl?BL_Number=O/390/12',
 'summary': 'The invention relates to accessing parameters in a process control system and controlling a process using a parameter value. A universal communication interface enables communication between field devices and other data sources in a process control system. As well as enhancing compatibility between components of the system, the universal interface caches parameter values. This means that in the event of a communications failure between one or more components of the system, when a parameter value cannot be obtained directly, a stored copy of the parameter value can be obtained from the local cache memory and can be used to control the process.\r\n'}

***
7) Save results in Database

In [17]:
# Setup SQLAlchemy session
from sqlalchemy.orm import sessionmaker
Session = sessionmaker(bind=engine)
session = Session()

for result in results:
    decision = Decision()
    decision.populate(result)
    session.add(decision)
session.commit()

In [18]:
# Check all results are added
session.query(Decision).count()

42

***
8) Get and save PDF of each decision if it doesn't already exist.

In [19]:
# Set directory to save to as excluded
directory = "allowed"

# If there is no folder called [directory] in the current working folder then make one
if not os.path.exists(directory):
    os.makedirs(directory)
    
# Set save path as current directory + new sub directory
savepath = os.path.join(current_dir, directory)

# For each result
for result in results:
    # Set save path
    savename = os.path.join(savepath, result['filelink'])
    # If file does not exist
    if os.path.isfile(savename):
        print(result['bl_number'] + ' - File: ' + result['filelink'] + ' - Already Exists', end='; ')
    else:
        print('Downloading ' + result['bl_number'] + ' - File: ' + result['filelink'], end='; ')
        # This downloads file into 
        filename = wget.download(urljoin(base_url, result['filelink']), out=savename)
        # Space out requests in time
        time.sleep(1)

O/312/15 - File: o31215.pdf - Already Exists; O/453/14 - File: o45314.pdf - Already Exists; O/435/14 - File: o43514.pdf - Already Exists; O/371/14 - File: o37114.pdf - Already Exists; O/179/14 - File: o17914.pdf - Already Exists; O/438/12 - File: o43812.pdf - Already Exists; O/390/12 - File: o39012.pdf - Already Exists; O/096/12 - File: o09612.pdf - Already Exists; O/018/12 - File: o01812.pdf - Already Exists; O/466/11 - File: o46611.pdf - Already Exists; O/415/11 - File: o41511.pdf - Already Exists; O/373/11 - File: o37311.pdf - Already Exists; O/163/11 - File: o16311.pdf - Already Exists; O/361/10 - File: o36110.pdf - Already Exists; O/117/10 - File: o11710.pdf - Already Exists; O/089/10 - File: o08910.pdf - Already Exists; O/058/10 - File: o05810.pdf - Already Exists; O/136/09 - File: o13609.pdf - Already Exists; O/107/09 - File: o10709.pdf - Already Exists; O/312/08 - File: o31208.pdf - Already Exists; O/238/08 - File: o23808.pdf - Already Exists; O/224/08 - File: o22408.pdf - Alre

***
9) Repeat the above steps for decisions that refused the case on the grounds of excluded subject matter.

In [27]:
r = requests.get(url_excluded_refused)
data = r.text
soup = BeautifulSoup(data, "lxml")
# Get details for results table as a list of dicts
results = []
# For each result entry
for tableitem in soup.find_all('tbody'):
    # Get individual data items from that result
    result = {field:td.get_text() for td, field in zip(tableitem.find_all('td'),fields)}
    result['link'] = urljoin(base_url, tableitem.find('a').get('href'))
    result['excluded'] = True
    results.append(result)
    
#Check number of results
print("There are " + str(len(results)) + " decisions in the results")

There are 400 decisions in the results


Sending requests on this many decisions eventually resulted in some "IncompleteRead" errors when fetching the PDF links.

In [29]:
processed_results = 0
for result in results:
    print("Processing " + result['bl_number'], end='; ')
    # Get link to PDF by following result link
    error = False
    #Retry request 4 times before failing, sleep 5 sec in between
    for x in range(0, 4):  # try 4 times
        try:
            r = requests.get(result['link'])
            str_error = None
        except Exception as str_error:
            pass

        if str_error:
            time.sleep(5)  # wait for 2 seconds before trying to fetch the data again
            error = True
            print('Error occurred', end='; ')
        else:
            error = False
            break
            
    if not error:  
        print('Getting link', end='; ')
        resultsoup = BeautifulSoup(r.text, "lxml")
        result['filelink'] = resultsoup.find('a', rel='pdf').get('href')
        result['summary'] = resultsoup.find('p', class_="summary").get_text()
        processed_results = processed_results + 1
    
    # Add delay to avoid server overload / denial
    time.sleep(5)
    
#Check number of results
print("Number of processed results = " + str(processed_results))

Processing O/253/16; Getting link; Processing O/137/16; Getting link; Processing O/136/16; Getting link; Processing O/124/16; Getting link; Processing O/111/16; Getting link; Processing O/023/16; Getting link; Processing O/002/16; Getting link; Processing O/599/15; Getting link; Processing O/597/15; Getting link; Processing O/544/15; Getting link; Processing O/543/15; Getting link; Processing O/519/15; Getting link; Processing O/479/15; Getting link; Processing O/477/15; Getting link; Processing O/326/15; Getting link; Processing O/239/15; Getting link; Processing O/186/15; Getting link; Processing O/184/15; Getting link; Processing O/139/15; Getting link; Processing O/118/15; Getting link; Processing O/106/15; Getting link; Processing O/071/15; Getting link; Processing O/057/15; Getting link; Processing O/044/15; Getting link; Processing O/042/15; Getting link; Processing O/038/15; Getting link; Processing O/037/15; Getting link; Processing O/566/14; Getting link; Processing O/551/14;

In [31]:
results[234]

{'applicant': 'Ewise.Com.Au Pty Ltd',
 'appln_no': 'GB0203654.9',
 'bl_number': 'O/327/07',
 'decision_date': '2 November 2007',
 'excluded': False,
 'filelink': 'o32707.pdf',
 'hearing_officer': 'Mr P Slater',
 'link': 'https://www.ipo.gov.uk/p-challenge-decision-results-bl?BL_Number=O/327/07',
 'summary': 'The invention concerns an arrangement for accessing over a computer network such as the Internet, secure sites which contain personal or financial information by employing an active content agent (ACA), which in the disclosed embodiment is a piece of software which when downloaded from the Internet runs on the user’s personal computer and accesses, for example, their bank account(s) on their behalf using passwords which are stored locally on their computer in an encrypted form. The advantages of this are there is no need for the user to remember their password(s) or to disclosure it to a third party.\r\n'}

In [32]:
#Save data to database
for result in results:
    decision = Decision()
    decision.populate(result)
    session.add(decision)
session.commit()

In [33]:
session.query(Decision).count()

442

In [37]:
# Download PDFs

# Set directory to save to as excluded
directory = "refused"

# If there is no folder called [directory] in the current working folder then make one
if not os.path.exists(directory):
    os.makedirs(directory)
    
# Set save path as current directory + new sub directory
savepath = os.path.join(current_dir, directory)

# For each result
for result in results:
    # Set save path
    savename = os.path.join(savepath, result['filelink'])
    # If file does not exist
    if os.path.isfile(savename):
        print(result['bl_number'] + ' - File: ' + result['filelink'] + ' - Already Exists', end='; ')
    else:
        print('Downloading ' + result['bl_number'] + ' - File: ' + result['filelink'], end='; ')
        # This downloads file into directory defined above
       
        str_error = ""
        #Retry request 4 times before failing, sleep 5 sec in between
        
        for x in range(0, 4):  # try 4 times
            try:
                filename = wget.download(urljoin(base_url, result['filelink']), out=savename)
                str_error = None
            except Exception as str_error:
                pass

            if str_error:
                time.sleep(5)  # wait for 2 seconds before trying to fetch the data again
                print('Error occurred', end='; ')
            else:
                break
        
        # Space out requests in time
        time.sleep(5)

O/253/16 - File: o25316.pdf - Already Exists; O/137/16 - File: o13716.pdf - Already Exists; O/136/16 - File: o13616.pdf - Already Exists; O/124/16 - File: o12416.pdf - Already Exists; O/111/16 - File: o11116.pdf - Already Exists; O/023/16 - File: o02316.pdf - Already Exists; O/002/16 - File: o00216.pdf - Already Exists; O/599/15 - File: o59915.pdf - Already Exists; O/597/15 - File: o59715.pdf - Already Exists; O/544/15 - File: o54415.pdf - Already Exists; O/543/15 - File: o54315.pdf - Already Exists; O/519/15 - File: o51915.pdf - Already Exists; O/479/15 - File: o47915.pdf - Already Exists; O/477/15 - File: o47715.pdf - Already Exists; O/326/15 - File: o32615.pdf - Already Exists; O/239/15 - File: o23915.pdf - Already Exists; O/186/15 - File: o18615.pdf - Already Exists; O/184/15 - File: o18415.pdf - Already Exists; O/139/15 - File: o13915.pdf - Already Exists; O/118/15 - File: o11815.pdf - Already Exists; O/106/15 - File: o10615.pdf - Already Exists; O/071/15 - File: o07115.pdf - Alre

There we go. Now we have a database of results with an "excluded" label ("True" = subject matter is excluded; "False" = subject matter is not excluded). We have the application numbers so we can make requests to get application data. We also have the PDFs of these decisions to parse at our leisure.
