# Replacing the Associate - Legal Machine Learning

This post is part of a series that look at applying machine learning to legal information. We will start off by looking at case law for exclued subject matter in the United Kingdom and Europe.

## Corpus Creation

This post will cover the steps needed to create a corpus of UK and European patent case law. This corpus can be used for machine learning / automated processing.

### Prequisites

Before you start you need to install Python and a bucketful of useful libraries. The best way to do this is to use [Anaconda](https://www.continuum.io/downloads). On my ten-year-old laptop running Puppy Linux (which was in the loft for a year or so covered in woodlouse excrement) this simply involved running the script. No compiling from source. No version errors. No messing with pip. 

I find that Jupyter (formerly iPython) notebooks are a great way to iteratively code. You can test out ideas block by block, shift stuff around, output and document all in the same tool. You can also easily export to HTML with one click (hence this post). To start a notebook having installed Anaconda run the following:
```
jypyter notebook
```
This will start the notebook server on your local machine and open your browser. By default the notebooks are served at *localhost:8888*. To access across a local network use the *-ip* flag with your IP address (e.g. -ip 192.168.1.2) and then point your browser at *[your-ip]:8888* (use -p to change the port).

This notebook also makes use of a library I hacked together for accessing the EPO OPS API. You can clone this library from: https://github.com/benhoyle/EPOops.

#### Imports

In [3]:
import feedparser
from datetime import datetime

The *num* parameter sets the number of results returned. If *num* is omitted the default is 10. *num* is limited to 1000 entries.

To overcome this we will need to perform two searches by different date ranges; each date range returning < 1000 results.

In [28]:
# Get RSS feed URL by running a search 
epo_feed_url = "https://www.epo.org/footer/search.html?site=BoA&entqr=0&output=xml_no_dtd&client=BoA_AJAX&ud=1&oe=UTF-8&ie=UTF-8&q=&getfields=dg3TLE.dg3DecisionOnline.dg3APN.dg3DecisionDate.dg3DecisionPDF.dg3CaseIPC.dg3DecisionBoard.dg3DecisionPRL.dg3KEY.dg3DecisionDistributionKey.dg3ECLI&requiredfields=(dg3DecisionPRL:EN).(dg3BOAnDot:3401%7Cdg3BOAnDot:3402%7Cdg3BOAnDot:3403%7Cdg3BOAnDot:3501%7Cdg3BOAnDot:3502%7Cdg3BOAnDot:3503%7Cdg3BOAnDot:3504%7Cdg3BOAnDot:3505%7Cdg3BOAnDot:3506%7Cdg3BOAnDot:3507).(dg3CaseType:T)&partialfields=dg3ArtRef:56&advOpts=hide&ulang=en&access=p&entqrm=0&lr=lang_en%7Clang_fr%7Clang_de&wc=200&wc_mc=1&proxystylesheet=BoA_RSS&sort=date:D:S:d1&filter=0&num=1050"

In [29]:
d = feedparser.parse(epo_feed_url)

In [5]:
d['feed']['title']

'Search results for :'

In [31]:
d.feed.description

'Results <strong></strong> - <strong></strong> of about <strong></strong> for<strong> </strong>'

In [30]:
len(d.entries)

0

In [11]:
import difflib

In [21]:
feed_pg1 = "https://www.epo.org/footer/search.html?site=BoA&entqr=0&output=xml_no_dtd&client=BoA_AJAX&ud=1&oe=UTF-8&ie=UTF-8&q=&getfields=dg3TLE.dg3DecisionOnline.dg3APN.dg3DecisionDate.dg3DecisionPDF.dg3CaseIPC.dg3DecisionBoard.dg3DecisionPRL.dg3KEY.dg3DecisionDistributionKey.dg3ECLI&requiredfields=%28dg3DecisionPRL:EN%29.%28dg3DecisionDistributionKey:A|dg3DecisionDistributionKey:B|dg3DecisionDistributionKey:C|dg3DecisionDistributionKey:D%29.%28dg3BOAnDot:3201|dg3BOAnDot:3202|dg3BOAnDot:3203|dg3BOAnDot:3204|dg3BOAnDot:3205|dg3BOAnDot:3206|dg3BOAnDot:3207|dg3BOAnDot:3208|dg3BOAnDot:3401|dg3BOAnDot:3402|dg3BOAnDot:3403|dg3BOAnDot:3501|dg3BOAnDot:3502|dg3BOAnDot:3503|dg3BOAnDot:3504|dg3BOAnDot:3505|dg3BOAnDot:3506|dg3BOAnDot:3507%29.%28dg3CaseType:T%29&partialfields=&advOpts=hide&ulang=en&access=p&entqrm=0&lr=lang_en|lang_fr|lang_de&wc=200&wc_mc=1&proxystylesheet=BoA_RSS&sort=date:D:S:d1&filter=0&num=50"
feed_pg2 = "https://www.epo.org/footer/search.html?site=BoA&entqr=0&output=xml_no_dtd&client=BoA_AJAX&ud=1&oe=UTF-8&ie=UTF-8&q=&getfields=dg3TLE.dg3DecisionOnline.dg3APN.dg3DecisionDate.dg3DecisionPDF.dg3CaseIPC.dg3DecisionBoard.dg3DecisionPRL.dg3KEY.dg3DecisionDistributionKey.dg3ECLI&requiredfields=%28dg3DecisionPRL:EN%29.%28dg3DecisionDistributionKey:A|dg3DecisionDistributionKey:B|dg3DecisionDistributionKey:C|dg3DecisionDistributionKey:D%29.%28dg3BOAnDot:3201|dg3BOAnDot:3202|dg3BOAnDot:3203|dg3BOAnDot:3204|dg3BOAnDot:3205|dg3BOAnDot:3206|dg3BOAnDot:3207|dg3BOAnDot:3208|dg3BOAnDot:3401|dg3BOAnDot:3402|dg3BOAnDot:3403|dg3BOAnDot:3501|dg3BOAnDot:3502|dg3BOAnDot:3503|dg3BOAnDot:3504|dg3BOAnDot:3505|dg3BOAnDot:3506|dg3BOAnDot:3507%29.%28dg3CaseType:T%29&partialfields=&advOpts=hide&ulang=en&access=p&entqrm=0&lr=lang_en|lang_fr|lang_de&wc=200&wc_mc=1&proxystylesheet=BoA_RSS&sort=date:D:S:d1&filter=0&num=50"
a = feed_pg1
b = feed_pg2
for i,s in enumerate(difflib.ndiff(a, b)):
    if s[0]==' ': continue
    elif s[0]=='-':
        print(u'Delete "{}" from position {}'.format(s[-1],i))
    elif s[0]=='+':
        print(u'Add "{}" to position {}'.format(s[-1],i))    
    print()

Feed strings are the same - there is no different RSS feed for different pages; nor a mechanism to change pages that I can see.

In [4]:
now = datetime.now().strftime("%Y-%m-%d")
print("Today's date is ", now)

Today's date is  2018-04-04


Here is a link that limits to Article 56 EPC.

http://www.epo.org/footer/search.html?site=BoA&entqr=0&output=xml_no_dtd&client=BoA_AJAX&ud=1&oe=UTF-8&ie=UTF-8&q=technical+inmeta:dg3DecisionDateGSA:2014-04-04..2018-04-04&getfields=dg3TLE.dg3DecisionOnline.dg3APN.dg3DecisionDate.dg3DecisionPDF.dg3CaseIPC.dg3DecisionBoard.dg3DecisionPRL.dg3KEY.dg3DecisionDistributionKey.dg3ECLI&requiredfields=(dg3DecisionPRL:EN).(dg3DecisionDistributionKey:A%7Cdg3DecisionDistributionKey:B%7Cdg3DecisionDistributionKey:C%7Cdg3DecisionDistributionKey:D).(dg3BOAnDot:3401%7Cdg3BOAnDot:3402%7Cdg3BOAnDot:3403%7Cdg3BOAnDot:3501%7Cdg3BOAnDot:3502%7Cdg3BOAnDot:3503%7Cdg3BOAnDot:3504%7Cdg3BOAnDot:3505%7Cdg3BOAnDot:3506%7Cdg3BOAnDot:3507).(dg3CaseType:T)&partialfields=dg3ArtRef:56&advOpts=hide&ulang=en&access=p&entqrm=0&lr=lang_en%7Clang_fr%7Clang_de&wc=200&wc_mc=1&proxystylesheet=BoA_RSS&sort=date:D:S:d1&filter=0&num=200

In [6]:
# Feed URL for all distribution, 3.4 and 3.5 Boards, English only, Technical boards, to 01.01.1995, language = English - THESE DO NOT LIMIT TO ArtRef:56
# 868 results
url_to_1995 = "https://www.epo.org/footer/search.html?site=BoA&entqr=0&output=xml_no_dtd&client=BoA_AJAX&ud=1&oe=UTF-8&ie=UTF-8&q=PDF%2Binmeta:dg3DecisionDate:1970-01-01..1995-01-01&getfields=dg3TLE.dg3DecisionOnline.dg3APN.dg3DecisionDate.dg3DecisionPDF.dg3CaseIPC.dg3DecisionBoard.dg3DecisionPRL.dg3KEY.dg3DecisionDistributionKey.dg3ECLI&requiredfields=%28dg3DecisionPRL:EN%29.%28dg3BOAnDot:3401|dg3BOAnDot:3402|dg3BOAnDot:3403|dg3BOAnDot:3501|dg3BOAnDot:3502|dg3BOAnDot:3503|dg3BOAnDot:3504|dg3BOAnDot:3505|dg3BOAnDot:3506|dg3BOAnDot:3507%29.%28dg3CaseType:T%29&partialfields=&advOpts=hide&ulang=en&access=p&entqrm=0&lr=lang_en&wc=200&wc_mc=1&proxystylesheet=BoA_RSS&sort=date:D:S:d1&filter=0&num=900"
url_from_95_to_2005 = "https://www.epo.org/footer/search.html?site=BoA&entqr=0&output=xml_no_dtd&client=BoA_AJAX&ud=1&oe=UTF-8&ie=UTF-8&q=PDF%2Binmeta:dg3DecisionDate:1995-01-01..2005-01-01&getfields=dg3TLE.dg3DecisionOnline.dg3APN.dg3DecisionDate.dg3DecisionPDF.dg3CaseIPC.dg3DecisionBoard.dg3DecisionPRL.dg3KEY.dg3DecisionDistributionKey.dg3ECLI&requiredfields=%28dg3DecisionPRL:EN%29.%28dg3BOAnDot:3401|dg3BOAnDot:3402|dg3BOAnDot:3403|dg3BOAnDot:3501|dg3BOAnDot:3502|dg3BOAnDot:3503|dg3BOAnDot:3504|dg3BOAnDot:3505|dg3BOAnDot:3506|dg3BOAnDot:3507%29.%28dg3CaseType:T%29&partialfields=&advOpts=hide&ulang=en&access=p&entqrm=0&lr=lang_en&wc=200&wc_mc=1&proxystylesheet=BoA_RSS&sort=date:D:S:d1&filter=0&num=999"
url_from_05_to_2012 = "https://www.epo.org/footer/search.html?site=BoA&entqr=0&output=xml_no_dtd&client=BoA_AJAX&ud=1&oe=UTF-8&ie=UTF-8&q=PDF%2Binmeta:dg3DecisionDate:2005-01-01..2012-01-01&getfields=dg3TLE.dg3DecisionOnline.dg3APN.dg3DecisionDate.dg3DecisionPDF.dg3CaseIPC.dg3DecisionBoard.dg3DecisionPRL.dg3KEY.dg3DecisionDistributionKey.dg3ECLI&requiredfields=%28dg3DecisionPRL:EN%29.%28dg3BOAnDot:3401|dg3BOAnDot:3402|dg3BOAnDot:3403|dg3BOAnDot:3501|dg3BOAnDot:3502|dg3BOAnDot:3503|dg3BOAnDot:3504|dg3BOAnDot:3505|dg3BOAnDot:3506|dg3BOAnDot:3507%29.%28dg3CaseType:T%29&partialfields=&advOpts=hide&ulang=en&access=p&entqrm=0&lr=lang_en&wc=200&wc_mc=1&proxystylesheet=BoA_RSS&sort=date:D:S:d1&filter=0&num=999"
url_from_12_to_now = "https://www.epo.org/footer/search.html?site=BoA&entqr=0&output=xml_no_dtd&client=BoA_AJAX&ud=1&oe=UTF-8&ie=UTF-8&q=PDF%2Binmeta:dg3DecisionDate:2012-01-01..{now}&getfields=dg3TLE.dg3DecisionOnline.dg3APN.dg3DecisionDate.dg3DecisionPDF.dg3CaseIPC.dg3DecisionBoard.dg3DecisionPRL.dg3KEY.dg3DecisionDistributionKey.dg3ECLI&requiredfields=%28dg3DecisionPRL:EN%29.%28dg3BOAnDot:3401|dg3BOAnDot:3402|dg3BOAnDot:3403|dg3BOAnDot:3501|dg3BOAnDot:3502|dg3BOAnDot:3503|dg3BOAnDot:3504|dg3BOAnDot:3505|dg3BOAnDot:3506|dg3BOAnDot:3507%29.%28dg3CaseType:T%29&partialfields=&advOpts=hide&ulang=en&access=p&entqrm=0&lr=lang_en&wc=200&wc_mc=1&proxystylesheet=BoA_RSS&sort=date:D:S:d1&filter=0&num=999".format(now=now)
print(url_from_12_to_now)

https://www.epo.org/footer/search.html?site=BoA&entqr=0&output=xml_no_dtd&client=BoA_AJAX&ud=1&oe=UTF-8&ie=UTF-8&q=PDF%2Binmeta:dg3DecisionDate:2012-01-01..2018-04-04&getfields=dg3TLE.dg3DecisionOnline.dg3APN.dg3DecisionDate.dg3DecisionPDF.dg3CaseIPC.dg3DecisionBoard.dg3DecisionPRL.dg3KEY.dg3DecisionDistributionKey.dg3ECLI&requiredfields=%28dg3DecisionPRL:EN%29.%28dg3BOAnDot:3401|dg3BOAnDot:3402|dg3BOAnDot:3403|dg3BOAnDot:3501|dg3BOAnDot:3502|dg3BOAnDot:3503|dg3BOAnDot:3504|dg3BOAnDot:3505|dg3BOAnDot:3506|dg3BOAnDot:3507%29.%28dg3CaseType:T%29&partialfields=&advOpts=hide&ulang=en&access=p&entqrm=0&lr=lang_en&wc=200&wc_mc=1&proxystylesheet=BoA_RSS&sort=date:D:S:d1&filter=0&num=999


In [7]:
# Looks like we will need four date ranges
data_to_95 = feedparser.parse(url_to_1995)
print("To 01.01.1995 - no. of entries = " + str(len(data_to_95.entries)))
data_from_95_to_2005 = feedparser.parse(url_from_95_to_2005)
print("From 01.01.1995 to 01.01.2005 - no. of entries = " + str(len(data_from_95_to_2005.entries)))
data_from_05_to_2012 = feedparser.parse(url_from_05_to_2012)
print("From 01.01.1995 to 01.01.2012 - no. of entries = " + str(len(data_from_05_to_2012.entries)))
data_from_12_to_now = feedparser.parse(url_from_12_to_now)
print("From 01.01.2012 to now- no. of entries = " + str(len(data_from_12_to_now.entries)))

To 01.01.1995 - no. of entries = 462
From 01.01.1995 to 01.01.2005 - no. of entries = 737
From 01.01.1995 to 01.01.2012 - no. of entries = 704
From 01.01.2012 to now- no. of entries = 843


In [8]:
def savedata(data, filename):
    """Helper function to save data to a pickle file."""
    import os, pickle
    with open(filename, "wb") as f:
        pickle.dump(data, f)
        print("Data saved")

def loaddata(filename):
    """Helper function to load data from a pickle file."""
    import os, pickle
    with open(filename, "rb") as f:
        print("Loading data")
        data = pickle.load(f)
    return data

In [12]:
data_from_95_to_2005.entries[0]

{'guidislink': False,
 'id': 'http://www.epo.org/law-practice/case-law-appeals/recent/t950036eu1.html',
 'lang': 'en',
 'link': 'http://www.epo.org/law-practice/case-law-appeals/recent/t950036eu1.html',
 'links': [{'href': 'http://www.epo.org/law-practice/case-law-appeals/recent/t950036eu1.html',
   'rel': 'alternate',
   'type': 'text/html'},
  {'href': 'http://www.epo.org/law-practice/case-law-appeals/pdf/t950036eu1.pdf',
   'rel': 'enclosure',
   'type': 'application/pdf'}],
 'published': '',
 'published_parsed': None,
 'summary': '<b>...</b> Download and more information: Decision text in EN (<b>PDF</b>, 29.063K).<br> Documentation of the appeal procedure can be found in the Register. <b>...</b>  \n          <br/>\n          <b>Online on</b>: 09.12.2014\n          | <b>Board</b>: 3.4.01\n          | <b>Decision date</b>: 4.5.1999\n          | <b>Proc. language</b>: EN\n          | <b>IPC</b>: G07B 17/02\n          | <b>Application no.</b>: 86114320\n          <br/>\n          <b>Ke

In [14]:
feeds = [
    data_to_95,
    data_from_95_to_2005,
    data_from_05_to_2012,
    data_from_12_to_now
]

cases = []
for feed in feeds:
    cases.extend(feed.entries)

savedata(cases, "feeds.pik")

Data saved
