# 2. Case Law Prediction - Getting the Data

## Processing Feeds

So at the moment we have a list of feed entries for cases that match our search criteria (relating to inventive step and containing the word "technical" by boards 3.04 and 3.05 in our area (we cannot search by G06 alone)). We need to process these feeds to create structured data we can use in our prediction algorithms.


In [1]:
def loaddata(filename):
    """Helper function to load data from a pickle file."""
    import os, pickle
    with open(filename, "rb") as f:
        print("Loading data")
        data = pickle.load(f)
    return data

feedentries = loaddata("feeds.pik")

Loading data


### Looking at Feed Entry Structure

In [2]:
# Let's have a look at the structure of each feed
feedentries[0]

{'guidislink': False,
 'id': 'http://www.epo.org/law-practice/case-law-appeals/recent/t930324eu1.html',
 'lang': 'en',
 'link': 'http://www.epo.org/law-practice/case-law-appeals/recent/t930324eu1.html',
 'links': [{'href': 'http://www.epo.org/law-practice/case-law-appeals/recent/t930324eu1.html',
   'rel': 'alternate',
   'type': 'text/html'},
  {'href': 'http://www.epo.org/law-practice/case-law-appeals/pdf/t930324eu1.pdf',
   'rel': 'enclosure',
   'type': 'application/pdf'}],
 'published': '',
 'published_parsed': None,
 'summary': '<b>...</b> Download and more information: Decision text in EN (<b>PDF</b>, 17.099K).<br> Documentation of the appeal procedure can be found in the Register. <b>...</b>  \n          <br/>\n          <b>Online on</b>: 02.02.1995\n          | <b>Board</b>: 3.5.02\n          | <b>Decision date</b>: 7.12.1994\n          | <b>Proc. language</b>: EN\n          | <b>IPC</b>: G11B 27/10\n          | <b>Application no.</b>: 88903369\n          <br/>\n          <b>K

In [3]:
# And check that all entries have the same structure
feedentries[-1]

{'guidislink': False,
 'id': 'http://www.epo.org/law-practice/case-law-appeals/recent/t111474eu1.html',
 'lang': 'en',
 'link': 'http://www.epo.org/law-practice/case-law-appeals/recent/t111474eu1.html',
 'links': [{'href': 'http://www.epo.org/law-practice/case-law-appeals/recent/t111474eu1.html',
   'rel': 'alternate',
   'type': 'text/html'},
  {'href': 'http://www.epo.org/law-practice/case-law-appeals/pdf/t111474eu1.pdf',
   'rel': 'enclosure',
   'type': 'application/pdf'}],
 'published': '',
 'published_parsed': None,
 'summary': '<b>...</b> Download and more information: Decision text in EN (<b>PDF</b>, 15.312K).<br> Documentation of the appeal procedure can be found in the Register. <b>...</b>  \n          <br/>\n          <b>Online on</b>: 18.01.2012\n          | <b>Board</b>: 3.4.02\n          | <b>Decision date</b>: 2.1.2012\n          | <b>Proc. language</b>: EN\n          | <b>IPC</b>: G01J 5/34, G01J 5/20, G01J 5/10, G01J 5/02\n          | <b>Application no.</b>: 06075345\n

In [4]:
feedentries[-1].keys()

dict_keys(['lang', 'published_parsed', 'summary_detail', 'summary', 'published', 'links', 'guidislink', 'title', 'title_detail', 'id', 'link'])

So useful fields appear to be:

* title (or title_detail>value same data) - has the case number and the date;
* id / link - link to the HTML page;
* links>href for pdf link;
* summary - html with useful information (or we get this from the case link).

### Get HTML of Case from Feed

In [5]:
case_page_url = feedentries[-1]['link']
print(case_page_url)

http://www.epo.org/law-practice/case-law-appeals/recent/t111474eu1.html


In [6]:
import requests

r = requests.get(case_page_url)

In [7]:
# Check request worked - i.e. got a 200
r.status_code == requests.codes.ok

True

In [8]:
case_page_text = r.text

In [9]:
case_page_text

'<!DOCTYPE html>\n\n<html class="no-js forceScrollBar"><head><meta content="IE=edge" http-equiv="X-UA-Compatible" /><title>EPO - T 1474/11 () of 2.1.2012</title><meta charset="utf-8" /><meta content="text/javascript" http-equiv="Content-Script-Type" /><meta content="text/css" http-equiv="Content-Style-Type" /><meta content="no" http-equiv="imagetoolbar" />\n<meta name="DCTERMS.available"  content="20120118" />\n<meta name="dg3aDCI"\t\tcontent="" />\n<meta name="dg3aDCIT"\t\tcontent="" /> \n<meta name="dg3ANR"             content="T111474EU1" />\n<meta name="dg3APN"\t\tcontent="06075345" />\n<meta name="dg3APNwc"\t\tcontent="06075345 0607534 060753" />\n<meta name="dg3ApplicantA"\tcontent="delphi technologies, inc." />\n<meta name="dg3Applicant"\tcontent="Delphi Technologies, Inc." />\n<meta name="dg3ArtRef"\t\tcontent="108" />\n<meta name="dg3BOAnDot"\t\tcontent="3402" />\n<meta name="dg3CaseIPC"\t\tcontent="G01J 5/34, G01J 5/20, G01J 5/10, G01J 5/02" />\n<meta name="dg3CaseType"\tcont

### Use Beautiful Soup to Parse Case Law HTML

In [14]:
# Import Beautiful Soup for HTML parsing
from bs4 import BeautifulSoup

In [16]:
soup = BeautifulSoup(case_page_text, "lxml")

We can use our browser to inspect the HTML. It appears the data we are interested in is within the `<div>` tag having `id="pagebody"`.

In [21]:
pagebody = soup.find("div", {"id": "pagebody"})

In [22]:
pagebody

<div class=" " data-pagetype="contentPage" id="pagebody"><h1>T 1474/11 () of 2.1.2012</h1><a name="Content"></a> <!--googleon: all--> <div id="body" lang="en">
<table class="tableType3">
<tbody>
<tr>
<th>European Case Law Identifier:</th>
<td>ECLI:EP:BA:2012:T147411.20120102</td>
</tr>
<tr>
<th>Date of decision:</th>
<td>02 January 2012</td>
</tr>
<tr>
<th>Case number:</th>
<td>T 1474/11</td>
</tr>
<tr>
<th>Application number:</th>
<td><a class="xint" href="https://register.epo.org/espacenet/application?number=EP06075345" title="Open in the European Patent Register: About this file">06075345.6</a></td>
</tr>
<tr>
<th>IPC class:</th>
<td><a class="intx" href="http://worldwide.espacenet.com/classification?locale=en_EP#!/CPC=G01J5/34" title="Look up IPC in Espacenet using the CPC browser">G01J 5/34</a><br/>
<a class="intx" href="http://worldwide.espacenet.com/classification?locale=en_EP#!/CPC=G01J5/20" title="Look up IPC in Espacenet using the CPC browser">G01J 5/20</a><br/>
<a class="int

#### Parse Case Details from Page Table

In [29]:
# Then we can convert th / tr entries into a dictionary
table_entries = pagebody.find_all("tr")
table_entries

[<tr>
 <th>European Case Law Identifier:</th>
 <td>ECLI:EP:BA:2012:T147411.20120102</td>
 </tr>, <tr>
 <th>Date of decision:</th>
 <td>02 January 2012</td>
 </tr>, <tr>
 <th>Case number:</th>
 <td>T 1474/11</td>
 </tr>, <tr>
 <th>Application number:</th>
 <td><a class="xint" href="https://register.epo.org/espacenet/application?number=EP06075345" title="Open in the European Patent Register: About this file">06075345.6</a></td>
 </tr>, <tr>
 <th>IPC class:</th>
 <td><a class="intx" href="http://worldwide.espacenet.com/classification?locale=en_EP#!/CPC=G01J5/34" title="Look up IPC in Espacenet using the CPC browser">G01J 5/34</a><br/>
 <a class="intx" href="http://worldwide.espacenet.com/classification?locale=en_EP#!/CPC=G01J5/20" title="Look up IPC in Espacenet using the CPC browser">G01J 5/20</a><br/>
 <a class="intx" href="http://worldwide.espacenet.com/classification?locale=en_EP#!/CPC=G01J5/10" title="Look up IPC in Espacenet using the CPC browser">G01J 5/10</a><br/>
 <a class="intx"

In [32]:
te = table_entries[0]
print("Field = {0}; data = {1}".format(te.th.text, te.td.text))

Field = European Case Law Identifier:; data = ECLI:EP:BA:2012:T147411.20120102


In [35]:
for te in table_entries:
    if te.th and te.td:
        print("Field = {0}; data = {1}".format(te.th.text.strip(), te.td.text.strip()))

Field = European Case Law Identifier:; data = ECLI:EP:BA:2012:T147411.20120102
Field = Date of decision:; data = 02 January 2012
Field = Case number:; data = T 1474/11
Field = Application number:; data = 06075345.6
Field = IPC class:; data = G01J 5/34
G01J 5/20
G01J 5/10
G01J 5/02
Field = Language of proceedings:; data = EN
Field = Distribution:; data = D
Field = Download and more information:; data = Decision text in  EN (PDF,   15.312K)


Documentation of the appeal procedure can be found in the Register


Bibliographic information is available in:
EN


Versions:
Unpublished
Field = Bibliographic information is available in:; data = EN
Field = Versions:; data = Unpublished
Field = Title of application:; data = Apparatus and method for providing thermal conductance in thermally responsive photonic imaging devices
Field = Applicant name:; data = Delphi Technologies, Inc.
Field = Opponent name:; data = -
Field = Board:; data = 3.4.02
Field = Headnote:; data = -
Field = Relevant legal pr

In [36]:
"string\n with some \n random new \n\nlines".split("\n")

['string', ' with some ', ' random new ', '', 'lines']

In [39]:
from collections import OrderedDict

case_data = OrderedDict()
for te in table_entries:
    if te.th and te.td:
        fieldname = te.th.text.strip()
        # Get rid of colon
        if ":" in fieldname:
            fieldname = fieldname.split(":")[0]
            
        data = te.td.text.strip()
        # Split multiline data into list
        if "\n" in data:
            data = [d.strip() for d in data.split("\n")]
        # Dash indicates no data
        if data == "-":
            data = None
        
        # Skip the "Download and more information field"
        if "and more information" not in fieldname:
            case_data[fieldname] = data
        

In [40]:
case_data

OrderedDict([('European Case Law Identifier',
              'ECLI:EP:BA:2012:T147411.20120102'),
             ('Date of decision', '02 January 2012'),
             ('Case number', 'T 1474/11'),
             ('Application number', '06075345.6'),
             ('IPC class',
              ['G01J 5/34', 'G01J 5/20', 'G01J 5/10', 'G01J 5/02']),
             ('Language of proceedings', 'EN'),
             ('Distribution', 'D'),
             ('Bibliographic information is available in', 'EN'),
             ('Versions', 'Unpublished'),
             ('Title of application',
              'Apparatus and method for providing thermal conductance in thermally responsive photonic imaging devices'),
             ('Applicant name', 'Delphi Technologies, Inc.'),
             ('Opponent name', None),
             ('Board', '3.4.02'),
             ('Headnote', None),
             ('Relevant legal provisions',
              ['European Patent Convention Art 108',
               'European Patent Convention R

#### Parse Decision Details from Paragraph Tags

Bold tags mark the headings for our different sections.

We need to ignore paragraph tags that do not contain any alphanumeric characters. The paragraph structure is useful so we'll say the text as a list of paragraphs.

In [28]:
decision_text = pagebody.find_all("p")
decision_text

[<p>
 			 - 
 			</p>,
 <p><b>Summary of Facts and Submissions</b></p>,
 <p>I. The appellant contests the decision of the examining division of the European Patent Office dated 18 January 2011 refusing European patent application No. 06075345.6.</p>,
 <p>The appellant filed a notice of appeal on 2 March 2011 and paid the appeal fee on the same day.</p>,
 <p>A written statement setting out the grounds of appeal was not filed within the four-month time limit provided for in Article 108 EPC.</p>,
 <p>II. In a communication dated 11 July 2011, the Board informed the appellant that no statement setting out the grounds of appeal had been received and that the appeal could be expected to be rejected as inadmissible. The appellant was informed that any observations should be filed within two months.</p>,
 <p>III. The appellant filed no observations in response to said communication.</p>,
 <p><b>Reasons for the Decision</b></p>,
 <p>No written statement setting out the grounds of appeal was fil

In [46]:
len(decision_text[0].text.strip())

1

In [47]:
decision_text[1].text.strip()

'Summary of Facts and Submissions'

In [49]:
decision_text[0].b

In [48]:
decision_text[1].b

<b>Summary of Facts and Submissions</b>

In [50]:
decision_dict = OrderedDict()
current_title = "No title"
for p in decision_text:
    if p:
        text = p.text.strip()
        # If bold mark as title
        if p.b:
            current_title = text
            decision_dict[current_title] = list()
        elif len(text) > 1:
            decision_dict[current_title].append(text)

decision_dict

OrderedDict([('Summary of Facts and Submissions',
              ['I. The appellant contests the decision of the examining division of the European Patent Office dated 18 January 2011 refusing European patent application No. 06075345.6.',
               'The appellant filed a notice of appeal on 2 March 2011 and paid the appeal fee on the same day.',
               'A written statement setting out the grounds of appeal was not filed within the four-month time limit provided for in Article 108 EPC.',
               'II. In a communication dated 11 July 2011, the Board informed the appellant that no statement setting out the grounds of appeal had been received and that the appeal could be expected to be rejected as inadmissible. The appellant was informed that any observations should be filed within two months.',
               'III. The appellant filed no observations in response to said communication.']),
             ('Reasons for the Decision',
              ['No written statement set