# 2. Case Law Prediction - Downloading the Data

This post looks at the steps needed to build a case law dataset. Given a set of feeds from a case law search, we want to download the case page HTML, and then process the page to generate a set of structured data.

In order to be kind to the EPO we need to throttle our requests and rotate IP addresses and user agent strings.

Robots.txt from the EPO:
```
### robots.txt for www.epo.org
# Wed Jan  5 12:38:07 CET 2011
User-agent: *
Disallow: /search.html
Disallow: /search_de.html
Disallow: /search_fr.html
Disallow: /*?update*
Disallow: /*?hp=*
Disallow: /law-practice/case-law-appeals/pdf
Disallow: /applying/online-services/security/reader/
Disallow: /gemsafe
Disallow: /learning-events/events/conferences/focus-marocco.html
Disallow: /learning-events/events/conferences/focus-marocco_de.html
Disallow: /learning-events/events/conferences/focus-marocco_fr.html
Disallow: /applying/online-services/online-filing/download/pdf.html
Disallow: /applying/online-services/online-filing/download/pdf_de.html
Disallow: /applying/online-services/online-filing/download/pdf_fr.html
##
```

In [14]:
# Imports
import time
import random
import os, pickle

from collections import OrderedDict

from lxml.html import fromstring
import requests
from itertools import cycle

# Import Beautiful Soup for HTML parsing
from bs4 import BeautifulSoup

## Load Feed Data

In [1]:
def loaddata(filename):
    """Helper function to load data from a pickle file."""
    with open(filename, "rb") as f:
        print("Loading data")
        data = pickle.load(f)
    return data

feedentries = loaddata("feeds.pik")
# Run in reverse date order 
feedentries.reverse()

Loading data


## Setup System for Multiple Requests

In [3]:
# Define a sleep function
def sleep():
    """Wait for a time period in seconds between 1 and 2 s in 0.2 s increments"""
    # Throttling Requests
       
    time.sleep(random.randint(5, 10)/5)

#### Setup a Set of Proxies

In use it was found that many of these proxies were pretty flakey, or already blacklisted. The methods below start out with a long list of about 50 servers, and then culls this list based on errors and timeouts.

In [64]:
# Rotating Proxies

def get_proxies():
    """ Get a list of proxies."""
    url = 'https://free-proxy-list.net/'
    response = requests.get(url)
    parser = fromstring(response.text)
    proxies = set()
    for i in parser.xpath('//tbody/tr'):
        if i.xpath('.//td[7][contains(text(),"yes")]'):
            proxy = ":".join([i.xpath('.//td[1]/text()')[0], i.xpath('.//td[2]/text()')[0]])
            proxies.add(proxy)
    return proxies


#If you are copy pasting proxy ips, put in the list below
#proxies = ['121.129.127.209:80', '124.41.215.238:45169', '185.93.3.123:8080', '194.182.64.67:3128', '106.0.38.174:8080', '163.172.175.210:3128', '13.92.196.150:8080']
proxies = get_proxies()

In [70]:
proxies

['147.135.210.114:54566',
 '151.80.140.233:54566',
 '80.211.4.187:8080',
 '92.53.73.138:8118']

In [49]:
random.choice(list(proxies))

'151.80.140.233:54566'

#### Define a List of User Agent Strings to Use

In [6]:
# Rotating User Agent Strings
user_agent_list = [
   #Chrome
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (Windows NT 5.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    #Firefox
    'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 6.2; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)',
    'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)'
]

### Make Requests

In [7]:
def savedata(data, filename):
    """Helper function to save data to a pickle file."""
    import os, pickle
    with open(filename, "wb") as f:
        pickle.dump(data, f)
        print("Data saved")

In [66]:
# This method could likely be improved along the lines
# discussed here - https://www.peterbe.com/plog/best-practice-with-retries-with-requests
def make_request(url, headers, proxies):
    proxy = random.choice(proxies)
    
    try:
        # Timeout of less than 1s leads to lots of connection errors
        r = requests.get(url, headers=headers, proxies={"http": proxy, "https": proxy}, timeout=2)
    except Exception as e:
        print("URL - {0} - raised error {1} with proxy {2}".format(url, e.__class__.__name__, proxy))
        proxies.remove(proxy)
        if proxies:
            print("{} proxies remaining".format(len(proxies)))
            sleep()
            return make_request(url, headers, proxies)
        
    if r.status_code == requests.codes.ok:
        return r.text, proxies
    elif r.status_code in [403, 407, 503]:
        # Remove and recycle proxy
        print("URL - {0} - returned status code {1} with proxy {2}".format(url, r.status_code, proxy))
        proxies.remove(proxy)
        if proxies:
            print("{} proxies remaining".format(len(proxies)))
            sleep()
            return make_request(url, headers, proxies)
        else:
            return None, proxies  
    else:
        print("URL - {0} - returned status code {1} and text:\n{2}".format(url, r.status_code, r.text))
        return None, proxies

In [29]:
def parse_html(case_page_text):
    """ Parse HTML into structured data."""
    soup = BeautifulSoup(case_page_text, "lxml")
    pagebody = soup.find("div", {"id": "pagebody"})
    # Parse table data
    table_entries = pagebody.find_all("tr")
    case_data = OrderedDict()
    for te in table_entries:
        if te.th and te.td:
            fieldname = te.th.text.strip()
            # Get rid of colon
            if ":" in fieldname:
                fieldname = fieldname.split(":")[0]
            
            data = te.td.text.strip()
            # Split multiline data into list
            if "\n" in data:
                data = [d.strip() for d in data.split("\n")]
            # Dash indicates no data
            if data == "-":
                data = None
        
            # Skip the "Download and more information field"
            if "and more information" not in fieldname:
                case_data[fieldname] = data
                
    # Parse decision details
    decision_text = pagebody.find("div", {"id": "body"}).find_all("p", recursive=False)
    decision_dict = OrderedDict()
    current_title = "No title"
    for p in decision_text:
        if p:
            text = p.text.strip()
            # If bold mark as title
            if p.b:
                current_title = text
                decision_dict[current_title] = list()
            elif len(text) > 1:
                # Else add text under title
                decision_dict[current_title].append(text)
    
    case_data["Decision Text"] = decision_dict
    return case_data

If a proxy returns a 403 we want to remove it from the list of proxies.

In [68]:
proxies = list(proxies)

In [69]:
# we'll process in reverse data order
for i, feed in enumerate(feedentries, start=0):
    print("Saving Feed {0} - {1}.".format(i, feed['title']))
    
    case_page_url = feed['link']
    #Pick a random user agent
    user_agent = random.choice(user_agent_list)
    #Set the headers 
    headers = {'User-Agent': user_agent}
    
    if not proxies:
        print("No working proxies")
        break
    
    html, proxies = make_request(case_page_url, headers, proxies)
    
    if not html:
        print("Feed {0} - {1} - {2} did not work.".format(i, feed['title'], case_page_url))
    else:
        parsed_data = parse_html(html)
        cases.append(
            (
                i,
                feed['title'],
                feed['link'],
                parsed_data
            )
        )
        
    sleep()
    
    if i > 0 and ((i % 10) == 0):
        print("Cases data is length: ", len(cases))
        savedata(cases, "parsed_data.pkl")
    

Saving Feed 2027 - T 0278/96 () of 10.7.2001.
URL - http://www.epo.org/law-practice/case-law-appeals/recent/t960278eu1.html - raised error ConnectionError with proxy 213.6.40.142:80
48 proxies remaining
URL - http://www.epo.org/law-practice/case-law-appeals/recent/t960278eu1.html - returned status code 403 with proxy 144.217.233.5:3128
47 proxies remaining
URL - http://www.epo.org/law-practice/case-law-appeals/recent/t960278eu1.html - raised error ConnectTimeout with proxy 147.75.113.108:8080
46 proxies remaining
Saving Feed 2028 - T 0121/99 () of 11.9.2001.
URL - http://www.epo.org/law-practice/case-law-appeals/recent/t990121eu1.html - raised error ProxyError with proxy 188.254.78.108:8888
45 proxies remaining
URL - http://www.epo.org/law-practice/case-law-appeals/recent/t990121eu1.html - returned status code 403 with proxy 5.196.169.98:3128
44 proxies remaining
URL - http://www.epo.org/law-practice/case-law-appeals/recent/t990121eu1.html - raised error ReadTimeout with proxy 180.244.

URL - http://www.epo.org/law-practice/case-law-appeals/recent/t981144eu1.html - raised error ConnectTimeout with proxy 194.126.183.141:53281
20 proxies remaining
Saving Feed 2058 - T 0492/00 (Channel switching system/NEC) of 12.10. ....
URL - http://www.epo.org/law-practice/case-law-appeals/recent/t000492eu1.html - raised error ConnectTimeout with proxy 5.189.146.57:80
19 proxies remaining
Saving Feed 2059 - T 0502/99 () of 5.2.2002.
URL - http://www.epo.org/law-practice/case-law-appeals/recent/t990502eu1.html - raised error ReadTimeout with proxy 188.120.232.181:8118
18 proxies remaining
Saving Feed 2060 - T 0807/99 () of 7.2.2002.
URL - http://www.epo.org/law-practice/case-law-appeals/recent/t990807eu1.html - raised error ProxyError with proxy 142.44.194.14:3128
17 proxies remaining
URL - http://www.epo.org/law-practice/case-law-appeals/recent/t990807eu1.html - raised error ConnectTimeout with proxy 31.28.5.185:63909
16 proxies remaining
Cases data is length:  2059
Data saved
Saving 

Cases data is length:  2159
Data saved
Saving Feed 2161 - T 0213/01 () of 8.5.2003.
Saving Feed 2162 - T 0498/99 (Insulated gate type bipolar-transistor with ....
Saving Feed 2163 - T 0353/99 (Method and apparatus for the classification ....
Saving Feed 2164 - T 0364/01 () of 10.4.2003.
Saving Feed 2165 - T 0124/01 (VCR scheduling/STARSIGHT TELECAST) ....
Saving Feed 2166 - T 0904/00 (Interconnection-point memory/TEXAS ....
Saving Feed 2167 - T 0971/00 () of 7.5.2003.
Saving Feed 2168 - T 0280/01 () of 1.7.2003.
Saving Feed 2169 - T 0778/02 () of 1.7.2003.
Saving Feed 2170 - T 0214/01 (Teletext receiver/EDICO) of 7.3.2003.
Cases data is length:  2169
Data saved
Saving Feed 2171 - T 0128/01 () of 11.6.2003.
Saving Feed 2172 - T 0937/00 () of 12.6.2003.
Saving Feed 2173 - T 0825/00 () of 11.7.2003.
Saving Feed 2174 - T 0659/00 () of 1.7.2003.
Saving Feed 2175 - T 0050/01 (Image Processing/CANON) of 9.5.2003.
Saving Feed 2176 - T 0088/01 () of 8.7.2003.
Saving Feed 2177 - T 0611/99 () of 

Saving Feed 2296 - T 0192/84 (Interruption in delivery of mail) of 9.11.1984.
Saving Feed 2297 - T 0035/82 () of 9.1.1985.
Saving Feed 2298 - T 0094/83 () of 1.3.1985.
Saving Feed 2299 - T 0287/84 (Restitutio) of 11.6.1985.
Saving Feed 2300 - T 0287/84 (Wiedereinsetzung) of 11.6.1985.
Cases data is length:  2299
Data saved
Saving Feed 2301 - T 0287/84 (Re-establishment) of 11.6.1985.
Saving Feed 2302 - T 0287/84 (Re-establishment) of 11.6.1985.
Saving Feed 2303 - T 0166/82 () of 12.6.1985.
Saving Feed 2304 - T 0096/83 () of 10.7.1985.
Saving Feed 2305 - T 0080/84 () of 3.10.1985.
Saving Feed 2306 - T 0040/83 () of 4.11.1985.
Saving Feed 2307 - T 0114/84 () of 12.11.1985.
Saving Feed 2308 - T 0146/82 () of 2.12.1985.
Saving Feed 2309 - T 0054/83 () of 9.12.1985.
Saving Feed 2310 - T 0011/84 () of 10.12.1985.
Cases data is length:  2309
Data saved
Saving Feed 2311 - T 0167/82 () of 11.12.1985.
Saving Feed 2312 - T 0043/83 () of 11.12.1985.
Saving Feed 2313 - T 0130/85 () of 10.4.1986.
Sa

Saving Feed 2444 - T 0265/88 () of 7.11.1989.
Saving Feed 2445 - T 0015/89 () of 1.12.1989.
Saving Feed 2446 - T 0128/88 () of 12.12.1989.
Saving Feed 2447 - T 0240/87 () of 8.12.1989.
Saving Feed 2448 - T 0568/89 () of 10.1.1990.
Saving Feed 2449 - T 0591/88 () of 12.12.1989.
Saving Feed 2450 - T 0126/89 () of 10.1.1990.
Cases data is length:  2449
Data saved
Saving Feed 2451 - T 0282/85 () of 5.12.1989.
Saving Feed 2452 - T 0420/89 () of 8.2.1990.
Saving Feed 2453 - T 0397/88 () of 11.1.1990.
Saving Feed 2454 - T 0657/89 () of 7.2.1990.
Saving Feed 2455 - T 0321/88 () of 6.2.1990.
Saving Feed 2456 - T 0400/87 () of 1.3.1990.
Saving Feed 2457 - T 0594/88 () of 8.2.1990.
Saving Feed 2458 - T 0618/88 () of 6.3.1990.
Saving Feed 2459 - T 0186/86 () of 5.12.1989.
Saving Feed 2460 - T 0239/84 () of 7.3.1990.
Cases data is length:  2459
Data saved
Saving Feed 2461 - T 0618/89 () of 7.3.1990.
Saving Feed 2462 - T 0143/88 () of 4.4.1990.
Saving Feed 2463 - T 0213/89 () of 10.4.1990.
Saving Fe

Saving Feed 2602 - T 0255/92 () of 9.9.1992.
Saving Feed 2603 - T 0743/91 () of 10.9.1992.
Saving Feed 2604 - T 0693/90 () of 11.8.1992.
Saving Feed 2605 - T 0881/90 () of 9.9.1992.
Saving Feed 2606 - T 0313/91 () of 6.8.1992.
Saving Feed 2607 - T 0835/91 () of 7.10.1992.
Saving Feed 2608 - T 0200/91 () of 6.10.1992.
Saving Feed 2609 - T 0141/91 () of 12.8.1992.
Saving Feed 2610 - T 0241/92 () of 2.9.1992.
Cases data is length:  2609
Data saved
Saving Feed 2611 - T 0669/92 () of 2.11.1992.
Saving Feed 2612 - T 0281/92 () of 6.11.1992.
Saving Feed 2613 - T 0519/92 () of 1.12.1992.
Saving Feed 2614 - T 0940/91 () of 12.10.1992.
Saving Feed 2615 - T 0249/91 () of 4.12.1992.
Saving Feed 2616 - T 0218/92 () of 9.12.1992.
Saving Feed 2617 - T 0135/91 () of 4.12.1992.
Saving Feed 2618 - T 0319/91 () of 8.12.1992.
Saving Feed 2619 - T 0043/92 () of 2.12.1992.
Saving Feed 2620 - T 0119/91 () of 2.2.1993.
Cases data is length:  2619
Data saved
Saving Feed 2621 - T 0101/91 () of 1.12.1992.
Saving

So now we have the data for 2745 case. A next step is to build a simple Mongo database to store this information.