# Replacing the Associate - Legal Machine Learning

This post is part of a series that look at applying machine learning to legal information. We will start off by looking at case law for exclued subject matter in the United Kingdom and Europe.

## Feature Generation

Machine learning algorithms typically operate on a vector of numbers (integers or floats). The patent information we have tends to be text-based. This post will explore how to turn our text into numeric feature vectors. 

A good place to start is to read these tutorials:
* Working with text data (from scikit learn) http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
* Learning to Classify Text (from nltk) http://www.nltk.org/book/ch06.html

### Prequisites

Before you start you need to install Python and a bucketful of useful libraries. The best way to do this is to use [Anaconda](https://www.continuum.io/downloads). On my ten-year-old laptop running Puppy Linux (which was in the loft for a year or so covered in woodlouse excrement) this simply involved running the script. No compiling from source. No version errors. No messing with pip. 

I find that Jupyter (formerly iPython) notebooks are a great way to iteratively code. You can test out ideas block by block, shift stuff around, output and document all in the same tool. You can also easily export to HTML with one click (hence this post). To start a notebook having installed Anaconda run the following:
```
jypyter notebook
```
This will start the notebook server on your local machine and open your browser. By default the notebooks are served at *localhost:8888*. To access across a local network use the *-ip* flag with your IP address (e.g. -ip 192.168.1.2) and then point your browser at *[your-ip]:8888* (use -p to change the port).

This notebook also makes use of a library I hacked together for accessing the EPO OPS API. You can clone this library from: https://github.com/benhoyle/EPOops.

We would want to save the retrieved data - either in a database or as a series of flat files. We need to investigate how to build a scikit learn corpus.

#### Imports

In [1]:
import os

In [2]:
from EPOops.epo_ops import EPOops, keysearch

In [3]:
epo = EPOops()

In [4]:
desctext = epo.get_published_desc("WO2006084269")
print(desctext[:1000])

SYSTEM FOR BROWSING THROUGH A MUSIC CATALOG USING CORRELATION METRICS OF A KNOWLEDGE BASE OF MEDIASETS
Related Applications
[0001] This application claims priority from U.S. Provisional Application No. 60/649,945 filed February 4, 2005, incorporated herein by this reference in its entirety as though fully set forth.
Technical Field
[0002] This invention relates generally to systems for assisting users to navigate media item catalogs with the ultimate goal of building mediasets and/or discover media items. More specifically, the present invention pertains to computer software methods and products to enable users to interactively browse through an electronic catalog by leveraging media item association metrics.
Background of the Invention
[0003] New technologies combining digital media item players with dedicated software, together with new media distribution channels through networks are quickly changing the way people organize and play media items. As a direct consequence of such evolu

Setup a connection to our caselaw database. (This code could be placed in a file and imported.)

In [5]:
# Save current working directory as a variable for use later
current_dir = os.getcwd()

# Define name and path for SQLite3 DB
db_name = "gb_excluded_sm.db"
db_path = os.path.join(current_dir, db_name)

# Create DB
from sqlalchemy import create_engine
engine = create_engine('sqlite:///' + db_path, echo=False)

# Setup base class
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()

# Define Class for Excluded Matter Case Details
from sqlalchemy import Column, Integer, String, Date, Boolean, Text
class Decision(Base):
    __tablename__ = 'decisions'
    
    id = Column(Integer, primary_key=True)
    # Hearing number
    bl_number = Column(String(15))
    # Application number
    appln_no = Column(String(15))
    # Publication number / patent number
    pub_no = Column(String(15))
    # Country code e.g. GB
    country_code = Column(String(2))
    # Applicant / proprietor 
    applicant = Column(String(256))
    
    hearing_officer = Column(String(128))
    decision_date = Column(Date)
    # Link for the decision page
    link = Column(String(128))
    # Summary text
    summary = Column(Text)
    # File name for the PDF
    filelink = Column(String(128))
    
    # Whether case was deemed to relate to excluded subject matter
    excluded = Column(Boolean)
    
    
    def as_dict(self):
        """ Return object as a dictionary. """
        temp_dict = {}
        temp_dict['object_type'] = self.__class__.__name__
        for c in self.__table__.columns:
            cur_attr = getattr(self, c.name)
            # If datetime generate string representation
            if isinstance(cur_attr, datetime):
                cur_attr = cur_attr.strftime('%d %B %Y')
            temp_dict[c.name] = cur_attr
        return temp_dict
    
    def populate(self, data):
        """ Populates matching attributes of class instance. 
        param dict data: dict where for each entry key, value equal attributename, attributevalue."""
        for key, value in data.items():
            if hasattr(self, key):
                # Convert string dates into datetimes
                if isinstance(getattr(self, key), datetime) or str(self.__table__.c[key].type) == 'DATE':
                    value = datetime.strptime(value, "%d %B %Y")
                setattr(self, key, value)
                
# Setup SQLAlchemy session
from sqlalchemy.orm import sessionmaker
Session = sessionmaker(bind=engine)
session = Session()

In [6]:
# Check Database is connected
session.query(Decision).count()

442

In [7]:
for record in session.query(Decision).all()[0:10]:
    print(record.appln_no)

GB1109923.1
GB1119833.0
GB1313219.6
GB 1201052.6
GB1217392.8
GB0817814.7
GB0809880.8
GB 0807865.1, GB 0807867.7
GB0624556.7
GB 0906015.3


The application numbers in our database require a bit of cleaning. The following problems were spotted:
* Use of one or more spaces after initial GB but before the application number;
* Some cases have multiple application numbers;
* Some cases have a publication number instead of an application number (and end in 'A' or 'B');
* Some cases have an extra '0' at the start of the application number (8 digits before the '.' instead of 7);
* Some cases have the Hearing Number instead of the application number;
* Some cases have an extra '.' at the end of the application number; and
* Some cases are missing the initial 'GB'. 



Regular expressions seem a good way to parse the numbers. 

In [8]:
import re

In [9]:
# GB Application Number - 7 digits then a dot, then a digit
re_appln_no = re.compile('\d{7}.\d')
# GB Publication Number - 7 digits followed by end of string or not a dot
re_pub_no = re.compile('\d{7}([^.]|$)')

Aside on GB application number check digit:

"The check digit is calculated using a modulus 10 algorithm. Each digit
of the base, from right to left is multiplied by 2,1,2,1 etc respectively.
The separate digits of the products are summed and then divided by
10, the remainder being subtracted from 10 to give the check digit"

http://www.wipo.int/export/sites/www/standards/en/pdf/07-02-06.pdf

In [10]:
for record in session.query(Decision).all()[10:20]:
    search_text = record.appln_no
    print(search_text)
    # Also may need check for a blank appln_no
    if not search_text:
        print("No application number")
        break
    located = re_appln_no.search(search_text)
    if located:
        print("Match Appln No: " + located.group(0))
    else:
        pub_located = re_pub_no.search(search_text)
        if pub_located:
            print("Match Pub No: " + pub_located.group(0))
        else:  
            print("No match")

GB0616189.7
Match Appln No: 0616189.7
GB0902255.9
Match Appln No: 0902255.9
GB0919376.4
Match Appln No: 0919376.4
GB 0710637.0
Match Appln No: 0710637.0
GB0710612.3
Match Appln No: 0710612.3
GB0613207.0
Match Appln No: 0613207.0
GB 2415387
Match Pub No: 2415387
GB0419580.6, GB0419583.0, GB0724070.8, GB0724072.4
Match Appln No: 0419580.6
GB 0424655.9
Match Appln No: 0424655.9
GB 0414654.4
Match Appln No: 0414654.4


Test converting the application number into a publication number and getting the full text description.

In [11]:
pub_no = epo.convert_number("GB", "0414654.4")
print(pub_no)

desctext = epo.get_published_desc(pub_no)
print(desctext)

GB20040014654
(404, '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\n<fault xmlns="http://ops.epo.org">\n    <code>SERVER.EntityNotFound</code>\n    <message>No results found</message>\n</fault>\n')


Ah - it seems the application number is being converted into the EPO_DOC number but not the publication number for the application. This will need a tweak to the EPOops code.

In [12]:
pub_nos = epo.appln_to_pub("GB", "0414654.4")

In [13]:
pub_nos

'GB2410364'

In [14]:
desctext = epo.get_published_desc(pub_nos[0]['number'])
print(desctext)

TypeError: string indices must be integers

Because we need to retrieve up to 442 full text descriptions, we will adapt George Song's OPS Client that includes support for throttling. The code can be cloned from here: https://github.com/55minutes/python-epo-ops-client (my fork is here: https://github.com/benhoyle/python-epo-ops-client/

In [4]:
# Add the path for my fork of the client to the system path
import os
import sys
sys.path.insert(0, '/root/projects/caselawml/python-epo-ops-client')

In [5]:
import epo_ops

In [6]:
# Load Key and Secret from config file called "config.ini" in the same directory as this notebook
import configparser
parser = configparser.ConfigParser()
parser.read(os.path.abspath(os.getcwd() + '/config.ini'))
consumer_key = parser.get('Login Parameters', 'C_KEY')
consumer_secret = parser.get('Login Parameters', 'C_SECRET')

In [7]:
# Setup a new registered EPO OPS client that returns JSON
registered_client = epo_ops.RegisteredClient(
    key=consumer_key, 
    secret=consumer_secret, 
    accept_type='json')

In [8]:
registered_client.published_description('GB2415387')[0:10]

['241 5387 Cosmetic Uses Of Electromagnetic Radiation The present invention',
 'relates to the cosmetic use of electromagnetic radiation for the reduction or alleviation or removal or diminishing of wrinkles or fine lines, especially but not exclusively facial and neck wrinkles and other signs of aging. The present invention also provides the use of electromagnetic radiation for generally rejuvenating skin, retarding signs of aging and improving skin elasticity, tone and appearance. The invention also provides for a method of treating skin so as to reduce or alleviate or retard or reverse visible signs of aging and for beautifying skin and an apparatus for effecting such cosmetic treatments.',
 'BACKGROUND',
 "In young skin, the collagen just beneath the surface of the skin forms an organised lattice with good elasticity and flexibility. As women go through menopause and men age, both experience increased skin wrinkling and decreased skin thickness. During aging, the collagen changes i

In [23]:
claims = registered_client.published_claims('GB2415387')
print(claims[0:10])

['Claims 1. A method of cosmetically treating a superficial area of', 'mammalian skin comprising irradiating the skin with a source of divergent electromagnetic radiation of between 900nm to 1500nm.', '2. A method according to claim 1 where the cosmetic treatment is reducing or alleviating or removing or diminishing wrinkles or fine lines, rejuvenating skin, retarding or reversing visible signs of aging, improving skin elasticity, tone, texture and appearance and beautifying the skin.', '3. A method according to either claim 1 or 2 wherein the skin includes the outermost epidermis, basal layer and dermis of face, breast, arm, buttock, thigh, stomach or neck.', '4. A method according to any preceding claim wherein the divergent light is between 10  to 50 .', '5. A method according to any preceding claim wherein the electromagnetic radiation has a bandwidth of about 10 to 120nm.', '6. A method according to any preceding claim wherein the wavelength of the electromagnetic radiation is cen