# Replacing the Associate - Legal Machine Learning

This post is part of a series that look at applying machine learning to legal information. We will start off by looking at case law for exclued subject matter in the United Kingdom and Europe.

## Feature Generation

Machine learning algorithms typically operate on a vector of numbers (integers or floats). The patent information we have tends to be text-based. This post will explore how to turn our text into numeric feature vectors. 

A good place to start is to read these tutorials:
* Working with text data (from scikit learn) http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
* Learning to Classify Text (from nltk) http://www.nltk.org/book/ch06.html

### Prequisites

Before you start you need to install Python and a bucketful of useful libraries. The best way to do this is to use [Anaconda](https://www.continuum.io/downloads). On my ten-year-old laptop running Puppy Linux (which was in the loft for a year or so covered in woodlouse excrement) this simply involved running the script. No compiling from source. No version errors. No messing with pip. 

I find that Jupyter (formerly iPython) notebooks are a great way to iteratively code. You can test out ideas block by block, shift stuff around, output and document all in the same tool. You can also easily export to HTML with one click (hence this post). To start a notebook having installed Anaconda run the following:
```
jypyter notebook
```
This will start the notebook server on your local machine and open your browser. By default the notebooks are served at *localhost:8888*. To access across a local network use the *-ip* flag with your IP address (e.g. -ip 192.168.1.2) and then point your browser at *[your-ip]:8888* (use -p to change the port).

This notebook also makes use of a library I hacked together for accessing the EPO OPS API. You can clone this library from: https://github.com/benhoyle/EPOops.

We would want to save the retrieved data - either in a database or as a series of flat files. We need to investigate how to build a scikit learn corpus.

#### Imports

Because we need to retrieve up to 442 full text descriptions, we will adapt George Song's OPS Client that includes support for throttling. The code can be cloned from here: https://github.com/55minutes/python-epo-ops-client (my fork is here: https://github.com/benhoyle/python-epo-ops-client/

In [1]:
# Add the path for my fork of the client to the system path
import os
import sys
sys.path.insert(0, '/root/projects/caselawml/python-epo-ops-client')

In [2]:
import epo_ops

In [3]:
# Load Key and Secret from config file called "config.ini" in the same directory as this notebook
import configparser
parser = configparser.ConfigParser()
parser.read(os.path.abspath(os.getcwd() + '/config.ini'))
consumer_key = parser.get('Login Parameters', 'C_KEY')
consumer_secret = parser.get('Login Parameters', 'C_SECRET')

In [4]:
# Setup a new registered EPO OPS client that returns JSON
registered_client = epo_ops.RegisteredClient(
    key=consumer_key, 
    secret=consumer_secret, 
    accept_type='json')

Test the Client Code is working...

In [5]:
registered_client.published_description('GB2415387')[0:3]

['241 5387 Cosmetic Uses Of Electromagnetic Radiation The present invention',
 'relates to the cosmetic use of electromagnetic radiation for the reduction or alleviation or removal or diminishing of wrinkles or fine lines, especially but not exclusively facial and neck wrinkles and other signs of aging. The present invention also provides the use of electromagnetic radiation for generally rejuvenating skin, retarding signs of aging and improving skin elasticity, tone and appearance. The invention also provides for a method of treating skin so as to reduce or alleviate or retard or reverse visible signs of aging and for beautifying skin and an apparatus for effecting such cosmetic treatments.',
 'BACKGROUND']

In [6]:
claims = registered_client.published_claims('GB2415387')
print(claims[0:10])

['Claims 1. A method of cosmetically treating a superficial area of', 'mammalian skin comprising irradiating the skin with a source of divergent electromagnetic radiation of between 900nm to 1500nm.', '2. A method according to claim 1 where the cosmetic treatment is reducing or alleviating or removing or diminishing wrinkles or fine lines, rejuvenating skin, retarding or reversing visible signs of aging, improving skin elasticity, tone, texture and appearance and beautifying the skin.', '3. A method according to either claim 1 or 2 wherein the skin includes the outermost epidermis, basal layer and dermis of face, breast, arm, buttock, thigh, stomach or neck.', '4. A method according to any preceding claim wherein the divergent light is between 10  to 50 .', '5. A method according to any preceding claim wherein the electromagnetic radiation has a bandwidth of about 10 to 120nm.', '6. A method according to any preceding claim wherein the wavelength of the electromagnetic radiation is cen

Now we use some tools from the NLTK to help us process the text.

In [7]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.probability import FreqDist
from nltk import stem
import nltk

We need to download the Punkt tokeniser model and a set of stopwords to make this work. To do this we run nltk.download() below and download the Punkt (from models) and English stopwords (from corpora) via the pop-up interface.

In [8]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

As a test case we'll use an Apple case that was the recent subject of a Board of Appeal decision: WO2006084269.

In [9]:
desc_text = registered_client.published_description('WO2006084269')

In [10]:
desc_text[0:5]

['SYSTEM FOR BROWSING THROUGH A MUSIC CATALOG USING CORRELATION METRICS OF A KNOWLEDGE BASE OF MEDIASETS',
 'Related Applications',
 '[0001] This application claims priority from U.S. Provisional Application No. 60/649,945 filed February 4, 2005, incorporated herein by this reference in its entirety as though fully set forth.',
 'Technical Field',
 '[0002] This invention relates generally to systems for assisting users to navigate media item catalogs with the ultimate goal of building mediasets and/or discover media items. More specifically, the present invention pertains to computer software methods and products to enable users to interactively browse through an electronic catalog by leveraging media item association metrics.']

Now we join the paragraphs as one long string separated by new lines.

In [11]:
text_string = "\n".join(desc_text)

In [12]:
words = word_tokenize(text_string)

In [13]:
print(len(words))
words[15:25]

3716


['MEDIASETS',
 'Related',
 'Applications',
 '[',
 '0001',
 ']',
 'This',
 'application',
 'claims',
 'priority']

It would be good to strip out punctuation and remove common stop words to reduce this length.

In [14]:
# Remove punctuation
words_no_punct = [w.lower() for w in words if w.isalpha()]
print("Without punctuation: " + str(len(words_no_punct)))

stopwords = nltk.corpus.stopwords.words('english')
words_no_stop = [w for w in words_no_punct if w not in stopwords]
print("Without punctuation and stop words: " + str(len(words_no_stop)))

Without punctuation: 3019
Without punctuation and stop words: 1690


That is good. We have reduced the length of our vector by half. Looking at some of the words I suspect that a set of patent specific stopwords would reduce this list even further.

In [15]:
words_no_stop[0:10]

['system',
 'browsing',
 'music',
 'catalog',
 'using',
 'correlation',
 'metrics',
 'knowledge',
 'base',
 'mediasets']

Now we apply a stemming algorithm such that only the stem of a word (i.e. without endings) is retained. Stemming has shown good results with Support Vector Machines. The Porter stemmer was written by [Martin Porter](https://en.wikipedia.org/wiki/Martin_Porter), a Cambridge computer scientist, in 1980. It is the de facto standard.

In [19]:
porter = stem.porter.PorterStemmer()
words_stemmed = [porter.stem(w) for w in words_no_stop]
words_stemmed[0:10]

['system',
 'brows',
 'music',
 'catalog',
 'use',
 'correl',
 'metric',
 'knowledg',
 'base',
 'mediaset']

I have a set of Patent Stopwords [I created](https://ipchimp.co.uk/2014/01/29/patents-another-language/). This resulted from an analysis of terms from all US patent publications in 2001. 

If you create a text file with each word on a new line and save it in ~/nltk_data/corpora/stopwords/ with the filename 'patent' (no extension), then you can load the stopwords in the same manner as the english stopwords.

In [28]:
patent_stopwords = nltk.corpus.stopwords.words('patent')

In [27]:
print(patent_stopwords, end='; ')

['said', 'use', 'first', 'one', 'form', 'invent', 'thi', 'may', 'second', 'data', 'claim', 'wherein', 'accord', 'control', 'signal', 'present', 'devic', 'provid', 'portion', 'includ', 'embodi', 'compris', 'method', 'layer', 'surfac', 'system', 'process', 'exampl', 'step', 'ha', 'shown', 'connect', 'posit', 'prefer', 'oper', 'gener', 'mean', 'inform', 'circuit', 'imag', 'unit', 'time', 'materi', 'also', 'end', 'wa', 'member', 'line', 'film', 'side', 'least', 'select', 'apparatu', 'output', 'element', 'refer', 'receiv', 'describ', 'direct', 'base', 'light', 'section', 'set', 'show', 'substrat', 'contain', 'display', 'view', 'valu', 'part', 'cell', 'two', 'plural', 'group', 'structur', 'number', 'optic', 'electrod', 'input', 'result', 'abov', 'respect', 'region', 'memori', 'plate', 'case', 'differ', 'user', 'detect', 'illustr', 'determin', 'support', 'within', 'addit', 'record', 'temperatur', 'open', 'power', 'termin', 'area']; 

In [30]:
print("Before filtering patent stopwords, number of words = " + str(len(words_stemmed)))
# Filter our list of stemmed terms to remove the patent stopwords
no_patent_sws = [word for word in words_stemmed if word not in patent_stopwords]
print("After filtering patent stopwords, number of words = " + str(len(no_patent_sws)))
words = no_patent_sws

Before filtering patent stopwords, number of words = 1690
After filtering patent stopwords, number of words = 1234


Now we want to count the terms. We can use [Counter](https://docs.python.org/3/library/collections.html#collections.Counter) from collections to do this.

In [31]:
from collections import Counter
counts = Counter(words)
print(counts)

Counter({'item': 102, 'media': 69, 'mediaset': 49, 'metric': 41, 'knowledg': 25, 'list': 23, 'catalog': 18, 'fig': 15, 'navig': 15, 'pair': 14, 'j': 12, 'music': 12, 'initi': 11, 'consid': 11, 'associ': 11, 'new': 11, 'song': 10, 'collect': 9, 'togeth': 9, 'k': 9, 'book': 8, 'similar': 8, 'interact': 8, 'defin': 8, 'etc': 8, 'approach': 7, 'indic': 7, 'mani': 7, 'express': 6, 'descript': 6, 'way': 6, 'matrix': 6, 'figur': 6, 'appear': 6, 'anoth': 6, 'might': 5, 'given': 5, 'video': 5, 'comput': 5, 'weight': 5, 'build': 5, 'everi': 5, 'softwar': 5, 'order': 5, 'applic': 5, 'follow': 5, 'detail': 5, 'brows': 4, 'limit': 4, 'specif': 4, 'effect': 4, 'certain': 4, 'recommend': 4, 'skill': 4, 'movi': 4, 'normal': 4, 'represent': 4, 'advantag': 4, 'could': 4, 'diagram': 4, 'repres': 4, 'edg': 4, 'interest': 3, 'add': 3, 'herein': 3, 'analyz': 3, 'help': 3, 'simplifi': 3, 'variou': 3, 'discov': 3, 'problem': 3, 'draw': 3, 'guid': 3, 'transit': 3, 'call': 3, 'correspond': 3, 'implement': 3, 'l

In [39]:
print("We now have a vector of length: " + str(len(counts.keys())))

We now have a vector of length: 421


Putting it all together into a function...

In [42]:
# Load English stopwords
ENG_STOPWORDS = nltk.corpus.stopwords.words('english')
# Load patent stopwords
PAT_STOPWORDS = nltk.corpus.stopwords.words('patent')

def word_vector(text):
    """Take a long string of patent text, process and return a Counter object."""
    # Tokenise text
    words = word_tokenize(text)
    # Remove punctuation
    words_no_punct = [w.lower() for w in words if w.isalpha()]
    # Remove English stopwords (our Patent stopwords are stemmed so we do them later)
    words_no_stop = [w for w in words_no_punct if w not in ENG_STOPWORDS]
    # Stem
    porter = stem.porter.PorterStemmer()
    words_stemmed = [porter.stem(w) for w in words_no_stop]
    # Remove patent stopwords
    words_no_pat_stop = [w for w in words_stemmed if w not in PAT_STOPWORDS]
    # Return counter object
    return Counter(words_no_pat_stop)

In [43]:
count_words = word_vector(text_string)

In [47]:
print(count_words)

Counter({'item': 102, 'media': 69, 'mediaset': 49, 'metric': 41, 'knowledg': 25, 'list': 23, 'catalog': 18, 'fig': 15, 'navig': 15, 'pair': 14, 'j': 12, 'music': 12, 'initi': 11, 'consid': 11, 'associ': 11, 'new': 11, 'song': 10, 'collect': 9, 'togeth': 9, 'k': 9, 'book': 8, 'similar': 8, 'interact': 8, 'defin': 8, 'etc': 8, 'approach': 7, 'indic': 7, 'mani': 7, 'express': 6, 'descript': 6, 'way': 6, 'matrix': 6, 'figur': 6, 'appear': 6, 'anoth': 6, 'might': 5, 'given': 5, 'video': 5, 'comput': 5, 'weight': 5, 'build': 5, 'everi': 5, 'softwar': 5, 'order': 5, 'applic': 5, 'follow': 5, 'detail': 5, 'brows': 4, 'limit': 4, 'specif': 4, 'effect': 4, 'certain': 4, 'recommend': 4, 'skill': 4, 'movi': 4, 'normal': 4, 'represent': 4, 'advantag': 4, 'could': 4, 'diagram': 4, 'repres': 4, 'edg': 4, 'interest': 3, 'add': 3, 'herein': 3, 'analyz': 3, 'help': 3, 'simplifi': 3, 'variou': 3, 'discov': 3, 'problem': 3, 'draw': 3, 'guid': 3, 'transit': 3, 'call': 3, 'correspond': 3, 'implement': 3, 'l

Now we need to base a TD-IDF algorithm on the resultant frequency distribution across a set of data.