# Exploration of the EP full-text data

**Table of contents**

* [Introduction](#Introduction)
* [1. Load data](#1.-Load-data)
* [2. Different types of text data, stored in different formats](#2.-Different-types-of-text-data,-stored-in-different-formats)
* [3. Language of the claims](#3.-Language-of-the-claims)
* [4. Parsing the xml content](#4.-Parsing-the-xml-content)
    * [4.1. Test example with xml from the documentation](#4.1.-Test-example-with-xml-from-the-documentation)
    * [4.2. Applying to the patent text data](#4.2.-Applying-to-the-patent-text-data)

## Introduction

We explore the textual content of the EP-full-text data and define some methods to access the data. In particular, we show how to parse the `xml` content of the `CLAIM` and `DESCR` fields.  

In [2]:
import pandas as pd
# library to parse the xml
# library doc: https://docs.python.org/3/library/xml.etree.elementtree.html
import xml.etree.ElementTree as ET 

## 1. Load data

In [28]:
data_sample = pd.read_csv('../data/raw/wind_tech_1990_2020_with_publications_full_text.csv')

In [33]:
# we remove the first column if data = '../data/raw/wind_tech_1990_2020_with_publications_full_text.csv'
cols = list(data_sample)
data_sample = data_sample[cols[1:]]

We rename the columns for clarity:

In [34]:
data_sample.columns = ['publication_authority', # will always have the value "EP"
                'publication_number', # a seven-digit number
                'publication_kind', # see https://www.epo.org/searching-for-patents/helpful-resources/first-time-here/definitions.html for help.
                'publication_date', # in format YYYY-MM-DD
                'language_text_component', # de, en, fr; xx means unknown
                'text_type', # TITLE, ABSTR, DESCR, CLAIM, AMEND, ACSTM, SREPT, PDFEP
                'text' # it contains, where appropriate, XML tags for better structure. You will find the DTD applicable to all parts of the publication at: http://docs.epoline.org/ebd/doc/ep-patent-document-v1-5.dtd
               ]
data_sample.head()

Unnamed: 0,publication_authority,publication_number,publication_kind,publication_date,language_text_component,text_type,text
0,EP,92,A1,1978-12-20,de,TITLE,Verfahren zur Herstellung von Ascorbinsäure un...
1,EP,92,A1,1978-12-20,en,TITLE,Method of preparing ascorbic acid and intermed...
2,EP,92,A1,1978-12-20,fr,TITLE,Procédé de préparation de l'acide ascorbique e...
3,EP,92,A1,1978-12-20,en,ABSTR,"<p id=""pa01"" num=""0001"">The invention relates ..."
4,EP,92,A1,1978-12-20,en,DESCR,"<p id=""p0001"" num=""0001"">This invention relate..."


## 2. Different types of text data, stored in different formats

In [35]:
data_sample.text_type.value_counts()

TITLE    114744
CLAIM     49796
PDFEP     28035
DESCR     24625
ABSTR     15324
SRPRT      5246
AMEND       297
ACSTM         4
Name: text_type, dtype: int64

We display how each type of data is encoded:
> * TITLE: plain text
* CLAIM: xml
* PDFEP: html link
* DESCR: xml
* ABSTR: xml
* AMEND: xml

In [36]:
for text_type in data_sample.text_type.value_counts().to_frame().reset_index()['index'].unique().tolist():
    
    print('__________________')
    print(text_type)
    
    
    condition1 = data_sample.text_type == text_type
    condition2 = data_sample.language_text_component == 'en'
    df = data_sample[condition1 & condition2]['text'].to_frame()
    example_xml = df.iloc[112]['text']
    print(example_xml)

__________________
TITLE
Monitoring oscillator with threshold
__________________
CLAIM
<claim id="c-en-01-0001" num=""><claim-text>1. A compressor unit enclosed by a rectangular housing comprising a drive motor (2), a screw compressor (5) and a ventilator wheel (3), wherein the ventilator wheel (3) is disposed between the motor (2) and the screw compresor (5) characterized in that there are provided a cooling air inlet at a face side of the rectangular housing (1) and a passage for the air intaken therethrough extending upwards over the ventilator wheel (3) and passing over to a horizontally extending section (13) followed by a section (14) extending upwards again and in an opposite direction to said horizontal section, which is disposed outside of the housing (1) at its upper side, wherein the outlet opening (15) of this exhaust air passage is directed to that side opposite to the air inlet (6).</claim-text></claim><claim id="c-en-01-0002" num=""><claim-text>2. A compressor unit accor

IndexError: single positional indexer is out-of-bounds

## 3. Language of the claims

In [37]:
data_sample.language_text_component.value_counts()

en    115569
de     68059
fr     54119
xx       324
Name: language_text_component, dtype: int64

In [41]:
# xx is one of the three languages (EN, FR, DE)
condition2 = data_sample.language_text_component == 'xx'
df = data_sample[condition2]['text'].to_frame()
df.iloc[1]['text']

'https://data.epo.org/publication-server/pdf-document?cc=EP&pn=0001611&ki=A3&pd=1979-09-05'

## 4. Parsing the xml content

We use as example a claim in English.

In [42]:
data_sample

Unnamed: 0,publication_authority,publication_number,publication_kind,publication_date,language_text_component,text_type,text
0,EP,92,A1,1978-12-20,de,TITLE,Verfahren zur Herstellung von Ascorbinsäure un...
1,EP,92,A1,1978-12-20,en,TITLE,Method of preparing ascorbic acid and intermed...
2,EP,92,A1,1978-12-20,fr,TITLE,Procédé de préparation de l'acide ascorbique e...
3,EP,92,A1,1978-12-20,en,ABSTR,"<p id=""pa01"" num=""0001"">The invention relates ..."
4,EP,92,A1,1978-12-20,en,DESCR,"<p id=""p0001"" num=""0001"">This invention relate..."
...,...,...,...,...,...,...,...
238066,EP,3597095,A1,2020-01-22,en,ABSTR,"<p id=""pa01"" num=""0001"">A dishwasher (10) incl..."
238067,EP,3597095,A1,2020-01-22,en,DESCR,"<!-- EPO <DP n=""1""> --><heading id=""h0001""><b>..."
238068,EP,3597095,A1,2020-01-22,en,CLAIM,"<!-- EPO <DP n=""26""> --><claim id=""c-en-0001"" ..."
238069,EP,3597095,A1,2020-01-22,en,SRPRT,<srep-fields-searched><minimum-documentation><...


In [49]:
condition1 = data_sample.text_type == 'CLAIM'
condition2 = data_sample.language_text_component == 'en'
data_sample = data_sample[data_sample['publication_number'] == 1173915]
df = data_sample[condition1 & condition2]['text'].to_frame()

example_xml = df.iloc[0]['text']
example_xml

Unnamed: 0,text
25879,"<claim id=""c-en-01-0001"" num=""0001""><claim-tex..."


'<claim id="c-en-01-0001" num="0001"><claim-text>Generator, preferably for a windmill and especially of the kind driven directly by the rotor of the windmill without any gearbox (5) installed between the rotor and the generator, wherein at least the stator of the generator (12) is made with at least two modules (20) which are fully enclosed and sealed, and that these at least two modules (20) may be mounted and dismantled independently of each other one or more at a time without dismantling the entire winding (25), <b>characterised in that</b> each single stator module (20) is individually contained in an enclosure (23) with a degree of sealing substantially corresponding to the degree of sealing which is desired in the finished generator (12), and that a given number of juxtaposed enclosures (23) abutting on each other form a closed ring of stator modules (20).</claim-text></claim><claim id="c-en-01-0002" num="0002"><claim-text>Generator according to claim 1, <b>characterised in that<

### 4.1. Test example with xml from the documentation

In [10]:
xml = """
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>
"""

In [11]:
root = ET.fromstring(xml)

In [12]:
root.tag

'data'

In [13]:
for child in root:
    print(child.tag, child.attrib)

country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}


In [14]:
for country in root.findall('country'):
    rank = country.find('rank').text
    name = country.get('name')
    print(name, rank)

Liechtenstein 1
Singapore 4
Panama 68


### 4.2. Applying to the patent text data

In [55]:
# We remove the markers for bold text!
example_xml_modified = example_xml.replace('<b>', '')
example_xml_modified = example_xml_modified.replace('</b>', '')

# We need to add a root node <data> to the XML in order to parse properly the xml
example_xml_modified = "<data>" + example_xml_modified + '</data>'

In [56]:
# Here is the modified xml
example_xml_modified

'<data><claim id="c-en-01-0001" num="0001"><claim-text>Generator, preferably for a windmill and especially of the kind driven directly by the rotor of the windmill without any gearbox (5) installed between the rotor and the generator, wherein at least the stator of the generator (12) is made with at least two modules (20) which are fully enclosed and sealed, and that these at least two modules (20) may be mounted and dismantled independently of each other one or more at a time without dismantling the entire winding (25), characterised in that each single stator module (20) is individually contained in an enclosure (23) with a degree of sealing substantially corresponding to the degree of sealing which is desired in the finished generator (12), and that a given number of juxtaposed enclosures (23) abutting on each other form a closed ring of stator modules (20).</claim-text></claim><claim id="c-en-01-0002" num="0002"><claim-text>Generator according to claim 1, characterised in that each

In [17]:
# we parse it with the ElementTree XML API¶
root = ET.fromstring(example_xml_modified)

In [18]:
# logically the root_tag is the one we have just set
root.tag

'data'

In [19]:
# And this is how we access the text of the claims
claims = root.findall("./claim/claim-text")
claims_text = [claim.text for claim in claims]
claims_text

['1. A method of cleaning teeth by applying thereto a cation of one or more elements selected from yttrium, scandium, lanthanum, cerium, praseodymium, neodymium, promethium, samarium, europium, gadolinium, terbium, dysprosium, holmium, erbium, thulium, ytterbium and lutetium.',
 '2. A method as claimed in claim 1 wherein the cation is the lanthanum cation.',
 '3. A composition for use in the method of claim 1 or claim 2 which is in a form for use in a non-sequential manner.',
 '4. A composition as claimed in claim 3 which is in the form of a single pack for use on its own.',
 '5. A composition as claimed in claim 3 or 4 which is a mouthwash, toothpaste, toothpowder or dental gel.']