# Exploration of the EP full-text data

**Table of contents**

* [Introduction](#Introduction)
* [1. Load data](#1.-Load-data)
* [2. Different types of text data, stored in different formats](#2.-Different-types-of-text-data,-stored-in-different-formats)
* [3. Language of the claims](#3.-Language-of-the-claims)
* [4. Parsing the xml content](#4.-Parsing-the-xml-content)
    * [4.1. Test example with xml from the documentation](#4.1.-Test-example-with-xml-from-the-documentation)
    * [4.2. Applying to the patent text data](#4.2.-Applying-to-the-patent-text-data)

## Introduction

We explore the textual content of the EP-full-text data and define some methods to access the data. In particular, we show how to parse the `xml` content of the `CLAIM` and `DESCR` fields.  

In [131]:
import pandas as pd
# library to parse the xml
# library doc: https://docs.python.org/3/library/xml.etree.elementtree.html
import xml.etree.ElementTree as ET 

## 1. Load data

In [132]:
data_sample = pd.read_csv('../data/ep_full_text_database/2020_edition/EP0000000.txt',
                          sep = '\t',  header = None)

We rename the columns for clarity:

In [133]:
data_sample.columns = ['publication_authority', # will always have the value "EP"
                'publication_number', # a seven-digit number
                'publication_kind', # see https://www.epo.org/searching-for-patents/helpful-resources/first-time-here/definitions.html for help.
                'publication_date', # in format YYYY-MM-DD
                'language_text_component', # de, en, fr; xx means unknown
                'text_type', # TITLE, ABSTR, DESCR, CLAIM, AMEND, ACSTM, SREPT, PDFEP
                'text' # it contains, where appropriate, XML tags for better structure. You will find the DTD applicable to all parts of the publication at: http://docs.epoline.org/ebd/doc/ep-patent-document-v1-5.dtd
               ]
data_sample.head()

Unnamed: 0,publication_authority,publication_number,publication_kind,publication_date,language_text_component,text_type,text
0,EP,2,A1,1978-12-20,de,TITLE,"Tetrahydrofuran-Derivate, Verfahren zu ihrer H..."
1,EP,2,A1,1978-12-20,en,TITLE,"Tetrahydrofurane derivatives, processes for th..."
2,EP,2,A1,1978-12-20,fr,TITLE,"Dérivés du tétrahydrofuranne, leurs procédés d..."
3,EP,2,A1,1978-12-20,de,ABSTR,"<p id=""pa01"" num=""0001"">Die vorliegende Erfind..."
4,EP,2,A1,1978-12-20,de,DESCR,"<p id=""p0001"" num=""0001"">Die vorliegende Erfin..."


## 2. Different types of text data, stored in different formats

In [134]:
data_sample.text_type.value_counts()

TITLE    591983
CLAIM    318266
PDFEP    189194
DESCR    163572
ABSTR    116290
AMEND       211
Name: text_type, dtype: int64

We display how each type of data is encoded:
> * TITLE: plain text
* CLAIM: xml
* PDFEP: html link
* DESCR: xml
* ABSTR: xml
* AMEND: xml

In [135]:
for text_type in data_sample.text_type.value_counts().to_frame().reset_index()['index'].unique().tolist():
    
    print('__________________')
    print(text_type)
    
    
    condition1 = data_sample.text_type == text_type
    condition2 = data_sample.language_text_component == 'en'
    df = data_sample[condition1 & condition2]['text'].to_frame()
    example_xml = df.iloc[110]['text']
    print(example_xml)

__________________
TITLE
Method for the oxidation of quinine to quininone and quinidinone.
__________________
CLAIM
<claim id="c-en-0001" num=""><claim-text>1. A method of cleaning teeth by applying thereto a cation of one or more elements selected from yttrium, scandium, lanthanum, cerium, praseodymium, neodymium, promethium, samarium, europium, gadolinium, terbium, dysprosium, holmium, erbium, thulium, ytterbium and lutetium.</claim-text></claim><claim id="c-en-0002" num=""><claim-text>2. A method as claimed in claim 1 wherein the cation is the lanthanum cation.</claim-text></claim><claim id="c-en-0003" num=""><claim-text>3. A composition for use in the method of claim 1 or claim 2 which is in a form for use in a non-sequential manner.</claim-text></claim><claim id="c-en-0004" num=""><claim-text>4. A composition as claimed in claim 3 which is in the form of a single pack for use on its own.</claim-text></claim><claim id="c-en-0005" num=""><claim-text>5. A composition as claimed in cl

<heading id="h0006">Amended claims in accordance with Rule 86(2) EPC.</heading><claim id="ac-en-0001" num=""><claim-text>1. An improved edge dam assembly for use with an applicator for applying a coating liquid to a web of moving paper carried on a movable support, wherein the applicator is of a type having a body portion defining a chamber therein with an elongate opening thereto positionable generally below, adjacent to and transversely of the web, the chamber receiving coating liquid and directing the same generally upwardly through the opening and onto the web, said edge dam assembly comprising seal means at each end of the opening mountable in the opening generally below the web for sealing along side surfaces thereof with the body portion on opposite sides of the opening and for extending at upper surfaces thereof toward and closely adjacent to but spaced from the web for sealing therewith, each said seal means having a plurality of spaced grooves formed therein along said side a

## 3. Language of the claims

In [136]:
data_sample.language_text_component.value_counts()

en    546740
de    444896
fr    336723
xx     51157
Name: language_text_component, dtype: int64

In [137]:
# xx is one of the three languages (EN, FR, DE)
condition2 = data_sample.language_text_component == 'xx'
df = data_sample[condition2]['text'].to_frame()
df.iloc[789]['text']

'<p id="pa01" num="0001">Herstellung von durch Chloratome und/oder Sulfochloridgruppen substituierten Alkanen durch Umsetzung von Alkanen mit Chlor und gegebenenfalls Schwefeldioxid in einem bestimmten Durchsatzverhättnis und in einem zur Horizontalen geneigten Reaktionsraum, wobei die Ausgangsstoffe im Gleichstrom von unten her durch den ReaktIonsraum geleitet werden und der ReaktIonsraum im unteren Teil mit Licht der Wellenlängen von 500 bis 700 Nanometem und im oberen Teil mit Licht der Wellenlängen zwischen 200 und 500 Nanometem belichtet wird.</p><p id="pa02" num="0002">Die nach dem Verfahren der Erfindung herstellbaren substituierten Alkane oder Paraffine sind Schädingsbekämpfungsmittel, Weichmacher, Lösungsmittel und wertvolle Ausgangsstoffe für die Herstellung solcher Stoffe, sowie von Lederfettungsmitteln, Netzmitteln, Waschmitteln, Schmierölen, Kunstharzen, Gleitmitteln.</p>'

## 4. Parsing the xml content

We use as example a claim in English.

In [138]:
condition1 = data_sample.text_type == 'CLAIM'
condition2 = data_sample.language_text_component == 'en'
df = data_sample[condition1 & condition2]['text'].to_frame()
example_xml = df.iloc[110]['text']
example_xml

'<claim id="c-en-0001" num=""><claim-text>1. A method of cleaning teeth by applying thereto a cation of one or more elements selected from yttrium, scandium, lanthanum, cerium, praseodymium, neodymium, promethium, samarium, europium, gadolinium, terbium, dysprosium, holmium, erbium, thulium, ytterbium and lutetium.</claim-text></claim><claim id="c-en-0002" num=""><claim-text>2. A method as claimed in claim 1 wherein the cation is the lanthanum cation.</claim-text></claim><claim id="c-en-0003" num=""><claim-text>3. A composition for use in the method of claim 1 or claim 2 which is in a form for use in a non-sequential manner.</claim-text></claim><claim id="c-en-0004" num=""><claim-text>4. A composition as claimed in claim 3 which is in the form of a single pack for use on its own.</claim-text></claim><claim id="c-en-0005" num=""><claim-text>5. A composition as claimed in claim 3 or 4 which is a mouthwash, toothpaste, toothpowder or dental gel.</claim-text></claim>'

### 4.1. Test example with xml from the documentation

In [139]:
xml = """
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>
"""

In [140]:
root = ET.fromstring(xml)

In [141]:
root.tag

'data'

In [142]:
for child in root:
    print(child.tag, child.attrib)

country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}


In [143]:
for country in root.findall('country'):
    rank = country.find('rank').text
    name = country.get('name')
    print(name, rank)

Liechtenstein 1
Singapore 4
Panama 68


### 4.2. Applying to the patent text data

In [144]:
# We need to add a root node <data> to the XML in order to parse properly the xml
example_xml_modified = "<data>" + example_xml + '</data>'

In [145]:
# Here is the modified xml
example_xml_modified

'<data><claim id="c-en-0001" num=""><claim-text>1. A method of cleaning teeth by applying thereto a cation of one or more elements selected from yttrium, scandium, lanthanum, cerium, praseodymium, neodymium, promethium, samarium, europium, gadolinium, terbium, dysprosium, holmium, erbium, thulium, ytterbium and lutetium.</claim-text></claim><claim id="c-en-0002" num=""><claim-text>2. A method as claimed in claim 1 wherein the cation is the lanthanum cation.</claim-text></claim><claim id="c-en-0003" num=""><claim-text>3. A composition for use in the method of claim 1 or claim 2 which is in a form for use in a non-sequential manner.</claim-text></claim><claim id="c-en-0004" num=""><claim-text>4. A composition as claimed in claim 3 which is in the form of a single pack for use on its own.</claim-text></claim><claim id="c-en-0005" num=""><claim-text>5. A composition as claimed in claim 3 or 4 which is a mouthwash, toothpaste, toothpowder or dental gel.</claim-text></claim></data>'

In [146]:
# we parse it with the ElementTree XML API¶
root = ET.fromstring(example_xml_modified)

In [147]:
# logically the root_tag is the one we have just set
root.tag

'data'

In [148]:
# And this is how we access the text of the claims
claims = root.findall("./claim/claim-text")
claims_text = [claim.text for claim in claims]
claims_text

['1. A method of cleaning teeth by applying thereto a cation of one or more elements selected from yttrium, scandium, lanthanum, cerium, praseodymium, neodymium, promethium, samarium, europium, gadolinium, terbium, dysprosium, holmium, erbium, thulium, ytterbium and lutetium.',
 '2. A method as claimed in claim 1 wherein the cation is the lanthanum cation.',
 '3. A composition for use in the method of claim 1 or claim 2 which is in a form for use in a non-sequential manner.',
 '4. A composition as claimed in claim 3 which is in the form of a single pack for use on its own.',
 '5. A composition as claimed in claim 3 or 4 which is a mouthwash, toothpaste, toothpowder or dental gel.']