# The EP full-text library - Lesson 2
This notebook expands on lesson 1 to dive into more advanced concepts of EPAB, the implementation in TIP of the EP full-text library. We will introduce querying by full text fields, divisionals and parents, and search report fields. As we did in the first notebook, we first create an instance of the EPAB library. Remember that by default we are getting access to a test database

In [3]:
# Importing the EPAB client
from epo.tipdata.epab import EPABClient

# creating an instance of the EPAB client with the production database
epab = EPABClient(env='PROD')


## Querying by full text fields
Much like the [EP full-text search](https://www.epo.org/en/searching-for-patents/technical/ep-full-text), one of the most powerful features of the EPAB library is that it gives you access to the description, claims, title and abstract of the publications within the EPAB database. 

### Querying by the title
You can search for applications containing one or more terms in the title. When performing a first search for patent publications of a given technological concept, it is generally a good approach to search in the title, since when a publication contains the search term in the title it is likely that it is a good match for your search query. If you followed lesson 1, you probably can guess nomenclature of the search method: `query_title`.

In [4]:
# querying by the title of the publication with the word 'covid'
q = epab.query_title('covid')
q.get_results("title", limit=5, output_type='list')


[{'title': {'de': 'UNTERDRÜCKUNG DER COVID-19-REPLIKATION DURCH COVID-19-EINTRITTSHEMMER',
   'en': 'SUPPRESSION OF COVID-19 REPLICATION BY COVID-19 ENTRY INHIBITORS',
   'fr': "SUPPRESSION DE LA RÉPLICATION DE COVID-19 PAR DES INHIBITEURS D'ENTRÉE DE COVID-19"}},
 {'title': {'de': 'VERWENDUNG VON MASITINIB ZUR BEHANDLUNG VON CORONAVIRUS-KRANKHEIT 2019 (COVID-19)',
   'en': 'USE OF MASITINIB FOR THE TREATMENT OF CORONAVIRUS DISEASE 2019 (COVID-19)',
   'fr': 'UTILISATION DE MASITINIB POUR LE TRAITEMENT DE LA MALADIE À CORONAVIRUS 2019 (COVID-19)'}},
 {'title': {'de': '2-DESOXY-D-GLUCOSE ZUR VORBEUGUNG UND BEHANDLUNG EINER VIRUSERKRANKUNG, INSBESONDERE VON COVID-19',
   'en': '2-DEOXY-D-GLUCOSE FOR PREVENTION AND TREATMENT OF A VIRAL DISEASE, IN PARTICULAR OF COVID-19',
   'fr': "2-DÉSOXY-D-GLUCOSE DESTINÉ À LA PRÉVENTION ET AU TRAITEMENT D'UNE MALADIE VIRALE, EN PARTICULIER DE LA COVID-19"}},
 {'title': {'de': 'LÖSLICHES ACE2 ZUR BEHANDLUNG VON COVID-19',
   'en': 'SOLUBLE ACE2 FOR TRE

#### Understanding fulltext languages
You can see in the result that the title field contains a dictionary with three titles. It is very important, when working with fulltext, to take into consideration that the EPO publishes the fulltext fields in the three official languages: German, English, and French.

When you search for a term in a fulltext field, by default you will search in all three languages. This can be problematic. A good example of a search query that would yield different results in English and German is the word "Gift."

In English, "gift" refers to a present or something given willingly to someone without payment. However, in German, "Gift" means "poison." You can change this by specifying one or more of the official languages with the strings `EN`, `DE` and `FR`.

In [5]:
# searching for publications with the word GIFT only in the English title
q = epab.query_title('gift', language="EN")
q.get_results("title", limit=5, )

Unnamed: 0,title.de,title.en,title.fr
0,VERFAHREN UND SYSTEM UM ELEKTRONISCH EIN ONLIN...,METHODS AND SYSTEMS FOR ELECTRONICALLY ACCEPTI...,PROCEDES ET SYSTEMES POUR ACCEPTER ET ECHANGER...
1,SYSTEM UND VERFAHREN ZUM SCHENKEN VON VIRTUELL...,SYSTEM AND METHOD FOR FACILITATING GIFTING OF ...,SYSTÈME ET PROCÉDÉ POUR FACILITER LE DON D'ÉLÉ...
2,Geschenkschachtel,Box for gift objects,Boîte à cadeaux
3,Behälter für Geschenke,A container for gifts,Récipient pour cadeaux
4,GESCHENKKARTONBEHÄLTER,GIFT BOX CONTAINER,PAQUET-CADEAU


#### Refresher of query combination
We saw in lesson 1 that we can combine queries to create more complex queries. Let's see if there are any publications that contain the word gift in both the German and English titles. 

In [6]:
# we get a second query with publications mentioning poison, in German
r = epab.query_title('gift', language="DE")
print (f'publications with the word Gift in German', r)

#combining the two queries
s = q & r

print (f'Poisionus gifts found:', s)

publications with the word Gift in German 1520 publications
Poisionus gifts found: 0 publications


### Case sensitivity
You have seen that we are querying in lowercase and the titles are displayed in all uppercase. It will come at no surprise that the search for full text terms is by default case insensitive. This can be overriden with `ignore_case=False`. Below we perform two queries with and without this parameter, to see the different results we get. 

In [7]:
# searching for publications with the word GIFT only in the English title ignoring case
q = epab.query_title('gift', language="EN")
print (f'Publications with the word gift in any combination of lower and upper case', q)

q.get_results('title', limit=5)




Publications with the word gift in any combination of lower and upper case 173 publications


Unnamed: 0,title.de,title.en,title.fr
0,VERFAHREN UND SYSTEM UM ELEKTRONISCH EIN ONLIN...,METHODS AND SYSTEMS FOR ELECTRONICALLY ACCEPTI...,PROCEDES ET SYSTEMES POUR ACCEPTER ET ECHANGER...
1,SYSTEM UND VERFAHREN ZUM SCHENKEN VON VIRTUELL...,SYSTEM AND METHOD FOR FACILITATING GIFTING OF ...,SYSTÈME ET PROCÉDÉ POUR FACILITER LE DON D'ÉLÉ...
2,Geschenkschachtel,Box for gift objects,Boîte à cadeaux
3,Behälter für Geschenke,A container for gifts,Récipient pour cadeaux
4,GESCHENKKARTONBEHÄLTER,GIFT BOX CONTAINER,PAQUET-CADEAU


In [8]:
# searching for publications with the word GIFT only in the English title forcing lowercase
r = epab.query_title('gift', language="EN", ignore_case=False)
print (f'Publications with the word gift in lowercase', r)

r.get_results('title', limit=5)

Publications with the word gift in lowercase 46 publications


Unnamed: 0,title.de,title.en,title.fr
0,Verfahren und Vorrichtung zum Schenken über ei...,Methods and apparatus for gifting over a data ...,Procédés et appareil pour donner des cadeaux d...
1,Verriegelter Geschenkschachtel,Locking gift box,Boîte à cadeaux scellée
2,Personalisiertes Geschenkartifakt,Personalized gift artifact,Artéfact de cadeaux personnalisés
3,Simuliertes Geschenk bildende Dose,Simulated gift wrap box,Boîte d'emballage simulant un cadeau
4,Kombinationsbehälter für Überraschungsgeschenk...,A container for surprise gifts which can be co...,"Récipient pour un cadeau-surprise, pouvant êtr..."


### Multiple search terms
We can enter multiple search terms in the queries we run on EPAB by full text fields. When we enter multiple terms, by default these terms are combined with an `OR`

In [15]:
# Searching a set of possible terms (e.g. synonyms)
q = epab.query_title("covid, corona virus, coronavirus", language="EN")
print (q)
q.get_results("title.en", output_type="list", limit=10)

1002 publications


[{'title': {'en': 'LIVE, ATTENUATED CORONAVIRUS COMPRISING A VARIANT REPLICASE GENE ENCODING POLYPROTEINS COMPRISING A MUTATION IN NSP-10.'}},
 {'title': {'en': "2'-SUBSTITUTED-N6-SUBSTITUTED PURINE NUCLEOTIDES FOR CORONA VIRUS TREATMENT"}},
 {'title': {'en': 'PEPTIDE FOR PREVENTION OR TREATMENT OF COVID-19'}},
 {'title': {'en': 'THE USE OF CHITOSAN POLYMER IN THE TREATMENT AND PREVENTION OF INFECTIONS CAUSED BY CORONAVIRUSES'}},
 {'title': {'en': 'COMPOSITIONS AGAINST SARS-CORONAVIRUS AND USES THEREOF'}},
 {'title': {'en': 'Peptide compounds for detecting or inhibiting SARS coronavirus and application thereof'}},
 {'title': {'en': 'SUGAR CHAIN AND COMPOSITIONS THEREOF AND USE THEREOF IN PREVENTION AND/OR TREATMENT OF CORONAVIRUS INFECTION'}},
 {'title': {'en': 'USE OF A GRAPE EXTRACT AS A VIRUCIDE AGAINST VIRUSES FROM THE CORONAVIRUS FAMILY'}},
 {'title': {'en': 'Nucleic acid sequences that can be used as primers and probes in the amplification and detection of SARS coronavirus'}},
 {

#### Multiple search terms combined with AND
We can also query with several strings, and specify that they all should be present, with the `match_all` parameter.

In [17]:
# We can also look for having multiple terms in the same title
q = epab.query_title("coronavirus, vaccine", match_all=True, language="EN")
print(q)
q.get_results("title.en", limit=5)

144 publications


Unnamed: 0,title.en
0,NUCLEIC ACID VACCINES FOR CORONAVIRUS
1,USE OF VIRAL VECTORS FOR CORONAVIRUS VACCINE P...
2,INFLUENZA VIRUS VECTOR-BASED NOVEL CORONAVIRUS...
3,FELINE SEVERE ACUTE RESPIRATORY SYNDROME CORON...
4,"Coronavirus, nucleic acid, protein, and method..."


#### Multiple search terms with advanced combinations
What if you want to mix `AND` with `OR` with the combinations of terms? Combining queries comes in handy for this case. 

In [None]:
# searching for synonims of Covid 
q = epab.query_title(search_terms="covid, corona virus, coronavirus", language="EN")

# searching for synonims of vaccine
r = epab.query_title(search_terms="vaccine%, inmun%", language="EN")

s = q & r

s.get_results('title.en', limit = 10)

### Querying abstract, claims and description
You can query other parts of the fulltext such as the claims, the abstract, and the description with the same methods, obviously changing the part of the fulltext in the method nomenclature. 

In [18]:
# abstract search
q = epab.query_abstract("handover, base station", match_all=True, ignore_case=True)
print(q)
q.get_results("abstract", output_type="list", limit=2)

1428 publications


[{'abstract': {'language': 'EN',
   'text': '<p id="pa01" num="0001">Disclosed is a feedback control method in closed-loop transmit diversity in which feedback information representing amounts of amplitude and phase control is transmitted from a mobile station to a radio base station. The mobile station receives downlink pilot signals, which are transmitted by a handover-destination base station, during handover control, calculates feedback information, which represents amounts of amplitude and phase and phase control transmitted to the handover-destination base station, beforehand based upon the pilot signals received, and transmits the feedback information to the handover-destination base station before completion of base-station changeover by handover.<img id="iaf01" file="imgaf001.tif" wi="126" he="94" img-content="drawing" img-format="tif"/></p>'}},
 {'abstract': {'language': 'EN',
   'text': '<p id="pa01" num="0001">A hardware is used to perform an SSDT processing, thereby avoidi

## Retrieving statistics from a query
Sometimes you will want to get statistics over the results of a query, before further processing it. The method `get_stats` returns a dataframe with the statistics over one or more selected fields. when you run this method on a query object, for the selected field(s) you will get the following information. 

- the `count` column reports the total number of occurrences of the corresponding field(s) value
- the `unique_publications` column reports the number of unique publications having that value
- the last two lines of the table are used to report the remainder and the total

### Statistics on patents about wireless communication networks
Let's look at an example. We will make a query for publications in the field of wireless communication networks, grouped in the CPC under H04W

In [None]:
# Running a query for all publications with CPC symbols starting with H04W
q = epab.query_ipc("H04W%")
q

In [None]:
# We want to see the distribution of the countries where the inventors mentioned in the publications resulting from the query live
q.get_stats("inventor.country")

Notice that the total number of unique publications corresponds with the size of the query result, which makes sense. You can also see that there are more inventors than publications. This happens because typically one application lists more than one inventor. We can also see what applicants are most active in the field of wireless communication networks

In [None]:
# We want to see the distribution of the countries where the inventors mentioned in the publications resulting from the query live
q.get_stats("applicant.name")

Again remember that a patent application can name more than one applicant, so it is possible that the sum of the `count` field will be higher than the sum of the `unique_publications` field.