# Downloading the EP full-text data for text analytics data on a server

Documentation available at:
* [EP full-text data for text analytics data](https://www.epo.org/searching-for-patents/data/bulk-data-sets/text-analytics.html)
* [Cloud Storage Client Libraries](https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-python)

## Import model configuration

In [1]:
import sys
sys.path.append("../settings")
import settings

## Installing new libraries

In [2]:
# Upgrading pip
!/usr/bin/python3.8 -m pip install --upgrade pip

Defaulting to user installation because normal site-packages is not writeable
Requirement already up-to-date: pip in /home/antoine/.local/lib/python3.8/site-packages (20.2.2)


In [3]:
from platform import python_version
print(python_version())

3.6.9


In [4]:
# Installing the Google cloud storage API
!pip3.6 install google
!pip3.6 install google-cloud-storage

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


## Authentification in the Google cloud service

To run the client library, you must first set up authentication by creating a service account and setting an environment variable. 

> * Complete the steps of the Google page [Cloud Storage Client Libraries](https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-python) to set up authentication.
* The keys and confidential settings are stored in a dedicated folder.
* Run `gsutil config` in a proper terminal then follow the steps to authentificate

> * <span style="background-color: #FFFF00">/!\ As from now, the actions will be charged on the Google cloud account!</span> 

## First glance at the content of the bucket

In [5]:
# Visualisation of the files in the EPO public repository
# Here we see that two versions of the database exist (the 2019 and the 2020 version)
!gsutil -u {settings.google_project_id} ls gs://epo-public

gs://epo-public/EP-fulltext-for-text-analytics_2019week31/
gs://epo-public/EP-fulltext-for-text-analytics_2020week08/


In [9]:
# Estimating the size of the 2019 edition of the EP data: 232 GB
!gsutil -u {settings.google_project_id} du -s gs://epo-public/EP-fulltext-for-text-analytics_2019week31/

232717793416  gs://epo-public/EP-fulltext-for-text-analytics_2019week31


In [7]:
# Estimating the size of the 2020 edition of the EP data
# Here we see that the size of the 2020 bucket is very small: 11 bytes!
!gsutil -u {settings.google_project_id} du -s gs://epo-public/EP-fulltext-for-text-analytics_2020week08/

11           gs://epo-public/EP-fulltext-for-text-analytics_2020week08


## Copying a single file (6 GB) to the local repository
* EP full-text data for text analytics comprises around 35 data files of about 5-8 GB each.
* each file contains the publications associated with 100 000 publication numbers.

In [6]:
!gsutil -u {settings.google_project_id} cp gs://epo-public/EP-fulltext-for-text-analytics_2019week31/EP1400000.txt {settings.local_storage_destination_sample}

Copying gs://epo-public/EP-fulltext-for-text-analytics_2019week31/EP1400000.txt...
- [1 files][  5.9 GiB/  5.9 GiB]   59.0 MiB/s                                   
Operation completed over 1 objects/5.9 GiB.                                      


## We visualise the data we have just downloaded!

In [2]:
import pandas as pd
data = pd.read_csv(settings.local_storage_destination_sample + '/EP1400000.txt', sep = '\t',  header = None)

In [4]:
data.columns = ['publication_authority', # will always have the value "EP"
                'publication_number', # a seven-digit number
                'publication_kind', # see https://www.epo.org/searching-for-patents/helpful-resources/first-time-here/definitions.html for help.
                'publication_date', # in format YYYY-MM-DD
                'language_text_component', # de, en, fr; xx means unknown
                'text_type', # TITLE, ABSTR, DESCR, CLAIM, AMEND, ACSTM, SREPT, PDFEP
                'text' # it contains, where appropriate, XML tags for better structure. You will find the DTD applicable to all parts of the publication at: http://docs.epoline.org/ebd/doc/ep-patent-document-v1-5.dtd
               ]
data.head(10)

Unnamed: 0,publication_authority,publication_number,publication_kind,publication_date,language_text_component,text_type,text
0,EP,1400022,A2,2004-03-24,de,TITLE,"VERFAHREN UND EINRICHTUNG ZUR KODIERUNG, BEZIE..."
1,EP,1400022,A2,2004-03-24,en,TITLE,METHOD AND APPARATUS FOR CODING AND DECODING DATA
2,EP,1400022,A2,2004-03-24,fr,TITLE,PROCEDE ET APPAREIL PERMETTANT DE CODER ET DE ...
3,EP,1400299,A1,2004-03-24,de,TITLE,ELEKTRODENDRAHT FÜR DRAHTEROSIONSMASCHINE
4,EP,1400299,A1,2004-03-24,en,TITLE,ELECTRODE WIRE FOR WIRE ELECTRICAL DISCHARGE M...
5,EP,1400299,A1,2004-03-24,fr,TITLE,FIL-ELECTRODE POUR ENSEMBLE D'USINAGE PAR ETIN...
6,EP,1400299,A1,2004-03-24,en,ABSTR,"<p id=""pa01"" num=""0001"">The present invention ..."
7,EP,1400299,A1,2004-03-24,en,DESCR,"<heading id=""h0001"">TECHNICAL FIELD</heading><..."
8,EP,1400299,A1,2004-03-24,en,CLAIM,"<claim id=""c-en-0001"" num=""0001""><claim-text>A..."
9,EP,1400299,A1,2004-03-24,en,PDFEP,https://data.epo.org/publication-server/pdf-do...


## Downloading the entire database on the local server

In [5]:
# 2019 edition
!gsutil -u {settings.google_project_id} cp gs://epo-public/EP-fulltext-for-text-analytics_2019week31/* {settings.local_storage_destination_2019}

Copying gs://epo-public/EP-fulltext-for-text-analytics_2019week31/$description.txt...
Copying gs://epo-public/EP-fulltext-for-text-analytics_2019week31/$license.txt...
Copying gs://epo-public/EP-fulltext-for-text-analytics_2019week31/EP0000000.txt...
Copying gs://epo-public/EP-fulltext-for-text-analytics_2019week31/EP0100000.txt...
- [4 files][ 10.4 GiB/ 10.4 GiB]   24.6 MiB/s                                   
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://epo-public/EP-fulltext-for-text-analytics_2019week31/EP0200000.txt...
Copying gs://epo-public/EP-fulltext-for-text-analytics_2019week31/EP0300000.txt...
Copying gs://epo-public/EP-fulltext-for-text-analytics_2019week31/EP0400000.txt...
Copying gs://epo-public/EP-fulltext-for-text-analytics_2019week31/EP0500000.txt...
C

In [6]:
# 2020 edition
!gsutil -u {settings.google_project_id} cp gs://epo-public/EP-fulltext-for-text-analytics_2020week08/* {settings.local_storage_destination_2020}

Copying gs://epo-public/EP-fulltext-for-text-analytics_2020week08/$description.txt...
Copying gs://epo-public/EP-fulltext-for-text-analytics_2020week08/$license.txt...
Copying gs://epo-public/EP-fulltext-for-text-analytics_2020week08/EP0000000.txt...
| [3 files][  5.2 GiB/  5.2 GiB]   58.7 MiB/s                                   
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://epo-public/EP-fulltext-for-text-analytics_2020week08/EP0100000.txt...
Copying gs://epo-public/EP-fulltext-for-text-analytics_2020week08/EP0200000.txt...
Copying gs://epo-public/EP-fulltext-for-text-analytics_2020week08/EP0300000.txt...
Copying gs://epo-public/EP-fulltext-for-text-analytics_2020week08/EP0400000.txt...
Copying gs://epo-public/EP-fulltext-for-text-analytics_2020week08/EP0500000.txt...
C

## Success!