## Aplicación del tokenizador de BERT a los textos del dataset

In [1]:
import pandas as pd
import numpy as np
import os
import sys

In [3]:
# Para poder usar BERT, hago un clone del repositorio que contiene todos los modulos y agrego el directorio al path
# git clone https://github.com/google-research/bert bert
if not 'bert' in sys.path:
  sys.path += ['bert']

import tokenization

A continuación configuro las directorios donde se encuentra el modelo pre-entrenado.

Los links para descargar los modelos están en:
https://github.com/google-research/bert

Hay distintos modelos pre-entrenados con distinta cantidad de capas.  A mayor cantidad de capas el entrenamiento toma mas tiempo.  Por esto comienzo haciendo las pruebas con modelos mas basicos, en este caso el uncased_L-2_H-128_A-2.

In [4]:
bert_path = 'D:\DS\COVID\BERT models'
BERT_MODEL = 'uncased_L-2_H-128_A-2'
BERT_PRETRAINED_DIR = os.path.join(bert_path, BERT_MODEL)

VOCAB_FILE = os.path.join(BERT_PRETRAINED_DIR, 'vocab.txt')
DO_LOWER_CASE = BERT_MODEL.startswith('uncased')

Pruebo el tokenizador de BERT


In [5]:
tokenizer = tokenization.FullTokenizer(vocab_file=VOCAB_FILE, do_lower_case=DO_LOWER_CASE)
result = tokenizer.tokenize("Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset")

print(result)
print(tokenizer.convert_tokens_to_ids(result))

['call', 'to', 'action', 'to', 'the', 'tech', 'community', 'on', 'new', 'machine', 'read', '##able', 'co', '##vid', '-', '19', 'data', '##set']
[2655, 2000, 2895, 2000, 1996, 6627, 2451, 2006, 2047, 3698, 3191, 3085, 2522, 17258, 1011, 2539, 2951, 13462]


Creo una función para usar el tokenizer en todos los textos

In [7]:
def tokenize_text(text):
    return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))

Cargo los datos ya convertidos a dataframe

In [9]:
%%time
df_path = 'D:\DS\COVID\outputs'

df = pd.read_csv(os.path.join(df_path, 'result.csv'), 
                 dtype={'title_x': str, 'abstract_x': str, 'body_text': str, 'has_full_text': str
                       }
                )

In [10]:
df.head(2)

Unnamed: 0,sha,title_x,abstract_x,body_text,source_x,title_y,doi,pmcid,pubmed_id,license,abstract_y,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,full_text_file
0,ffed5d2a31a0c1a0db11905fe378e7735b6d70ca,"Supplemental material for the paper ""Evidence ...",Israel. *Corresponding author (TT): tamirtul@p...,20min). We trimmed the poly-A adaptors from th...,,,,,,,,,,,,,,
1,ffe718db1820f27bf274e3fc519ab78e450de288,Replication enhancer elements within the open ...,We provide experimental evidence of a replicat...,Tick-borne encephalitis virus (TBEV) is a huma...,PMC,Replication enhancer elements within the open ...,10.1093/nar/gkr237,PMC3303483,21622960.0,cc-by-nc,We provide experimental evidence of a replicat...,2011 Sep 27,"Tuplin, A.; Evans, D. J.; Buckley, A.; Jones, ...",Nucleic Acids Res,,,True,noncomm_use_subset


Aplico a los abstracts

In [43]:
%%time
# aplico a todos los abstracts
abstracts = df[df['abstract_y'].notnull()]['abstract_y']
tokenized_abstract = [tokenize_text(text) for text in abstracts]
len(tokenized_abstract)

Wall time: 2min 17s


Aplico a los textos completos


In [47]:
%%time
# aplico a todos los body texts
body_texts = df[df['body_text'].notnull()]['body_text']
tokenized_body = [tokenize_text(text) for text in body_texts]
len(tokenized_body)

Wall time: 48min 11s


Guardo en csv <--- Creo que existe algún formato más adecuado para esto, falta investigarlo.

In [59]:
%%time
df_body_token = pd.DataFrame (np.array(tokenized_body), columns = ['body'])
df_body_token.to_csv(os.path.join(df_path, 'tokenized_body.csv'), index=False)

Wall time: 52.5 s


In [58]:
%%time
df_abs_token = pd.DataFrame (np.array(tokenized_abstract), columns = ['abstract'])
df_abs_token.to_csv(os.path.join(df_path, 'tokenized_abstract.csv'), index=False)

Wall time: 2.91 s
