# Mining COVID-19 Kaggle competition scientific papers to build an understanding of viruses
## Part 2. Processing and featurizing data

Working off of a clean metadata file, in this notebook we will featurize the subset of the JSON files that we downloaded from AI2 S3 repository.

# Imports

In [1]:
import cudf
import dask_cudf
import pandas as pd
import json
import re
import cupy

from distributed import Client

pd.options.display.max_rows = 100

# Read the processed files
All the data is located in our S3 bucket: `s3://bsql/data/covid`. However, since the process is quite lenghty we already uploaded the cleaned up version.

In [2]:
data_dir = 's3://bsql/data/covid'

In [3]:
client = Client('54.209.32.248:8786')
client.restart()

0,1
Client  Scheduler: tcp://54.209.32.248:8786  Dashboard: http://54.209.32.248:8787/status,Cluster  Workers: 4  Cores: 16  Memory: 65.93 GB


In [4]:
articles = dask_cudf.read_parquet(f'{data_dir}/full_text_clean.parquet').persist()

Let's check how many paragraphs we have in the DataFrame.

In [5]:
print(f'There are {len(articles):,} paragraps in the articles DataFrame')
print(f'The DataFrame consumes {(articles.memory_usage().sum().compute()) / 1024**3:,.2}GB')

There are 2,850,235 paragraps in the articles DataFrame
The DataFrame consumes 2.3GB


# Data featurization

First step on the way to featurize our dataset - we need to create a vocabulary file. The vocabulary needs to conform to the same format as it is expected by the BERT models. 

## Build vocabulary

In the first step we will simply tokenize the strings into words, normalize the strings to lower, and remove some of the punctuation signs we don't need. The `tokenize()` method splits a string on a space and puts every tokenized word in a `cudf.Series`. Next, we aggregate and count the occurence of each word.

In [6]:
def tokenize_frame(frame, col):
    temp = frame[col].str.tokenize().to_frame()
    temp['text'] = temp['text'].str.lower()
    temp['text'] = temp['text'].str.replace('[\.?,#"$!;:=\(\)\-\+0-9]', '')
    temp['counter'] = 1
    return temp

min_count = 300

token_counts = (
    articles
    .map_partitions(tokenize_frame, 'text')
    .groupby('text')
    .count()
    .reset_index()
    .sort_values(by='counter', ascending=False)
    .query(f'counter > {min_count}')
)

token_counts = token_counts.compute().to_pandas()

print(f'Total number of tokens: {len(token_counts)}')

Total number of tokens: 31093


Let's have a look what this looks like.

In [7]:
token_counts.head()

Unnamed: 0,text,counter
102067,dvgs,301
224936,perth,301
315619,complacency,301
538209,rpi,301
595962,skp,301


To create the final vocabulary we will be using a `SubwordTextEncoder` from this repository: https://github.com/kwonmha/bert-vocab-builder/. The script we use is further slightly modified to remove the dependency on Tensorflow.

The algorithm scans the words and iteratively builds a vocabulary of the longest subwords that the original words can be subdivided into.

In [8]:
from scripts import text_encoder

sw = text_encoder.SubwordTextEncoder()

The `SubwordTextEncoder` expects a dictionary with keys being the words and the values being the word counts.

In [9]:
token_counts_dict = dict(token_counts.to_dict('split')['data'])

sw.build_from_token_counts(
      token_counts_dict
      , 3000
      , 4)

total vocab size : 12989, 2.406862735748291 seconds elapsed 


Let's have a look what we got.

In [10]:
vocab = (
    cudf.Series(sw._all_subtoken_strings)
    .sort_values()
    .reset_index(drop=True)
)

with open('vocabulary_LARGE.txt', 'w') as f:
    f.writelines([f'{item}\n' for item in list(vocab.to_array())])

## Build the hash version of the vocabulary
The `subword_tokenizer` requires an encoded version of the vocabulary to tokenize to the representation BERT is expecting. The script from CLX achieves that: https://github.com/rapidsai/clx/blob/80d3198dfe54bef704d177404873d2312a77f2c9/python/clx/analytics/perfect_hash.py.

In [11]:
from scripts import perfect_hash

perfect_hash.hash_vocab(
    'vocabulary_LARGE.txt'
    , 'vocabulary_hash_LARGE.txt'
    , False
)

Attempting to build table using 5.467472n space
Longest bin was 169
Number of bins: 3247
Processing bin 0 size 3
Processing bin 1000 size 2
Processing bin 2000 size 9
Processing bin 3000 size 6
Final table size 64523 elements compared to 12989 for original
Max bin length was 169


  return ((a*k + b) % PRIME) % size


All present tokens return correct value.


# Tokenize text
Now we are ready to tokenize the text. First, we need to push the `vocabulary_hash_LARGE.txt` to each worker so the `subword_tokenizer` can read it. Let's first check how big is our file

In [12]:
!du -h vocabulary_hash_LARGE.txt

432K	vocabulary_hash_LARGE.txt


Tiny. Let's upload.

In [13]:
client.upload_file('vocabulary_hash_LARGE.txt')

Now we can tokenize.

In [14]:
import os, re

def subword_tokenize_frame(frame):
    directory = '/mnt/blazingsql/dask-worker-space/dask-worker-space'
    list_dirs = os.listdir(directory)
    
    pat = re.compile('(worker-[a-z0-9_]+)(?!\.dirlock)')
    local_folder = list(filter(pat.fullmatch, list_dirs))[0]
    vocab_file = f'{directory}/{local_folder}/vocabulary_hash_LARGE.txt'
    
    num_strings = len(frame.text)
    num_bytes = frame.text.str.byte_count().sum()

    tokens, attention = frame.text.str.subword_tokenize(
        vocab_file
        , 256
        , 256
        , max_num_strings=num_strings
        , max_num_chars=num_bytes
        , max_rows_tensor=num_strings
        , do_lower=False
        , do_truncate=True
    )[:2]
    
    temp = cudf.DataFrame()
    temp['tokens'] = tokens
    temp['attention'] = attention
    
    return temp
tokenized = articles.partitions[:].map_partitions(subword_tokenize_frame)

Let's check how many tokens we get from all the articles we read.

In [15]:
%%time
tokens_cnt = len(tokenized)
print(f'There are {tokens_cnt:,} tokens in the dataset.')

There are 729,660,160 tokens in the dataset.
CPU times: user 51.1 ms, sys: 7.24 ms, total: 58.4 ms
Wall time: 804 ms


In [16]:
tokenized.head()

Unnamed: 0,tokens,attention
0,2338,1
1,7127,1
2,3573,1
3,383,1
4,1941,1


Since each token has a maximum (padded) length of 256, if we divide the total number of tokens by 256 we should get the total number of paragraphs in our corpus.

In [17]:
articles_cnt = len(articles)

assert tokens_cnt / 256 == articles_cnt
print(f'Number of paragraphs derived from tokens: {int(tokens_cnt / 256):,}, actual number of paragraphs: {articles_cnt:,}')

Number of paragraphs derived from tokens: 2,850,235, actual number of paragraphs: 2,850,235


How about memory usage?

In [18]:
print(f'The DataFrame consumes {(tokenized.memory_usage().sum().compute()) / 1024**3:,.2}GB')

The DataFrame consumes 5.4GB


This might feel small even for a T4 with 16GB. However, as it stands today, the `subwords_tokenizer` method requires 21x the number of bytes as are stored in the string columns. https://docs.rapids.ai/api/cudf/nightly/api.html#cudf.core.column.string.StringMethods.subword_tokenize

If you run `.compute()` on this DataFrame it will explode very quickly and OOM.