<a href="https://colab.research.google.com/github/adimyth/datascience_stuff/blob/master/nlp/DataProcessingQA(WordPiece%2CByteLevelBPE%26SentencePiece).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Processing (Question Answering)


In [None]:
# !pip install tokenizers

In [None]:
import os
import json
from tokenizers import BertWordPieceTokenizer, ByteLevelBPETokenizer, SentencePieceBPETokenizer

## BERT

[Kaggle Kernel](https://www.kaggle.com/akensert/tweet-bert-base-with-tf2-1-mixed-precision/comments)

BERT uses Word Piece Tokenization

In [None]:
tweet = "Sooo SAD I will miss you here in San Diego!!!"
selected_text = "Sooo SAD"
sentiment = "negative"

In [None]:
idx_start, idx_end = None, None

In [None]:
for index in (i for i, c in enumerate(tweet) if c == selected_text[0]):
    if tweet[index:index+len(selected_text)] == selected_text:
        idx_start = index
        idx_end = index + len(selected_text)
        break

In [None]:
idx_start, idx_end, tweet[idx_start: idx_end]

(0, 8, 'Sooo SAD')

In [None]:
intersection = [0]*len(tweet)

In [None]:
for idx in range(idx_start, idx_end):
    intersection[idx] = 1

In [None]:
print(intersection)

[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [None]:
!wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt

--2020-07-16 08:43:27--  https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.64.238
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.64.238|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 231508 (226K) [text/plain]
Saving to: ‘bert-base-uncased-vocab.txt’


2020-07-16 08:43:29 (307 KB/s) - ‘bert-base-uncased-vocab.txt’ saved [231508/231508]



In [None]:
tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", 
                                   lowercase=True)

In [None]:
tweet = "Sooo SAD I will miss you here in San Diego!!! Unfortunately, he will not be coming"

In [None]:
enc = tokenizer.encode(tweet, add_special_tokens=False)

In [None]:
print(f"IDS: {enc.ids}\n")
print(f"TOKENS: {enc.tokens}\n")
print(f"OFFSET: {enc.offsets}\n")

IDS: [17111, 2080, 6517, 1045, 2097, 3335, 2017, 2182, 1999, 2624, 5277, 999, 999, 999, 6854, 1010, 2002, 2097, 2025, 2022, 2746]

TOKENS: ['soo', '##o', 'sad', 'i', 'will', 'miss', 'you', 'here', 'in', 'san', 'diego', '!', '!', '!', 'unfortunately', ',', 'he', 'will', 'not', 'be', 'coming']

OFFSET: [(0, 3), (3, 4), (5, 8), (9, 10), (11, 15), (16, 20), (21, 24), (25, 29), (30, 32), (33, 36), (37, 42), (42, 43), (43, 44), (44, 45), (46, 59), (59, 60), (61, 63), (64, 68), (69, 72), (73, 75), (76, 82)]



### Calculating Offsets based on tokens

Function to calculate offsets given tokens. However, it fails when token is just a single character. Example. -

```
He is studying! Oh no, he's playing!
```
Here, "he's" decomposes into "he", "s" & hence it fails when generating offset of "s"


In [None]:
def find_all_indexes(input_str, substring):
    l2 = []
    length = len(input_str)
    index = 0
    while index < length:
        i = input_str.find(substring, index)
        if i == -1:
            return l2
        l2.append(i)
        index = i + 1
    return l2

In [None]:
offsets = []
counts = {} # count for repetitive words
for idx, x in enumerate(enc.tokens):
    y = x.strip("##")       # BERT
    if y not in counts.keys():
        counts[y] = 0
    else:
        counts[y] += 1
    o1 = find_all_indexes(tweet.lower(), y)[counts[y]]
    if "##" in x:
        o1 = offsets[idx-1][1]
    o2 = o1+len(y)
    offsets.append((o1, o2))

In [None]:
print(offsets)

[(0, 3), (3, 4), (5, 8), (9, 10), (11, 15), (16, 20), (21, 24), (25, 29), (30, 32), (33, 36), (37, 42), (42, 43), (43, 44), (44, 45)]


In [None]:
target_idx = []
for i, (o1, o2) in enumerate(enc.offsets):
    if sum(intersection[o1: o2]) > 0:
        print(o1, o2, enc.tokens[i])
        target_idx.append(i)
    
target_start = target_idx[0]
target_end = target_idx[-1]

0 3 soo
3 4 ##o
5 8 sad


Because the selected text could contain half of the word only, so we cannot use `idx_start` & `idx_end` calculated previously. So, we recalculate to include the entire word.

Try changing selected text to `Sooo SA` and compare

In [None]:
print(f"Target start token index: {target_start}")
print(f"Target end token index: {target_end}")

Target start token index: 0
Target end token index: 2


In [None]:
with open("bert-base-uncased-vocab.txt", "r") as file:
    vocab = [x.strip() for x in file.readlines()]

In [None]:
vocab[101], vocab[102], vocab[3893], vocab[4997], vocab[8699]

('[CLS]', '[SEP]', 'positive', 'negative', 'neutral')

In [None]:
sentiment_map = {'positive': 3893, 
                 'negative': 4997,
                 'neutral': 8699,
}

Bert Question Anwering has the following format -

`[CLS][q1, q2, q3, ....][SEP][c1, c2, c3, ....][SEP]`

* `[q1, q2, q3, ...]` are the token ids for question tokens
* `[c1, c2, c3, ...]` are the token ids for context tokens
* `[CLS]` - Classification token (NSP)
* `[SEP]` - Seperator token

In [None]:
input_ids = [101] + [sentiment_map[sentiment]] + [102] + enc.ids + [102]

In [None]:
input_type_ids = [0, 0, 0] + [1]*len(enc.ids) + [0]

In [None]:
attention_mask = [1]*(len(enc.ids)+4)

In [None]:
# Offsets for [CLS] [sentiment] [SEP] followed by actual offsets & [SEP] at end
offsets = [(0, 0), (0, 0), (0, 0)]+enc.offsets+[(0, 0)]

In [None]:
target_start += 3
target_end += 3

Since, we added `[CLS] [sentiment] [SEP]` before the actual `token ids` so the target start index and target end will be shifted now by 3 tokens

In [None]:
print(f"Input IDS: {input_ids}\n")
print(f"Tokens: {' '.join([vocab[i] for i in input_ids])}\n")
print(f"Input Type IDS: {input_type_ids}\n")
print(f"Offsets: {offsets}\n")
print(f"Start Target Index: {target_start}\tEnd Target Index: {target_end}")

Input IDS: [101, 4997, 102, 17111, 2080, 6517, 1045, 2097, 3335, 2017, 2182, 1999, 2624, 5277, 999, 999, 999, 102]

Tokens: [CLS] negative [SEP] soo ##o sad i will miss you here in san diego ! ! ! [SEP]

Input Type IDS: [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]

Offsets: [(0, 0), (0, 0), (0, 0), (0, 3), (3, 4), (5, 8), (9, 10), (11, 15), (16, 20), (21, 24), (25, 29), (30, 32), (33, 36), (37, 42), (42, 43), (43, 44), (44, 45), (0, 0)]

Start Target Index: 3	End Target Index: 5


In [None]:
MAX_LEN = 512 # hyperparameter

In [None]:
padding_length = MAX_LEN - len(enc.ids)

In [None]:
if padding_length > 0:
    input_ids = input_ids+([0]*padding_length)
    attention_mask = attention_mask+([0]*padding_length)
    input_type_ids = input_type_ids+([0]*padding_length)
    offsets = offsets+([(0, 0)]*padding_length)
elif padding_length < 0:
    # adding [SEP] token at the end
    input_ids = input_ids[:padding_length-1]+[102]
    attention_mask = attention_mask[:padding_length-1]+[1]
    input_type_ids = input_type_ids[:padding_length-1]+[1]
    offsets = offsets[:padding_length-1]+[(0, 0)]
    if target_start >= MAX_LEN:
        target_start = MAX_LEN - 1
    if target_end >= MAX_LEN:
        target_end = MAX_LEN - 1

## RoBERTa

[Abhishek Thakur's Kernel](https://www.kaggle.com/abhishek/roberta-inference-5-folds)

RoBERTa uses Byte Level Byte Pair Encoding. Also used in Open AI's GPT2 model.

RoBERTa builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

RoBERTa doesn’t have `token_type_ids`, you don’t need to indicate which token belongs to which segment. Just separate your segments with the separation token `</s>`.

Special tokens in RoBERTa differ from BERT -
* `</s>` - Seperator Token, End of Seq (*eos*) token
* `<s>` - CLS Token, Beginning of Sequence (*bos*) token
* `<pad>` - Padding Token

A RoBERTa sequence has the following format:

* *single sequence:* `<s> X </s>`

* *pair of sequences:* `<s> A </s></s> B </s>`

Notice the **space** at the beginning and the end

In [None]:
!wget https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json

In [None]:
!wget https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt

In [None]:
tweet = "Sooo SAD I will miss you here in San Diego!!!"
selected_text = "Sooo SAD"
sentiment = "negative"

Add a space at the start

In [None]:
tweet = " " + " ".join(str(tweet).split())
selected_text = " " + " ".join(str(selected_text).split())

In [None]:
idx_start, idx_end = None, None

In [None]:
# comparing from index 1
for index in (i for i, c in enumerate(tweet) if c == selected_text[1]):
    if " "+tweet[index:index+len(selected_text)-1] == selected_text:
        idx_start = index
        idx_end = index + len(selected_text)-1
        break

In [None]:
idx_start, idx_end, tweet[idx_start: idx_end]

(1, 9, 'Sooo SAD')

In [None]:
intersection = [0]*len(tweet)

In [None]:
for idx in range(idx_start, idx_end):
    intersection[idx] = 1

In [None]:
tokenizer = ByteLevelBPETokenizer(vocab_file="roberta-base-vocab.json",
                                  merges_file="roberta-base-merges.txt", 
                                  lowercase=True, add_prefix_space=True)

In [None]:
enc = tokenizer.encode(tweet)

In [None]:
print(f"IDS: {enc.ids}\n")
print(f"TOKENS: {enc.tokens}\n")
print(f"OFFSET: {enc.offsets}\n")

IDS: [98, 3036, 5074, 939, 40, 2649, 47, 259, 11, 15610, 1597, 2977, 16506]

TOKENS: ['Ġso', 'oo', 'Ġsad', 'Ġi', 'Ġwill', 'Ġmiss', 'Ġyou', 'Ġhere', 'Ġin', 'Ġsan', 'Ġdie', 'go', '!!!']

OFFSET: [(0, 3), (3, 5), (5, 9), (9, 11), (11, 16), (16, 21), (21, 25), (25, 30), (30, 33), (33, 37), (37, 41), (41, 43), (43, 46)]



In [None]:
target_idx = []
for i, (o1, o2) in enumerate(enc.offsets):
    if sum(intersection[o1: o2]) > 0:
        print(o1, o2, enc.tokens[i])
        target_idx.append(i)
    
target_start = target_idx[0]
target_end = target_idx[-1]

0 3 Ġso
3 5 oo
5 9 Ġsad


In [None]:
with open("roberta-base-vocab.json", "r") as file:
    vocab = json.load(file)

In [None]:
print(f"Positive: {vocab['positive']}")
print(f"Negative: {vocab['negative']}")
print(f"Neutral: {vocab['neutral']}")
print(f"BOS: {vocab['<s>']}")
print(f"EOS: {vocab['</s>']}")

Positive: 22173
Negative: 33407
Neutral: 12516
BOS: 0
EOS: 2


* *pair of sequences:* `<s> A </s></s> B </s>`

In [None]:
input_ids = [0] + [sentiment_map[sentiment]] + [2] + [2] + enc.ids + [2]

RoBERTa doesn’t have *token_type_ids*, you don’t need to indicate which token belongs to which segment.

In [None]:
attention_mask = [1]*(len(enc.ids)+5)

In [None]:
# Offsets for [CLS] [sentiment] [SEP] followed by actual offsets & [SEP] at end
offsets = [(0, 0), (0, 0), (0, 0), (0, 0)]+enc.offsets+[(0, 0)]

In [None]:
target_start += 4
target_end += 4

In [None]:
MAX_LEN = 512 # hyperparameter

In [None]:
padding_length = MAX_LEN - len(enc.ids)

In [None]:
if padding_length > 0:
    input_ids = input_ids+([1]*padding_length) # {<pad>: 1}
    attention_mask = attention_mask+([0]*padding_length)
    offsets = offsets+([(0, 0)]*padding_length)

## XLNET
[Kaggle Kernel](https://www.kaggle.com/abhishek/sentencepiece-tokenizer-with-offsets/notebook)

XLNET uses SentencePiece Tokenizer.

* Pre-tokenization is not required
* No language-dependent logic
* BPE and unigram language model supported
* Same tokenization/detokenization is obtained as long as the same model is used

It has 4 components -
* Normalizer
* Trainer
* Encoder
* Decoder

$$Decode(Encode(Normalize(text)))=Normalize(text)$$



In [None]:
# !pip install tensorflow_text sentencepiece

In [None]:
# !wget https://storage.googleapis.com/xlnet/released_models/cased_L-12_H-768_A-12.zip

In [None]:
!unzip cased_L-12_H-768_A-12.zip

Archive:  cased_L-12_H-768_A-12.zip
   creating: xlnet_cased_L-12_H-768_A-12/
  inflating: xlnet_cased_L-12_H-768_A-12/xlnet_model.ckpt.index  
  inflating: xlnet_cased_L-12_H-768_A-12/xlnet_model.ckpt.data-00000-of-00001  
  inflating: xlnet_cased_L-12_H-768_A-12/spiece.model  
  inflating: xlnet_cased_L-12_H-768_A-12/xlnet_model.ckpt.meta  
  inflating: xlnet_cased_L-12_H-768_A-12/xlnet_config.json  


In [None]:
tweet = "Sooo SAD I will miss you here in San Diego!!!"
selected_text = "Sooo SAD"
sentiment = "negative"

In [None]:
import sentencepiece as spm

In [None]:
from sentencepiece_pb2 import SentencePieceText

In [None]:
class SentencePieceTokenizer:
    def __init__(self, model_path):
        self.sp = spm.SentencePieceProcessor()
        self.sp.load(os.path.join(model_path, "spiece.model"))
    
    def encode(self, sentence):
        spt = SentencePieceText()
        spt.ParseFromString(self.sp.encode_as_serialized_proto(sentence))
        tokenized_str = self.sp.encode(sentence, out_type=str)
        offsets = []
        tokens = []
        for piece in spt.pieces:
            tokens.append(piece.id)
            offsets.append((piece.begin, piece.end))
        return tokens, offsets, tokenized_str

In [None]:
spt = SentencePieceTokenizer(model_path="xlnet_cased_L-12_H-768_A-12")

In [None]:
tokens, offsets, tokenized_str = spt.encode(tweet)

In [None]:
print(f"Tokens: {tokens}\n")
print(f"Offsets: {offsets}\n")
print(f"Tokenized String: {tokenized_str}")

Tokens: [346, 5449, 4763, 417, 35, 53, 3706, 44, 193, 25, 647, 4223, 12791]

Offsets: [(0, 2), (2, 4), (4, 7), (7, 8), (8, 10), (10, 15), (15, 20), (20, 24), (24, 29), (29, 32), (32, 36), (36, 42), (42, 45)]

Tokenized String: ['▁So', 'oo', '▁SA', 'D', '▁I', '▁will', '▁miss', '▁you', '▁here', '▁in', '▁San', '▁Diego', '!!!']


In [None]:
for index in (i for i, c in enumerate(tweet) if c == selected_text[0]):
    if tweet[index:index+len(selected_text)] == selected_text:
        idx_start = index
        idx_end = index + len(selected_text)
        break

In [None]:
idx_start, idx_end, tweet[idx_start: idx_end]

(0, 8, 'Sooo SAD')

In [None]:
intersection = [0]*len(tweet)

In [None]:
for idx in range(idx_start, idx_end):
    intersection[idx] = 1

In [None]:
target_idx = []
for i, (o1, o2) in enumerate(offsets):
    if sum(intersection[o1: o2]) > 0:
        target_idx.append(i)
    
target_start = target_idx[0]
target_end = target_idx[-1]


An XLNet sequence has the following format:

*single sequence:* `X <sep> <cls>`

*pair of sequences:* `A <sep> B <sep> <cls>`

In [None]:
print(f"Positive: {spt.encode('positive')[0][0]}")
print(f"Negative: {spt.encode('negative')[0][0]}")
print(f"Neutral: {spt.encode('neutral')[0][0]}")

Positive: 1654
Negative: 2981
Neutral: 9201


* `[BOS]` - 1
* `[EOS]` - 2
* `[CLS]` - 3
* `[SEP]` - 4
* `[PAD]` - 5

In [None]:
sentiment_map = {'positive': 1654, 
                 'negative': 2981,
                 'neutral': 9201
                 }

In [None]:
input_ids = tokens + [4] + [sentiment_map[sentiment]] + [4] + [3] 

In [None]:
token_type_ids = [1]*len(tokens)+[0]*4

In [None]:
attention_mask = [1]*(len(tokens)+4)

In [None]:
offsets = offsets+[(0, 0)*4]

In [None]:
print(f"Input IDS: {input_ids}\n")
print(f"Input Type IDS: {token_type_ids}\n")
print(f"Offsets: {offsets}\n")
print(f"Start Target Index: {target_start}\tEnd Target Index: {target_end}")

Input IDS: [346, 5449, 4763, 417, 35, 53, 3706, 44, 193, 25, 647, 4223, 12791, 4, 2981, 4, 3]

Input Type IDS: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

Offsets: [(0, 2), (2, 4), (4, 7), (7, 8), (8, 10), (10, 15), (15, 20), (20, 24), (24, 29), (29, 32), (32, 36), (36, 42), (42, 45), (0, 0, 0, 0, 0, 0, 0, 0)]

Start Target Index: 0	End Target Index: 3


In [None]:
MAX_LEN = 192
padding_length = MAX_LEN - len(tokens)

In [None]:
if padding_length > 0:
    input_ids = input_ids+([5]*padding_length)
    attention_mask = attention_mask+([0]*padding_length)
    token_type_ids = token_type_ids+([0]*padding_length)
    offsets = offsets+([(0, 0)]*padding_length)