

```
# This is formatted as code
```

# Using RoBERTa Tokenizer and RoBERTa Model from Huggingface

Resources:

- https://huggingface.co/course/chapter6/6?fw=pt
- https://towardsdatascience.com/a-beginners-guide-to-use-bert-for-the-first-time-2e99b8c5423
- http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
- https://huggingface.co/docs/transformers/model_doc/roberta
- https://jesusleal.io/2020/10/20/RoBERTA-Text-Classification/

Next Steps:

- https://www.analyticsvidhya.com/blog/2021/06/why-and-how-to-use-bert-for-nlp-text-classification/
- https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/

# Aggregated Preprocessing Steps

In [1]:
import pandas as pd

In [2]:
!pip install contractions

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting contractions
  Downloading contractions-0.1.72-py2.py3-none-any.whl (8.3 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii
  Downloading anyascii-0.3.1-py3-none-any.whl (287 kB)
[K     |████████████████████████████████| 287 kB 5.2 MB/s 
[?25hCollecting pyahocorasick
  Downloading pyahocorasick-1.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (106 kB)
[K     |████████████████████████████████| 106 kB 52.1 MB/s 
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.1 contractions-0.1.72 pyahocorasick-1.4.4 textsearch-0.0.24


In [3]:
# Convert all reviews to lower case (optional according to study)
def to_lower(data: pd.Series):
    return data.str.lower()


In [4]:
def remove_accented_characters(data: pd.Series):
    import unicodedata

    """Removes accented characters from the Series

    Args:
        data (pd.Series): Series of string

    Returns:
        _type_: pd.Series
    """
    import unicodedata

    return data.apply(lambda x: unicodedata.normalize("NFKD", x).encode("ascii", "ignore").decode("utf-8", "ignore"))


In [5]:
def remove_html_encodings(data: pd.Series):
  return data.str.replace(r"&#\d+;", " ", regex=True)

In [6]:
def remove_html_tags(data: pd.Series):
  return data.str.replace(r"<[a-zA-Z]+\s?/?>", " ", regex=True)

In [7]:
def remove_url(data: pd.Series):
  return data.str.replace(r"https?://([\w\-\._]+){2,}/[\w\-\.\-/=\+_\?]+", " ", regex=True)

In [8]:
def remove_html_and_url(data: pd.Series):
    """Function to remove
             1. HTML encodings
             2. HTML tags (both closed and open)
             3. URLs

    Args:
        data (pd.Series): A Pandas series of type string

    Returns:
        _type_: pd.Series
    """
    # Remove HTML encodings
    data.str.replace(r"&#\d+;", " ", regex=True)

    # Remove HTML tags (both open and closed)
    data.str.replace(r"<[a-zA-Z]+\s?/?>", " ", regex=True)

    # Remove URLs
    data.str.replace(r"https?://([\w\-\._]+){2,}/[\w\-\.\-/=\+_\?]+", " ", regex=True)

    return data


In [9]:
# Remove non-alphabetical characters
def remove_non_alpha_characters(data: pd.Series):
    return data.str.replace(r"_+|\\|[^a-zA-Z0-9\s]", " ", regex=True)


In [10]:
# Remove extra spaces
def remove_extra_spaces(data: pd.Series):
    return data.str.replace(r"^\s*|\s\s*", " ", regex=True)


In [11]:
# Expanding contractions
def fix_contractions(data: pd.Series):
    import contractions

    def contraction_fixer(txt: str):
        return " ".join([contractions.fix(word) for word in txt.split()])

    return data.apply(contraction_fixer)


In [12]:
# remove "-lrb-"
def remove_special_words(data: pd.Series):
  return data.str.replace(r"\-[^a-zA-Z]{3}\-", " ", regex=True)

In [13]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 4.8 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 15.6 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 27.7 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.1 tokenizers-0.13.1 transformers-4.23.1


### Load Data

In [14]:
train_dataset_path = "/content/drive/MyDrive/nlp_data/tos_clauses_train.csv"
test_dataset_path = "/content/drive/MyDrive/nlp_data/tos_clauses_dev.csv"

In [15]:
train_df = pd.read_csv(train_dataset_path, header=0)
test_df = pd.read_csv(test_dataset_path, header=0)

In [16]:
train_df.head()

Unnamed: 0,label,sentences
0,0,content license and intellectual property rights
1,0,reactivated skype credit is not refundable .
2,1,spotify may change the price for the paid subs...
3,0,the term of your licenses under this eula shal...
4,0,the arbitrator may award declaratory or injunc...


In [17]:
# A dictionary containing the columns and a list of functions to perform on it in order
data_cleaning_pipeline = {
    "sentences": [
        to_lower,
        remove_special_words,
        remove_accented_characters,
        remove_html_encodings,
        remove_html_tags,
        remove_url,
        fix_contractions,
        remove_non_alpha_characters,
        remove_extra_spaces,
    ]
}

cleaned_data = train_df.copy()

# Process all the cleaning instructions
for col, pipeline in data_cleaning_pipeline.items():
    # Get the column to perform cleaning on
    temp_data = cleaned_data[col].copy()

    # Perform all the cleaning functions sequencially
    for func in pipeline:
        print(f"Starting: {func.__name__}")
        temp_data = func(temp_data)
        print(f"Ended: {func.__name__}")

    # Replace the old column with cleaned one.
    cleaned_data[col] = temp_data.copy()


Starting: to_lower
Ended: to_lower
Starting: remove_special_words
Ended: remove_special_words
Starting: remove_accented_characters
Ended: remove_accented_characters
Starting: remove_html_encodings
Ended: remove_html_encodings
Starting: remove_html_tags
Ended: remove_html_tags
Starting: remove_url
Ended: remove_url
Starting: fix_contractions
Ended: fix_contractions
Starting: remove_non_alpha_characters
Ended: remove_non_alpha_characters
Starting: remove_extra_spaces
Ended: remove_extra_spaces


In [18]:
cleaned_data.head()

Unnamed: 0,label,sentences
0,0,content license and intellectual property rights
1,0,reactivated skype credit is not refundable
2,1,spotify may change the price for the paid sub...
3,0,the term of your licenses under this eula sha...
4,0,the arbitrator may award declaratory or injun...


In [19]:
cleaned_data["sentences"][2]

' spotify may change the price for the paid subscriptions pre paid period lrb for periods not yet paid for rrb or codes from time to time and will communicate any price changes to you in advance and if applicable how to accept those changes '

### Using RoBERTa Tokenizer and RoBERTa Model to get Embeddings

In [20]:
cls = "[CLS]"
sep = "[SEP]"
pad = "[PAD]"
bert_pad_len = 512

In [21]:
import logging
import torch
import numpy as np
import warnings
from transformers import RobertaTokenizer, RobertaModel

In [22]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


### Functions

In [23]:
def create_tensors_ROBERTA(text):
  """
    Tokenize using BERT Tokenizer for the pd.Series
  """
  print("Tokenizing text...")
  logging.basicConfig(level = logging.INFO)

  # Load the `bert-base-uncased` model
  tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

  # Tokenize every sentence in the pd.Series
  tokenized_text = [tokenizer.tokenize(x) for x in text]

  # Pad the tokens to be used for BERT Model; BERT takes fixed lengend sequence
  tokenized_text = [x + ([pad] * (bert_pad_len - len(x))) for x in tokenized_text]

  # Convert the tokens to their IDs
  indexed_text = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_text]

  # BERTModel has Q&A format, so setting the context to one for every sentence
  segment_ids = [[1] * len(x) for x in tokenized_text]

  # Convert to tensor
  torch_idx_text = torch.LongTensor(indexed_text)
  torch_seg_ids = torch.LongTensor(segment_ids)
  
  return tokenized_text, torch_idx_text, torch_seg_ids     

In [24]:
#takes in the index and segment tensors and returns the bert embeddings as a list
def get_embeddings(torch_idx_text, torch_seg_ids):
    """
      Create RoBERTa embeddings from tokens
    """
    print("Getting Embeddings...")

    # Load pretrained `roberta-base` model, and set to inference
    model = RobertaModel.from_pretrained('roberta-base', output_hidden_states = True)
    model.eval()

    torch_idx_text, torch_seg_ids = torch_idx_text.to("cpu"), torch_seg_ids.to("cpu")
    model.to(device)

    # Disable gradient and get BERT embeddings
    with torch.no_grad():
        roberta_embeddings = []
        for i in range(len(torch_idx_text)):
            print(i, end = "\r")
            text_temp = torch.unsqueeze(torch_idx_text[i], dim = 0).to(device)
            sgmt_temp = torch.unsqueeze(torch_seg_ids[i], dim = 0).to(device)
            output = model(text_temp, sgmt_temp)
            roberta_embeddings.append(output[0])
            del text_temp, sgmt_temp
    del model
    return roberta_embeddings

In [25]:
def embeddings_to_words(tokenized_text, embeddings):
  """
    Clubbing same word tokens to reduce dimensionality
    Note: Need to run this locally and tweak virtual memory, as the colab
    runtime crashes
  """
    print("Untokenizing text and embeddings...")
    embeddings = [x.cpu().detach().numpy() for x in embeddings]
    embeddings = np.concatenate(embeddings, axis = 0)
    sentences = []
    final_emb = []

    # Iterate over every sentence
    for i in range(len(tokenized_text)):
        txt = tokenized_text[i]
        sub_len = 0
        sent = []
        sub = []
        emb = []
        sub_emb = None
        try:
            idx = txt.index(pad)
        except:
            idx = len(txt)
        for j in range(idx):
            # For the token that starts with ## process it; remove ## and
            # club that with previous token;
            # For the embedding, take the average of token's embeddings
            if txt[j].startswith("##"):
                if sub == []:
                    sub.append(sent.pop())
                    emb.pop()
                    sub_emb = embeddings[i][j - 1] + embeddings[i][j]
                    sub.append(txt[j][2:])
                    sub_len = 2
                else:
                    sub.append(txt[j][2:])
                    sub_emb += embeddings[i][j]
                    sub_len += 1
            else:
                if sub != []:
                    sent.append("".join(sub))
                    emb.append(sub_emb / sub_len)
                    sub = []
                    sub_emb = None
                    sub_len = 0
                sent.append(txt[j])
                emb.append(embeddings[i][j])
        sentences.append(sent)
        final_emb.append(emb)
    return sentences, final_emb

In [26]:
tokenized_text, torch_idx_text, torch_seg_ids = create_tensors_ROBERTA(cleaned_data.sentences)

Tokenizing text...


Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

In [27]:
roberta_embeddings = get_embeddings(torch_idx_text, torch_seg_ids)


Getting Embeddings...


Downloading:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




In [28]:
text, roberta = embeddings_to_words(tokenized_text[:10], roberta_embeddings[:10])


Untokenizing text and embeddings...


In [29]:
text[0], roberta[0]

(['Ġcontent', 'Ġlicense', 'Ġand', 'Ġintellectual', 'Ġproperty', 'Ġrights'],
 [array([-7.76480325e-03,  6.05275147e-02, -3.16386521e-02, -1.49603859e-01,
          5.62893152e-02, -1.38195336e-01, -3.99483331e-02, -4.13960144e-02,
          1.63958266e-01, -1.87585484e-02, -8.19408000e-02, -2.13369122e-03,
          8.09758008e-02, -3.77253210e-03,  3.67125981e-02,  2.09599156e-02,
         -1.48334309e-01,  6.09356165e-02,  4.01592441e-02, -8.78157243e-02,
         -1.10403582e-01,  1.12902811e-02, -1.82267241e-02,  1.90555260e-01,
         -2.18550041e-02,  5.10467514e-02,  7.58060515e-02,  1.12534679e-01,
         -4.63841669e-03,  2.14812186e-04,  2.83237477e-03, -6.74470365e-02,
          9.30240601e-02, -1.25537207e-02,  1.15030587e-01,  4.39038016e-02,
          4.39697877e-02,  7.46276230e-03, -1.58268809e-01,  4.83999774e-02,
         -3.50974090e-02,  1.10237658e-01,  2.47333534e-02,  5.49434684e-02,
          2.99700908e-02,  1.24676591e-02, -2.03984920e-02,  2.99594831e-03,
