# Using BERT Tokenizer and BERT Model from Huggingface

Resources:

- https://huggingface.co/course/chapter6/6?fw=pt
- https://huggingface.co/docs/transformers/training#finetune-a-pretrained-model
- https://stackoverflow.com/questions/58636587/how-to-use-bert-for-long-text-classification
- https://towardsdatascience.com/a-beginners-guide-to-use-bert-for-the-first-time-2e99b8c5423
- http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

Next Steps:

- https://www.analyticsvidhya.com/blog/2021/06/why-and-how-to-use-bert-for-nlp-text-classification/
- https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/

# Aggregated Preprocessing Steps

In [1]:
import pandas as pd

In [2]:
!pip install contractions

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
# Convert all reviews to lower case (optional according to study)
def to_lower(data: pd.Series):
    return data.str.lower()


In [4]:
def remove_accented_characters(data: pd.Series):
    import unicodedata

    """Removes accented characters from the Series

    Args:
        data (pd.Series): Series of string

    Returns:
        _type_: pd.Series
    """
    import unicodedata

    return data.apply(lambda x: unicodedata.normalize("NFKD", x).encode("ascii", "ignore").decode("utf-8", "ignore"))


In [5]:
def remove_html_encodings(data: pd.Series):
  return data.str.replace(r"&#\d+;", " ", regex=True)

In [6]:
def remove_html_tags(data: pd.Series):
  return data.str.replace(r"<[a-zA-Z]+\s?/?>", " ", regex=True)

In [7]:
def remove_url(data: pd.Series):
  return data.str.replace(r"https?://([\w\-\._]+){2,}/[\w\-\.\-/=\+_\?]+", " ", regex=True)

In [8]:
def remove_html_and_url(data: pd.Series):
    """Function to remove
             1. HTML encodings
             2. HTML tags (both closed and open)
             3. URLs

    Args:
        data (pd.Series): A Pandas series of type string

    Returns:
        _type_: pd.Series
    """
    # Remove HTML encodings
    data.str.replace(r"&#\d+;", " ", regex=True)

    # Remove HTML tags (both open and closed)
    data.str.replace(r"<[a-zA-Z]+\s?/?>", " ", regex=True)

    # Remove URLs
    data.str.replace(r"https?://([\w\-\._]+){2,}/[\w\-\.\-/=\+_\?]+", " ", regex=True)

    return data


In [9]:
# Remove non-alphabetical characters
def remove_non_alpha_characters(data: pd.Series):
    return data.str.replace(r"_+|\\|[^a-zA-Z0-9\s]", " ", regex=True)


In [10]:
# Remove extra spaces
def remove_extra_spaces(data: pd.Series):
    return data.str.replace(r"^\s*|\s\s*", " ", regex=True)


In [11]:
# Expanding contractions
def fix_contractions(data: pd.Series):
    import contractions

    def contraction_fixer(txt: str):
        return " ".join([contractions.fix(word) for word in txt.split()])

    return data.apply(contraction_fixer)


In [12]:
# remove "-lrb-"
def remove_special_words(data: pd.Series):
  return data.str.replace(r"\-[^a-zA-Z]{3}\-", " ", regex=True)

In [13]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [14]:
from transformers import BertTokenizer

### Load Data

In [16]:
train_dataset_path = "/content/drive/MyDrive/nlp_data/tos_clauses_train.csv"
test_dataset_path = "/content/drive/MyDrive/nlp_data/tos_clauses_dev.csv"

In [17]:
train_df = pd.read_csv(train_dataset_path, header=0)
test_df = pd.read_csv(test_dataset_path, header=0)

In [18]:
train_df.head()

Unnamed: 0,label,sentences
0,0,content license and intellectual property rights
1,0,reactivated skype credit is not refundable .
2,1,spotify may change the price for the paid subs...
3,0,the term of your licenses under this eula shal...
4,0,the arbitrator may award declaratory or injunc...


### Clean the Data

In [19]:
# A dictionary containing the columns and a list of functions to perform on it in order
data_cleaning_pipeline = {
    "sentences": [
        to_lower,
        remove_special_words,
        remove_accented_characters,
        remove_html_encodings,
        remove_html_tags,
        remove_url,
        fix_contractions,
        remove_non_alpha_characters,
        remove_extra_spaces,
    ]
}

cleaned_data = train_df.copy()

# Process all the cleaning instructions
for col, pipeline in data_cleaning_pipeline.items():
    # Get the column to perform cleaning on
    temp_data = cleaned_data[col].copy()

    # Perform all the cleaning functions sequencially
    for func in pipeline:
        print(f"Starting: {func.__name__}")
        temp_data = func(temp_data)
        print(f"Ended: {func.__name__}")

    # Replace the old column with cleaned one.
    cleaned_data[col] = temp_data.copy()


Starting: to_lower
Ended: to_lower
Starting: remove_special_words
Ended: remove_special_words
Starting: remove_accented_characters
Ended: remove_accented_characters
Starting: remove_html_encodings
Ended: remove_html_encodings
Starting: remove_html_tags
Ended: remove_html_tags
Starting: remove_url
Ended: remove_url
Starting: fix_contractions
Ended: fix_contractions
Starting: remove_non_alpha_characters
Ended: remove_non_alpha_characters
Starting: remove_extra_spaces
Ended: remove_extra_spaces


In [20]:
cleaned_data.head()

Unnamed: 0,label,sentences
0,0,content license and intellectual property rights
1,0,reactivated skype credit is not refundable
2,1,spotify may change the price for the paid sub...
3,0,the term of your licenses under this eula sha...
4,0,the arbitrator may award declaratory or injun...


In [21]:
cleaned_data["sentences"][2]

' spotify may change the price for the paid subscriptions pre paid period lrb for periods not yet paid for rrb or codes from time to time and will communicate any price changes to you in advance and if applicable how to accept those changes '

### Using BERT Tokenizer and BERT Model to get Embeddings

In [22]:
cls = "[CLS]"
sep = "[SEP]"
pad = "[PAD]"
bert_pad_len = 512

In [23]:
import logging
import torch
import numpy as np
import warnings
from transformers import BertTokenizer, BertModel

In [24]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


### Functions

In [25]:
def create_tensors_BERT(text):
  """
    Tokenize using BERT Tokenizer for the pd.Series
  """
  print("Tokenizing text...")
  logging.basicConfig(level = logging.INFO)

  # Load the `bert-base-uncased` model
  tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

  # Tokenize every sentence in the pd.Series
  tokenized_text = [tokenizer.tokenize(x) for x in text]

  # Pad the tokens to be used for BERT Model; BERT takes fixed lengend sequence
  tokenized_text = [x + ([pad] * (bert_pad_len - len(x))) for x in tokenized_text]

  # Convert the tokens to their IDs
  indexed_text = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_text]

  # BERTModel has Q&A format, so setting the context to one for every sentence
  segment_ids = [[1] * len(x) for x in tokenized_text]

  # Convert to tensor
  torch_idx_text = torch.LongTensor(indexed_text)
  torch_seg_ids = torch.LongTensor(segment_ids)
  
  return tokenized_text, torch_idx_text, torch_seg_ids 

In [26]:
#takes in the index and segment tensors and returns the bert embeddings as a list
def get_embeddings(torch_idx_text, torch_seg_ids):
    """
      Create BERT embeddings from tokens
    """
    print("Getting Embeddings...")

    # Load pretrained `bert-base-uncased` model, and set to inference
    model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states = True)
    model.eval()

    torch_idx_text, torch_seg_ids = torch_idx_text.to("cpu"), torch_seg_ids.to("cpu")
    model.to(device)

    # Disable gradient and get BERT embeddings
    with torch.no_grad():
        bert_embeddings = []
        for i in range(len(torch_idx_text)):
            print(i, end = "\r")
            text_temp = torch.unsqueeze(torch_idx_text[i], dim = 0).to(device)
            sgmt_temp = torch.unsqueeze(torch_seg_ids[i], dim = 0).to(device)
            output = model(text_temp, sgmt_temp)
            bert_embeddings.append(output[0])
            del text_temp, sgmt_temp
    del model
  
    return bert_embeddings


Note: As an additional improvement to reduce the dimentionality, we can aggregate expansions of words like - `running` -> "run", "##ing". This is an additional step is not mandatory and can be used when trying out the embeddings.

In [27]:
def embeddings_to_words(tokenized_text, embeddings):
  """
    Clubbing same word tokens to reduce dimensionality
    Note: Need to run this locally and tweak virtual memory, as the colab
    runtime crashes
  """
    print("Untokenizing text and embeddings...")
    embeddings = [x.cpu().detach().numpy() for x in embeddings]
    embeddings = np.concatenate(embeddings, axis = 0)
    sentences = []
    final_emb = []

    # Iterate over every sentence
    for i in range(len(tokenized_text)):
        txt = tokenized_text[i]
        sub_len = 0
        sent = []
        sub = []
        emb = []
        sub_emb = None
        try:
            idx = txt.index(pad)
        except:
            idx = len(txt)
        for j in range(idx):
            # For the token that starts with ## process it; remove ## and
            # club that with previous token;
            # For the embedding, take the average of token's embeddings
            if txt[j].startswith("##"):
                if sub == []:
                    sub.append(sent.pop())
                    emb.pop()
                    sub_emb = embeddings[i][j - 1] + embeddings[i][j]
                    sub.append(txt[j][2:])
                    sub_len = 2
                else:
                    sub.append(txt[j][2:])
                    sub_emb += embeddings[i][j]
                    sub_len += 1
            else:
                if sub != []:
                    sent.append("".join(sub))
                    emb.append(sub_emb / sub_len)
                    sub = []
                    sub_emb = None
                    sub_len = 0
                sent.append(txt[j])
                emb.append(embeddings[i][j])
        sentences.append(sent)
        final_emb.append(emb)
    return sentences, final_emb

### Tokenize and Create Embeddings

In [28]:
tokenized_text, torch_idx_text, torch_seg_ids = create_tensors_BERT(cleaned_data.sentences)

Tokenizing text...


In [29]:
bert_embeddings = get_embeddings(torch_idx_text, torch_seg_ids)


Getting Embeddings...


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




Most computationally time consuming part is to get embeddings

In [30]:
text, bert = embeddings_to_words(tokenized_text[:10], bert_embeddings[:10])


Untokenizing text and embeddings...


In [31]:
text[0], bert[0]

(['content', 'license', 'and', 'intellectual', 'property', 'rights'],
 [array([-5.04065976e-02,  6.26452088e-01, -4.53739136e-01,  2.54993677e-01,
         -7.93713108e-02, -3.74242663e-01,  8.41267347e-01,  7.93136537e-01,
          3.02752823e-01, -6.97149456e-01, -1.14608669e+00, -5.00292122e-01,
         -7.55346239e-01,  8.14944923e-01,  4.99583602e-01,  1.22355735e+00,
          3.58899295e-01,  1.18482709e+00, -3.63670826e-01,  6.45519495e-01,
          9.65986490e-01,  2.66659051e-01,  8.48129168e-02,  1.10067296e+00,
          6.21268809e-01, -3.02487195e-01,  8.91415834e-01, -1.01979768e+00,
         -4.82328057e-01,  2.05055282e-01,  9.41085935e-01, -3.93895917e-02,
          1.70143932e-01, -2.46444363e-02,  5.70479870e-01, -7.46426046e-01,
         -3.07868570e-01, -1.23606160e-01,  9.52045992e-02,  7.44842470e-01,
         -1.10866904e+00, -1.08360016e+00,  5.14029860e-01, -4.00640160e-01,
         -2.93298304e-01, -1.04004812e+00, -1.35488045e+00,  2.12775081e-01,
      