# Using XLNet Tokenizer and XLNet Model from Huggingface

Resources:

- https://github.com/shanayghag/Sentiment-classification-using-XLNet/blob/master/Sentiment_Analysis_Series_part_1.ipynb
- https://medium.com/swlh/using-xlnet-for-sentiment-classification-cfa948e65e85
- https://stackoverflow.com/questions/70951556/how-to-get-pre-trained-xlnet-sentence-embeddings
- https://huggingface.co/xlnet-base-cased?text=My+name+is+Thomas+and+my+main


Next Steps:

- Using XLNet Classification

# Aggregated Preprocessing Steps

In [1]:
import pandas as pd

In [2]:
!pip install contractions


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
# Convert all reviews to lower case (optional according to study)
def to_lower(data: pd.Series):
    return data.str.lower()


In [4]:
def remove_accented_characters(data: pd.Series):
    import unicodedata

    """Removes accented characters from the Series

    Args:
        data (pd.Series): Series of string

    Returns:
        _type_: pd.Series
    """
    import unicodedata

    return data.apply(lambda x: unicodedata.normalize("NFKD", x).encode("ascii", "ignore").decode("utf-8", "ignore"))


In [5]:
def remove_html_encodings(data: pd.Series):
  return data.str.replace(r"&#\d+;", " ", regex=True)

In [6]:
def remove_html_tags(data: pd.Series):
  return data.str.replace(r"<[a-zA-Z]+\s?/?>", " ", regex=True)

In [7]:
def remove_url(data: pd.Series):
  return data.str.replace(r"https?://([\w\-\._]+){2,}/[\w\-\.\-/=\+_\?]+", " ", regex=True)

In [8]:
def remove_html_and_url(data: pd.Series):
    """Function to remove
             1. HTML encodings
             2. HTML tags (both closed and open)
             3. URLs

    Args:
        data (pd.Series): A Pandas series of type string

    Returns:
        _type_: pd.Series
    """
    # Remove HTML encodings
    data.str.replace(r"&#\d+;", " ", regex=True)

    # Remove HTML tags (both open and closed)
    data.str.replace(r"<[a-zA-Z]+\s?/?>", " ", regex=True)

    # Remove URLs
    data.str.replace(r"https?://([\w\-\._]+){2,}/[\w\-\.\-/=\+_\?]+", " ", regex=True)

    return data


In [9]:
# Remove non-alphabetical characters
def remove_non_alpha_characters(data: pd.Series):
    return data.str.replace(r"_+|\\|[^a-zA-Z0-9\s]", " ", regex=True)


In [10]:
# Remove extra spaces
def remove_extra_spaces(data: pd.Series):
    return data.str.replace(r"^\s*|\s\s*", " ", regex=True)


In [11]:
# Expanding contractions
def fix_contractions(data: pd.Series):
    import contractions

    def contraction_fixer(txt: str):
        return " ".join([contractions.fix(word) for word in txt.split()])

    return data.apply(contraction_fixer)


In [12]:
# remove "-lrb-"
def remove_special_words(data: pd.Series):
  return data.str.replace(r"\-[^a-zA-Z]{3}\-", " ", regex=True)

In [13]:
!pip install sentencepiece


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [14]:
!pip install transformers


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Load Data

In [15]:
train_dataset_path = "./data/tos_clauses_train.csv"
test_dataset_path = "./data/tos_clauses_dev.csv"

In [16]:
train_df = pd.read_csv(train_dataset_path, header=0)
test_df = pd.read_csv(test_dataset_path, header=0)

In [17]:
train_df.head()

Unnamed: 0,label,sentences
0,0,content license and intellectual property rights
1,0,reactivated skype credit is not refundable .
2,1,spotify may change the price for the paid subs...
3,0,the term of your licenses under this eula shal...
4,0,the arbitrator may award declaratory or injunc...


### Clean the Data

In [18]:
# A dictionary containing the columns and a list of functions to perform on it in order
def cleaning(df):
  data_cleaning_pipeline = {
      "sentences": [
          to_lower,
          remove_special_words,
          remove_accented_characters,
          remove_html_encodings,
          remove_html_tags,
          remove_url,
          fix_contractions,
          remove_non_alpha_characters,
          remove_extra_spaces,
      ]
  }

  cleaned_data = df.copy()

  # Process all the cleaning instructions
  for col, pipeline in data_cleaning_pipeline.items():
      # Get the column to perform cleaning on
      temp_data = cleaned_data[col].copy()

      # Perform all the cleaning functions sequencially
      for func in pipeline:
          print(f"Starting: {func.__name__}")
          temp_data = func(temp_data)
          print(f"Ended: {func.__name__}")

      # Replace the old column with cleaned one.
      cleaned_data[col] = temp_data.copy()

  return cleaned_data


In [19]:
train_df = cleaning(train_df)
test_df = cleaning(test_df)

train_df.head(), test_df.head()

Starting: to_lower
Ended: to_lower
Starting: remove_special_words
Ended: remove_special_words
Starting: remove_accented_characters
Ended: remove_accented_characters
Starting: remove_html_encodings
Ended: remove_html_encodings
Starting: remove_html_tags
Ended: remove_html_tags
Starting: remove_url
Ended: remove_url
Starting: fix_contractions
Ended: fix_contractions
Starting: remove_non_alpha_characters
Ended: remove_non_alpha_characters
Starting: remove_extra_spaces
Ended: remove_extra_spaces
Starting: to_lower
Ended: to_lower
Starting: remove_special_words
Ended: remove_special_words
Starting: remove_accented_characters
Ended: remove_accented_characters
Starting: remove_html_encodings
Ended: remove_html_encodings
Starting: remove_html_tags
Ended: remove_html_tags
Starting: remove_url
Ended: remove_url
Starting: fix_contractions
Ended: fix_contractions
Starting: remove_non_alpha_characters
Ended: remove_non_alpha_characters
Starting: remove_extra_spaces
Ended: remove_extra_spaces


(   label                                          sentences
 0      0   content license and intellectual property rights
 1      0        reactivated skype credit is not refundable 
 2      1   spotify may change the price for the paid sub...
 3      0   the term of your licenses under this eula sha...
 4      0   the arbitrator may award declaratory or injun...,
    label                                          sentences
 0      0   uber reserves the right to withhold or deduct...
 1      0   niantic s failure to enforce any right or pro...
 2      0   14 3 if you feel that any member you interact...
 3      0   blizzard entertainment has the right to obtai...
 4      0   myfitnesspal does not lrb i rrb guarantee the...)

In [20]:
train_df["sentences"][2]

' spotify may change the price for the paid subscriptions pre paid period lrb for periods not yet paid for rrb or codes from time to time and will communicate any price changes to you in advance and if applicable how to accept those changes '

### Using XLNet Tokenizer and XLNet Model to get Embeddings

In [21]:
import torch
import numpy as np

In [22]:
import logging
import torch
import numpy as np
import warnings
from transformers import XLNetTokenizer, XLNetModel


In [23]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


### Functions

In [24]:
cls = "[CLS]"
sep = "[SEP]"
pad = "[PAD]"
max_pad_length=512

In [25]:
PRE_TRAINED_MODEL_NAME = 'xlnet-base-cased'

In [26]:
def create_tensors_XLNET(text):
  """
    Tokenize using BERT Tokenizer for the pd.Series
  """
  print("Tokenizing text...")
  logging.basicConfig(level = logging.INFO)

  # Load the `bert-base-uncased` model
  tokenizer = XLNetTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

  # Tokenize every sentence in the pd.Series
  tokenized_text = [tokenizer.tokenize(x) for x in text]

  # Pad the tokens to be used for BERT Model; BERT takes fixed lengend sequence
  tokenized_text = [x + ([pad] * (max_pad_length - len(x))) for x in tokenized_text]

  # Convert the tokens to their IDs
  indexed_text = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_text]

  # BERTModel has Q&A format, so setting the context to one for every sentence
  segment_ids = [[1] * len(x) for x in tokenized_text]

  # Convert to tensor
  torch_idx_text = torch.LongTensor(indexed_text)
  torch_seg_ids = torch.LongTensor(segment_ids)
  
  return tokenized_text, torch_idx_text, torch_seg_ids 

In [27]:
#takes in the index and segment tensors and returns the bert embeddings as a list
def get_embeddings(torch_idx_text, torch_seg_ids):
    """
      Create BERT embeddings from tokens
    """
    print("Getting Embeddings...")

    # Load pretrained `bert-base-uncased` model, and set to inference
    model = XLNetModel.from_pretrained(PRE_TRAINED_MODEL_NAME, output_hidden_states = True)
    model.eval()

    torch_idx_text, torch_seg_ids = torch_idx_text.to("cpu"), torch_seg_ids.to("cpu")
    model.to(device)

    # Disable gradient and get BERT embeddings
    with torch.no_grad():
        bert_embeddings = []
        for i in range(len(torch_idx_text)):
            print(i, end = "\r")
            text_temp = torch.unsqueeze(torch_idx_text[i], dim = 0).to(device)
            sgmt_temp = torch.unsqueeze(torch_seg_ids[i], dim = 0).to(device)
            output = model(text_temp, sgmt_temp)
            bert_embeddings.append(output[0])
            del text_temp, sgmt_temp
    del model
  
    return bert_embeddings


### Tokenize and Create Embeddings

In [28]:
train_tokenized_text, train_torch_idx_text, train_torch_seg_ids = create_tensors_XLNET(train_df.sentences)
test_tokenized_text, test_torch_idx_text, test_torch_seg_ids = create_tensors_XLNET(test_df.sentences)

Tokenizing text...
Tokenizing text...


In [29]:
train_xlnet_embeddings = get_embeddings(train_torch_idx_text, train_torch_seg_ids)
test_xlnet_embeddings = get_embeddings(test_torch_idx_text, test_torch_seg_ids)


Getting Embeddings...


Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetModel: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Getting Embeddings...


Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetModel: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


1882

In [30]:
train_xlnet_embeddings = torch.cat(train_xlnet_embeddings)
test_xlnet_embeddings = torch.cat(test_xlnet_embeddings)

In [None]:
import pickle as pkl
train_embeddings_file_path = "./embeddings/train_xlnet_embeddings.pkl"
test_embeddings_file_path = "./embeddings/test_xlnet_embeddings.pkl"

def save_embeddings(embeddings_file_path, embeddings):
  with open(embeddings_file_path, mode="wb") as file:
    pkl.dump({"embeddings": embeddings}, file, protocol=pkl.HIGHEST_PROTOCOL)

In [None]:
save_embeddings(train_embeddings_file_path, train_xlnet_embeddings)
save_embeddings(test_embeddings_file_path, test_xlnet_embeddings)