## Load Dataset

In [1]:
from datasets import load_dataset
emotions = load_dataset("emotion") # loading emotions dataset

No config specified, defaulting to: emotion/split
Found cached dataset emotion (/Users/leisun/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd)


  0%|          | 0/3 [00:00<?, ?it/s]

In [2]:
from transformers import AutoTokenizer, AutoModel # The Auto meant it will automatically select model to use based on model_ckpt

## Tokenising and Embedding

This is to use the **distilbert** pre-trained model tokeniser.  Use AutoTokenizer class to automatically determine tokeniser from the name (i.e. model_ckpt). Otherwse can also use the specific tokeniser. See below, the AutoTokenizer returns DistilBertTokenizerFast class.

In [7]:
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

type(tokenizer)

transformers.models.distilbert.tokenization_distilbert_fast.DistilBertTokenizerFast

### Padding, Truncation, Tokens & IDs

- The device can be set as "cpu", "cuda" or "mps" where mps is for apple silicon
- The tokenizer takes a text array and produce a tensor, use "pt" for pytorch "tf" for tensorflow

**To insure texts are at the same size**
- Padding meaning text will be padded as 0's, taking the longest text in the array.
- Trunction meaning the longer texts will be truncted to context length of the model (max tokens)
- [CLS], [SEP] will be introduced at the beginning and end of each sentence

#### Example Below

In [9]:
import torch
device = torch.device("mps") # utilising apple silicon

# this is a sample text
text = ["this is a blazing fast test", "this is another test, but a lot slower and longer"]

# this is to apply the tokenizer
inputs = tokenizer(text, padding = True, truncation = True, return_tensors = "pt") # by default tokeniser returns arrays as tensor (pt is pytorch)

print(inputs) # can see the first sentence is padded with zeros at the end. The ids are in the vocab
print(tokenizer.convert_ids_to_tokens(inputs['input_ids'][1])) # note that [CLS] [SEP] is added.

{'input_ids': tensor([[  101,  2023,  2003,  1037, 17162,  3435,  3231,   102,     0,     0,
             0,     0,     0],
        [  101,  2023,  2003,  2178,  3231,  1010,  2021,  1037,  2843, 12430,
          1998,  2936,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
['[CLS]', 'this', 'is', 'another', 'test', ',', 'but', 'a', 'lot', 'slower', 'and', 'longer', '[SEP]']


## Model 

The model is loaded as well, note it is the same model_ckpt. 

The function will accept input and generate hidden states of the inputs. (hidden states are outputs of recurrent type of networks) 

Firstly the input (batch) will be converted into a dictionary with value carried to the device for comput. Then it will be used as input to the model. Note by having torch.no_grad(), we save the compute (not needed as we just need to output)

In [21]:
model = AutoModel.from_pretrained(model_ckpt).to(device)

print(tokenizer.model_input_names)

inputs_ = {k: v.to(device) for k, v in inputs.items() if k in tokenizer.model_input_names} # carry to device

with torch.no_grad():
    model_outputs = model(**inputs_)


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


['input_ids', 'attention_mask']


In [26]:
model_outputs.last_hidden_state[:,0].shape

torch.Size([2, 768])

In [17]:
# this function uses the tokeniser on the batch text (this is the feature of emotion['train']. 
# this also applies padding
def tokenize(batch):
    return tokenizer(batch['text'], padding = True, truncation = True, return_tensors = "pt")


# this function extract last hidden state from a batch 
def extract_hidden_states(batch):
    inputs = {k: v.to(device) for k, v in batch.items() # model only expect input_ids and attention_mask
                 if k in tokenizer.model_input_names}
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state # take the last hidden state [CLS]

    return {"hidden_state": last_hidden_state.cpu().numpy()}

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [11]:
emotions_encoded = emotions.map(tokenize, batched = True, batch_size = None) # do not batch as it will need padding
emotions_encoded.set_format('torch', columns = ['input_ids', 'attention_mask', 'label']) # set the ds to use torch
                                                
print(emotions_encoded['train'][0])

Loading cached processed dataset at /Users/leisun/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd/cache-340b8d5609895ea7.arrow
Loading cached processed dataset at /Users/leisun/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd/cache-edaf99feeb6ed202.arrow
Loading cached processed dataset at /Users/leisun/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd/cache-ba243ea584e486d4.arrow


{'label': tensor(0), 'input_ids': tensor([  101,  1045,  2134,  2102,  2514, 26608,   102,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0,

In [12]:
emotions_hidden = emotions_encoded.map(extract_hidden_states, batched = True, batch_size = 128) #this needs batch as it will run out of memory

Loading cached processed dataset at /Users/leisun/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd/cache-5a647c2500552338.arrow
Loading cached processed dataset at /Users/leisun/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd/cache-1996d999c6672d2b.arrow
Loading cached processed dataset at /Users/leisun/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd/cache-ac8ec274b73b2b38.arrow


In [14]:
import pickle

pickle.dump(emotions_hidden, open("./hidden_state.pickle", "wb"))

In [68]:
{k: v.to(device) for k, v in emotions_encoded['train'][1].items()}

{'label': tensor(0),
 'input_ids': tensor([  101,  1045,  2064,  2175,  2013,  3110,  2061, 20625,  2000,  2061,
          9636, 17772,  2074,  2013,  2108,  2105,  2619,  2040, 14977,  1998,
          2003,  8300,   102,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0