# Supplemental Materials

---
- Installation (Conda & Pacakges )
- View Attention Weights

Instructor: Yen-Chieh Liao and Stefan Müller 

Date: 22 April 2024
  

__Understanding BERT Input Data Preparation: A Breakdown of Key Components__

- __input_ids:__ These are token IDs from BERT's vocabulary. Each number represents a specific tokenized piece of the input text. For example, 101 and 102 are special tokens [CLS] and [SEP], respectively, used by BERT for classification tasks and to separate segments.

- __token_type_ids:__ These indicate which segment each token belongs to. In tasks involving single text inputs or those that don’t distinguish between multiple sequences, this is usually set to all zeros. In pair tasks (like question answering), different segments would be marked with different numbers (e.g., 0 for the first segment and 1 for the second).

- __attention_mask:__ This mask indicates to the model which tokens should be attended to, and which should not. All tokens that are real (not padding) are marked by 1, which means "pay attention to this," while 0 would indicate padding tokens, which should not influence the context.

In [None]:
# pip install bertviz
# pip install transformers

In [4]:
from transformers import BertTokenizer 
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') 

In [6]:
text = "Using transformers is easy!" 
tokened_text = tokenizer(text) 
tokened_text

{'input_ids': [101, 2478, 19081, 2003, 3733, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [9]:
tokened_text['input_ids']

[101, 2478, 19081, 2003, 3733, 999, 102]

In [10]:
tokened_text['token_type_ids']

[0, 0, 0, 0, 0, 0, 0]

In [None]:
tokened_text['attention_mask']

In [None]:
encoded_input = tokenizer(text, return_tensors="pt")

__Load Model and Retrieve Attention Eeights__

In [1]:
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel 
from bertviz.neuron_view import show

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)
text = "my dog is cute [SEP] he likes eating"
show(model, "bert", tokenizer, text, display_mode="light", layer=0, head=8)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>