<a href="https://colab.research.google.com/github/amkayhani/DSML24/blob/main/5_2_bert_tokenisation_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **BERT Tokenisation Example**
This notebook demonstrates tokenisation, padding and attention mask using BERT with a simple example.

**Key Topics Covered:**
- Tokenisation
- Subword Splitting
- Special Tokens ([CLS], [SEP])
- Padding
- Attention Masks

Let's get started!

## **Step 1: Install and Import Dependencies**
We'll install the required libraries and import the necessary modules.

In [1]:
!pip install transformers datasets



In [2]:
from transformers import BertTokenizer
from datasets import load_dataset
import torch

## **Step 2: Load the IMDb Dataset**
We'll use the IMDb dataset, which consists of movie reviews labeled as positive or negative.

## **Step 5: Subword Splitting**
Tokenisation converts raw text into numerical representations for the model.

**Why Tokenisation?**
- Breaks text into smaller components (tokens)
- Converts words into subwords for better vocabulary handling
- Assigns a unique numerical ID to each tokenBERT uses **WordPiece tokenisation**, which breaks rare words into subwords.
This allows handling of unseen words more efficiently.

In [5]:
from transformers import BertTokenizer

# Load the pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
text = 'I really enjoyed this movie. The story was engaging and the characters were well-developed.'
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print('Original Text:', text)
print('Tokenised:', tokens)
print('Token IDs:', token_ids)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Original Text: I really enjoyed this movie. The story was engaging and the characters were well-developed.
Tokenised: ['i', 'really', 'enjoyed', 'this', 'movie', '.', 'the', 'story', 'was', 'engaging', 'and', 'the', 'characters', 'were', 'well', '-', 'developed', '.']
Token IDs: [1045, 2428, 5632, 2023, 3185, 1012, 1996, 2466, 2001, 11973, 1998, 1996, 3494, 2020, 2092, 1011, 2764, 1012]


## **Step 6: Special Tokens ([CLS], [SEP])**
- `[CLS]` marks the start of a sentence.
- `[SEP]` is used to separate sentences or mark the end of a sentence.

In [6]:
tokens_with_special = ['[CLS]'] + tokens + ['[SEP]']
print('Tokens with Special Tokens:', tokens_with_special)

Tokens with Special Tokens: ['[CLS]', 'i', 'really', 'enjoyed', 'this', 'movie', '.', 'the', 'story', 'was', 'engaging', 'and', 'the', 'characters', 'were', 'well', '-', 'developed', '.', '[SEP]']


## **Step 7: Padding**
Since models process text in batches, all input sequences must have the same length.
Padding ensures that shorter sequences are extended to match the longest sequence in the batch.

In [7]:
tokenized_output = tokenizer([text], padding='max_length', truncation=True, max_length=50)
print('Padded Input IDs:', tokenized_output['input_ids'][0])

Padded Input IDs: [101, 1045, 2428, 5632, 2023, 3185, 1012, 1996, 2466, 2001, 11973, 1998, 1996, 3494, 2020, 2092, 1011, 2764, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


## **Step 8: Attention Masks**
Attention masks help the model ignore padding tokens (`0s`) while processing meaningful text (`1s`).

In [8]:
print('Attention Mask:', tokenized_output['attention_mask'][0])

Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
