Question 1 - Download the Yelp review dataset “Yelp/yelp_review_full”. Split each sample by calling the string method “.split()” and choose the correct statements about the dataset.
 A. The dataset contains close to 99 million words
 B. There are more than 300 samples that contain a single word
 C. There are less than 300 samples that contain only a single word
 D. “Cheesy-melty-roasted-cauliflower-with-fresh-bread-crumbs-on top.\\nTo-die-for.” is one of the single words in the dataset
 E. The average length of a sample is 134.1
  F. The distribution of the length of the samples is right skewed

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [1]:
from datasets import load_dataset
import numpy as np

dataset = load_dataset("Yelp/yelp_review_full")
print(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})


In [2]:
reviews = dataset['train']['text']
split_reviews = [review.split() for review in reviews]
total_words = sum(len(review) for review in split_reviews)
single_word_reviews = sum(1 for review in split_reviews if len(review) == 1)
average_length = round(np.mean([len(review) for review in split_reviews]), 1)
lengths = [len(review) for review in split_reviews]
right_skewed = np.mean(lengths) < np.median(lengths)

print(f"Total number of words: {total_words}")
print(f"Number of reviews with a single word: {single_word_reviews}")
print(f"Average length of a review: {average_length}")
print(f"Is the distribution of review lengths right skewed? {right_skewed}")
print("A. The dataset contains close to 99 million words:", total_words > 98_000_000 and total_words < 100_000_000)
print("B. There are more than 300 samples that contain a single word:", single_word_reviews > 300)
print("C. There are less than 300 samples that contain only a single word:", single_word_reviews < 300)
print("D. ‘Cheesy-melty-roasted-cauliflower-with-fresh-bread-crumbs-on top.\\nTo-die-for.’ is one of the single words in the dataset:", "cheesy" in split_reviews[0]) # Example check for a long sentence as a single word
print("E. The average length of a sample is 134.1:", np.isclose(average_length, 134.1))
print("F. The distribution of the length of the samples is right skewed:", right_skewed)

Total number of words: 87163758
Number of reviews with a single word: 337
Average length of a review: 134.1
Is the distribution of review lengths right skewed? False
A. The dataset contains close to 99 million words: False
B. There are more than 300 samples that contain a single word: True
C. There are less than 300 samples that contain only a single word: False
D. ‘Cheesy-melty-roasted-cauliflower-with-fresh-bread-crumbs-on top.\nTo-die-for.’ is one of the single words in the dataset: False
E. The average length of a sample is 134.1: True
F. The distribution of the length of the samples is right skewed: False


Question 2 - Load the “bert-base-uncased” pre-trained tokenizer and choose the correct statements about the tokenizer.
 A. The tokenizer is used for the BERT model with the context length of 512
 B. The tokenizer has 5 special tokens
 C. Tokenizing a sample that contains more than 512 words would result in truncation of all tokens beyond the length 512
 D. Tokenizer inserts all the special tokens when it processes a single sample as an input
 E. Tokenizer inserts [CLS] and [SEP] special tokens when it processes a single sample as an input
 F. Tokenizer inserts only [CLS]special token when it processes a single sample as an input

In [3]:
!pip install transformers



In [7]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
context_length = tokenizer.model_max_length
print(f"Context length (max length) of the tokenizer: {context_length}")
special_tokens = tokenizer.all_special_tokens
print(f"Special tokens: {special_tokens}")
print(f"Number of special tokens: {len(special_tokens)}")
sample_text = "This is a sample text " * 1000
tokens = tokenizer(sample_text, truncation=False, padding=False)
print(f"Tokenized length without truncation: {len(tokens['input_ids'])}")
tokens = tokenizer(sample_text, padding=False)
print(f"Hmmmmmm just checking: {len(tokens['input_ids'])}")
tokens = tokenizer(sample_text, truncation=True, padding=False)
print(f"Tokenized length with truncation: {len(tokens['input_ids'])}")
sample_input = "This is a sample input."
tokens = tokenizer(sample_input)
print(f"Tokenized input: {tokens['input_ids']}")
tokens = tokenizer(sample_input, add_special_tokens=True)
print(f"Tokenized input with special tokens: {tokens['input_ids']}")
tokens = tokenizer(sample_input, add_special_tokens=True)
print(f"Tokenized input with special tokens: {tokens['input_ids']}")

Context length (max length) of the tokenizer: 512
Special tokens: ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
Number of special tokens: 5


Token indices sequence length is longer than the specified maximum sequence length for this model (5002 > 512). Running this sequence through the model will result in indexing errors


Tokenized length without truncation: 5002
Hmmmmmm just checking: 5002
Tokenized length with truncation: 512
Tokenized input: [101, 2023, 2003, 1037, 7099, 7953, 1012, 102]
Tokenized input with special tokens: [101, 2023, 2003, 1037, 7099, 7953, 1012, 102]
Tokenized input with special tokens: [101, 2023, 2003, 1037, 7099, 7953, 1012, 102]


In [9]:
cls_token_id = tokenizer.cls_token_id
sep_token_id = tokenizer.sep_token_id
pad = tokenizer.pad_token_id
unk = tokenizer.unk_token_id
mask = tokenizer.mask_token_id
print(f"Token ID of [CLS]: {cls_token_id}")
print(f"Token ID of [SEP]: {sep_token_id}")
print(f"Token ID of [PAD]: {pad}")
print(f"Token ID of [UNK]: {unk}")
print(f"Token ID of [MASK]: {mask}")

Token ID of [CLS]: 101
Token ID of [SEP]: 102
Token ID of [PAD]: 0
Token ID of [UNK]: 100
Token ID of [MASK]: 103


Question 3 - Use “BertConfig” and “BertForMaskedLM” to construct the default (original) BERT model. Choose the correct statements
 A. The model has 12 Bert layers
 B. The model has 6 Bert layers
 C. The model uses absolute position embeddings
 D. The word embedding (token embedding) layer has about 23 million learnable parameters
 E. The total number of parameters in the model is close to 110 million

In [13]:
from transformers import BertConfig, BertForMaskedLM

config = BertConfig.from_pretrained('bert-base-uncased')
model = BertForMaskedLM(config)

numlayers = config.num_hidden_layers
print(f"Number of layers in the model: {numlayers}")

pos_embed_type = config.position_embedding_type
print(f"Position embedding type: {pos_embed_type}")

num_parameters = model.num_parameters()
print(f"Total number of parameters in the model: {num_parameters}")

embedding_size = config.hidden_size
vocab_size = config.vocab_size
embedding_params = embedding_size * vocab_size
print(f"Embedding layer parameters: {embedding_params}")

total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters in the model: {total_params / 1e6} million")

Number of layers in the model: 12
Position embedding type: absolute
Total number of parameters in the model: 109514298
Embedding layer parameters: 23440896
Total number of parameters in the model: 109.514298 million


Question 4 -  Double the context length from 512 to 1024 (you can change it in the configuration). Count the number of parameters and enter the change in the number of parameters (in millions) compared to the default configuration.

In [14]:
config.max_position_embeddings = 1024
updated_model = BertForMaskedLM(config)
updated_params = sum(p.numel() for p in updated_model.parameters())
change_in_params = (updated_params - total_params) / 1e6
print(f"Original number of parameters: {total_params / 1e6} million")
print(f"Updated number of parameters: {updated_params / 1e6} million")
print(f"Change in parameters: {change_in_params} million")

Original number of parameters: 109.514298 million
Updated number of parameters: 109.907514 million
Change in parameters: 0.393216 million


Question 5 - Pack (chunk) the samples such that the length of all the samples in the dataset is 512 (for efficient training). Define a mapping function that implements the following procedure
 1. Take a batch of 1000 samples
 2. Tokenize it to get input IDs and attention mask
 3. Concatenate all the input IDs
 4. Chunk the concatenated IDs into a size of 512
 5. Drop the last chunk if its length is less than 512
 6. Pack all the chunks
 7. Iterate over all the batches in the dataset
Store the resulting dataset in the variable “ds_chunked”. Enter the total number of samples in the new dataset.
Note: the batch size should be kept at 1000 while calling "ds.map()" for theanswer to match.

In [15]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})


In [17]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def chunk_samples(batch):
    inputs = tokenizer(batch['text'], padding=True, truncation=True, return_tensors='pt')
    input_ids = inputs['input_ids'].flatten().tolist()
    attention_mask = inputs['attention_mask'].flatten().tolist()
    chunked_input_ids = [input_ids[i:i + 512] for i in range(0, len(input_ids), 512)]
    chunked_attention_mask = [attention_mask[i:i + 512] for i in range(0, len(attention_mask), 512)]
    if len(chunked_input_ids[-1]) < 512:
        chunked_input_ids = chunked_input_ids[:-1]
        chunked_attention_mask = chunked_attention_mask[:-1]
    return {
        'input_ids': chunked_input_ids,
        'attention_mask': chunked_attention_mask,
    }

ds_chunked = dataset.map(chunk_samples, batched=True, batch_size=1000)
total_samples = sum(len(batch['input_ids']) for batch in ds_chunked['train'])
print(f"Total number of samples in the chunked dataset: {total_samples}")

Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Total number of samples in the chunked dataset: 332800000


Question 6 - Split the new dataset into training and test sets with the test_size=0.05 and seed=42. Use the appropriate data collator function for the MLM objective and set the masking probability to 0.2. Use the data loader from PyTorch to load a batch of samples, and enter the token ID corresponding to the unmasked token

In [27]:
from datasets import DatasetDict
from transformers import DataCollatorForLanguageModeling, BertTokenizer
from torch.utils.data import DataLoader

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
split_dataset = ds_chunked['train'].train_test_split(test_size=0.05, seed=42)

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=512)

split_dataset = split_dataset.map(tokenize_function, batched=True)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.2
)
train_dataloader = DataLoader(
    split_dataset['train'],
    batch_size=16,
    collate_fn=data_collator
)
for batch in train_dataloader:
    input_ids = batch['input_ids']
    attention_mask = batch['attention_mask']
    for i in range(input_ids.size(0)):
        unmasked_token = None
        for j in range(input_ids.size(1)):
            if input_ids[i, j] != tokenizer.mask_token_id:
                unmasked_token = input_ids[i, j].item()
                break
        if unmasked_token is not None:
            print(f"Unmasked token ID: {unmasked_token}")
            break
    break

Map:   0%|          | 0/617500 [00:00<?, ? examples/s]

Map:   0%|          | 0/32500 [00:00<?, ? examples/s]

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`text` in this case) have excessive nesting (inputs type `list` where type `int` is expected).