Where BERT is Still Used Today

1. Search & Retrieval (Vector Search, RAG Base Models)

- For generating dense embeddings (e.g., Sentence-Transformers, MiniLM).

- Stored in FAISS, Pinecone, Milvus for fast similarity search.

- Used in smaller LLM pipelines (retriever + generator architecture).

2. Enterprise-level NLP Tasks (Fast & Cost-Effective)

- Named Entity Recognition (NER)

- Sentiment Analysis

- Classification tasks (spam detection, intent classification)

- Summarization using lightweight variants (DistilBERT).

3. Hybrid Pipelines with LLMs

- BERT embeddings for the retriever, then an LLM generates the answer (RAG architecture).

4. Multilingual NLP

- XLM-R (a multilingual BERT version) is still a top choice for 100+ languages.

- Used for translation and cross-lingual search.

5. On-Device / Low-Latency Inference

- For mobile apps and edge devices where GPT/Claude can’t run.

- Quantized DistilBERT/MiniLM models for chatbots and offline NLP tasks.

| **Model**           | **Main Use-Cases**                                  | **Why Still Used**                    |
| ------------------- | --------------------------------------------------- | ------------------------------------- |
| BERT / DistilBERT   | NER, classification, embeddings                     | Small, fast, cheap inference          |
| RoBERTa / DeBERTa   | Classification, QA, summarization                   | High accuracy, Kaggle/enterprise use  |
| MPNet / MiniLM      | Vector search, semantic retrieval (RAG)             | Best for FAISS/Pinecone retrieval     |
| T5 / Flan-T5        | Summarization, translation, instruction tasks       | Lightweight text-to-text generation   |
| BART / Pegasus      | Abstractive summarization                           | Less resource-hungry than LLMs        |
| XLNet / Electra     | Classification, QA (legacy setups)                  | Still optimized for speed             |
| XLM-R / mT5 / LaBSE | Multilingual NLP, translation, cross-lingual search | 100+ language support, enterprise use |


In [None]:
!pip install --upgrade datasets fsspec transformers

# First Part

In [None]:
from datasets import load_dataset
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments

In [None]:
# take the below complex dataset
# load_dataset("ag_news")
# load_dataset("dbpedia_14")

In [None]:
# Customer feedback classification (positive/negative/neutral)
# Support ticket intent detection (billing, technical, general)
# Email/topic categorization

In [None]:
from datasets import load_dataset
from transformers import BertTokenizer
# Load IMDB dataset and subset
dataset = load_dataset("imdb")

In [None]:
dataset

In [None]:
train_dataset = dataset["train"].select(range(1000))
test_dataset = dataset["test"].select(range(500))

In [None]:
# Initialize tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")



##### The raw text column (Each row in the IMDB dataset contains a review text)
##### It will pad each sentence to the same length (up to 256 tokens)
##### If the text is longer than 256 tokens, it will truncate (cut) it

    


In [None]:
# Tokenization function
def tokenize_fn(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256)

In [None]:
# Apply tokenization + rename + format in a single flow
def preprocess(ds):
    ds = ds.map(tokenize_fn, batched=True, remove_columns=["text"])  # remove raw text (saves memory)
    ds = ds.rename_column("label", "labels")
    ds.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
    return ds

In [None]:
train_dataset = preprocess(train_dataset)

In [None]:
test_dataset = preprocess(test_dataset)

In [None]:
# 3. Model load
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

In [None]:
for layer in model.bert.encoder.layer:
  print(layer)


In [None]:
# #Classifier head trainable rahega by default

# for param in model.bert.parameters():
#     param.requires_grad = False  # Freeze BERT encoder

# #Unfreeze last 2 encoder layers

# for layer in model.bert.encoder.layer[-2:]:
#     for param in layer.parameters():
#         param.requires_grad = True

In [None]:
from transformers import TrainingArguments
training_args = TrainingArguments(
    output_dir="./bert-finetuned-imdb",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    logging_dir="./logs",
    learning_rate=2e-5,
    weight_decay=0.01,
    report_to="none"
)

In [None]:
# 5. Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

In [None]:
# 6. Train
trainer.train()

In [None]:
!tensorboard --logdir=./logs

In [None]:
trainer.save_model("./bert-finetuned-imdb")

In [None]:
tokenizer.save_pretrained("./bert-finetuned-imdb")

In [None]:
# 7. Evaluate
metrics = trainer.evaluate()

In [None]:
print(metrics)

## Prediction

In [None]:
tokenizer = BertTokenizer.from_pretrained("/content/bert-finetuned-imdb")
model = BertForSequenceClassification.from_pretrained("/content/bert-finetuned-imdb")

In [None]:
from transformers import pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

In [None]:
# Predict
text = "This movie was amazing and I loved the acting!"
result = classifier(text)

In [None]:
print(result)  # Example: [{'label': 'POSITIVE', 'score': 0.98}]

### Pushing it to Huggingfacehub

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
from huggingface_hub import whoami
print(whoami())

In [None]:
tokenizer.push_to_hub("sunny199/my-bert-imdb2")

In [None]:
trainer.push_to_hub("sunny199/my-bert-imdb2")

In [None]:
# from datasets import load_dataset
# from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding

# # 1. Load IMDB dataset (subset for speed)
# dataset = load_dataset("imdb")
# train_dataset = dataset["train"].select(range(1000))
# test_dataset = dataset["test"].select(range(500))

# # 2. Initialize tokenizer
# tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# # 3. Tokenization function (no fixed padding here)
# def tokenize_fn(examples):
#     return tokenizer(examples["text"], truncation=True, max_length=256)

# # 4. Preprocess dataset (map + rename + torch format)
# def preprocess(ds):
#     ds = ds.map(tokenize_fn, batched=True, remove_columns=["text"])  # remove raw text
#     ds = ds.rename_column("label", "labels")
#     ds.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
#     return ds

# train_dataset = preprocess(train_dataset)
# test_dataset = preprocess(test_dataset)

# # 5. Initialize model
# model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# # 6. Data collator (dynamic padding)
# data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# # 7. Training arguments
# training_args = TrainingArguments(
#     output_dir="./bert-finetuned-imdb",
#     num_train_epochs=1,
#     per_device_train_batch_size=8,
#     per_device_eval_batch_size=8,
#     logging_dir="./logs",
#     logging_steps=50,
#     learning_rate=2e-5,
#     weight_decay=0.01,
#     eval_steps=500,
#     save_steps=500,
#     save_total_limit=1,
#     report_to="none"
# )

# # 8. Trainer
# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=train_dataset,
#     eval_dataset=test_dataset,
#     data_collator=data_collator,  # dynamic padding here
# )

# # 9. Train
# trainer.train()


# Finetune on multiple problem

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import (
    BertTokenizer,
    BertTokenizerFast,
    BertForSequenceClassification,
    BertForTokenClassification,
    BertForQuestionAnswering,
    get_linear_schedule_with_warmup ###### Gradually warms up then decays learning rate for stable BERT training.
)
from datasets import load_dataset
from sklearn.metrics import accuracy_score, classification_report, f1_score
import numpy as np
from tqdm import tqdm
from torch.optim import AdamW #Adam Optimizer with Weight Decay

In [None]:
# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Dataset size = 800 samples

batch_size=8 → 1 epoch = 800/8 = 100 batches

epochs=3

To:

total_steps = 100 × 3 = 300

300 total optimizer updates honge.

Agar warmup 10% hai → 30 steps warmup, baaki 270 steps me LR linearly decrease hoga.

Hugging Face’s direct map() + set_format() approach (used in the IMDB example)

Custom PyTorch Dataset class (TextClassificationDataset)

when the below class will be used?

You haven’t preprocessed the dataset into Hugging Face format (like map() + set_format()).

You only have Python lists (train_texts, train_labels).

You need custom preprocessing logic (e.g., a different tokenizer, extra transformations).

In [None]:
class Basket:
    def __init__(self, fruits):
        self.fruits = fruits  # List of fruits

    def __len__(self):
        return len(self.fruits)  # Kitne fruits hai total

    def __getitem__(self, idx):
        return self.fruits[idx]  # Index se specific fruit nikalna


In [None]:
basket = Basket(["Apple", "Banana", "Mango"])

In [None]:
print(len(basket))
print(basket[0])
print(basket[0])

In [None]:
# Dataset class
class TextClassificationDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=512):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = int(self.labels[idx])

        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

1. Initialize the classifier – load BERT model, tokenizer, and move to device.

2. Load IMDb data – sample train and test texts with labels.

3. Train the model –

- Convert texts and labels into a dataset.

- Use DataLoader for batching.

- For each epoch and batch: forward pass, compute loss, backward pass, update weights, adjust learning rate.

4. Evaluate the model –

- Run on test data without gradients.

- Collect predictions and compute accuracy, F1, and report.

5. Predict new texts –

- Tokenize, run through the model, apply softmax, and return predictions with probabilities.

In [None]:
# BERT Text Classifier
class BERTTextClassifier:
    """BERT for Text Classification (Sentiment, Spam etc.)"""

    def __init__(self, model_name='bert-base-uncased', num_classes=2, max_length=512):
        self.model_name = model_name
        self.num_classes = num_classes
        self.max_length = max_length

        self.tokenizer = BertTokenizerFast.from_pretrained(model_name)
        self.model = BertForSequenceClassification.from_pretrained(
            model_name, num_labels=num_classes
        )
        self.model.to(device)

    def load_imdb_data(self, sample_size=5000):

        """Load IMDb movie reviews dataset"""

        print("Loading IMDb dataset...")

        dataset = load_dataset("imdb")

        # Sample data for faster training
        train_indices = np.random.choice(len(dataset['train']),
                                       min(sample_size, len(dataset['train'])),
                                       replace=False)

        test_indices = np.random.choice(len(dataset['test']),
                                      min(sample_size//4, len(dataset['test'])),
                                      replace=False)

        # Convert numpy.int64 → int for indexing
        train_texts = [dataset['train'][int(i)]['text'] for i in train_indices]
        train_labels = [dataset['train'][int(i)]['label'] for i in train_indices]

        test_texts = [dataset['test'][int(i)]['text'] for i in test_indices]
        test_labels = [dataset['test'][int(i)]['label'] for i in test_indices]

        print(f"Train samples: {len(train_texts)}")
        print(f"Test samples: {len(test_texts)}")

        return train_texts, train_labels, test_texts, test_labels

    def train(self, train_texts, train_labels, epochs=1, batch_size=8, learning_rate=2e-5):

        """Train the text classifier"""

        train_dataset = TextClassificationDataset(
            train_texts, train_labels, self.tokenizer, self.max_length
        )

        train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

        optimizer = AdamW(self.model.parameters(), lr=learning_rate, weight_decay=0.01)

        total_steps = len(train_loader) * epochs

        scheduler = get_linear_schedule_with_warmup(
            optimizer, num_warmup_steps=total_steps//10, num_training_steps=total_steps
        )

        self.model.train()

        for epoch in range(epochs):
            total_loss = 0
            progress_bar = tqdm(train_loader, desc=f'Epoch {epoch+1}/{epochs}')

            for batch in progress_bar:
                optimizer.zero_grad()

                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)

                outputs = self.model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    labels=labels
                )

                loss = outputs.loss
                total_loss += loss.item()

                loss.backward()

                torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)

                optimizer.step()

                scheduler.step()

                progress_bar.set_postfix({'Loss': f'{loss.item():.4f}'})

            print(f'Epoch {epoch+1}, Average Loss: {total_loss/len(train_loader):.4f}')

    def evaluate(self, test_texts, test_labels, batch_size=8):

        """Evaluate the text classifier"""

        test_dataset = TextClassificationDataset(
            test_texts, test_labels, self.tokenizer, self.max_length
        )

        test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

        self.model.eval()

        predictions = []

        true_labels = []

        with torch.no_grad():
            for batch in tqdm(test_loader, desc='Evaluating'):
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)

                outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
                logits = outputs.logits
                preds = torch.argmax(logits, dim=1).cpu().numpy()

                predictions.extend(preds)
                true_labels.extend(labels.cpu().numpy())

        accuracy = accuracy_score(true_labels, predictions)
        f1 = f1_score(true_labels, predictions, average='weighted')
        report = classification_report(true_labels, predictions,
                                     target_names=['Negative', 'Positive'])

        return accuracy, f1, report

    def predict(self, texts):

        """Predict sentiment for new texts"""
        predictions = []
        probabilities = []

        self.model.eval()

        for text in texts:
            encoding = self.tokenizer(
                text,
                truncation=True,
                padding='max_length',
                max_length=self.max_length,
                return_tensors='pt'
            )

            input_ids = encoding['input_ids'].to(device)
            attention_mask = encoding['attention_mask'].to(device)

            with torch.no_grad():
                outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
                logits = outputs.logits
                probs = torch.softmax(logits, dim=1).cpu().numpy()[0]
                pred = torch.argmax(logits, dim=1).cpu().numpy()[0]

                predictions.append(pred)
                probabilities.append(probs)

        return predictions, probabilities

Tokens: ["John", "lives", "in", "London"]

Labels: [1, 0, 0, 2]  

1 = B-PER (John is a person)

0 = O (Outside entity)

2 = B-LOC (London is a location)

Step 1: Tokenizer output

Original words: John | lives | in | London

BERT tokens:    [CLS], John, lives, in, Lon, ##don, [SEP], [PAD]...

Word IDs:       None,   0,    1,    2,   3,    3,    None, None...

Step 2: Label Alignment

Input IDs:     [101, 1001, 2002, 1999, 3001, 3010, 102, 0, 0, 0]

Word IDs:      [None,   0,   1,   2,   3,   3,  None, None, None, None]

Aligned Labels:[-100,   1,   0,   0,   2, -100, -100, -100, -100, -100]

Step 3: Final output model ko ye milega

{
  'input_ids': tensor([...]),         # token IDs

  'attention_mask': tensor([1,1,1,...]),  # 1 for real tokens, 0 for pads

  'labels': tensor([-100,1,0,0,2,-100,-100,...])  # aligned labels
}


In [None]:
import torch
from torch.utils.data import Dataset

class NERDataset(Dataset):
    """Dataset for Named Entity Recognition"""

    def __init__(self, tokens_list, labels_list, tokenizer, max_length=512):
        self.tokens_list = tokens_list
        self.labels_list = labels_list
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.tokens_list)

    def __getitem__(self, idx):
        tokens = self.tokens_list[idx]
        labels = self.labels_list[idx]

        # Tokenize with word-level alignment
        encoding = self.tokenizer(
            tokens,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            is_split_into_words=True,
            return_tensors='pt'
        )

        # Get word alignment for batch index 0
        word_ids = encoding.word_ids(batch_index=0)
        aligned_labels = []
        previous_word_idx = None

        for word_idx in word_ids:
            if word_idx is None:
                aligned_labels.append(-100)  # Ignore special tokens
            elif word_idx != previous_word_idx:
                aligned_labels.append(labels[word_idx] if word_idx < len(labels) else 0)
            else:
                aligned_labels.append(-100)  # Ignore subword tokens
            previous_word_idx = word_idx

        # Ensure aligned_labels length matches max_length
        if len(aligned_labels) < self.max_length:
            aligned_labels += [-100] * (self.max_length - len(aligned_labels))
        elif len(aligned_labels) > self.max_length:
            aligned_labels = aligned_labels[:self.max_length]

        return {
            'input_ids': encoding['input_ids'].squeeze(0),  # Shape: (max_length,)
            'attention_mask': encoding['attention_mask'].squeeze(0),  # Shape: (max_length,)
            'labels': torch.tensor(aligned_labels, dtype=torch.long)  # Shape: (max_length,)
        }


| **Label**  | **Full Form**          | **Meaning (Hindi + English)**                                          | **Example**                                                       |
| ---------- | ---------------------- | ---------------------------------------------------------------------- | ----------------------------------------------------------------- |
| **O**      | Outside                | Koi entity nahi hai (normal word)                                      | "works", "at"                                                     |
| **B-PER**  | Begin - Person         | Person entity ka pehla word (naam ki shuruaat)                         | "John" → `B-PER`                                                  |
| **I-PER**  | Inside - Person        | Person entity ka continuation (naam ke dusre words)                    | "Mary Jane" → `Mary = B-PER`, `Jane = I-PER`                      |
| **B-ORG**  | Begin - Organization   | Organization/company ka pehla word                                     | "Google" → `B-ORG`                                                |
| **I-ORG**  | Inside - Organization  | Organization ke naam ke baaki words                                    | "New York Times" → `New = B-ORG`, `York = I-ORG`, `Times = I-ORG` |
| **B-LOC**  | Begin - Location       | Location/place ka pehla word                                           | "London" → `B-LOC`                                                |
| **I-LOC**  | Inside - Location      | Location ke naam ke baaki words                                        | "New York" → `New = B-LOC`, `York = I-LOC`                        |
| **B-MISC** | Begin - Miscellaneous  | Miscellaneous entity ka pehla word (event, product, nationality, etc.) | "Indian" (nationality) → `B-MISC`                                 |
| **I-MISC** | Inside - Miscellaneous | Miscellaneous entity ke dusre words (agar multi-word hai)              | "South Korean" → `South = B-MISC`, `Korean = I-MISC`              |


B → Begin (entity ka starting word)

I → Inside (entity ke continuation wale words)

O → Outside (koi entity nahi hai, normal word)

WikiAnn (Wikipedia + Annotation) dataset

Complete Flow:



1.   Hugging Face se WikiAnn English dataset load hota hai.
2.   Randomly training (1000) aur test (250) samples select hote hain.
3.   Tokens aur unke NER tags alag lists me nikale jate hain.
4.   Return hote hain taaki training pipeline me use ho sake.










In [None]:
class BERTNERClassifier:
    """BERT for Named Entity Recognition"""

    def __init__(self, model_name='bert-base-uncased', num_labels=9, max_length=512):
        self.model_name = model_name
        self.num_labels = num_labels
        self.max_length = max_length

        self.tokenizer = BertTokenizerFast.from_pretrained(model_name)

        self.model = BertForTokenClassification.from_pretrained(
            model_name, num_labels=num_labels
        )
        self.model.to(device)

        # Label mapping
        self.labels = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG',
                       'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

    def load_wikiann_data(self, sample_size=1000):
        """Load WikiAnn (Wikipedia + Annotation) dataset"""
        print("Loading wikiann NER dataset...")
        dataset = load_dataset("wikiann", "en")

        train_indices = np.random.choice(len(dataset['train']),
                                       min(sample_size, len(dataset['train'])),
                                       replace=False)

        test_indices = np.random.choice(len(dataset['test']),
                                      min(sample_size//4, len(dataset['test'])),
                                      replace=False)

        # Convert numpy.int64 to int
        train_tokens = [dataset['train'][int(i)]['tokens'] for i in train_indices]
        train_labels = [dataset['train'][int(i)]['ner_tags'] for i in train_indices]

        test_tokens = [dataset['test'][int(i)]['tokens'] for i in test_indices]
        test_labels = [dataset['test'][int(i)]['ner_tags'] for i in test_indices]

        print(f"Train samples: {len(train_tokens)}")
        print(f"Test samples: {len(test_tokens)}")

        return train_tokens, train_labels, test_tokens, test_labels

    def train(self, train_tokens, train_labels, epochs=1, batch_size=8, learning_rate=2e-5):
        """Train the NER model"""
        train_dataset = NERDataset(
            train_tokens, train_labels, self.tokenizer, self.max_length
        )
        train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

        optimizer = AdamW(self.model.parameters(), lr=learning_rate, weight_decay=0.01)
        total_steps = len(train_loader) * epochs
        scheduler = get_linear_schedule_with_warmup(
            optimizer, num_warmup_steps=total_steps//10, num_training_steps=total_steps
        )

        self.model.train()

        for epoch in range(epochs):
            total_loss = 0
            progress_bar = tqdm(train_loader, desc=f'Epoch {epoch+1}/{epochs}')

            for batch in progress_bar:
                optimizer.zero_grad()

                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)

                outputs = self.model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    labels=labels
                )

                loss = outputs.loss
                total_loss += loss.item()

                loss.backward()
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
                optimizer.step()
                scheduler.step()

                progress_bar.set_postfix({'Loss': f'{loss.item():.4f}'})

            print(f'Epoch {epoch+1}, Average Loss: {total_loss/len(train_loader):.4f}')

    def evaluate(self, test_tokens, test_labels, batch_size=8):
        """Evaluate the NER model"""
        test_dataset = NERDataset(
            test_tokens, test_labels, self.tokenizer, self.max_length
        )
        test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

        self.model.eval()
        predictions = []
        true_labels = []

        with torch.no_grad():
            for batch in tqdm(test_loader, desc='Evaluating'):
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)

                outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
                logits = outputs.logits
                preds = torch.argmax(logits, dim=2).cpu().numpy()
                labels = labels.cpu().numpy()

                # Collect valid predictions (ignore -100)
                for i in range(preds.shape[0]):
                    for j in range(preds.shape[1]):
                        if labels[i][j] != -100:
                            predictions.append(preds[i][j])
                            true_labels.append(labels[i][j])

        accuracy = accuracy_score(true_labels, predictions)
        f1 = f1_score(true_labels, predictions, average='weighted')

        return accuracy, f1

    def predict(self, tokens_list):
        """Predict NER tags for new tokens"""
        predictions = []

        self.model.eval()
        for tokens in tokens_list:
            encoding = self.tokenizer(
                tokens,
                truncation=True,
                padding='max_length',
                max_length=self.max_length,
                return_tensors='pt',
                is_split_into_words=True
            )

            input_ids = encoding['input_ids'].to(device)
            attention_mask = encoding['attention_mask'].to(device)

            with torch.no_grad():
                outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
                logits = outputs.logits
                preds = torch.argmax(logits, dim=2).cpu().numpy()[0]

                # Get alignment for original words
                word_ids = encoding.word_ids(batch_index=0)
                token_predictions = []
                previous_word_idx = None

                for i, word_idx in enumerate(word_ids):
                    if word_idx is not None and word_idx != previous_word_idx:
                        if word_idx < len(tokens):
                            token_predictions.append(self.labels[preds[i]])
                    previous_word_idx = word_idx

                predictions.append(token_predictions)

        return predictions


In [None]:
class QADataset(Dataset):
    """Dataset for Question Answering (SQuAD-style)"""

    def __init__(self, questions, contexts, answers, tokenizer, max_length=512):
        self.questions = questions
        self.contexts = contexts
        self.answers = answers
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        question = self.questions[idx]
        context = self.contexts[idx]
        answer = self.answers[idx]  # dict: {'text': [...], 'answer_start': [...]}

        # Encode inputs with offsets to locate answer
        encoding = self.tokenizer(
            question,
            context,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_offsets_mapping=True,
            return_tensors='pt'
        )

        offset_mapping = encoding.pop("offset_mapping")[0]  # (max_length, 2)
        start_positions = torch.tensor(0, dtype=torch.long)
        end_positions = torch.tensor(0, dtype=torch.long)

        if answer and 'answer_start' in answer and answer['answer_start']:
            answer_start = answer['answer_start'][0]
            answer_text = answer['text'][0]
            answer_end = answer_start + len(answer_text)

            # Find token start/end matching answer char positions
            for idx, (start, end) in enumerate(offset_mapping):
                if start <= answer_start < end:
                    start_positions = torch.tensor(idx, dtype=torch.long)
                if start < answer_end <= end:
                    end_positions = torch.tensor(idx, dtype=torch.long)
                    break

        return {
            'input_ids': encoding['input_ids'].squeeze(0),      # (max_length,)
            'attention_mask': encoding['attention_mask'].squeeze(0),
            'start_positions': start_positions,
            'end_positions': end_positions
        }


In [None]:
class BERTQuestionAnswering:
    """BERT for Question Answering (SQuAD)"""

    def __init__(self, model_name='bert-base-uncased', max_length=512):
        self.model_name = model_name
        self.max_length = max_length

        self.tokenizer = BertTokenizerFast.from_pretrained(model_name)
        self.model = BertForQuestionAnswering.from_pretrained(model_name)
        self.model.to(device)

    def load_squad_data(self, sample_size=2000):
        """Load and sample SQuAD dataset"""
        print("Loading SQuAD dataset...")
        dataset = load_dataset("squad")

        train_indices = np.random.choice(len(dataset['train']),
                                         min(sample_size, len(dataset['train'])),
                                         replace=False)
        val_indices = np.random.choice(len(dataset['validation']),
                                       min(sample_size//4, len(dataset['validation'])),
                                       replace=False)

        train_questions = [dataset['train'][int(i)]['question'] for i in train_indices]
        train_contexts = [dataset['train'][int(i)]['context'] for i in train_indices]
        train_answers = [dataset['train'][int(i)]['answers'] for i in train_indices]

        val_questions = [dataset['validation'][int(i)]['question'] for i in val_indices]
        val_contexts = [dataset['validation'][int(i)]['context'] for i in val_indices]
        val_answers = [dataset['validation'][int(i)]['answers'] for i in val_indices]

        print(f"Train samples: {len(train_questions)}")
        print(f"Validation samples: {len(val_questions)}")

        return (train_questions, train_contexts, train_answers,
                val_questions, val_contexts, val_answers)

    def train(self, questions, contexts, answers, epochs=1, batch_size=8, learning_rate=2e-5):
        """Train QA model using QADataset (offset mapping based)"""
        train_dataset = QADataset(questions, contexts, answers, self.tokenizer, self.max_length)
        train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

        optimizer = AdamW(self.model.parameters(), lr=learning_rate, weight_decay=0.01)

        total_steps = len(train_loader) * epochs

        scheduler = get_linear_schedule_with_warmup(
            optimizer, num_warmup_steps=total_steps//10, num_training_steps=total_steps
        )

        self.model.train()

        for epoch in range(epochs):
            total_loss = 0
            progress_bar = tqdm(train_loader, desc=f'Epoch {epoch+1}/{epochs}')

            for batch in progress_bar:

                optimizer.zero_grad()

                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                start_positions = batch['start_positions'].to(device)
                end_positions = batch['end_positions'].to(device)

                outputs = self.model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    start_positions=start_positions,
                    end_positions=end_positions
                )

                loss = outputs.loss

                total_loss += loss.item()

                loss.backward()

                torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)

                optimizer.step()

                scheduler.step()

                progress_bar.set_postfix({'Loss': f'{loss.item():.4f}'})

            print(f'Epoch {epoch+1}, Average Loss: {total_loss/len(train_loader):.4f}')

    def answer_question(self, question, context, max_answer_len=30):
        """Answer a single question given context"""

        encoding = self.tokenizer(
            question,
            context,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )

        input_ids = encoding['input_ids'].to(device)

        attention_mask = encoding['attention_mask'].to(device)

        self.model.eval()

        with torch.no_grad():

            outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)

            start_logits = outputs.start_logits

            end_logits = outputs.end_logits

            start_idx = torch.argmax(start_logits, dim=1).item()

            end_idx = torch.argmax(end_logits, dim=1).item()

            # Ensure valid span
            if end_idx < start_idx:
                end_idx = start_idx

            if (end_idx - start_idx) > max_answer_len:
                end_idx = start_idx + max_answer_len

            # Decode predicted tokens
            answer_tokens = input_ids[0][start_idx:end_idx+1]

            answer = self.tokenizer.decode(answer_tokens, skip_special_tokens=True).strip()

            return answer


In [None]:
def run_text_classification_demo():

    """Demo for text classification"""

    print("\n" + "="*60)
    print("TEXT CLASSIFICATION (Sentiment Analysis) DEMO")
    print("="*60)

    classifier = BERTTextClassifier(num_classes=2)

    # Load data
    train_texts, train_labels, test_texts, test_labels = classifier.load_imdb_data(sample_size=1000)

    # Show sample
    print(f"\nSample Review: {train_texts[0][:200]}...")
    print(f"Label: {'Positive' if train_labels[0] == 1 else 'Negative'}")

    # Train for 2 epochs (small for demo)
    classifier.train(train_texts, train_labels, epochs=1, batch_size=8)

    # Evaluate
    accuracy, f1, report = classifier.evaluate(test_texts, test_labels, batch_size=8)

    print(f"\nAccuracy: {accuracy:.4f}")
    print(f"F1 Score: {f1:.4f}")

    # Test custom examples
    custom_reviews = [
        "This movie was fantastic! Amazing acting and great plot.",
        "Boring and terrible. Waste of time.",
        "Not bad, could be better though."
    ]

    predictions, probabilities = classifier.predict(custom_reviews)

    print(f"\nCustom Predictions:")

    for text, pred, prob in zip(custom_reviews, predictions, probabilities):

        sentiment = "Positive" if pred == 1 else "Negative"

        confidence = prob[pred] * 100

        print(f"'{text[:50]}...' -> {sentiment} ({confidence:.1f}%)")


In [None]:
def run_ner_demo():

    """Demo for Named Entity Recognition"""

    print("\n" + "="*60)
    print("NAMED ENTITY RECOGNITION DEMO")
    print("="*60)

    ner_model = BERTNERClassifier(num_labels=9)

    # Load small subset for demo (fast training)
    train_tokens, train_labels, test_tokens, test_labels = ner_model.load_wikiann_data(sample_size=500)

    # Show sample
    print(f"\nSample tokens: {train_tokens[0][:10]}")

    label_names = [ner_model.labels[l] if l < len(ner_model.labels) else "O" for l in train_labels[0][:10]]

    print(f"Sample labels: {label_names}")

    # Train for 2 epochs
    ner_model.train(train_tokens, train_labels, epochs=1, batch_size=8)

    # Evaluate
    accuracy, f1 = ner_model.evaluate(test_tokens, test_labels, batch_size=8)
    print(f"\nAccuracy: {accuracy:.4f}")
    print(f"F1 Score: {f1:.4f}")

    # Custom examples
    custom_sentences = [
        ["John", "Smith", "works", "at", "Google", "in", "California"],
        ["Apple", "Inc.", "was", "founded", "by", "Steve", "Jobs"]
    ]

    predictions = ner_model.predict(custom_sentences)
    print(f"\nCustom Predictions:")
    for tokens, preds in zip(custom_sentences, predictions):
        print("Tokens:", tokens)
        print("Labels:", preds)
        print()


In [None]:
def run_qa_demo():

    """Demo for Question Answering"""

    print("\n" + "="*60)
    print("QUESTION ANSWERING DEMO")
    print("="*60)

    qa_model = BERTQuestionAnswering()

    # Load small subset (for speed)
    (train_questions, train_contexts, train_answers,
     val_questions, val_contexts, val_answers) = qa_model.load_squad_data(sample_size=500)

    # Safe sample print
    ans_text = train_answers[0]['text'][0] if train_answers[0]['text'] else 'No answer'

    print(f"\nSample Question: {train_questions[0]}")

    print(f"Sample Context: {train_contexts[0][:200]}...")

    print(f"Sample Answer: {ans_text}")

    # Train model (2 epochs for demo)
    qa_model.train(train_questions, train_contexts, train_answers, epochs=1, batch_size=4)

    # Test on custom questions
    print(f"\nCustom Q&A Examples:")

    test_cases = [
        {
            "question": "What is the capital of France?",
            "context": "France is a country in Europe. Paris is the capital and largest city of France. The city is known for the Eiffel Tower and the Louvre Museum."
        },
        {
            "question": "Who founded Apple?",
            "context": "Apple Inc. is an American technology company. It was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976. The company is known for products like iPhone and Mac."
        }
    ]

    for case in test_cases:
        answer = qa_model.answer_question(case["question"], case["context"], max_answer_len=30)
        print(f"Q: {case['question']}")
        print(f"A: {answer}")
        print()


In [None]:
print("BERT Multi-Task Demo")
print("Choose a task to run:")
print("1. Text Classification (Sentiment Analysis)")
print("2. Named Entity Recognition (NER)")
print("3. Question Answering")
print("4. Run All Tasks")

choice = input("\nEnter your choice (1-4): ").strip()

try:
    if choice == "1":
        run_text_classification_demo()
    elif choice == "2":
        run_ner_demo()
    elif choice == "3":
        run_qa_demo()
    elif choice == "4":
        run_text_classification_demo()
        run_ner_demo()
        run_qa_demo()
    else:
        print("Invalid choice! Please run again.")
except Exception as e:
    print("\n--- ERROR OCCURRED ---")
    print(f"Error: {e}")
    print("\nMake sure you have:")
    print("1. Installed required packages:")
    print("   pip install torch transformers datasets scikit-learn tqdm numpy")
    print("2. Loaded all classes & dataset helpers (BERTTextClassifier, BERTNERClassifier, BERTQuestionAnswering, TextClassificationDataset, NERDataset, QADataset)")
    print("3. Using the fixed versions (with int casting, offset_mapping for QA, word_ids fix for NER)")
