<a href="https://colab.research.google.com/github/aditya0589/notebooks/blob/main/Natural%20Language%20Processing/NLP07.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NLP07 BERT MODEL**


## **BERT (Bidirectional Encoder Representations from Transformers)**

**BERT** is a pretrained deep learning language model based on the Transformer encoder architecture that learns contextual representations of words by looking at both left and right context simultaneously.

Unlike traditional language models that read text in one direction, BERT is bidirectional, which allows it to capture the full meaning of a word based on its surrounding words.

## **Why BERT is Powerful**

1. Uses self-attention to model relationships between all words in a sentence

2. Trained on large corpora using unsupervised pretraining

3. Can be fine-tuned for many downstream NLP tasks with minimal changes

## **NLP Using the BERT model**

In [1]:
!pip install transformers torch datasets



In [2]:
import torch
from transformers import BertTokenizer, BertModel, BertForSequenceClassification



## **Load BERT Tokenizer**

Tokenize words using BERT

In [3]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [4]:
text = "BERT is amazing for NLP tasks"
encoded = tokenizer(text, return_tensors="pt")
encoded

{'input_ids': tensor([[  101, 14324,  2003,  6429,  2005, 17953,  2361,  8518,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}

## **Load Pretrained BERT Model**

In [5]:
model = BertModel.from_pretrained('bert-base-uncased')

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

## **Generate Embeddings**

In [6]:
with torch.no_grad():
    outputs = model(**encoded)

outputs.last_hidden_state.shape

torch.Size([1, 9, 768])

## **Sentence Embedding**

In [7]:
cls_embedding = outputs.last_hidden_state[:, 0, :]
cls_embedding.shape

torch.Size([1, 768])

## **Text Classification with BERT**

In [8]:
classifier = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Example Inputs

In [9]:
texts = [
    "I love this movie",
    "This product is terrible"
]

inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

### Run Classification

In [10]:
with torch.no_grad():
    outputs = classifier(**inputs)

logits = outputs.logits
predictions = torch.argmax(logits, dim=1)
predictions

tensor([1, 0])

## 9️⃣ Named Entity Recognition (NER)

In [12]:
from transformers import pipeline

ner_pipeline = pipeline(
    "ner",
    model="dslim/bert-base-NER",
    aggregation_strategy="simple"
)

sentence = "Aditya studies Computer Science at JNTU Hyderabad"
ner_pipeline(sentence)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'entity_group': 'PER',
  'score': np.float32(0.9988004),
  'word': 'Ad',
  'start': 0,
  'end': 2},
 {'entity_group': 'PER',
  'score': np.float32(0.9526341),
  'word': '##ity',
  'start': 2,
  'end': 5},
 {'entity_group': 'MISC',
  'score': np.float32(0.63641477),
  'word': 'Computer Science',
  'start': 15,
  'end': 31},
 {'entity_group': 'ORG',
  'score': np.float32(0.980435),
  'word': 'J',
  'start': 35,
  'end': 36},
 {'entity_group': 'ORG',
  'score': np.float32(0.88837737),
  'word': '##NTU Hyderabad',
  'start': 36,
  'end': 49}]