# Natural Language Processing with Transformers
## Detailed Step-by-Step Solution
### 1. Introduction
This assignment involves using a pre-trained BERT model from Hugging Face to compute sentence embeddings, measure cosine similarity between sentence pairs, and predict whether sentences are semantically similar.

### Key Tasks:
Apply BERT for sentence encoding.

Extract token-level embeddings.

Compute cosine similarity between embeddings.

Predict similarity based on a threshold (0.7).

Evaluate accuracy against manual labels.

In [5]:
!pip install --upgrade transformers

Collecting transformers
  Downloading transformers-4.53.3-py3-none-any.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.53.3-py3-none-any.whl (10.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m99.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.53.2
    Uninstalling transformers-4.53.2:
      Successfully uninstalled transformers-4.53.2
Successfully installed transformers-4.53.3


In [6]:
!pip install tf-keras



### 2. Task Completion
### Step 1: Import Libraries & Load Pre-trained BERT
We use transformers for BERT, tensorflow for model execution, and sklearn for cosine similarity.

In [7]:
from transformers import BertTokenizer, TFBertModel
import tensorflow as tf
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were 

### Step 2: Define Sentence Pairs & Labels
We extend the given 5 sentence pairs with 5 new ones and manually label them (1=similar, 0=not similar).

In [8]:
sentence_pairs = [
    ("How do I learn Python?", "What is the best way to study Python?"),
    ("What is AI?", "How to cook pasta?"),
    ("How do I bake a chocolate cake?", "Give me a chocolate cake recipe."),
    ("How can I improve my coding skills?", "Tips for becoming better at programming."),
    ("Where can I buy cheap laptops?", "Best sites to find affordable computers."),
    # New pairs
    ("What is the weather today?", "Is it raining outside?"),
    ("How to train a dog?", "Best ways to teach a puppy tricks."),
    ("What is machine learning?", "How does deep learning work?"),
    ("How to make coffee?", "Steps to prepare tea."),
    ("Best restaurants in town?", "Top places to eat nearby.")
]

labels = [1, 0, 1, 1, 1, 1, 1, 1, 0, 1]  # Manual ground truth

### Step 3: Define Function to Get BERT Embeddings
BERT generates contextual embeddings. We extract the [CLS] token embedding for sentence-level representation.

In [9]:
def get_sentence_embedding(sentence):
    inputs = tokenizer(sentence, return_tensors='tf', padding=True, truncation=True)
    outputs = bert_model(inputs)
    cls_embedding = outputs.last_hidden_state[:, 0, :]  # [CLS] token embedding
    return cls_embedding.numpy()

### Step 4: Compute Cosine Similarity & Predictions
For each pair, we:

Get embeddings.

Compute cosine similarity.

Predict similarity if score > 0.7.

In [10]:
predictions = []
for sent1, sent2 in sentence_pairs:
    emb1 = get_sentence_embedding(sent1)
    emb2 = get_sentence_embedding(sent2)
    sim_score = cosine_similarity(emb1, emb2)[0][0]
    pred = 1 if sim_score > 0.7 else 0
    predictions.append(pred)

    print(f"\nSentence 1: {sent1}")
    print(f"Sentence 2: {sent2}")
    print(f"Cosine Similarity: {sim_score:.4f} → Predicted Similar: {pred}")

TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.



Sentence 1: How do I learn Python?
Sentence 2: What is the best way to study Python?
Cosine Similarity: 0.9743 → Predicted Similar: 1

Sentence 1: What is AI?
Sentence 2: How to cook pasta?
Cosine Similarity: 0.9033 → Predicted Similar: 1

Sentence 1: How do I bake a chocolate cake?
Sentence 2: Give me a chocolate cake recipe.
Cosine Similarity: 0.8938 → Predicted Similar: 1

Sentence 1: How can I improve my coding skills?
Sentence 2: Tips for becoming better at programming.
Cosine Similarity: 0.8633 → Predicted Similar: 1

Sentence 1: Where can I buy cheap laptops?
Sentence 2: Best sites to find affordable computers.
Cosine Similarity: 0.8750 → Predicted Similar: 1

Sentence 1: What is the weather today?
Sentence 2: Is it raining outside?
Cosine Similarity: 0.9476 → Predicted Similar: 1

Sentence 1: How to train a dog?
Sentence 2: Best ways to teach a puppy tricks.
Cosine Similarity: 0.9343 → Predicted Similar: 1

Sentence 1: What is machine learning?
Sentence 2: How does deep learni

### Step 5: Evaluate Accuracy
Compare predictions with ground truth labels.

In [11]:
correct = sum(1 for i in range(len(predictions)) if predictions[i] == labels[i])
accuracy = correct / len(labels)
print(f"\nAccuracy: {accuracy:.2%}")


Accuracy: 80.00%
