## 1. Install and Import Dependencies

In this section, we install and import all the necessary libraries:
- `datasets` for loading data from Hugging Face
- `spaCy` for sentence tokenization
- `transformers` for BERT embeddings

In [None]:
# # Install if not already installed
# !pip3 install datasets --quiet
# !pip3 install nltk --quiet
# !pip3 install spacy
# !pip3 install transformers
# !pip3 install torch
# !python -m spacy download en_core_web_sm
# !pip3 install scikit-learn
# !pip3 install matplotlib

# Imports
from datasets import load_dataset
import spacy

from transformers import AutoTokenizer, AutoModel
import torch

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity

## 2. Load a 1% Sample from the Wikipedia Dataset

We'll load a small subset (`train[:1%]`) to keep things fast.

In [19]:
wiki = load_dataset("wikipedia", "20220301.en", split="train[:1%]", trust_remote_code=True)
print(f"Sampled {len(wiki)} documents\n")
print("Example text:\n")
print(wiki[6000]['text'][:2000]) 

Sampled 64587 documents


 Example text:

Glass is a non-crystalline, often transparent amorphous solid, that has widespread practical, technological, and decorative use in, for example, window panes, tableware, and optics. Glass is most often formed by rapid cooling (quenching) of the molten form; some glasses such as volcanic glass are naturally occurring. The most familiar, and historically the oldest, types of manufactured glass are "silicate glasses" based on the chemical compound silica (silicon dioxide, or quartz), the primary constituent of sand. Soda-lime glass, containing around 70% silica, accounts for around 90% of manufactured glass. The term glass, in popular usage, is often used to refer only to this type of material, although silica-free glasses often have desirable properties for applications in modern communications technology. Some objects, such as drinking glasses and eyeglasses, are so commonly made of silicate-based glass that they are simply called by the name of

## 3. Sentence Splitting

We'll split the first article into sentences using NLTK's `sent_tokenize`. 
Then we'll only keep the **first 10 sentences** for our BERT embedding example.


In [20]:
nlp = spacy.load("en_core_web_sm")

doc = nlp(wiki[6000]['text'])
sentences = [sent.text for sent in doc.sents]

print("\n SpaCy sentence split:\n")
for i, sent in enumerate(sentences[:10]):
    print(f"{i+1}. {sent}")


 SpaCy sentence split:

1. Glass is a non-crystalline, often transparent amorphous solid, that has widespread practical, technological, and decorative use in, for example, window panes, tableware, and optics.
2. Glass is most often formed by rapid cooling (quenching) of the molten form; some glasses such as volcanic glass are naturally occurring.
3. The most familiar, and historically the oldest, types of manufactured glass are "silicate glasses" based on the chemical compound silica (silicon dioxide, or quartz), the primary constituent of sand.
4. Soda-lime glass, containing around 70% silica, accounts for around 90% of manufactured glass.
5. The term glass, in popular usage, is often used to refer only to this type of material, although silica-free glasses often have desirable properties for applications in modern communications technology.
6. Some objects, such as drinking glasses and eyeglasses, are so commonly made of silicate-based glass that they are simply called by the name o

## 4. Load BERT Model and Tokenizer

We'll use the `bert-base-uncased` model from Hugging Face. 

In [23]:
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Put model in eval mode to disable dropout
model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

## 5. Generate Embeddings for the First 10 Sentences

For each sentence:
1. Tokenize with the BERT tokenizer
2. Run it through the model to get embeddings
3. Print the **first 5 values** of each token's embedding vector

**Note**: BERT outputs have shape `[batch_size, sequence_length, hidden_size]`, 
so we loop through the `sequence_length` dimension to inspect token embeddings.


In [24]:
# Process each of the first 10 sentences
for i, sentence in enumerate(sentences[:10]):
    # Tokenize
    inputs = tokenizer(sentence, return_tensors="pt")
    
    # Forward pass through BERT (no gradients needed)
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Get the last hidden state (the embeddings)
    # shape = [batch_size=1, seq_length, hidden_size=768]
    embeddings = outputs.last_hidden_state
    
    print(f"\n===== Sentence {i+1} Embeddings =====")
    print(f"Original Sentence:\n{sentence}\n")
    
    # Convert input IDs back to tokens to see which token matches which embedding
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    
    for token_idx, token_emb in enumerate(embeddings[0]):
        # Print the token and the first 5 values of its embedding
        print(f"Token: {tokens[token_idx]:<15} | Embedding first 5 dims: {token_emb[:5].tolist()}")



===== Sentence 1 Embeddings =====
Original Sentence:
Glass is a non-crystalline, often transparent amorphous solid, that has widespread practical, technological, and decorative use in, for example, window panes, tableware, and optics.

Token: [CLS]           | Embedding first 5 dims: [-0.22782164812088013, -0.18914073705673218, -0.3277425169944763, -0.05423904210329056, -0.3639874756336212]
Token: glass           | Embedding first 5 dims: [1.0369515419006348, 0.39504313468933105, -0.634232759475708, -0.02918941341340542, 0.8679481744766235]
Token: is              | Embedding first 5 dims: [-0.02266383171081543, 0.23825767636299133, -0.024853739887475967, -0.37907376885414124, 0.2955845892429352]
Token: a               | Embedding first 5 dims: [-0.1935730278491974, 0.4670875072479248, 0.28465571999549866, -0.12208503484725952, 0.38772886991500854]
Token: non             | Embedding first 5 dims: [0.002641589380800724, -0.5909901857376099, 0.2815864682197571, -0.4769973158836365, -0.10

In [31]:
print()
# Simple cosine similarity function
def cosine_sim(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# Suppose 'embeddings' is a list of (token_text, embedding_vector) pairs
# Example: embeddings = [("glass", np.array([... ])), ("bird", np.array([...])), ...]

# 1. Build a dictionary { token_text: embedding_vector }
token_dict = {}
for token_text, emb_vec in enumerate(embeddings[0]):
    token_dict[token_text] = emb_vec

# 2. Retrieve the embedding for "glass"
if "glass" not in token_dict:
    print("'glass' not found in token_dict!")
else:
    glass_emb = token_dict["glass"]

    # 3. Compute similarity for all other words
    similarities = {}
    for word, emb_vec in token_dict.items():
        if word == "glass":
            continue
        sim = cosine_sim(glass_emb, emb_vec)
        similarities[word] = sim

    # 4. Print all words and their similarity to "glass"
    print("\nSimilarity to 'glass':")
    for w, sim_val in similarities.items():
        print(f"{w} -> {sim_val:.4f}")

'glass' not found in token_dict!
