# 📘 Notebook 1: Introduction to Tokens and Embeddings
This notebook introduces the basic concepts of tokens and embeddings used in Large Language Models.

## What is an LLM?
A Large Language Model (LLM) is a type of AI trained on massive amounts of text to understand and generate human-like language. It doesn’t “know” facts — instead, it learns patterns, grammar, and meaning through billions of examples.


## 🧩 What is a Token?
A token is a piece of text — often a word or subword — that the model processes. Tokenizers convert sentences into tokens. LLMs don’t see raw text. They break input into tokens, which are like pieces of words or symbols. Tokenizing helps standardize and compress the input for processing.

## 🔢 Tokenization Example: Word vs. Tokens vs. Token IDs

Before we get embeddings, we break a sentence into tokens.
Here's an example:

| Word      | Token(s)     | Token ID(s)     |
|-----------|--------------|-----------------|
| running   | run, ##ning  | 2142, 2075      |
| happily   | happ, ##ily  | 6203, 12973     |

- Some words break into **subwords** based on what the model has seen during training.
- Each token is mapped to a unique ID from the model's vocabulary.

This is what gets fed into the model — not the raw text!


In [None]:
import logging
import warnings
from transformers.utils import logging as hf_logging
import os
from transformers import AutoTokenizer

# Suppress all user warnings, including HF_TOKEN ones
warnings.filterwarnings("ignore", category=UserWarning)

# Suppress Hugging Face progress bars and warnings
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
hf_logging.set_verbosity_error()
logging.getLogger("huggingface_hub").setLevel(logging.ERROR)

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
sentence = "The cat sat on the mat."
tokens = tokenizer.tokenize(sentence)
print("Tokens:", tokens)
print("Token IDs:", tokenizer.convert_tokens_to_ids(tokens))

## 📐 What is an Embedding?
Once a sentence is tokenized, each token is converted into a vector — a list of numbers that captures its meaning. These are called embeddings. They allow machines to compare meaning using math.
Embeddings help AI understand meaning, not just words. For example, the sentences:
*   "The dog ran away."
*   "The canine escaped."

…may use different words, but their embeddings are similar because their meaning is similar.

The code below takes a sentence written in English and shows you how a computer “understands” it — by turning it into a long list of numbers called an embedding.
* `SentenceTransformer` is a tool that takes a sentence and creates a special kind of fingerprint called an embedding — a list of numbers that capture the meaning of the sentence.
* `embedding = model.encode([sentence])[0]`
This is where your sentence gets converted into a vector — a list of around 384 numbers (for this model). Each number represents a different aspect of the sentence’s meaning (e.g., its tone, topic, grammar, etc.).
* The bar chart `(plt.bar(...))`
This chart visualizes those numbers in a way your eyes can understand. Each bar shows how “strongly” the sentence expresses a certain hidden concept. The x-axis is the dimension number (just the position in the list), and the y-axis is the value (how much that dimension is “active”).

In [None]:

import logging
import warnings
from transformers.utils import logging as hf_logging
import os
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns


# Suppress Hugging Face logs
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
hf_logging.set_verbosity_error()
logging.getLogger("huggingface_hub").setLevel(logging.ERROR)

model = SentenceTransformer('all-MiniLM-L6-v2')

embedding = model.encode([sentence])[0]

plt.figure(figsize=(12, 3))
plt.bar(range(len(embedding)), embedding)
plt.title(f"Embedding Vector for: '{sentence}'")
plt.xlabel("Dimension")
plt.ylabel("Value")
plt.show()


## 💬 Why Do We Use Embeddings?

Embeddings turn text into numbers — but why does that matter?

- **Similar sentences → similar vectors** (they are *close* in space).
- **Different sentences → distant vectors** (they are *far apart* in space).

This makes it easy for the computer to understand meaning using math.
For example:

| Sentence A                  | Sentence B                  | Cosine Similarity |
|----------------------------|-----------------------------|-------------------|
| The dog barked.            | A canine made noise.        | High              |
| I like pizza.              | The economy is slowing down.| Low               |

We use embeddings for:
- **Semantic Search**: Find texts that *mean* the same thing.
- **Chatbots**: Match new questions to known answers.
- **Classification**: Label text by analyzing its vector.
- **Clustering**: Group similar ideas together.