<a href="https://colab.research.google.com/github/chenwh0/Natural-Language-Processing-work/blob/main/module5/HowTokenizationAndEmbeddingsWorksInsideLLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Exploration and Analysis of Tokenization and Embeddings in Modern LLMs**

**Background:**  
This lab tested how multiple real tokenizers operate, dissect how their output impacts the resulting embeddings, and analyze why these choices matter for both accuracy and efficiency in language tasks.

# *Sources/References*

* Academic Data source: https://pubmed.ncbi.nlm.nih.gov/18270555/

### Instructions & Deliverables

#### 1. **Tokenization Deep Dive (2 points)**

- Select *five diverse sentences*:  
  - Two with formal academic language  
  - One with slang or social media language  
  - One with an emoji  
  - One with code/math notation  
- For each sentence, tokenize with **three different pretrained tokenizers** (choose from e.g. `bert-base-uncased`, `gpt2`, `microsoft/Phi-3-mini-4k-instruct`, `google/flan-t5-small`, etc.).
- Display for each:
  - The original text
  - The sequence of tokens and their decoded forms (subwords)
  - The token IDs

#### 2. **Cross-Tokenizer Comparison (2 points)**

- Place your results in a **comparison table**:
  - For each sentence and tokenizer, show:  
    - Number of tokens  
    - How words or special features (names/emoji/code) are split
    - Presence of [UNK] or unknown tokens
- In a *markdown cell*, answer:  
  - Which tokenization schemes are more robust to slang, emojis, and code?  
  - Which produce the longest and shortest sequences? Why?

#### 3. **Token Embedding Visualization (3 points)**

- Pick one sentence and one tokenizer from your previous results.
- Use the tokenizer‚Äôs pretrained embedding layer (from its associated model) to produce the embedding vector for each token in the sentence.
- Use PCA or t-SNE to project the token embeddings to 2D and create a **scatter plot**:
  - Each point should be labeled with the decoded token.
  - Color points differently for subwords, whole words, and special tokens.
- Comment on the geometry: Do related words/subwords cluster? Are special tokens outliers?

#### 4. **Prompt Engineering & Model Output (2 points)**

- Take two tokenized prompts that yielded notably different token splits across tokenizers (e.g. one with code/math and one with informal language).
- For each:
  - Use two *different* language models (‚Äúmatching‚Äù the tokenizer used) to generate text completions.
  - In a short table, report:
    - Length (in tokens and characters) of the generated output
    - Are any [UNK] tokens, empty outputs, or odd/non-conversational results observed?
- Discuss how the tokenizer choice might affect downstream output quality and efficiency.

#### 5. **Reflection (1 point)**

- In a paragraph (markdown), summarize:
  - How does the choice of tokenizer and embedding scheme affect which kinds of input a model can ‚Äúunderstand‚Äù?
  - Why must LLM practitioners consider both the *efficiency* (sequence length) and the *semantic coverage* (handling unknowns, subwords, emoji) of each tokenizer?

**Submission**:  
Produce a Jupyter notebook with clearly separated code and markdown cells for each section. All code must run under Python and Hugging Face Transformers. Include all required tables, plots, and discussion.

**Grading Rubric:**

| Section                           | Points |
|:-----------------------------------|:------:|
| Tokenization Deep Dive             | 2      |
| Cross-Tokenizer Comparison Table & Analysis | 2 |
| Embedding Visualization            | 3      |
| Prompt Engineering & Model Output  | 2      |
| Reflection                        | 1      |
| **Total**                         | **10** |

# *Installs & Imports*

In [None]:
# Data preprocessing libraries
import pandas

# Tokenizer library
from transformers import AutoModelForCausalLM, AutoTokenizer

# Data visualization
import matplotlib.pyplot as pyplot

# 1. **Tokenization Deep Dives**

a. Selected *5 diverse sentences*
  - 2 with formal academic language  
  - 1 with slang or social media language  
  - 1 with an emoji  
  - 1 with code/math notation  

b. Tokenized each sentence with **3 different pretrained tokenizers** (`bert-base-uncased`, `gpt2`, `microsoft/Phi-3-mini-4k-instruct`).

c. Displayed for each sentence:
  - The original text
  - The sequence of tokens and their decoded forms (subwords)
  - The token IDs

In [None]:
texts = [
    "New methodologies and prototype systems for dynamically incorporating automatic feature extraction, visual selection, and knowledge-rich semantics for content-based image database management and retrieval are needed to assist image analysts.",
    "GeoIRIS can be best described by its architecture as shown in Fig. 1. There are six modules: feature extraction (FE), indexing structures (IS), semantic framework (SF), GeoName server (GS), fusion and ranking (FR), and retrieval visualization (RV).",
    "If I get one more L I'm gonna yeet my controller out a window no cap",
    "One will need a üñ•Ô∏è first to make a üåè application.",
    "To ensure features are rotationally insensitive, we order each bin, i ‚àà [1, F], from W+y and W‚àíy, such that S[i] = max {W +yi , W ‚àíyi } and S[i + F] = min {W +yi , W ‚àíyi }"
]

In [None]:
# List of RGB color codes for highlighting tokens in the output
colors_list = [
    '102;194;165', '252;141;98', '141;160;203',
    '231;138;195', '166;216;84', '255;217;47'
]

def show_tokens(sentence, tokenizer_name):
    # Load the specified tokenizer from Hugging Face
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    # Tokenize the input sentence and get token IDs
    token_ids = tokenizer(sentence).input_ids
    print("Original text:", sentence)
    tokenized_prompt = []
    # Iterate over each token ID and print the decoded token with colored background
    for idx, t in enumerate(token_ids):
        text = tokenizer.decode(t)
        print(
            # ANSI escape code for colored background using RGB values from colors_list
            f'\x1b[0;30;48;2;{colors_list[idx % len(colors_list)]}m' +
            text +
            '\x1b[0m',
            end=' '
        )
        #tokenized_prompt.append((text, t))

    print("\nToken IDs:", token_ids)
    print()
    return token_ids

In [None]:
def tokenizer_split(tokenizer_name, texts):
    tokenized_prompts = []
    for text in texts:
        token_ids = show_tokens(text, tokenizer_name)
        tokenized_prompts.append(token_ids)
    return tokenized_prompts

In [None]:
bert_tokenized_prompts = tokenizer_split("bert-base-uncased", texts)

Original text: New methodologies and prototype systems for dynamically incorporating automatic feature extraction, visual selection, and knowledge-rich semantics for content-based image database management and retrieval are needed to assist image analysts.
[0;30;48;2;102;194;165m[CLS][0m [0;30;48;2;252;141;98mnew[0m [0;30;48;2;141;160;203mmethod[0m [0;30;48;2;231;138;195m##ologies[0m [0;30;48;2;166;216;84mand[0m [0;30;48;2;255;217;47mprototype[0m [0;30;48;2;102;194;165msystems[0m [0;30;48;2;252;141;98mfor[0m [0;30;48;2;141;160;203mdynamic[0m [0;30;48;2;231;138;195m##ally[0m [0;30;48;2;166;216;84mincorporating[0m [0;30;48;2;255;217;47mautomatic[0m [0;30;48;2;102;194;165mfeature[0m [0;30;48;2;252;141;98mextraction[0m [0;30;48;2;141;160;203m,[0m [0;30;48;2;231;138;195mvisual[0m [0;30;48;2;166;216;84mselection[0m [0;30;48;2;255;217;47m,[0m [0;30;48;2;102;194;165mand[0m [0;30;48;2;252;141;98mknowledge[0m [0;30;48;2;141;160;203m-[0m [0;30;48;2;231;1

In [None]:
gpt2_tokenized_prompts = tokenizer_split("gpt2", texts)

Original text: New methodologies and prototype systems for dynamically incorporating automatic feature extraction, visual selection, and knowledge-rich semantics for content-based image database management and retrieval are needed to assist image analysts.
[0;30;48;2;102;194;165mNew[0m [0;30;48;2;252;141;98m method[0m [0;30;48;2;141;160;203mologies[0m [0;30;48;2;231;138;195m and[0m [0;30;48;2;166;216;84m prototype[0m [0;30;48;2;255;217;47m systems[0m [0;30;48;2;102;194;165m for[0m [0;30;48;2;252;141;98m dynamically[0m [0;30;48;2;141;160;203m incorporating[0m [0;30;48;2;231;138;195m automatic[0m [0;30;48;2;166;216;84m feature[0m [0;30;48;2;255;217;47m extraction[0m [0;30;48;2;102;194;165m,[0m [0;30;48;2;252;141;98m visual[0m [0;30;48;2;141;160;203m selection[0m [0;30;48;2;231;138;195m,[0m [0;30;48;2;166;216;84m and[0m [0;30;48;2;255;217;47m knowledge[0m [0;30;48;2;102;194;165m-[0m [0;30;48;2;252;141;98mrich[0m [0;30;48;2;141;160;203m semantics[0

In [None]:
phi3_mini4K_tokenized_prompts = tokenizer_split("microsoft/Phi-3-mini-4k-instruct", texts)

Original text: New methodologies and prototype systems for dynamically incorporating automatic feature extraction, visual selection, and knowledge-rich semantics for content-based image database management and retrieval are needed to assist image analysts.
[0;30;48;2;102;194;165mNew[0m [0;30;48;2;252;141;98mmethod[0m [0;30;48;2;141;160;203mologies[0m [0;30;48;2;231;138;195mand[0m [0;30;48;2;166;216;84mprototype[0m [0;30;48;2;255;217;47msystems[0m [0;30;48;2;102;194;165mfor[0m [0;30;48;2;252;141;98mdynamically[0m [0;30;48;2;141;160;203mincorpor[0m [0;30;48;2;231;138;195mating[0m [0;30;48;2;166;216;84mautomatic[0m [0;30;48;2;255;217;47mfeature[0m [0;30;48;2;102;194;165mextra[0m [0;30;48;2;252;141;98mction[0m [0;30;48;2;141;160;203m,[0m [0;30;48;2;231;138;195mvisual[0m [0;30;48;2;166;216;84mselection[0m [0;30;48;2;255;217;47m,[0m [0;30;48;2;102;194;165mand[0m [0;30;48;2;252;141;98mknowledge[0m [0;30;48;2;141;160;203m-[0m [0;30;48;2;231;138;195mri

# 2. **Cross-Tokenizer Comparison**

## Tokenization scheme more robuts to slang, emojis, code

Phi-3-mini-4k-instruct is more robust for emojis and math symbols because it had the ability of understanding emojis as well as the math symbol ‚àà

## Tokenization scheme that produced longest/shortest sequences
GPT2 produced the shortest sequence overall & Phi-3-mini-4k-instruct produced the longest sequence overall. This might be because GPT2 is least aggressive in its word splits and Phi-3-mini-4k-instruct is the most aggressive in word-splits

In [None]:
def calculate_variables(name, tokenized_prompts, emoji_split, unk_token):
    total_tokens = 0
    total_unk = 0
    for prompt in tokenized_prompts:
        total_tokens += len(prompt)
        total_unk += prompt.count(unk_token)

    row = {"Tokenizer name": name, "# of tokens": total_tokens,
           "emoji split": emoji_split, "Unknown tokens count": total_unk}
    return row

In [None]:
row1 = calculate_variables("BERT uncased", bert_tokenized_prompts, "[UNK]", 100)
row2 = calculate_variables("GPT2", gpt2_tokenized_prompts, " ÔøΩ ÔøΩ ÔøΩ", 50256)
row3 = calculate_variables("Phi-3-mini-4k-instruct", phi3_mini4K_tokenized_prompts, "ÔøΩ ÔøΩ ÔøΩ ÔøΩ",32000)
comparison_dataframe = pandas.DataFrame([row1, row2, row3])
display(comparison_dataframe)

Unnamed: 0,Tokenizer name,# of tokens,emoji split,Unknown tokens count
0,BERT uncased,209,[UNK],2
1,GPT2,195,ÔøΩ ÔøΩ ÔøΩ,0
2,Phi-3-mini-4k-instruct,212,ÔøΩ ÔøΩ ÔøΩ ÔøΩ,0


# 3. **Token Embedding Visualization**

- Pick one sentence and one tokenizer from your previous results.
- Use the tokenizer‚Äôs pretrained embedding layer (from its associated model) to produce the embedding vector for each token in the sentence.
- Use PCA or t-SNE to project the token embeddings to 2D and create a **scatter plot**:
  - Each point should be labeled with the decoded token.
  - Color points differently for subwords, whole words, and special tokens.
- Comment on the geometry: Do related words/subwords cluster? Are special tokens outliers?

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Load Phi-3-mini-4k-instruct tokenizer and model
model_name = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Get all token IDs
token_ids = list(range(tokenizer.vocab_size))

# Get embeddings from model's embedding layer
with torch.no_grad():
    embeddings = model.get_input_embeddings()(torch.tensor(token_ids))

# Reduce to 2D
# Option 1: PCA
pca = PCA(n_components=2)
emb_2d = pca.fit_transform(embeddings.numpy())

# Option 2: t-SNE (uncomment if you want)
# tsne = TSNE(n_components=2, perplexity=30, n_iter=1000, random_state=42)
# emb_2d = tsne.fit_transform(embeddings.numpy())

# Decode tokens
tokens = [tokenizer.decode([tid]) for tid in token_ids]

# Determine token type for coloring
colors = []
for t in tokens:
    if t.startswith("<") and t.endswith(">"):
        colors.append("red")  # special tokens
    elif t.startswith("‚ñÅ"):  # SentencePiece uses '‚ñÅ' for start of words
        colors.append("green")  # whole word
    else:
        colors.append("blue")  # subword

# Plot
plt.figure(figsize=(15, 10))
for i, (x, y) in enumerate(emb_2d):
    plt.scatter(x, y, color=colors[i])
    plt.text(x, y, tokens[i], fontsize=6)
plt.title("Token Embeddings Projected to 2D")
plt.show()


# 4. **Prompt Engineering & Model Output (2 points)**

- Take two tokenized prompts that yielded notably different token splits across tokenizers (e.g. one with code/math and one with informal language).
- For each:
  - Use two *different* language models (‚Äúmatching‚Äù the tokenizer used) to generate text completions.
  - In a short table, report:
    - Length (in tokens and characters) of the generated output
    - Are any [UNK] tokens, empty outputs, or odd/non-conversational results observed?
- *Discuss how the tokenizer choice might affect downstream output quality and efficiency.*

BERT - They are efficient for retrieval because they use the least amount of tokens to represent texts compared to other tokenizers, however they can not retrieve unicode & emojis as well. In generation tasks, they are good for classification but may have trouble with capitalization.

GPT2 - They are good for retrieval because they can handle unknown words by breaking them down into subwords, and they have some representations for unicode & emojis, but required to search through more tokens. In generation tasks, they are sensitive to spacing.

Phi-3-mini-4k-instruct - They are efficient for retrieval because they use the least amount of tokens to represent texts compared to other tokenizers, and they can retrieve unicode & emojis. In generation tasks, they consider capitalization & can have a more global view than WordPiece BERT.

# **Technical Reflection**

## Tokenizer choice affects which types of input a model can "understand"

BERT uses WordPiece. WordPiece uses [UNK] token to represent unseen emojis, breaks math expressions down by characters, has basic tokenization skills for code syntax, and ignores capitalization in tokenization of proper nouns.

GPT2 uses Byte-Pair Encoding (BPE). BPE breaks unseen emojis further down as subunits of their Unicode representations, breaks unseen math expressions down by characters, have a better tokenization of code syntax than WordPiece, and can represent capitalization in tokenization of proper nouns.

## Efficiency vs Semantic coverage tradeoffs
In some cases, more efficient tokenizers capture less meaning and do not have high semantic coverage for all-purpose use cases while more robust semantic coverage tokenizers capture finer meaning but have higher computational and time costs in training.
