# Researcher 2: Segmentation Verifier (Week 2)

As Researcher 2 in Week 2, my role is to verify the tokenization process and ensure that the segmentation into P (system text), U (user text), and A (assistant's prior text) is correct. I will do this by:
1. Loading and validating the tokenizer for the LLaMA-2-7B model.
2. Mapping tokens back to text segments for each example.
3. Ensuring that the spans for P, U, and A are correctly calculated and that there are no overlaps in these spans.
4. Performing a visual check by decoding the tokens back to text and comparing them with the original segments.
5. Saving the segmentation metadata to be used later in the pipeline.

In addition to the tokenization, I will make sure that the lengths of the spans are consistent and correct. Professor’s note: I should tokenize the whole text (P+U+A) together, rather than tokenizing segments separately, to avoid breaking relationships between segments.


In [None]:
# Cell 1: Install necessary packages and authenticate to HuggingFace
!pip install -q transformers accelerate

from huggingface_hub import login
import getpass

# Authenticate to HuggingFace (Needed for LLaMA access)
print("Enter your HuggingFace token (starts with 'hf_'): ")
hf_token = getpass.getpass()

# Login to HF Hub
login(token=hf_token)

# Load the tokenizer for LLaMA-2-7b-chat
from transformers import AutoTokenizer

HF_TOKENIZER_NAME = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(
    HF_TOKENIZER_NAME,
    use_fast=True
)

print("Loaded tokenizer:", HF_TOKENIZER_NAME)
print("Special tokens:", tokenizer.special_tokens_map)


Enter your HuggingFace token (starts with 'hf_'): 
··········
Loaded tokenizer: meta-llama/Llama-2-7b-chat-hf
Special tokens: {'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}


# Tokenization and Segmentation Process

In this step, I will:
1. **Tokenize the full context**: I will tokenize the entire concatenated text (system prompt P, user query U, and assistant prior text A) together. This will prevent changes in relationships between the segments due to separate tokenization.
2. **Calculate spans for P, U, A**: I will derive the start and end positions for each segment (P, U, A) after tokenization.
3. **Sanity check**: I will ensure that the sum of lengths of P, U, and A spans equals the total length of the input tokens.


In [None]:
# Cell 2: Load the dataset (selected_all_shuffled.jsonl)
import json
from pathlib import Path

# Define path for dataset
selected_path = Path("selected_all_shuffled.jsonl")

# Ensure the file exists
assert selected_path.exists(), f"File not found: {selected_path}"

# Load examples from the JSONL file
examples = []
with selected_path.open("r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if not line:
            continue
        examples.append(json.loads(line))

len(examples), examples[0].keys()  # Print length and keys of the first example


(300,
 dict_keys(['system_text', 'user_text', 'assistant_prior_text', 'constraint_tags', 'dataset', 'id']))

# Segment Span Calculation

For each example:
- **P (system_text)**: Represents the instructions given to the assistant.
- **U (user_text)**: Represents the query or the input provided by the user.
- **A (assistant_prior_text)**: Represents the assistant’s prior response.

I will:
1. Tokenize the entire context (P + U + A) together.
2. Calculate the token spans for each segment (P, U, A) using the token indices.


In [None]:
# Cell 3: Segmenting and verifying spans
def build_context_and_spans(example, tokenizer):
    """
    Given one example dict with keys:
        - id
        - system_text (P)
        - user_text (U)
        - assistant_prior_text (A)

    Returns a dict containing:
        - id
        - input_ids (list of tokenized input)
        - p_span (start, end indices for system_text)
        - u_span (start, end indices for user_text)
        - a_span (start, end indices for assistant_prior_text)
    """
    # Extract texts
    P = example.get("system_text", "") or ""
    U = example.get("user_text", "") or ""
    A = example.get("assistant_prior_text", "") or ""

    # Concatenate P, U, A together with fixed separator
    sep = "\n\n"
    context_text = P + sep + U + sep + A

    # Tokenize full context
    full_enc = tokenizer(
        context_text,
        add_special_tokens=False,
        return_attention_mask=False,
        return_tensors=None,
    )
    input_ids = full_enc["input_ids"]
    total_len = len(input_ids)

    # Calculate the lengths of each segment (P, U, A)
    enc_P = tokenizer(
        context_text[:len(P)],
        add_special_tokens=False,
        return_attention_mask=False,
        return_tensors=None,
    )
    len_p = len(enc_P["input_ids"])

    # For P + U
    prefix_PU_end = len(P) + len(sep) + len(U)
    enc_PU = tokenizer(
        context_text[:prefix_PU_end],
        add_special_tokens=False,
        return_attention_mask=False,
        return_tensors=None,
    )
    len_pu = len(enc_PU["input_ids"])
    len_u = len_pu - len_p  # U starts after P, so its length is the difference

    # The rest is for A
    len_a = total_len - len_pu

    # Sanity check
    assert len_p + len_u + len_a == total_len, (
        f"Length mismatch for id={example.get('id')}: "
        f"len_p+len_u+len_a={len_p + len_u + len_a}, total_len={total_len}"
    )

    # Define the span for each segment (0-based index)
    p_start, p_end = 0, len_p
    u_start, u_end = p_end, p_end + len_u
    a_start, a_end = u_end, u_end + len_a

    # Sanity checks for spans
    assert 0 <= p_start <= p_end <= total_len
    assert 0 <= u_start <= u_end <= total_len
    assert 0 <= a_start <= a_end <= total_len
    assert a_end == total_len

    return {
        "id": example["id"],
        "dataset": example.get("dataset", None),
        "input_ids": input_ids,
        "p_span": [p_start, p_end],
        "u_span": [u_start, u_end],
        "a_span": [a_start, a_end],
    }

# Run the function for the first 2 examples in the dataset
test_meta = [build_context_and_spans(ex, tokenizer) for ex in examples[:2]]
test_meta


[{'id': 'alpaca:22052',
  'dataset': 'alpaca',
  'input_ids': [6991,
   3034,
   675,
   278,
   1749,
   14433,
   411,
   402,
   7982,
   1904,
   297,
   694,
   901,
   1135,
   29871,
   29947,
   3838,
   29889,
   13,
   13,
   13,
   13],
  'p_span': [0, 18],
  'u_span': [18, 20],
  'a_span': [20, 22]},
 {'id': 'alpaca:24364',
  'dataset': 'alpaca',
  'input_ids': [14350,
   263,
   26576,
   1048,
   6709,
   29889,
   10604,
   881,
   367,
   3109,
   1135,
   29871,
   29947,
   29900,
   3838,
   29889,
   13,
   13,
   13,
   13],
  'p_span': [0, 16],
  'u_span': [16, 18],
  'a_span': [18, 20]}]

# Visual Check for Segmenting

I will visualize the output to ensure that:
1. Each segment (P, U, A) has been correctly tokenized.
2. The spans for P, U, and A are correct and do not overlap.
3. The decoded tokens match the original text segments.


In [None]:
# Cell 4: Visual check for decoded tokens
def pretty_visual_check(example, meta, tokenizer, max_chars=300):
    """
    This function prints the original text and the decoded tokenized segments (P, U, A)
    to verify that the spans are correctly calculated.
    """
    print("=" * 80)
    print("ID:", example["id"])
    print("Dataset:", example.get("dataset", ""))
    print("-" * 80)

    input_ids = meta["input_ids"]
    p_start, p_end = meta["p_span"]
    u_start, u_end = meta["u_span"]
    a_start, a_end = meta["a_span"]

    P_dec = tokenizer.decode(input_ids[p_start:p_end], skip_special_tokens=False)
    U_dec = tokenizer.decode(input_ids[u_start:u_end], skip_special_tokens=False)
    A_dec = tokenizer.decode(input_ids[a_start:a_end], skip_special_tokens=False)

    print("[system_text] original:")
    print(example.get("system_text", "")[:max_chars])
    print("\n[P_span decoded]:")
    print(P_dec[:max_chars])
    print("-" * 80)

    print("[user_text] original:")
    print(example.get("user_text", "")[:max_chars])
    print("\n[U_span decoded]:")
    print(U_dec[:max_chars])
    print("-" * 80)

    print("[assistant_prior_text] original:")
    print(example.get("assistant_prior_text", "")[:max_chars])
    print("\n[A_span decoded]:")
    print(A_dec[:max_chars])
    print("=" * 80)

# Run visual check for the first 2 examples
for ex, meta in zip(examples[:2], test_meta[:2]):
    pretty_visual_check(ex, meta, tokenizer)


ID: alpaca:22052
Dataset: alpaca
--------------------------------------------------------------------------------
[system_text] original:
Summarize the our goals with GPT model in no more than 8 words.

[P_span decoded]:
Summarize the our goals with GPT model in no more than 8 words.
--------------------------------------------------------------------------------
[user_text] original:


[U_span decoded]:



--------------------------------------------------------------------------------
[assistant_prior_text] original:


[A_span decoded]:



ID: alpaca:24364
Dataset: alpaca
--------------------------------------------------------------------------------
[system_text] original:
Write a poem about spring. Output should be less than 80 words.

[P_span decoded]:
Write a poem about spring. Output should be less than 80 words.
--------------------------------------------------------------------------------
[user_text] original:


[U_span decoded]:



-----------------------------------------

# Save Segmentation Metadata

Once the segmentation is validated, I will save the segmentation metadata in a JSONL file for later use in the pipeline.


In [None]:
# Save segmentation metadata for all examples
segmentation_metadata = []
for ex in examples:
    meta = build_context_and_spans(ex, tokenizer)
    segmentation_metadata.append(meta)

# Save the metadata
out_path = Path("segmentation_metadata.jsonl")
with out_path.open("w", encoding="utf-8") as f:
    for meta in segmentation_metadata:
        f.write(json.dumps(meta) + "\n")

print(f"Segmentation metadata saved to {out_path}, total examples: {len(segmentation_metadata)}")


Segmentation metadata saved to segmentation_metadata.jsonl, total examples: 300
