# Llama-2-7b-chat Tokenizer

1.   Inputs HF credentials (requires token to run)
2.   Loads model (not the full model just the tokenizer)
3. Runs the tokenizer on the json dataset and saves a file to directory



In [None]:
!pip install -U transformers



## Local Inference on GPU
Model page: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

⚠️ If the generated code snippets do not work, please open an issue on either the [model repo](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
			and/or on [huggingface.js](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries-snippets.ts) 🙏

The model you are trying to use is gated. Please make sure you have access to it by visiting the model page.To run inference, either set HF_TOKEN in your environment variables/ Secrets or run the following cell to login. 🤗

In [None]:
from huggingface_hub import login
login(new_session=False)

In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="meta-llama/Llama-2-7b-chat-hf")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Device set to use cuda:0


[{'generated_text': [{'role': 'user', 'content': 'Who are you?'},
   {'role': 'assistant',
    'content': "  Hello! I'm LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a conversational manner. I'm here to help you with any questions or topics you'd like to discuss. I'm trained on a massive dataset of text from the internet and can generate human-like responses, so feel free to chat with me like you would with a friend! 😊"}]}]

In [None]:
# ============================================
# TOKENIZATION - CPU ONLY
# ============================================

# Force CPU usage - no GPU required for tokenization
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""

import json
from pathlib import Path
from typing import Dict, List, Tuple
from dataclasses import dataclass, asdict
from transformers import AutoTokenizer

# Login to Hugging Face (run this first if not already logged in)
from huggingface_hub import login
login(new_session=False)

@dataclass
class TokenizedExample:
    """Container for a tokenized example with all metadata"""
    id: str
    dataset: str
    system_text: str
    user_text: str
    assistant_prior_text: str
    constraint_tags: List[str]
    p_token_ids: List[int]
    u_token_ids: List[int]
    a_token_ids: List[int]
    p_span: Tuple[int, int]
    u_span: Tuple[int, int]
    a_span: Tuple[int, int]
    total_tokens: int
    sum_len_matches: bool
    no_overlap: bool
    full_token_ids: List[int]


class DatasetTokenizer:
    """Handles tokenization of the dataset with segmentation"""

    def __init__(self, tokenizer_name: str = "meta-llama/Llama-2-7b-chat-hf"):
        print(f"Loading tokenizer: {tokenizer_name}")
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token

        print(f"✓ Tokenizer loaded. Vocab size: {len(self.tokenizer)}")

    def tokenize_example(self, example: Dict) -> TokenizedExample:
        """Tokenize a single example and compute segment spans"""
        system_text = example.get('system_text', '')
        user_text = example.get('user_text', '')
        assistant_prior_text = example.get('assistant_prior_text', '')

        # Tokenize each segment individually
        p_tokens = self.tokenizer.encode(system_text, add_special_tokens=False)
        u_tokens = self.tokenizer.encode(user_text, add_special_tokens=False)
        a_tokens = self.tokenizer.encode(assistant_prior_text, add_special_tokens=False)

        # Concatenate: [P] + [U] + [A]
        full_tokens = p_tokens + u_tokens + a_tokens

        # Compute spans
        p_start, p_end = 0, len(p_tokens)
        u_start, u_end = p_end, p_end + len(u_tokens)
        a_start, a_end = u_end, u_end + len(a_tokens)

        # Validation
        total_tokens = len(full_tokens)
        sum_len_matches = (len(p_tokens) + len(u_tokens) + len(a_tokens)) == total_tokens
        no_overlap = (p_end == u_start and u_end == a_start and a_end == total_tokens)

        return TokenizedExample(
            id=example.get('id', 'unknown'),
            dataset=example.get('dataset', 'unknown'),
            system_text=system_text,
            user_text=user_text,
            assistant_prior_text=assistant_prior_text,
            constraint_tags=example.get('constraint_tags', []),
            p_token_ids=p_tokens,
            u_token_ids=u_tokens,
            a_token_ids=a_tokens,
            p_span=(p_start, p_end),
            u_span=(u_start, u_end),
            a_span=(a_start, a_end),
            total_tokens=total_tokens,
            sum_len_matches=sum_len_matches,
            no_overlap=no_overlap,
            full_token_ids=full_tokens
        )

    def validate_tokenization(self, tokenized: TokenizedExample) -> bool:
        """Validate tokenization correctness"""
        if not tokenized.sum_len_matches:
            print(f"WARNING [{tokenized.id}]: Sum of segment lengths doesn't match total")
            return False

        if not tokenized.no_overlap:
            print(f"WARNING [{tokenized.id}]: Segments overlap or have gaps")
            return False

        reconstructed = (tokenized.p_token_ids + tokenized.u_token_ids + tokenized.a_token_ids)
        if reconstructed != tokenized.full_token_ids:
            print(f"WARNING [{tokenized.id}]: Reconstructed tokens don't match full sequence")
            return False

        # Verify spans point to correct tokens
        p_from_full = tokenized.full_token_ids[tokenized.p_span[0]:tokenized.p_span[1]]
        u_from_full = tokenized.full_token_ids[tokenized.u_span[0]:tokenized.u_span[1]]
        a_from_full = tokenized.full_token_ids[tokenized.a_span[0]:tokenized.a_span[1]]

        if p_from_full != tokenized.p_token_ids:
            print(f"WARNING [{tokenized.id}]: P span doesn't match P tokens")
            return False
        if u_from_full != tokenized.u_token_ids:
            print(f"WARNING [{tokenized.id}]: U span doesn't match U tokens")
            return False
        if a_from_full != tokenized.a_token_ids:
            print(f"WARNING [{tokenized.id}]: A span doesn't match A tokens")
            return False

        return True

    def process_jsonl(self, input_path: str, output_path: str) -> Dict:
        """Process entire JSONL file"""
        print(f"\nProcessing: {input_path}")
        print(f"Output will be saved to: {output_path}")

        input_path = Path(input_path)
        output_path = Path(output_path)

        if not input_path.exists():
            raise FileNotFoundError(f"Input file not found: {input_path}")

        stats = {
            'total_examples': 0,
            'valid_examples': 0,
            'invalid_examples': 0,
            'total_tokens': 0,
            'avg_tokens_per_example': 0,
            'avg_p_tokens': 0,
            'avg_u_tokens': 0,
            'avg_a_tokens': 0,
            'examples_by_dataset': {}
        }

        tokenized_examples = []

        # Process each line
        with open(input_path, 'r', encoding='utf-8') as f:
            for line_num, line in enumerate(f, 1):
                try:
                    example = json.loads(line.strip())
                    tokenized = self.tokenize_example(example)
                    is_valid = self.validate_tokenization(tokenized)

                    tokenized_examples.append(tokenized)
                    stats['total_examples'] += 1

                    if is_valid:
                        stats['valid_examples'] += 1
                    else:
                        stats['invalid_examples'] += 1

                    stats['total_tokens'] += tokenized.total_tokens
                    stats['avg_p_tokens'] += len(tokenized.p_token_ids)
                    stats['avg_u_tokens'] += len(tokenized.u_token_ids)
                    stats['avg_a_tokens'] += len(tokenized.a_token_ids)

                    dataset = tokenized.dataset
                    if dataset not in stats['examples_by_dataset']:
                        stats['examples_by_dataset'][dataset] = 0
                    stats['examples_by_dataset'][dataset] += 1

                except json.JSONDecodeError:
                    print(f"ERROR: Could not parse JSON on line {line_num}")
                    continue
                except Exception as e:
                    print(f"ERROR on line {line_num}: {str(e)}")
                    continue

        # Compute averages
        if stats['total_examples'] > 0:
            stats['avg_tokens_per_example'] = stats['total_tokens'] / stats['total_examples']
            stats['avg_p_tokens'] /= stats['total_examples']
            stats['avg_u_tokens'] /= stats['total_examples']
            stats['avg_a_tokens'] /= stats['total_examples']

        # Write output
        print(f"\nWriting tokenized data to: {output_path}")
        with open(output_path, 'w', encoding='utf-8') as f:
            for tokenized in tokenized_examples:
                output_dict = asdict(tokenized)
                f.write(json.dumps(output_dict, ensure_ascii=False) + '\n')

        print(f"\n✓ Successfully tokenized {stats['total_examples']} examples")
        print(f"  Valid: {stats['valid_examples']}, Invalid: {stats['invalid_examples']}")

        return stats

    def print_statistics(self, stats: Dict):
        """Print detailed statistics"""
        print("\n" + "="*60)
        print("TOKENIZATION STATISTICS")
        print("="*60)
        print(f"Total examples: {stats['total_examples']}")
        print(f"  ✓ Valid: {stats['valid_examples']}")
        print(f"  ✗ Invalid: {stats['invalid_examples']}")
        print(f"\nToken counts:")
        print(f"  Total tokens: {stats['total_tokens']:,}")
        print(f"  Avg tokens/example: {stats['avg_tokens_per_example']:.1f}")
        print(f"  Avg P (system): {stats['avg_p_tokens']:.1f}")
        print(f"  Avg U (user): {stats['avg_u_tokens']:.1f}")
        print(f"  Avg A (assistant): {stats['avg_a_tokens']:.1f}")
        print(f"\nExamples by dataset:")
        for dataset, count in stats['examples_by_dataset'].items():
            print(f"  {dataset}: {count}")
        print("="*60)

    def display_sample(self, input_path: str, num_samples: int = 3):
        """Display sample tokenizations"""
        print("\n" + "="*60)
        print("SAMPLE TOKENIZATIONS")
        print("="*60)

        with open(input_path, 'r', encoding='utf-8') as f:
            for i, line in enumerate(f):
                if i >= num_samples:
                    break

                example = json.loads(line.strip())
                tokenized = self.tokenize_example(example)

                print(f"\n--- Sample {i+1}: {tokenized.id} ---")
                print(f"Dataset: {tokenized.dataset}")
                print(f"Constraint tags: {tokenized.constraint_tags}")
                print(f"\nP (system): '{tokenized.system_text[:80]}...'")
                print(f"  Tokens: {len(tokenized.p_token_ids)}, Span: {tokenized.p_span}")
                print(f"  First 5 IDs: {tokenized.p_token_ids[:5]}")

                print(f"\nU (user): '{tokenized.user_text[:80]}...'")
                print(f"  Tokens: {len(tokenized.u_token_ids)}, Span: {tokenized.u_span}")
                print(f"  First 5 IDs: {tokenized.u_token_ids[:5]}")

                print(f"\nA (assistant): '{tokenized.assistant_prior_text[:80]}...'")
                print(f"  Tokens: {len(tokenized.a_token_ids)}, Span: {tokenized.a_span}")
                print(f"  First 5 IDs: {tokenized.a_token_ids[:5]}")

                print(f"\nTotal: {tokenized.total_tokens} tokens")
                print(f"Valid: sum_len={tokenized.sum_len_matches}, no_overlap={tokenized.no_overlap}")

        print("="*60)


# ============================================
# RUN TOKENIZATION
# ============================================

print("Upload selected_all_shuffled.jsonl:")
from google.colab import files
uploaded = files.upload()

# Initialize tokenizer
print("\n🔧 Initializing LLaMA-2-7b-chat tokenizer...")
tokenizer = DatasetTokenizer(tokenizer_name="meta-llama/Llama-2-7b-chat-hf")

# Show 3 sample tokenizations
tokenizer.display_sample("selected_all_shuffled.jsonl", num_samples=3)

# Process full dataset
print("\n Processing full dataset...")
stats = tokenizer.process_jsonl(
    input_path="selected_all_shuffled.jsonl",
    output_path="selected_all_tokenized.jsonl"
)

# Print statistics
tokenizer.print_statistics(stats)

# Download result
print("\n💾 Downloading output file...")
files.download("selected_all_tokenized.jsonl")

print("\n Your tokenized data is ready.")
print("The output file contains:")
print("  - Token IDs for P/U/A segments")
print("  - Span indices for attention extraction")
print("  - Validation flags (all should be True)")

Upload selected_all_shuffled.jsonl:


Saving selected_all_shuffled.jsonl to selected_all_shuffled (1).jsonl

🔧 Initializing LLaMA-2-7b-chat tokenizer...
Loading tokenizer: meta-llama/Llama-2-7b-chat-hf
✓ Tokenizer loaded. Vocab size: 32000

SAMPLE TOKENIZATIONS

--- Sample 1: flan:664 ---
Dataset: flan
Constraint tags: ['no_explanations']

P (system): 'Can we conclude from "A man is jumping into a screened-in outdoor pool." that "T...'
  Tokens: 279, Span: (0, 279)
  First 5 IDs: [1815, 591, 17668, 515, 376]

U (user): '...'
  Tokens: 0, Span: (279, 279)
  First 5 IDs: []

A (assistant): '...'
  Tokens: 0, Span: (279, 279)
  First 5 IDs: []

Total: 279 tokens
Valid: sum_len=True, no_overlap=True

--- Sample 2: flan:896 ---
Dataset: flan
Constraint tags: ['no_explanations']

P (system): 'Given the sentence "A snowboarder slides across an icy table." is it true that "...'
  Tokens: 279, Span: (0, 279)
  First 5 IDs: [11221, 278, 10541, 376, 29909]

U (user): '...'
  Tokens: 0, Span: (279, 279)
  First 5 IDs: []

A (assistant

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


 Your tokenized data is ready.
The output file contains:
  - Token IDs for P/U/A segments
  - Span indices for attention extraction
  - Validation flags (all should be True)
