# Tutorial 1: Data and Vocabulary

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/byu-matrix-lab/torchlingo/blob/main/docs/docs/tutorials/01-data-and-vocab.ipynb)

Learn how to load parallel data, build vocabularies, and prepare your data for neural machine translation.

**⚡ Running in Google Colab?** Make sure to:
1. Go to **Runtime → Change runtime type → GPU** (optional but faster)
2. Uncomment and run the `%pip install torchlingo` cell below

In [None]:
# Install TorchLingo (uncomment in Google Colab)
# %pip install torchlingo

# Check GPU availability
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

PyTorch version: 2.9.1
CUDA available: False


In [None]:
# Import TorchLingo modules
from torchlingo.preprocessing import load_data, parallel_txt_to_dataframe
from torchlingo.data_processing import SimpleVocab, NMTDataset
from torchlingo.config import Config, get_default_config
from pathlib import Path

print("✓ Imports successful!")

✓ Imports successful!


## Part 1: Loading Data

TorchLingo supports multiple data formats. Let's create some sample data and load it.

In [15]:
import pandas as pd

# Create sample parallel data
sample_data = {
    "src": [
        "Hello world",
        "Hello neighbor",
        "Cruel world",
        "How are you today",
        "Good morning",
        "Thank you very much",
        "I love programming",
        "The cat is sleeping",
        "What is your name",
        "Nice to meet you",
    ],
    "tgt": [
        "Hola mundo",
        "Hola vecino",
        "Mundo cruel",
        "Cómo estás hoy",
        "Buenos días",
        "Muchas gracias",
        "Me encanta programar",
        "El gato está durmiendo",
        "Cuál es tu nombre",
        "Mucho gusto",
    ]
}

df = pd.DataFrame(sample_data)
print("Sample data:")
df

Sample data:


Unnamed: 0,src,tgt
0,Hello world,Hola mundo
1,Hello neighbor,Hola vecino
2,Cruel world,Mundo cruel
3,How are you today,Cómo estás hoy
4,Good morning,Buenos días
5,Thank you very much,Muchas gracias
6,I love programming,Me encanta programar
7,The cat is sleeping,El gato está durmiendo
8,What is your name,Cuál es tu nombre
9,Nice to meet you,Mucho gusto


In [16]:
# Save as TSV (tab-separated values)
data_dir = Path("data")
data_dir.mkdir(exist_ok=True)

data_path = data_dir / "sample_train.tsv"
df.to_csv(data_path, sep="\t", index=False)
print(f"Saved to {data_path}")

Saved to data/sample_train.tsv


In [17]:
# Load it back
loaded_df = load_data(data_path)
print(f"Loaded {len(loaded_df)} rows")
loaded_df.head()

Loaded 10 rows


Unnamed: 0,src,tgt
0,Hello world,Hola mundo
1,Hello neighbor,Hola vecino
2,Cruel world,Mundo cruel
3,How are you today,Cómo estás hoy
4,Good morning,Buenos días


## Part 2: Building Vocabularies

Neural networks work with numbers, not text. A vocabulary maps words to indices and back.

In [None]:
# Create a vocabulary
vocab = SimpleVocab()

# Build from sentences
vocab.build_vocab(df["src"].tolist())

print(f"Vocabulary size: {len(vocab)}")
print(f"\nToken to index mapping:")
for token, idx in list(vocab.token2idx.items())[:10]:
    print(f"  '{token}' → {idx}")

Vocabulary size: 8

Token to index mapping:
  '<pad>' → 0
  '<unk>' → 1
  '<sos>' → 2
  '<eos>' → 3
  'Hello' → 4
  'world' → 5
  'you' → 6
  'is' → 7


### Special Tokens

Notice the first four tokens are special:

| Token | Index | Purpose |
|-------|-------|------|
| `<pad>` | 0 | Padding for batching |
| `<unk>` | 1 | Unknown words |
| `<sos>` | 2 | Start of sequence |
| `<eos>` | 3 | End of sequence |

In [19]:
# Encode a sentence
sentence = "Hello world"
indices = vocab.encode(sentence, add_special_tokens=True)

print(f"Original: '{sentence}'")
print(f"Encoded:  {indices}")
print(f"Meaning:  [SOS, 'Hello', 'world', EOS]")

Original: 'Hello world'
Encoded:  [2, 4, 5, 3]
Meaning:  [SOS, 'Hello', 'world', EOS]


In [20]:
# Decode back to text
decoded = vocab.decode(indices, skip_special_tokens=True)
print(f"Decoded: '{decoded}'")

# With special tokens
decoded_with_special = vocab.decode(indices, skip_special_tokens=False)
print(f"With special tokens: '{decoded_with_special}'")

Decoded: 'Hello world'
With special tokens: '<sos> Hello world <eos>'


### Handling Unknown Words

In [22]:
# What happens with words not in our vocabulary?
unknown_sentence = "Hello universe"
indices = vocab.encode(unknown_sentence, add_special_tokens=True)

print(f"Original: '{unknown_sentence}'")
print(f"Encoded:  {indices}")
# With special tokens
decoded_with_special = vocab.decode(indices, skip_special_tokens=False)
print(f"With special tokens: '{decoded_with_special}'")

Original: 'Hello universe'
Encoded:  [2, 4, 1, 3]
With special tokens: '<sos> Hello <unk> <eos>'


## Part 3: Creating Datasets

The `NMTDataset` class combines data loading and vocabulary building.

In [23]:
# Create dataset (vocabularies built automatically)
dataset = NMTDataset(data_path)

print(f"Dataset size: {len(dataset)} samples")
print(f"Source vocab: {len(dataset.src_vocab)} tokens")
print(f"Target vocab: {len(dataset.tgt_vocab)} tokens")

Dataset size: 10 samples
Source vocab: 8 tokens
Target vocab: 5 tokens


In [24]:
# Access a sample
src_tensor, tgt_tensor = dataset[0]

print(f"Sample 0:")
print(f"  Source sentence: '{dataset.src_sentences[0]}'")
print(f"  Source tensor:   {src_tensor.tolist()}")
print(f"  Target sentence: '{dataset.tgt_sentences[0]}'")
print(f"  Target tensor:   {tgt_tensor.tolist()}")

Sample 0:
  Source sentence: 'Hello world'
  Source tensor:   [2, 4, 5, 3]
  Target sentence: 'Hola mundo'
  Target tensor:   [2, 4, 1, 3]


In [26]:
# Decode to verify
src_decoded = dataset.src_vocab.decode(src_tensor.tolist(), skip_special_tokens=False)
tgt_decoded = dataset.tgt_vocab.decode(tgt_tensor.tolist(), skip_special_tokens=False)

print(f"Source decoded: '{src_decoded}'")
print(f"Target decoded: '{tgt_decoded}'")

Source decoded: '<sos> Hello world <eos>'
Target decoded: '<sos> Hola <unk> <eos>'


## Part 4: Batching

Training requires batching multiple samples together. Let's see how padding works.

In [28]:
from torch.utils.data import DataLoader
from torchlingo.data_processing import collate_fn

# Create a data loader
loader = DataLoader(
    dataset,
    batch_size=4,
    shuffle=True,
    collate_fn=collate_fn,
)

# Get one batch
src_batch, tgt_batch = next(iter(loader))

print(f"Batch shapes:")
print(f"  Source: {src_batch.shape}  (batch_size, max_src_len)")
print(f"  Target: {tgt_batch.shape}  (batch_size, max_tgt_len)")

Batch shapes:
  Source: torch.Size([4, 6])  (batch_size, max_src_len)
  Target: torch.Size([4, 6])  (batch_size, max_tgt_len)


In [29]:
# Inspect the batch
print("Source batch (notice padding with 0s):")
for i, seq in enumerate(src_batch):
    print(f"  Sample {i}: {seq.tolist()}")

Source batch (notice padding with 0s):
  Sample 0: [2, 1, 7, 1, 1, 3]
  Sample 1: [2, 1, 1, 7, 1, 3]
  Sample 2: [2, 1, 1, 1, 6, 3]
  Sample 3: [2, 1, 1, 1, 3, 0]


## Summary

You've learned:

1. **Loading data**: Use `load_data()` for TSV, CSV, JSON, Parquet
2. **Vocabularies**: Map words ↔ indices with special tokens
3. **Datasets**: `NMTDataset` handles encoding automatically
4. **Batching**: `collate_fn` pads sequences for batching

## Next Steps

Continue to [Tutorial 2: Training a Tiny Model](02-train-tiny-model.ipynb) to build and train your first translation model!