## Tokenization & Dataset Formatting

In this step, we:
- Load our cleaned `train/val/test.csv` files
- Use the HuggingFace `roberta-base` tokenizer
- Convert each article (`full_text`) into numerical tokens (input IDs)
- Add attention masks so the model ignores padding

In [None]:
import pandas as pd
import torch # For working with tensors (required for model training)
from transformers import RobertaTokenizer # Loads the RoBERTa tokenizer for text → tokens
import pickle
import os

In [None]:
train_df = pd.read_csv('../data/train.csv') # Load Training Data
val_df = pd.read_csv('../data/val.csv')     # Load Validation Data
test_df = pd.read_csv('../data/test.csv')   # Load Test Data

In [None]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

### Tokenize the Full Text in Each Dataset

We will tokenize the `full_text` column of each dataset:
- Apply truncation so text longer than 512 tokens is cut off
- Apply padding to shorter texts to match the max length
- Return PyTorch tensors directly (`return_tensors='pt'`)

In [None]:
train_encodings = tokenizer(

    train_df['full_text'].to_list(),    # List of strings (news articles)
    truncation = True,                  # Cuts off articles >512 Tokens
    padding = True,                     # Pad shorter ones to 512 tokens
    max_length = 512,                   # Max tokens RoBERTa supports
    return_tensors = 'pt'               # Return PyTorch-style tensors

)

val_encodings = tokenizer(

    val_df['full_text'].to_list(),
    truncation = True,
    padding = True,
    max_length = 512,
    return_tensors = 'pt'
)

test_encodings = tokenizer(

    val_df['full_text'].to_list(),
    truncation = True,
    padding = True,
    max_length = True,
    return_tensors = 'pt'
)


In [None]:
train_labels = torch.tensor(train_df['label'].values) # Converts labels into tensors
val_labels = torch.tensor(val_df['label'].values)
test_labels = torch.tensor(test_df['label'].values)

In [None]:
print(tokenizer.decode(train_encodings['input_ids'][0])) # Decode tokenized article 0
print("Label: ", train_labels[0].item()) # Print its label (0 or 1)

In [None]:
os.makedirs("artifacts", exist_ok=True)
# Save tokenized inputs and labels
with open("artifacts/train_encodings.pkl", "wb") as f:
    pickle.dump(train_encodings, f)
with open("artifacts/train_labels.pkl", "wb") as f:
    pickle.dump(train_labels, f)
with open("artifacts/val_encodings.pkl", "wb") as f:
    pickle.dump(val_encodings, f)
with open("artifacts/val_labels.pkl", "wb") as f:
    pickle.dump(val_labels, f)
with open("artifacts/test_encodings.pkl", "wb") as f:
    pickle.dump(test_encodings, f)
with open("artifacts/test_labels.pkl", "wb") as f:
    pickle.dump(test_labels, f)
with open("artifacts/train_df.pkl", "wb") as f:
    pickle.dump(train_df, f)
with open("artifacts/val_df.pkl", "wb") as f:
    pickle.dump(val_df, f)

## Done: Tokenization and Formatting Complete

- Loaded the cleaned `train/val/test` data
- Tokenized each article using `roberta-base`
- Converted the labels into PyTorch tensors
- Verified that tokenization is working correctly
