# How can we think of text as numbers for quantitative analysis -- Continued!

We can turn text into embeddings, but how can we fine-tune a transformer model on your own data?  What do we need to do to text + labels for fine-tuning with `transformers.Trainer`?

Hugging Face’s `Trainer` API expects your data in a `datasets.Dataset` (or a PyTorch/TF dataset).  

We will:
1. Take raw Python lists of texts (and labels).
2. Turn them into a `Dataset`.
3. Tokenize them for a given transformer model.
4. Feed them into `Trainer` for training.

We’ll use a simple text classification example, but the same pattern applies to many tasks.

### 1. Create a `Dataset` from Raw Python Lists

Assume we have a list of short texts and corresponding labels:

- `0` = negative
- `1` = positive

We'll create a `datasets.Dataset` from them.


In [None]:
from datasets import Dataset

In [None]:
# Example raw data
texts = [
    "I love this movie, it was fantastic!",
    "The plot was boring and predictable.",
    "Amazing performance by the lead actor.",
    "I didn't enjoy the film at all.",
    "It was okay, some parts were good."
]

In [None]:
labels = [1, 0, 1, 0, 1]  # 1 = positive, 0 = negative

Create a Dataset from a dictionary

In [None]:
raw_dataset = Dataset.from_dict({
    "text": texts,
    "label": labels,
})

raw_dataset

For now we’ll pretend this small dataset is our training data.

### 2. Tokenize the Dataset

`Trainer` works with tokenized inputs. We’ll use BERT, but you can swap in any model name from the Hugging Face Hub.

Steps:

1. Load the tokenizer.
2. Define a `tokenize_function`.
3. Use `.map()` to apply it to every example in the dataset.

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-uncased"  # or any other model
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_function(example):
    # return tokenizer(
    #     example["text"],
    #     padding="max_length",        # or "longest" / use DataCollator for dynamic padding
    #     truncation=True,
    #     max_length=128,
    # )
    return tokenizer(
        example["text"],
        padding="longest",        # or "longest" / use DataCollator for dynamic padding
        truncation=True,
    )

tokenized_dataset = raw_dataset.map(tokenize_function, 
                                    batched=True)

tokenized_dataset[0]

You should now see additional fields like:

- `input_ids`
- `token_type_ids` (for some models)
- `attention_mask`

These are what the model actually consumes.


### 3. (Optional) Train / Evaluation Split

`Trainer` usually expects separate **train** and **eval** datasets.

We’ll just split our tiny dataset into 80% train, 20% validation.


In [None]:
type(tokenized_dataset)

In [None]:
# Simple train/test split
splits = tokenized_dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = splits["train"]
eval_dataset = splits["test"]

len(train_dataset), len(eval_dataset)

In [None]:
train_dataset

In [None]:
for i in train_dataset[0].items():
    print(i)

### 4. Define the Model and Training Arguments

For classification we’ll use `AutoModelForSequenceClassification`.

Key pieces:

- `num_labels`: number of classes.
- `TrainingArguments`: hyperparameters & output folder.
  * [Doc for version currently on the JupyterHub](https://huggingface.co/docs/transformers/v4.57.5/en/main_classes/trainer#transformers.TrainingArguments)
- `Trainer`: wraps the model, data, and training loop.
  * [Doc for version currently on the JupyterHub](https://huggingface.co/docs/transformers/v4.57.5/en/main_classes/trainer#transformers.Trainer)

In [None]:
from transformers import (
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)

In [None]:
num_labels = 2

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels=num_labels,
)

In [None]:
training_args = TrainingArguments(
    output_dir="./bert-base-uncased-finetuned-sentiment",
    eval_strategy="epoch",      # run eval at end of each epoch
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

### 5. Define Metrics and Create the `Trainer`

We’ll compute simple **accuracy** (you can add F1, precision, recall, etc.).


In [None]:
import evaluate
import numpy as np

f1_metric = evaluate.load("f1")  # or any other metric name

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return f1_metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
)

### 6. Train the Model

Now we can call `.train()` and let `Trainer` handle the rest.

This will actually fine-tune the model on our tiny example dataset - in a real project you'd have thousands of examples.

In [None]:
train_result = trainer.train()

In [None]:
train_result

In [None]:
trainer.evaluate()

To feed your text into `transformers.Trainer`, you typically:

1. Store your raw data in a `Dataset`:
   - `Dataset.from_dict(...)` or `load_dataset("csv", ...)`.
2. Tokenize with a Hugging Face tokenizer:
   - Use `.map(tokenize_function, batched=True)`.
3. Split into `train_dataset` and `eval_dataset`.
4. Instantiate a model (`AutoModelForSequenceClassification` or task-specific variant).
5. Set up `TrainingArguments` and a `Trainer`.
6. Call `.train()` and `.evaluate()`.

You can plug your own text/labels into the same pipeline, and swap the model checkpoint to match your task (e.g. RoBERTa, DistilBERT, domain-specific models, etc.).