Install the Transformers and Datasets libraries to run this notebook.

In [3]:
!pip install datasets transformers[sentencepiece]
!pip install git+https://github.com/huggingface/peft
!pip install git+https://github.com/huggingface/accelerate
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install scikit-learn
!pip install sentencepiece

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed dataset

You will need an authentication token with your Hugging Face credentials to use the `push_to_hub` method. Execute `huggingface-cli login` in your terminal or by uncommenting the following cell:

In [None]:
!pip install accelerate -U
!pip show accelerate
!pip install -U datasets

In [1]:
import torch
print("Is GPU available? ",torch.cuda.is_available())

Is GPU available?  False


In [2]:
import numpy as np

from datasets import load_dataset, load_metric
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments,
)

  from .autonotebook import tqdm as notebook_tqdm


In [10]:
! pip install ipywidgets

In [None]:
checkpoint = "bert-base-cased"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [3]:
tokenizer = AutoTokenizer.from_pretrained("rasyosef/bert-amharic-tokenizer")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [8]:
#!huggingface-cli login
from huggingface_hub import notebook_login

In [4]:
raw_datasets = load_dataset("glue", "mrpc")

Downloading readme: 100%|██████████| 35.3k/35.3k [00:00<00:00, 11.8MB/s]
Downloading data: 100%|██████████| 649k/649k [00:01<00:00, 327kB/s]
Downloading data: 100%|██████████| 75.7k/75.7k [00:01<00:00, 55.1kB/s]
Downloading data: 100%|██████████| 308k/308k [00:01<00:00, 199kB/s]
Generating train split: 100%|██████████| 3668/3668 [00:00<00:00, 53578.56 examples/s]
Generating validation split: 100%|██████████| 408/408 [00:00<00:00, 102049.98 examples/s]
Generating test split: 100%|██████████| 1725/1725 [00:00<00:00, 215608.50 examples/s]


In [11]:
notebook_login()

ImportError: The `notebook_login` function can only be used in a notebook (Jupyter or Colab) and you need the `ipywidgets` module: `pip install ipywidgets`.

In [None]:
raw_datasets = load_dataset("glue", "mrpc")

def tokenize_function(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

training_args = TrainingArguments(
    "finetuned-bert-mrpc",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    log_level="error",
    push_to_hub=True,
    push_to_hub_model_id="test_falcon_model_learning",
    # push_to_hub_organization="huggingface",
    # push_to_hub_token="my_token",
)

data_collator = DataCollatorWithPadding(tokenizer)

metric = load_metric("glue", "mrpc")

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  metric = load_metric("glue", "mrpc")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [None]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.5491,0.456061,0.79902,0.867314
2,0.3059,0.421829,0.830882,0.884034
3,0.1573,0.466558,0.833333,0.883162


TrainOutput(global_step=690, training_loss=0.33743014681166494, metrics={'train_runtime': 61.5671, 'train_samples_per_second': 178.732, 'train_steps_per_second': 11.207, 'total_flos': 446520016497120.0, 'train_loss': 0.33743014681166494, 'epoch': 3.0})

## Push to hub from the Trainer directly

The `Trainer` has a new method to directly upload the model, tokenizer and model configuration in a repo on the [Hub](https://huggingface.co/). It will even auto-generate a model card draft using the hyperparameters and evaluation results!

In [None]:
trainer.push_to_hub(
    "rcade/test_falcon_model_learning"
)

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

events.out.tfevents.1706175385.ip-172-31-57-128.18108.0:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/rcade/test_falcon_model_learning/commit/634175393b34578114e6af4a6160755650124069', commit_message='rcade/test_falcon_model_learning', commit_description='', oid='634175393b34578114e6af4a6160755650124069', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
trainer.push_to_hub()

CommitInfo(commit_url='https://huggingface.co/rcade/test_falcon_model_learning/commit/7cc69cfafbab3cc3b3460bfb2b5d31fd0e603a99', commit_message='End of training', commit_description='', oid='7cc69cfafbab3cc3b3460bfb2b5d31fd0e603a99', pr_url=None, pr_revision=None, pr_num=None)

If you are using your own training loop, you can push the model and tokenizer separately (and you will have to write the model card yourself):

In [None]:
# model.push_to_hub("finetuned-bert-mrpc")
tokenizer.push_to_hub("rcade/finetuned-bert-mrpc")

README.md:   0%|          | 0.00/1.54k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/rcade/finetuned-bert-mrpc/commit/3212b575bbd1e4f3f19147621f1450517ad835c7', commit_message='Upload tokenizer', commit_description='', oid='3212b575bbd1e4f3f19147621f1450517ad835c7', pr_url=None, pr_revision=None, pr_num=None)

## You can load your model from anywhere using from_pretrained!

In [None]:
from transformers import AutoModelForSequenceClassification

model_name = "sgugger/finetuned-bert-mrpc"
model = AutoModelForSequenceClassification.from_pretrained(model_name)

  return self.fget.__get__(instance, owner)()


## You can use your model in a pipeline!

In [None]:
from transformers import pipeline

classifier = pipeline("text-classification", model=model_name)

In [None]:
classifier("My name is Sylvain. [SEP] My name is Sylvain")

[{'label': 'not_equivalent', 'score': 0.8155951499938965}]

## Updating a problematic file is super easy!

In [None]:
model.config.label2id = {"not equivalent": 0, "equivalent": 1}

In [None]:
model.config.id2label = {0: "not equivalent", 1: "equivalent"}

In [None]:
model.config.push_to_hub("finetuned-bert-mrpc")

README.md:   0%|          | 0.00/1.54k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/rcade/finetuned-bert-mrpc/commit/fcb90f3f8183804c03eeb5dbdad48fc35e9e5ab1', commit_message='Upload config', commit_description='', oid='fcb90f3f8183804c03eeb5dbdad48fc35e9e5ab1', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
classifier = pipeline("text-classification", model=model_name)

classifier("My name is Sylvain. [SEP] My name is Sylvain")