## fine tune bert model for custom dataset

### 1. load/define data set

In [2]:
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [32]:
imdb = load_dataset("imdb")

Reusing dataset imdb (/home/barinale/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)
100%|████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 718.16it/s]


In [33]:
imdb = imdb.shuffle()

In [34]:
for key in imdb.keys():
    imdb[key] = imdb[key].select(range(10))

In [35]:
print(imdb.num_rows)

{'train': 10, 'test': 10, 'unsupervised': 10}


### 2. install transofrmers library

In [39]:
! pip install transformers



### 3. preprocess text

In [40]:
from transformers import AutoTokenizer

In [41]:
# load the same tokenizer a model was trained with
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [42]:
def preprocess(examples):
    return tokenizer(examples["text"], truncation=True)

In [43]:
tokenized = imdb.map(preprocess, batched=True)

100%|█████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 94.75ba/s]
100%|█████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 55.12ba/s]
100%|█████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 62.59ba/s]


In [44]:
from transformers import DataCollatorWithPadding

In [45]:
collator = DataCollatorWithPadding(tokenizer=tokenizer)

### 4. load pretrained model

In [46]:
from transformers import AutoModelForSequenceClassification

In [47]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.w

In [48]:
from transformers import TrainingArguments, Trainer

#### a. Define your training hyperparameters in TrainingArguments.

In [49]:
! mkdir './tunedbert'

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
mkdir: cannot create directory ‘./tunedbert’: File exists


In [50]:
training_args = TrainingArguments(
    output_dir="./tunedbert",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
)

#### b. Pass the training arguments to a Trainer along with the model, dataset, tokenizer, and data collator.

In [51]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    data_collator=collator,
    tokenizer=tokenizer,
)


#### c. Call Trainer.train() to fine-tune your model.

In [None]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text.
***** Running training *****
  Num examples = 10
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2
