# Quick Introduction to Huggingface's Transformers- and Datasets-Library

Adjusted from: https://huggingface.co/transformers/training.html

Other relevant links:
- Transformers docs: https://huggingface.co/transformers/index.html
- Datasets docs: https://huggingface.co/docs/datasets/
- BertTokenizer: https://huggingface.co/transformers/model_doc/bert.html?highlight=berttokenizer#transformers.BertTokenizer (Check it out, it can do most of the preprocessing for you.)
- BertModel: https://huggingface.co/transformers/model_doc/bert.html?highlight=bertmodel#transformers.BertModel
- BertForSequenceClassification: https://huggingface.co/transformers/model_doc/bert.html?highlight=bertforsequenceclassification#transformers.BertForSequenceClassification (BertModel-based class for this introduction) 
- BertForTokenClassification: https://huggingface.co/transformers/model_doc/bert.html?highlight=bertfortokenclassification#transformers.BertForTokenClassification (BertModel-based class for the exercise)
- On the model outputs from different transformers-versions: https://huggingface.co/transformers/migration.html


In [1]:
!pip install transformers datasets sklearn

Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
Collecting sklearn
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-win_amd64.whl (3.3 MB)
Collecting regex!=2019.12.17
  Downloading regex-2022.10.31-cp38-cp38-win_amd64.whl (267 kB)
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py): started
  Building wheel for sklearn (setup.py): finished with status 'done'
  Created wheel for sklearn: filename=sklearn-0.0.post1-py3-none-any.whl size=2343 sha256=c0c50fbae6b6b50000c1505c9e95f1e38d9fe2f6f21e681005258bf5d490332d
  Stored in directory: c:\users\damja\appdata\local\pip\cache\wheels\14\25\f7\1cc0956978ae479e75140219088deb7a36f60459df242b1a72
Successfully built sklearn
Installing collected packages: tokenizers, regex, transformers, sklearn
Successfully installed regex-2022.10.31 sklearn-0.0.post1 tokenizers-0.13.2 transformers-4.24.0




In [2]:
import pandas as pd
import datasets
dataset = datasets.load_dataset('sms_spam')

Downloading builder script:   0%|          | 0.00/3.21k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.69k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/4.87k [00:00<?, ?B/s]

Downloading and preparing dataset sms_spam/plain_text to C:/Users/Damja/.cache/huggingface/datasets/sms_spam/plain_text/1.0.0/53f051d3b5f62d99d61792c91acefe4f1577ad3e4c216fb0ad39e30b9f20019c...


Downloading data:   0%|          | 0.00/203k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5574 [00:00<?, ? examples/s]

Dataset sms_spam downloaded and prepared to C:/Users/Damja/.cache/huggingface/datasets/sms_spam/plain_text/1.0.0/53f051d3b5f62d99d61792c91acefe4f1577ad3e4c216fb0ad39e30b9f20019c. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [3]:
print(dataset.keys())

dict_keys(['train'])


In [4]:
print(len(dataset['train']))

5574


In [5]:
# next time, if we only want a few examples:
dataset = datasets.load_dataset('sms_spam', split='train[800:1000]')  # [:100] [:1%] 

Found cached dataset sms_spam (C:/Users/Damja/.cache/huggingface/datasets/sms_spam/plain_text/1.0.0/53f051d3b5f62d99d61792c91acefe4f1577ad3e4c216fb0ad39e30b9f20019c)


In [6]:
from collections import Counter
Counter(dataset['label'])

Counter({0: 165, 1: 35})

In [7]:
dataset

Dataset({
    features: ['sms', 'label'],
    num_rows: 200
})

In [8]:
dataset[0]

{'sms': '"Gimme a few" was  &lt;#&gt;  minutes ago\n', 'label': 0}

In [9]:
dataset[1]

{'sms': 'Last Chance! Claim ur £150 worth of discount vouchers today! Text SHOP to 85023 now! SavaMob, offers mobile! T Cs SavaMob POBOX84, M263UZ. £3.00 Sub. 16\n',
 'label': 1}

In [10]:
dataset[100]

{'sms': 'Your free ringtone is waiting to be collected. Simply text the password "MIX" to 85069 to verify. Get Usher and Britney. FML, PO Box 5249, MK17 92H. 450Ppw 16\n',
 'label': 1}

In [11]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [12]:
tokenizer(dataset[0]['sms'])

{'input_ids': [101, 107, 144, 4060, 3263, 170, 1374, 107, 1108, 111, 181, 1204, 132, 108, 111, 176, 1204, 132, 1904, 2403, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [13]:
tokenizer(dataset[0]['sms'], return_tensors="pt", padding='max_length', truncation=True, max_length=128)  # deprecated encode_plus

{'input_ids': tensor([[ 101,  107,  144, 4060, 3263,  170, 1374,  107, 1108,  111,  181, 1204,
          132,  108,  111,  176, 1204,  132, 1904, 2403,  102,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0,

In [14]:
encoded_dataset = [tokenizer(item['sms'], return_tensors="pt", padding='max_length', truncation=True, max_length=128) for item in dataset]


In [24]:
import torch
for enc_item, item in zip(encoded_dataset, dataset):
    enc_item['labels'] = torch.LongTensor([item['label']])

{'input_ids': tensor([[ 101,  107,  144, 4060, 3263,  170, 1374,  107, 1108,  111,  181, 1204,
          132,  108,  111,  176, 1204,  132, 1904, 2403,  102,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0,

In [21]:
print(len(encoded_dataset))
for key, val in encoded_dataset[0].items():
    print(f'key: {key}, dimensions: {val.size()}')

200
key: input_ids, dimensions: torch.Size([1, 128])
key: token_type_ids, dimensions: torch.Size([1, 128])
key: attention_mask, dimensions: torch.Size([1, 128])
key: labels, dimensions: torch.Size([1])


In [29]:
from random import shuffle
shuffle(encoded_dataset)

In [30]:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
model = BertForSequenceClassification.from_pretrained('bert-base-cased', num_labels=2)


Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [31]:
train_set = encoded_dataset[:100]
test_set = encoded_dataset[100:]

The torch-like way to train:

In [32]:
from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=1e-5)
model.train()  # set model train state
outputs = model(**train_set[0])[0]
print(outputs)
loss = outputs
loss.backward()
optimizer.step()



tensor(0.9295, grad_fn=<NllLossBackward0>)


In [34]:
train_set[0]

{'input_ids': tensor([[  101,  1267,   117,   178,  1450,  2368,  1128,   170,  2549,   170,
          1374,  1551,   192,  6094,  1233,  1730,  1106,  1128,  1579,  5277,
          1106,  5529, 16408, 11931,  5773,   119,   146,  1108,  6100,   176,
         13374,  1128,   112, 27216,  1141,   117,  1133,   170, 26574,  2137,
         27451,  2349, 18784,  2523,  1110,  1136,  6100,  1243,  1149, 27216,
          1170,   123,   119,  1192,  1444,  1106,  1435,  1313,   119,  1192,
          1444,  1106,  3370,  6894,  1643,  1105,   117,  1191,  1625,   117,
          1128,  1444,  1106,   171, 24084,  3810,  1158,  3811,  2013,   119,
           102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,  

The easier way to train:

In [35]:
# we don't need the batch dimension when using the trainer
# because the trainer does batching for us 
for item in encoded_dataset:
    for key in item:
        item[key] = torch.squeeze(item[key])
train_set = encoded_dataset[:100]
test_set = encoded_dataset[100:]

In [36]:
train_set

[{'input_ids': tensor([  101,  1267,   117,   178,  1450,  2368,  1128,   170,  2549,   170,
          1374,  1551,   192,  6094,  1233,  1730,  1106,  1128,  1579,  5277,
          1106,  5529, 16408, 11931,  5773,   119,   146,  1108,  6100,   176,
         13374,  1128,   112, 27216,  1141,   117,  1133,   170, 26574,  2137,
         27451,  2349, 18784,  2523,  1110,  1136,  6100,  1243,  1149, 27216,
          1170,   123,   119,  1192,  1444,  1106,  1435,  1313,   119,  1192,
          1444,  1106,  3370,  6894,  1643,  1105,   117,  1191,  1625,   117,
          1128,  1444,  1106,   171, 24084,  3810,  1158,  3811,  2013,   119,
           102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,  

In [37]:
training_args = TrainingArguments(
    num_train_epochs=1,
    weight_decay=0.01,
    output_dir='results',
    logging_dir='logs',
    no_cuda=False,  # defaults to false anyway, just to be explicit
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_set,
)

In [38]:
trainer.train()

***** Running training *****
  Num examples = 100
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 13
  Number of trainable parameters = 108311810


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=13, training_loss=0.4902255351726825, metrics={'train_runtime': 101.6173, 'train_samples_per_second': 0.984, 'train_steps_per_second': 0.128, 'total_flos': 6577776384000.0, 'train_loss': 0.4902255351726825, 'epoch': 1.0})

In [None]:
preds = trainer.predict(test_set)

***** Running Prediction *****
  Num examples = 100
  Batch size = 4


In [None]:
print(preds.predictions[:2])
print(preds.predictions[:2].argmax(-1))
print(preds.label_ids[:2])
print(preds.metrics)

[[ 0.9839595  -0.77930415]
 [ 1.2374914  -0.9335977 ]]
[0 0]
[0 0]
{'test_loss': 0.22895419597625732, 'test_runtime': 41.3092, 'test_samples_per_second': 2.421, 'test_steps_per_second': 0.605}


In [None]:
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
predictions = preds.predictions.argmax(-1)
f1_score(preds.label_ids, predictions, average='binary')

0.8750000000000001

In [None]:
confusion_matrix(predictions, preds.label_ids)

array([[82,  4],
       [ 0, 14]])

#**SimpleTransformers**

Lets do the same solution in a few lines

In [None]:
!pip install simpletransformers
from simpletransformers.classification import ClassificationModel, ClassificationArgs

In [None]:
train_df = pd.DataFrame(dataset).iloc[:100, :].sample(frac=1)
test_df = pd.DataFrame(dataset).iloc[100:, :].sample(frac=1)
train_df = train_df.rename(columns={'sms' : 'text'})
test_df = test_df.rename(columns={'sms' : 'text'})
# creating a model on simpletransformers
model_args = ClassificationArgs(num_train_epochs=1, manual_seed=42, train_batch_size=4, max_seq_length=128)
# Create a ClassificationModel
bert_model = ClassificationModel(
    "bert", "bert-base-cased", args=model_args, use_cuda=False
)

In [None]:
test_df

Unnamed: 0,text,label
179,Hey you can pay. With salary de. Only &lt;#&g...,0
137,"Since when, which side, any fever, any vomitin.\n",0
165,Are you this much buzy\n,0
154,Also remember to get dobby's bowl from your car\n,0
115,"Call me da, i am waiting for your call.\n",0
...,...,...
132,Congratulations ore mo owo re wa. Enjoy it and...,0
197,Yetunde i'm in class can you not run water on ...,0
134,What time you think you'll have it? Need to kn...,0
107,"all the lastest from Stereophonics, Marley, Di...",1


In [None]:
bert_model.train_model(train_df, output_dir='test_2')

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


  0%|          | 0/100 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/25 [00:00<?, ?it/s]

Configuration saved in test_2/checkpoint-25-epoch-1/config.json
Model weights saved in test_2/checkpoint-25-epoch-1/pytorch_model.bin
tokenizer config file saved in test_2/checkpoint-25-epoch-1/tokenizer_config.json
Special tokens file saved in test_2/checkpoint-25-epoch-1/special_tokens_map.json
Configuration saved in outputs/config.json
Model weights saved in outputs/pytorch_model.bin
tokenizer config file saved in outputs/tokenizer_config.json
Special tokens file saved in outputs/special_tokens_map.json


(25, 0.4350083839893341)

In [None]:
bert_predictions, _ = bert_model.predict(test_df['text'].tolist())
f1_score(test_df['label'], bert_predictions, average='binary')

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/13 [00:00<?, ?it/s]

0.9

In [None]:
confusion_matrix(bert_predictions, test_df['label'])

array([[89,  2],
       [ 0,  9]])