# HuggingFace

In [1]:
import pandas as pd
from sklearn.metrics import accuracy_score

import torch
from datasets import load_dataset
from transformers import pipeline
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments

In [2]:
# constants
DEVICE = torch.device("cuda")
BATCH_SIZE=32

## Datasets

In [3]:
imdb = load_dataset("imdb")

Using the latest cached version of the module from /home/petruschka/.cache/huggingface/modules/datasets_modules/datasets/imdb/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1 (last modified on Thu Sep 22 10:09:40 2022) since it couldn't be found locally at imdb., or remotely on the Hugging Face Hub.
Found cached dataset imdb (/home/petruschka/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [5]:
train_ds = imdb["train"]

In [6]:
train_ds

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [7]:
train_ds["text"][0], train_ds["label"][0]

('I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, e

In [8]:
train_ds.features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None)}

In [9]:
train_ds.column_names

['text', 'label']

## Pretrained Models

In [10]:
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [11]:
classifier("We are very happy to show you the 🤗 Transformers library.")

[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

In [12]:
results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."])

In [13]:
results

[{'label': 'POSITIVE', 'score': 0.9997795224189758},
 {'label': 'NEGATIVE', 'score': 0.5308617353439331}]

In [14]:
classifier(train_ds["text"][0:5])

[{'label': 'POSITIVE', 'score': 0.787282407283783},
 {'label': 'NEGATIVE', 'score': 0.9991909861564636},
 {'label': 'NEGATIVE', 'score': 0.998217761516571},
 {'label': 'POSITIVE', 'score': 0.814461350440979},
 {'label': 'NEGATIVE', 'score': 0.9993877410888672}]

In [15]:
train_ds["label"][0:5]

[0, 0, 0, 0, 0]

In [16]:
train_ds["text"][3], train_ds["label"][3]

("This film was probably inspired by Godard's Masculin, féminin and I urge you to see that film instead.<br /><br />The film has two strong elements and those are, (1) the realistic acting (2) the impressive, undeservedly good, photo. Apart from that, what strikes me most is the endless stream of silliness. Lena Nyman has to be most annoying actress in the world. She acts so stupid and with all the nudity in this film,...it's unattractive. Comparing to Godard's film, intellectuality has been replaced with stupidity. Without going too far on this subject, I would say that follows from the difference in ideals between the French and the Swedish society.<br /><br />A movie of its time, and place. 2/10.",
 0)

## Fine-Tuning

In [17]:
model_name = "distilbert-base-uncased"

In [18]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [19]:
encodings =  tokenizer("Memorizing a library is a bad idea")
encodings

{'input_ids': [101, 24443, 21885, 2075, 1037, 3075, 2003, 1037, 2919, 2801, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [20]:
tokens = tokenizer.convert_ids_to_tokens(encodings.input_ids)
tokens

['[CLS]',
 'memo',
 '##riz',
 '##ing',
 'a',
 'library',
 'is',
 'a',
 'bad',
 'idea',
 '[SEP]']

In [21]:
tokenizer.vocab_size

30522

In [22]:
tokenizer.model_max_length

512

In [23]:
tokenizer.model_input_names

['input_ids', 'attention_mask']

In [24]:
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

In [25]:
dataset = imdb.map(tokenize, batched=True, batch_size=None)

Loading cached processed dataset at /home/petruschka/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1/cache-d10f0321c91dd083.arrow


  0%|          | 0/1 [00:00<?, ?ba/s]

Loading cached processed dataset at /home/petruschka/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1/cache-83b3fcd2685aab7b.arrow


In [26]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [27]:
model = (AutoModelForSequenceClassification
         .from_pretrained(model_name, num_labels=2)
         .to(DEVICE))

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier

In [28]:
def compute_metrics(pred):
    labels = pred.label_ids
    predictions = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, predictions)
    return acc

In [30]:
args = TrainingArguments(output_dir="../temp/",
                       num_train_epochs=3,
                       learning_rate=1e-5,
                       per_device_eval_batch_size=BATCH_SIZE,
                       per_device_train_batch_size=BATCH_SIZE,
                       evaluation_strategy="epoch",
                       logging_steps=len(dataset["train"]) // BATCH_SIZE)

In [31]:
trainer = Trainer(model=model,
                  args=args,
                  compute_metrics=compute_metrics,
                  train_dataset=dataset["train"],
                  eval_dataset=dataset["test"],
                  tokenizer=tokenizer)

In [32]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 25000
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 4689
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 5.81 GiB total capacity; 4.86 GiB already allocated; 113.06 MiB free; 4.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF