# Hugging Face
### Build, train and deploy state of the art models powered by the reference open source in machine learning. 

## Libraries

In [1]:
from datasets import load_dataset
import spacy
from transformers import AutoTokenizer
from transformers import DataCollatorForTokenClassification, pipeline
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer


Things to watch out:
- Models and Dataset can be very large and so can consume a lot of RAM / disk space

## Datasets
https://huggingface.co/datasets
- Hugging Face hosts over 5000 datasets for text, and they are tagged by language, size and task (classification, question answering)

- Each Dataset should have a filled out datacard, explaining what the dataset contains:
- https://huggingface.co/datasets/imdb

In [2]:
ag_news = load_dataset("ag_news")


Using custom data configuration default
Reusing dataset ag_news (C:\Users\Adrian\.cache\huggingface\datasets\ag_news\default\0.0.0\bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548)
100%|██████████| 2/2 [00:00<00:00, 124.99it/s]


Let's look at a specific article:

In [3]:
ag_news["train"][1]


{'text': 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.',
 'label': 2}

## Using Pretrained Models out of the box

- What hugging face is most well known about is it's collection of state of the art models
- They have 47,000 pretrained models that you can use
- Supporting over 120 model architectures
- We will need the tokenizer of the model (Making text machine readable, and different models have different formats they expect)
- And the pretrained model

In [4]:
model_name = "Jean-Baptiste/roberta-large-ner-english"

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

This needs to be customised based on the model used, as the entities will be different and the output format often varies

In [6]:
ner_pipeline = pipeline(task="ner", model=model, tokenizer=tokenizer)

def apply_ner(text: str):
    classifications = ner_pipeline(text, aggregation_strategy="simple")

    entities = []
    for i in range(len(classifications)):
        if classifications[i]["entity_group"] not in ["O", "MISC"]:
            entities.append(
                (
                    classifications[i]["entity_group"],
                    classifications[i]["start"],
                    classifications[i]["end"],
                )
            )
    spacy_display = {}
    spacy_display["ents"] = []
    spacy_display["text"] = text
    spacy_display["title"] = None

    for entity in classifications:
        spacy_display["ents"].append(
            {
                "start": entity["start"],
                "end": entity["end"],
                "label": entity["entity_group"],
            }
        )

    entity_list = ["PER", "LOC", "ORG", "MISC"]
    colors = {
        "PER": "#85DCDF",
        "LOC": "#DF85DC",
        "ORG": "#DCDF85",
        "MISC": "#85ABDF",
    }
    html = spacy.displacy.render(
        spacy_display,
        style="ent",
        minify=True,
        manual=True,
        options={"ents": entity_list, "colors": colors},
    )
    return html


In [7]:
example_text = ag_news["train"][0]["text"]


In [8]:
apply_ner(example_text)


## Transfer Learning

This shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions.

0: O
1: B-corporation
2: I-corporation
3: B-creative-work
4: I-creative-work
5: B-group
6: I-group
7: B-location
8: I-location
9: B-person
10: I-person
11: B-product
12: I-product

B- indicates the beginning of an entity.
I- indicates a token is contained inside the same entity (e.g., the State token is a part of an entity like Empire State Building).

In [33]:
wnut = load_dataset("wnut_17")


Reusing dataset wnut_17 (C:\Users\Adrian\.cache\huggingface\datasets\wnut_17\wnut_17\1.0.0\077c7f08b8dbc800692e8c9186cdf3606d5849ab0e7be662e6135bb10eba54f9)
100%|██████████| 3/3 [00:00<00:00, 535.06it/s]


In [34]:
model_name = "microsoft/deberta-v3-base"


In [35]:
tokenizer = AutoTokenizer.from_pretrained(model_name)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [36]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(
            batch_index=i
        )  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif (
                word_idx != previous_word_idx
            ):  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs


In [37]:
tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)


Loading cached processed dataset at C:\Users\Adrian\.cache\huggingface\datasets\wnut_17\wnut_17\1.0.0\077c7f08b8dbc800692e8c9186cdf3606d5849ab0e7be662e6135bb10eba54f9\cache-99fd98fcc80797d7.arrow
Loading cached processed dataset at C:\Users\Adrian\.cache\huggingface\datasets\wnut_17\wnut_17\1.0.0\077c7f08b8dbc800692e8c9186cdf3606d5849ab0e7be662e6135bb10eba54f9\cache-503a40c710afefa3.arrow
Loading cached processed dataset at C:\Users\Adrian\.cache\huggingface\datasets\wnut_17\wnut_17\1.0.0\077c7f08b8dbc800692e8c9186cdf3606d5849ab0e7be662e6135bb10eba54f9\cache-5aaa835acfced39a.arrow


In [38]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)


In [39]:
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=14)


Some weights of the model checkpoint at microsoft/deberta-v3-base were not used when initializing DebertaV2ForTokenClassification: ['lm_predictions.lm_head.LayerNorm.weight', 'deberta.embeddings.position_embeddings.weight', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.classifier.weight', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.dense.weight', 'mask_predictions.LayerNorm.bias', 'mask_predictions.classifier.bias', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias']
- This IS expected if you are initializing DebertaV2ForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a Be

In [40]:
lr, bs = 5e-5, 16
wd, epochs = 0.01, 2

training_args = TrainingArguments(
    "/outputs",
    learning_rate=lr,
    warmup_steps=35,
    lr_scheduler_type="cosine",
    fp16=True,
    evaluation_strategy="epoch",
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs * 2,
    num_train_epochs=epochs,
    weight_decay=wd,
    report_to="none",
    save_strategy="no",
)


Cosine Annealing: https://paperswithcode.com/method/cosine-annealing
Mixed Precision TrainingL https://paperswithcode.com/paper/mixed-precision-training (fp16=True)
Warmup: Train a few batches with low lr

Things to explore:
- Hyperparameter tuning
- Cross fold validation & ensembling
- Try Different Optimizers: Default is AdamW
- Try differnt loss function

In [41]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_wnut["train"],
    eval_dataset=tokenized_wnut["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)


Using amp half precision backend


In [42]:
trainer.train()


The following columns in the training set don't have a corresponding argument in `DebertaV2ForTokenClassification.forward` and have been ignored: ner_tags, tokens, id. If ner_tags, tokens, id are not expected by `DebertaV2ForTokenClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3394
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 426
 50%|█████     | 213/426 [00:31<00:27,  7.75it/s]The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForTokenClassification.forward` and have been ignored: ner_tags, tokens, id. If ner_tags, tokens, id are not expected by `DebertaV2ForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1287
  Batch size = 32
                                            

{'eval_loss': 0.21272608637809753, 'eval_runtime': 2.4298, 'eval_samples_per_second': 529.678, 'eval_steps_per_second': 16.874, 'epoch': 1.0}


100%|██████████| 426/426 [01:02<00:00,  7.82it/s]The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForTokenClassification.forward` and have been ignored: ner_tags, tokens, id. If ner_tags, tokens, id are not expected by `DebertaV2ForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1287
  Batch size = 32
                                                 
100%|██████████| 426/426 [01:04<00:00,  7.82it/s]

Training completed. Do not forget to share your model on huggingface.co/models =)


100%|██████████| 426/426 [01:04<00:00,  6.58it/s]


{'eval_loss': 0.22114568948745728, 'eval_runtime': 2.341, 'eval_samples_per_second': 549.769, 'eval_steps_per_second': 17.514, 'epoch': 2.0}
{'train_runtime': 64.7783, 'train_samples_per_second': 104.788, 'train_steps_per_second': 6.576, 'train_loss': 0.245234296915117, 'epoch': 2.0}


TrainOutput(global_step=426, training_loss=0.245234296915117, metrics={'train_runtime': 64.7783, 'train_samples_per_second': 104.788, 'train_steps_per_second': 6.576, 'train_loss': 0.245234296915117, 'epoch': 2.0})

In [43]:
trainer.save_model("model")


Saving model checkpoint to model
Configuration saved in model\config.json
Model weights saved in model\pytorch_model.bin
tokenizer config file saved in model\tokenizer_config.json
Special tokens file saved in model\special_tokens_map.json


In [44]:
ner_model = AutoModelForTokenClassification.from_pretrained("./model")
ner_tok = AutoTokenizer.from_pretrained("./model")

loading configuration file ./model\config.json
Model config DebertaV2Config {
  "_name_or_path": "./model",
  "architectures": [
    "DebertaV2ForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_10": 10,
    "LABEL_11": 11,
    "LABEL_12": 12,
    "LABEL_13": 13,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6,
    "LABEL_7": 7,
    "LABEL_8": 8,
    "LABEL_9": 9
  },
  "layer_norm_eps": 1e-07,
  "max_position_embeddings": 512,
  "max

In [45]:
ner_pipeline = pipeline(task="ner", model=ner_model, tokenizer=ner_tok)


In [74]:
def apply_ner(text: str):
    classifications = ner_pipeline(text)

    ner_map = {
        "LABEL_0": "O",
        "LABEL_1": "B-corporation",
        "LABEL_2": "I-corporation",
        "LABEL_3": "B-creative-work",
        "LABEL_4": "I-creative-work",
        "LABEL_5": "B-group",
        "LABEL_6": "I-group",
        "LABEL_7": "B-location",
        "LABEL_8": "I-location",
        "LABEL_9": "B-person",
        "LABEL_10": "I-person",
        "LABEL_11": "B-product",
        "LABEL_12": "I-product",
    }
    spacy_display = {}
    spacy_display["ents"] = []
    spacy_display["text"] = " ".join(text)
    spacy_display["title"] = None

    entity_start = 0
    for i in range(len(classifications)):

        if classifications[i][0]["entity"] != "label_0":
            if ner_map[classifications[i][0]["entity"]][0] == "B":
                j = i + 1
                entity_end = entity_start + classifications[i][0]["end"] + 1
                while (
                    j < len(classifications)
                    and ner_map[classifications[j][0]["entity"]][0] == "I"
                ):
                    j += 1
                    entity_end = entity_end + classifications[j][0]["end"] + 1
                spacy_display["ents"].append(
                    {
                        "start": entity_start,
                        "end": entity_end,
                        "label": ner_map[classifications[i][0]["entity"]].split("-")[1],
                    }
                )
        entity_start = entity_start + classifications[i][0]["end"] + 1

    entity_list = [
        "corporation",
        "creative-work",
        "group",
        "location",
        "person",
        "product",
    ]
    colors = {
        "corporation": "lightblue",
        "creative-work": "lightyellow",
        "group": "lightgreen",
        "location": "lightred",
        "person": "orange",
        "product": "black",
    }
    html = spacy.displacy.render(
        spacy_display,
        style="ent",
        minify=True,
        manual=True,
        options={"ents": entity_list, "colors": colors},
    )
    return html


In [125]:
text = wnut["validation"][2]["tokens"]


In [126]:
apply_ner(text)
