In [1]:
labels = ['fraud', 'hacker groups', 'government', 'corporation',
       'unrelated', 'darknet', 'cyber defense', 'hacking', 'security concepts',
       'security products', 'network security', 'cyberwar', 'geopolitical',
       'data breach', 'vulnerability', 'platform', 'cyber attack']
       

id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}

# BERT

In [2]:
import pandas as pd
import numpy as np
df = pd.read_csv("./reduced_dataset[0,1000].csv")
df.shape

(1000, 19)

In [3]:
from sklearn.model_selection import train_test_split

train, test= train_test_split(df, test_size=0.15, random_state=42)
train, validation= train_test_split(train, test_size=0.20, random_state=42)

In [4]:
train.to_csv('train_data.csv', index=False)
validation.to_csv('validation_data.csv', index=False)
test.to_csv('test_data.csv', index=False)

Load dataset to train bert with

## Load dataset

Next, let's download a multi-label text classification dataset from the [hub](https://huggingface.co/).

At the time of writing, I picked a random one as follows:   

* first, go to the "datasets" tab on huggingface.co
* next, select the "multi-label-classification" tag on the left as well as the the "1k<10k" tag (fo find a relatively small dataset).

Note that you can also easily load your local data (i.e. csv files, txt files, Parquet files, JSON, ...) as explained [here](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files).



In [5]:
from datasets import load_dataset

# dataset = load_dataset("sem_eval_2018_task_1", "subtask5.english")

In [6]:
dataset = load_dataset('csv', data_files={'train': ['train_data.csv'],
                                              'test': 'test_data.csv', 'validation' : 'validation_data.csv'})
dataset

Using custom data configuration default-9b0126f3dc201ec6


Downloading and preparing dataset csv/default to /Users/cankrmn/.cache/huggingface/datasets/csv/default-9b0126f3dc201ec6/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]


Extracting data files #0:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #1:   0%|          | 0/1 [00:00<?, ?obj/s][A
Extracting data files #1: 100%|██████████| 1/1 [00:00<00:00, 581.41obj/s]

Extracting data files #0: 100%|██████████| 1/1 [00:00<00:00, 495.78obj/s]



Extracting data files #2:   0%|          | 0/1 [00:00<?, ?obj/s][A[A
Extracting data files #2: 100%|██████████| 1/1 [00:00<00:00, 1723.92obj/s]


Generating train split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", use_auth_token=use_auth_token), **kwargs)


Generating test split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", use_auth_token=use_auth_token), **kwargs)


Generating validation split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /Users/cankrmn/.cache/huggingface/datasets/csv/default-9b0126f3dc201ec6/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  return pd.read_csv(xopen(filepath_or_buffer, "rb", use_auth_token=use_auth_token), **kwargs)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'raw_text', 'fraud', 'hacker groups', 'government', 'corporation', 'unrelated', 'darknet', 'cyber defense', 'hacking', 'security concepts', 'security products', 'network security', 'cyberwar', 'geopolitical', 'data breach', 'vulnerability', 'platform', 'cyber attack'],
        num_rows: 680
    })
    test: Dataset({
        features: ['Unnamed: 0', 'raw_text', 'fraud', 'hacker groups', 'government', 'corporation', 'unrelated', 'darknet', 'cyber defense', 'hacking', 'security concepts', 'security products', 'network security', 'cyberwar', 'geopolitical', 'data breach', 'vulnerability', 'platform', 'cyber attack'],
        num_rows: 150
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'raw_text', 'fraud', 'hacker groups', 'government', 'corporation', 'unrelated', 'darknet', 'cyber defense', 'hacking', 'security concepts', 'security products', 'network security', 'cyberwar', 'geopolitical', 'data breach'

In [7]:
example = dataset['train'][1]
example

{'Unnamed: 0': 284,
 'raw_text': 'Operation Diànxùn: Cyberespionage Campaign Targeting Telecommunication Companies\nhttps://www.mcafee.com/blogs/other-blogs/mcafee-labs/operation-dianxun-cyberespionage-campaign-targeting-telecommunication-companies/',
 'fraud': 0,
 'hacker groups': 0,
 'government': 0,
 'corporation': 0,
 'unrelated': 0,
 'darknet': 0,
 'cyber defense': 0,
 'hacking': 0,
 'security concepts': 0,
 'security products': 0,
 'network security': 0,
 'cyberwar': 0,
 'geopolitical': 1,
 'data breach': 0,
 'vulnerability': 0,
 'platform': 0,
 'cyber attack': 1}

The dataset consists of tweets, labeled with one or more emotions. 

Let's create a list that contains the labels, as well as 2 dictionaries that map labels to integers and back.

## Preprocess data

As models like BERT don't expect text as direct input, but rather `input_ids`, etc., we tokenize the text using the tokenizer. Here I'm using the `AutoTokenizer` API, which will automatically load the appropriate tokenizer based on the checkpoint on the hub.

What's a bit tricky is that we also need to provide labels to the model. For multi-label text classification, this is a matrix of shape (batch_size, num_labels). Also important: this should be a tensor of floats rather than integers, otherwise PyTorch' `BCEWithLogitsLoss` (which the model will use) will complain, as explained [here](https://discuss.pytorch.org/t/multi-label-binary-classification-result-type-float-cant-be-cast-to-the-desired-output-type-long/117915/3).

In [8]:
import string
import re
def remove_punctuation(text):
    pattern = r'[^a-zA-Z0-9]'
    text = re.sub(pattern, ' ', text) 
    text = re.sub(r' +', ' ', text)
    return text.strip()

In [9]:
remove_punctuation("dorasdasdas     asdaw1132!!!")

'dorasdasdas asdaw1132'

In [None]:
txt = {"raw_text": "hello, my name is John and toady I will teach you how to be a loser"}

In [10]:
from transformers import AutoTokenizer
import numpy as np
import unicodedata

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_data(examples):
  # take a batch of texts
  text = examples["raw_text"]
  # normalize
  text = [unicodedata.normalize('NFKC', x) for x in text] 
  # remove punctuation and lowercase the text
  text = [remove_punctuation(x).lower() for x in text] 
  # encode them
  encoding = tokenizer(text, padding="max_length", truncation=True, max_length=512)
  # add labels
  labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
  # create numpy array of shape (batch_size, num_labels)
  labels_matrix = np.zeros((len(text), len(labels)))
  # fill numpy array
  for idx, label in enumerate(labels):
    labels_matrix[:, idx] = labels_batch[label]

  encoding["labels"] = labels_matrix.tolist()
  
  return encoding

In [11]:
encoded_dataset = dataset.map(preprocess_data, batched=True, remove_columns=dataset['train'].column_names)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [12]:
example = encoded_dataset['train'][0]
print(example.keys())

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])


In [13]:
i=999
text = unicodedata.normalize('NFKC', df.raw_text.iloc[i])
encoded = tokenizer.encode_plus(text, padding="max_length", truncation=True, max_length=512,)["input_ids"]
tokenizer.decode(encoded)

'[CLS] microsoft wpbt flaw lets hackers install rootkits on windows devices security researchers have found a flaw in the microsoft windows platform binary table ( wpbt ) that could be exploited in easy attacks to install rootkits on all windows computers shipped since 2012. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD

In [18]:
def get_unknown_counts(sent): 
  return (tokenizer.encode_plus(text=sent, return_tensors="pt")["input_ids"] == 100).sum().item()

In [19]:
df.raw_text.iloc[[i]].apply(get_unknown_counts)

999    0
Name: raw_text, dtype: int64

In [20]:
df.raw_text.apply(get_unknown_counts).sum()

0

In [21]:
tokenizer.decode(example['input_ids'])

'[CLS] zooming with the grandkids nieces and nephews five free and easy video chat apps for the holidays all the kids are doing it and so can you if you haven t hopped onto a video chat with the family yet the holidays are a great time to give it a whirl while there are plenty of the post zooming with the grandkids five easy video chat apps for the holidays https www mcafee com blogs consumer family safety zooming with the grandkids five easy video chat apps for the holidays appeared first on mcafee blogs https www mcafee com blogs [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

In [22]:
example['labels']

[0.0,
 0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0]

In [23]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]

['unrelated']

Finally, we set the format of our data to PyTorch tensors. This will turn the training, validation and test sets into standard PyTorch [datasets](https://pytorch.org/docs/stable/data.html). 

In [24]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm_probability=0.15, return_tensors="pt"
)

In [25]:
encoded_dataset.set_format("torch")

## Define model

Here we define a model that includes a pre-trained base (i.e. the weights from bert-base-uncased) are loaded, with a random initialized classification head (linear layer) on top. One should fine-tune this head, together with the pre-trained base on a labeled dataset.

This is also printed by the warning.

We set the `problem_type` to be "multi_label_classification", as this will make sure the appropriate loss function is used (namely [`BCEWithLogitsLoss`](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html)). We also make sure the output layer has `len(labels)` output neurons, and we set the id2label and label2id mappings.

In [26]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## Train the model!

We are going to train the model using HuggingFace's Trainer API. This requires us to define 2 things: 

* `TrainingArguments`, which specify training hyperparameters. All options can be found in the [docs](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments). Below, we for example specify that we want to evaluate after every epoch of training, we would like to save the model every epoch, we set the learning rate, the batch size to use for training/evaluation, how many epochs to train for, and so on.
* a `Trainer` object (docs can be found [here](https://huggingface.co/transformers/main_classes/trainer.html#id1)).

In [27]:
batch_size = 8
metric_name = "f1"

In [28]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    f"bert-finetuned-sem_eval-english",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    #push_to_hub=True,
)

We are also going to compute metrics while training. For this, we need to define a `compute_metrics` function, that returns a dictionary with the desired metric values.

In [29]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction
import torch
    
# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)
    # return as dictionary
    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, 
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds, 
        labels=p.label_ids)
    return result

Let's verify a batch as well as a forward pass:

In [30]:
encoded_dataset['train'][0]['labels'].type()

'torch.FloatTensor'

In [31]:
encoded_dataset['train']['input_ids'][0]

tensor([  101, 24095,  2075,  2007,  1996,  2882,  3211,  5104, 12286,  2015,
         1998,  7833,  2015,  2274,  2489,  1998,  3733,  2678, 11834, 18726,
         2005,  1996, 11938,  2035,  1996,  4268,  2024,  2725,  2009,  1998,
         2061,  2064,  2017,  2065,  2017,  4033,  1056, 17230,  3031,  1037,
         2678, 11834,  2007,  1996,  2155,  2664,  1996, 11938,  2024,  1037,
         2307,  2051,  2000,  2507,  2009,  1037,  1059, 11961,  2140,  2096,
         2045,  2024,  7564,  1997,  1996,  2695, 24095,  2075,  2007,  1996,
         2882,  3211,  5104,  2274,  3733,  2678, 11834, 18726,  2005,  1996,
        11938, 16770,  7479, 22432,  7959,  2063,  4012, 23012,  7325,  2155,
         3808, 24095,  2075,  2007,  1996,  2882,  3211,  5104,  2274,  3733,
         2678, 11834, 18726,  2005,  1996, 11938,  2596,  2034,  2006, 22432,
         7959,  2063, 23012, 16770,  7479, 22432,  7959,  2063,  4012, 23012,
          102,     0,     0,     0,     0,     0,     0,     0, 

In [32]:
#forward pass
outputs = model(input_ids=encoded_dataset['train']['input_ids'][0].unsqueeze(0), labels=encoded_dataset['train'][0]['labels'].unsqueeze(0))
outputs

SequenceClassifierOutput(loss=tensor(0.7759, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), logits=tensor([[-0.1391,  0.1235, -0.4908,  0.3604, -0.4730,  0.7627,  0.0364,  0.4406,
         -0.3080,  0.2050,  0.5862, -0.4603,  0.0987,  0.4429, -0.3821, -0.0355,
          0.4458]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [33]:
trainer = Trainer(
    model,
    args,
    train_dataset= encoded_dataset["train"],
    eval_dataset=  encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [34]:
trainer.train()

***** Running training *****
  Num examples = 680
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 425
  Number of trainable parameters = 109495313
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,No log,0.286221,0.259587,0.574901,0.252941
2,No log,0.244513,0.480583,0.666267,0.311765
3,No log,0.230874,0.496552,0.679092,0.358824
4,No log,0.215276,0.51954,0.688674,0.394118
5,No log,0.21313,0.53211,0.694231,0.394118


***** Running Evaluation *****
  Num examples = 170
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-85
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-85/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-85/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-85/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-85/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 170
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-170
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-170/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-170/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-170/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-170/spe

TrainOutput(global_step=425, training_loss=0.2526123944450827, metrics={'train_runtime': 6625.2851, 'train_samples_per_second': 0.513, 'train_steps_per_second': 0.064, 'total_flos': 894698068992000.0, 'train_loss': 0.2526123944450827, 'epoch': 5.0})

## Evaluate

In [36]:
# trainer.evaluate()
trainer.evaluate(eval_dataset=encoded_dataset['test'])

***** Running Evaluation *****
  Num examples = 150
  Batch size = 8


<transformers.trainer.Trainer object at 0x16a5c69a0>


## Inference

Let's test the model on a new sentence:

In [38]:
text = "Apple Patches Two iOS Zero-Days Abused for Years\nhttps://threatpost.com/apple-patches-two-ios-zero-days-abused-for-years/155042/\n\nResearchers revealed two zero-day security vulnerabilities affecting Apple's stock Mail app on iOS devices."

encoding = tokenizer(text, return_tensors="pt")
encoding = {k: v.to(trainer.model.device) for k,v in encoding.items()}

outputs = trainer.model(**encoding)

print(outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[-2.8815, -2.4201, -2.4742, -0.7778, -1.9879, -2.9931, -2.8872, -1.8126,
         -3.1530, -2.9171, -2.5911, -3.2919, -2.5013, -2.2084, -0.4374, -1.7543,
          0.0959]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


The logits that come out of the model are of shape (batch_size, num_labels). As we are only forwarding a single sentence through the model, the `batch_size` equals 1. The logits is a tensor that contains the (unnormalized) scores for every individual label.

In [40]:
logits = outputs.logits
logits.shape

torch.Size([1, 17])

To turn them into actual predicted labels, we first apply a sigmoid function independently to every score, such that every score is turned into a number between 0 and 1, that can be interpreted as a "probability" for how certain the model is that a given class belongs to the input text.

Next, we use a threshold (typically, 0.5) to turn every probability into either a 1 (which means, we predict the label for the given example) or a 0 (which means, we don't predict the label for the given example).

In [41]:
# apply sigmoid + threshold
text = "Apple Patches Two iOS Zero-Days Abused for Years\nhttps://threatpost.com/apple-patches-two-ios-zero-days-abused-for-years/155042/\n\nResearchers revealed two zero-day security vulnerabilities affecting Apple's stock Mail app on iOS devices."
sigmoid = torch.nn.Sigmoid()
probs = sigmoid(logits.squeeze().cpu())
predictions = np.zeros(probs.shape)
predictions[np.where(probs >= 0.5)] = 1
# turn predicted id's into actual label names
predicted_labels = [id2label[idx] for idx, label in enumerate(predictions) if label == 1.0]
print(predicted_labels)

['cyber attack']


In [43]:
trainer.save_metrics()

TypeError: save_metrics() missing 2 required positional arguments: 'split' and 'metrics'