# COMP8325 Workshop 11
## Natural Language Processing for Cyber Security

We're going to use a version of RoBERTA, a large pre-trained transformer language model, that has been fine-tuned to indentify spam, and apply it to a dataset of spam (and non-spam) sms's.

For those who are keen, we can also apply other ML models, such as XGBoost (a recent optimised "decision tree" method), to this data to explore how effective RoBERTA is on this task. 

This workshop is modelled around the [huggingface text classification tutorial](https://huggingface.co/docs/transformers/tasks/sequence_classification). 

## Step 1: Check you have a GPU enabled

Click on the "RAM... Disk..." display top right (next to "Editing"). This should show you a "Resources" tab.
You should see 3 resource displays: "System RAM", "GPU RAM" and "Disk". 

If "GPU RAM" is missing, click on "Change runtime type" in blue at the bottom of the Resources tab. In the "Hardware accelerator" dropdown, select "GPU". You could also select "TPU" but in my experience you don't often get access to a physical TPU, and for this tutorial a GPU should be sufficient.

Once you have a GPU/TPU, execute the cell below. The middle box of the output shows the GPU memory. I believe you need at least 10,000MiB to run the models in this tutorial. In the unlikely event you have less, select "Restart runtime" from the "Runtime" menu --- I believe it will re-assign your GPU. Run `!nvidia-smi` again to check the GPU.

In [1]:
!nvidia-smi 

'nvidia-smi' is not recognized as an internal or external command,
operable program or batch file.


# Step 2
## Install huggingface python packages

Huggingface has made access to large (transformer) language models much easier.

In [None]:
! pip install datasets transformers

Collecting datasets
  Downloading datasets-2.2.1-py3-none-any.whl (342 kB)
[K     |████████████████████████████████| 342 kB 5.3 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.19.1-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 63.5 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.3.0-py3-none-any.whl (136 kB)
[K     |████████████████████████████████| 136 kB 66.4 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 60.7 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 4.2 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux

In [None]:
from datasets import load_dataset

# Step 3
## Download a dataset
Hugginface provides a library of community contributed data sets. 

In [None]:
# spam_data = load_dataset('sms_spam')
spam_data = load_dataset('SetFit/enron_spam')

# You can search for other datasets on the huggingface website:
#  https://huggingface.co

Using custom data configuration SetFit--enron_spam-5c67eb35f5974df9


Downloading and preparing dataset json/SetFit--enron_spam to /root/.cache/huggingface/datasets/SetFit___json/SetFit--enron_spam-5c67eb35f5974df9/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/101M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.27M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/SetFit___json/SetFit--enron_spam-5c67eb35f5974df9/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
spam_data

DatasetDict({
    train: Dataset({
        features: ['message_id', 'text', 'label', 'label_text', 'subject', 'message', 'date'],
        num_rows: 31716
    })
    test: Dataset({
        features: ['message_id', 'text', 'label', 'label_text', 'subject', 'message', 'date'],
        num_rows: 2000
    })
})

In [None]:
spam_data["train"][0]  #, spam_data["label"][0] 

{'date': datetime.datetime(2005, 6, 18, 0, 0),
 'label': 1,
 'label_text': 'spam',
 'message': 'understanding oem software\nlead me not into temptation ; i can find the way myself .\n# 3533 . the law disregards trifles .',
 'message_id': 33214,
 'subject': 'any software just for 15 $ - 99 $',
 'text': 'any software just for 15 $ - 99 $ understanding oem software\nlead me not into temptation ; i can find the way myself .\n# 3533 . the law disregards trifles .'}

# Step 4
## Prepare the dataset for use in a transformer model
To apply a model to data, you need to pre-process the model in accordance with the pre-processing used to train the model (i.e.: use the same vocabulary and method to divide the text into tokens).

In [None]:
from transformers import AutoTokenizer

# tokenizer = AutoTokenizer.from_pretrained("mvonwyl/roberta-twitter-spam-classifier")
tokenizer = AutoTokenizer.from_pretrained("roberta-base")

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [None]:
def preprocess_function(examples):
    # return tokenizer(examples["sms"], truncation=True, return_tensors="pt", padding=True)
    return tokenizer(examples["message"], truncation=True)

In [None]:
tokenized_spam = spam_data.map(preprocess_function, batched=True)  # 

  0%|          | 0/32 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [None]:
tokenized_spam

DatasetDict({
    train: Dataset({
        features: ['message_id', 'text', 'label', 'label_text', 'subject', 'message', 'date', 'input_ids', 'attention_mask'],
        num_rows: 31716
    })
    test: Dataset({
        features: ['message_id', 'text', 'label', 'label_text', 'subject', 'message', 'date', 'input_ids', 'attention_mask'],
        num_rows: 2000
    })
})

In [None]:
# # Use DataCollatorWithPadding to create a batch of examples. It will also dynamically pad your text to the length of the longest element in its batch, 
# # so they are a uniform length. While it is possible to pad your text in the tokenizer function by setting padding=True, dynamic padding is more efficient.

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Step 5
## Download and run a model 
Huggingface also maintains a library of community supplied models. Most of these are variations on the main base models (BERT, RoBERTA, GPT, BART, T5, ...) fine-tuned on a specific type of data and/or for a specific task.

There are also a variety of pre-trained models that have appeared in the research literature. The model collection is quite extensive.

__NOTE:__ Training time will typically be several hours, however inference time (ie: how long it takes to apply the model to new data) takes a second or so for each text.

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

# model = AutoModelForSequenceClassification.from_pretrained("mvonwyl/roberta-twitter-spam-classifier", num_labels=2)
model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=2)

Downloading:   0%|          | 0.00/478M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifie

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_spam["train"],
    eval_dataset=tokenized_spam["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: message_id, date, label_text, text, subject, message. If message_id, date, label_text, text, subject, message are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 31716
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 9915


Step,Training Loss
500,0.1394
1000,0.0735
1500,0.0475
2000,0.0511
2500,0.0282
3000,0.0261
3500,0.0228
4000,0.0175
4500,0.0167
5000,0.0141


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-500/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json
Model weights saved in ./results/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1500/special_toke

# Step 6
## Download your trained model for future use

CoLab virtual machines are ephemeral, and have a limited life span in the free teir. If you want to experiment the model later, you should download it.

Be aware also that CoLab limits access to GPUs. I found that access is blocked when attempting to training a model twice. 

In [None]:
!ls results

In [None]:
# combining all checkpoints will likely be too large to download
# !tar -czf results.tar.gz ./results/  

# edit to compress the last chekcpoint
!tar -czf ckpt-9500-model.tar.gz ./results/checkpoint-2500/pytorch_model.bin

In [None]:
from google.colab import files
# files.download('results.tar.gz')
files.downlaod('ckpt-9500-model.tar.gz')
files.downlaod('./results/checkpoint-2500/config.json')
files.downlaod('./results/checkpoint-2500/tokenizer_config.json')
files.downlaod('./results/checkpoint-2500/special_tokens_map.json')