## Text Classification with BERT in PyTorch

### What is BERT?
#### BERT stands for Bidirectional Encoder Representations from Transformers. 
First, a transformer is a deep learning model that uses the mechanism of self-attention, which weights each word according to its relation with other words. Based on the attention scores, the model can "pay attention" to the valuable parts of the sequence. Second, BERT is bidirectional, which means it considers both left and right context when training. In this sense, BERT model can understand the context from both directions.  
#### BERT BASE and BERT LARGE
BERT BASE: less transformer blocks and hidden layers size, have the same model size as OpenAI GPT. [12 Transformer blocks, 12 Attention heads, 768 hidden layer size]

BERT LARGE: huge network with twice the attention layers as BERT BASE, achieves a state of the art results on NLP tasks. [24 Transformer blocks, 16 Attention heads, 1024 hidden layer size]

Differences: 
Bert base has fewer parameters than Bert large, so it can be used with less computer memory.  Bert large has more parameters, so it is more accurate than Bert base.
#### BERT Input and Output
Input: [CLS]sequence of tokens[SEP]

- [CLS] stands for classification token;
- [SEP] lets BERT know which token belongs to which sequence
- the maximum size of tokens that can be fed into BERT model is 512. Hence, if the tokens are less than 512, we can use padding to fill the empty token; if the tokens in a sequence are longer than 512, then we need to truncate the tokens. 
- the output of a BERT model will be an embedding vector of size 768 in each of the tokens. These tokens will then be the inputs of our classifier.  

#### Experiement with one simple text

In [1]:
from transformers import AutoTokenizer

tokenizer= AutoTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-v2')

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
example_text = '今日は一日サッカーをしました'
bert_input = tokenizer(example_text,padding='max_length', max_length = 20, 
                       truncation=True, return_tensors="pt")


# print(bert_input['input_ids'])
# print(bert_input['token_type_ids'])
# print(bert_input['attention_mask'])

#### Explain
- padding : to pad each sequence to the maximum length that you specify.
- max_length : the maximum length of each sequence. In this example we use 20, but for our actual dataset we will use 512, which is the maximum length of a sequence allowed for BERT.
- truncation : if True, then the tokens in each sequence that exceed the maximum length will be truncated.
- return_tensors : the type of tensors that will be returned. Since we’re using Pytorch, then we use pt. If you use Tensorflow, then you need to use tf .

#### What is input_ids

In [3]:
example_text = tokenizer.decode(bert_input.input_ids[0])
# print(example_text)

#### What are token_type_ids and attention_mask?
- token_type_ids is a binary mask that identifies which tokens belong to which sequence. Because we only have one sequence, all tokens belong to class 0.
- attention_mask is a binary mask that if a token is a real word, [CLS], [SEP], or a padding. If a token is a real word, [CLS], [SEP], then the mask will be 1. Otherwise, the mask will be 0. 

#### Import the data

In [4]:
import pandas as pd
datasets = pd.read_pickle("/home/danmengcai/datasets.pkl") 

In [13]:
datasets.head()

Unnamed: 0,tweets,label
0,まじで今回気合入ってるのでぜひ,1
1,つくばメイクアップミーティング まであと3日\n今回は初心者講座ということでベースメイク...,1
2,次回開催が518水 19002000に決定致しました\n\nメイク初心者のあなたもそろそ...,1
3,つくばメイクアップミーティング まであと1日\nいよいよ明日がイベント当日です\n下記U...,1
4,つくばメイクアップミーティング まであと3日 \n当日投影する資料をチラ見せ\n全貌は...,1


In [5]:
# datasets.head()

#### Use the datasets library in hugging face to split the datasets into training and testing datasets 

In [5]:
from datasets import Dataset

dataset_packed = Dataset.from_pandas(datasets)
dataset_split = dataset_packed.train_test_split(test_size=0.2, seed=0)
print(dataset_split)

DatasetDict({
    train: Dataset({
        features: ['tweets', 'label'],
        num_rows: 121281
    })
    test: Dataset({
        features: ['tweets', 'label'],
        num_rows: 30321
    })
})


#### Tokenize the dataset with BERT model

In [6]:
from transformers import AutoTokenizer

tokenizer= AutoTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-v2')

def preprocess_function(examples):
    MAX_LENGTH = 512
    return tokenizer(examples["tweets"], max_length=MAX_LENGTH, truncation=True)

tokenized_dataset = dataset_split.map(preprocess_function, batched=True)

100%|█████████████████████████████████████████| 122/122 [00:30<00:00,  4.02ba/s]
100%|███████████████████████████████████████████| 31/31 [00:07<00:00,  4.14ba/s]


In [7]:
from transformers import DataCollatorWithPadding
from transformers import DistilBertForSequenceClassification, TrainingArguments, Trainer

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
model = DistilBertForSequenceClassification.from_pretrained("cl-tohoku/bert-base-japanese-v2", num_labels=2)

You are using a model of type bert to instantiate a model of type distilbert. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-v2 were not used when initializing DistilBertForSequenceClassification: ['cls.predictions.decoder.bias', 'bert.encoder.layer.0.output.dense.weight', 'bert.encoder.layer.1.output.LayerNorm.bias', 'bert.encoder.layer.4.attention.output.LayerNorm.weight', 'bert.encoder.layer.2.output.dense.weight', 'bert.encoder.layer.11.attention.self.key.bias', 'bert.encoder.layer.4.attention.self.value.bias', 'bert.encoder.layer.1.attention.output.LayerNorm.weight', 'bert.encoder.layer.6.intermediate.dense.bias', 'bert.encoder.layer.11.output.LayerNorm.weight', 'bert.encoder.layer.9.attention.self.value.bias', 'bert.encoder.layer.6.attention.output.dense.bias', 'bert.encoder.layer.6.output.LayerNorm.weight', 'bert.embeddings.token_type_embeddings.weight', 'bert.encoder.layer.9.outpu

In [8]:
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {'accuracy':acc, 'f1':f1}

In [9]:
training_args = TrainingArguments(
    output_dir="./results_230303",
    evaluation_strategy='epoch',
    logging_strategy='epoch',
    save_strategy='epoch',
    save_total_limit=1,
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    no_cuda=False, # GPUを使用する場合はFalse
)

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test'],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: token_type_ids, tweets. If token_type_ids, tweets are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 121281
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 75805
  Number of trainable parameters = 111207170


Epoch,Training Loss,Validation Loss


OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 10.76 GiB total capacity; 4.96 GiB already allocated; 83.69 MiB free; 5.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [10]:
import torch
torch.cuda.empty_cache()
# One quick call out. If you are on a Jupyter or Colab notebook , after you hit `RuntimeError: CUDA out of memory`. You need to restart the kernel.