# Author Identification with DistilBERT

We implement a text classification algorithm using DistilBERT context token (CLS).

Adaptation from this notebook: https://towardsdatascience.com/hugging-face-transformers-fine-tuning-distilbert-for-binary-classification-tasks-490f1d192379

In [1]:
from google.colab import drive
import os
drive.mount('/content/gdrive')

# Jean
# os.chdir('/content/gdrive/My Drive/NLP_Project_Author_identification')

# Victor
os.chdir('/content/gdrive/MyDrive/6.864 - NLP/NLP_Project_Author_identification')

Mounted at /content/gdrive


In [2]:
%%bash
# Logistics: install the transformers package
pip -q install transformers
pip -q install datasets
pip -q install tqdm
pip -q install sentencepiece

import numpy as np

bash: line 7: import: command not found


In [3]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('To enable a high-RAM runtime, select the Runtime > "Change runtime type"')
  print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
  print('re-execute this cell.')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 27.3 gigabytes of available RAM

You are using a high-RAM runtime!


# DistilBERT

## Parsing of the files

In [4]:
### We could augment the train with 25 examples from each author from the text

import glob
import pandas as pd

create_set = False

if create_set:
  train_df = pd.DataFrame(columns = ["Author", "Article"], dtype = str)
  for i, filename in enumerate(glob.iglob("/content/gdrive/MyDrive/6.864 - NLP/NLP_Project_Author_identification/C50/C50train/*/*.txt")):
    if i % 100 == 0:
      print(i)
    author = filename.split("/")[8]
    with open(filename) as file:
      text = file.readlines()
    # text = text.replace("\\n\',", "").strip("]").replace("\\n", "").replace("\'", "")
    text = "".join(text)
    text = text.replace("\n", " ")
    if i == 1:
      text11 = text
      print(text)
    train_df.loc[i] = [author, text]

  test_df = pd.DataFrame(columns = ["Author", "Article"], dtype = str)
  for i, filename in enumerate(glob.iglob("/content/gdrive/MyDrive/6.864 - NLP/NLP_Project_Author_identification/C50/C50test/*/*.txt")):
    if i % 100 == 0:
      print(i)
    author = filename.split("/")[8]
    with open(filename) as file:
      text = file.readlines()
    # text = text.replace("\\n\',", "").strip("]").replace("\\n", "").replace("\'", "")
    text = "".join(text)
    text = text.replace("\n", " ")
    if i == 1:
      text11 = text
      print(text)
    test_df.loc[i] = [author, text]

  # Next, we re-attribute some examples from the test to the train to get a 75%/25% split

  authors = test_df.Author.unique()
  test_df_final = pd.DataFrame(columns = ["Author", "Article"], dtype = str)

  for a in authors:
    sub_df = test_df[test_df['Author'] == a]
    mask = np.full(50, False)
    mask[:25] = True
    np.random.seed(34)
    np.random.shuffle(mask)
    train_df = pd.concat([train_df, sub_df.iloc[mask]])
    test_df_final = pd.concat([test_df_final, sub_df.iloc[~mask]])
  
  train_df.to_csv('data/train_DistilBERT.csv',index=False)
  test_df_final.to_csv('data/test_DistilBERT.csv',index=False)

else:
  train_df = pd.read_csv('data/train_DistilBERT.csv')
  test_df = pd.read_csv('data/test_DistilBERT.csv')

In [5]:
train_df.iloc[0].Article

'British life assurance group Scottish Amicable on Thursday announced plans to shed its 170-year old mutual status in preparation for a 1.0 billion pounds ($1.68 billion) flotation on the London stock market in three to five years\' time. The group, based in Stirling, Scotland, said it is to use capital raised by the two-stage proposals to take advantage of opportunities for growth and increase its share of the life and pensions market. Policyholders are to receive an initial special bonus totalling 75 million pounds after demutualising on May 1 this year, based on the length of time policies have been held. At the top end of the scale, an individual with a 25-year policy maturing this April will receive 1,503 pounds. On flotation another, larger payout - expected to amount to at least 200 million pounds but possibly "substantially higher" - will be made. The move to seek a flotation is the second by a British life group, following plans announced last year by Norwich Union. Scottish A

## Transformers and tokenizer

In [6]:
### For now we just truncate the context to only take the 512 first tokens - otherwise too long with the positional embedding

import transformers

from transformers import DistilBertModel, DistilBertConfig

example = True

### That could have been a way to increase max_position_embeddings, but I can't figure out how
# configuration = DistilBertConfig(max_position_embeddings=1024)

# Use a pretrained tokenizer with CLASS.from_pretrained() function
# "cased" or "uncased" is whether or not English and english are the same
tokenizer = transformers.AutoTokenizer.from_pretrained('distilbert-base-cased') # config=configuration


if example:
  context = train_df.iloc[0].Article

  print(context)

  context_ids = tokenizer.encode(context, truncation=True, max_length=512)
  print(context_ids)
  print(tokenizer.convert_ids_to_tokens(context_ids))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=411.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…


British life assurance group Scottish Amicable on Thursday announced plans to shed its 170-year old mutual status in preparation for a 1.0 billion pounds ($1.68 billion) flotation on the London stock market in three to five years' time. The group, based in Stirling, Scotland, said it is to use capital raised by the two-stage proposals to take advantage of opportunities for growth and increase its share of the life and pensions market. Policyholders are to receive an initial special bonus totalling 75 million pounds after demutualising on May 1 this year, based on the length of time policies have been held. At the top end of the scale, an individual with a 25-year policy maturing this April will receive 1,503 pounds. On flotation another, larger payout - expected to amount to at least 200 million pounds but possibly "substantially higher" - will be made. The move to seek a flotation is the second by a British life group, following plans announced last year by Norwich Union. Scottish Am

In [7]:
# Create index for the author, as well as dictionary to go from index to author
idx2author = list(train_df.Author.sort_values().unique()) # we sort values so that it's easy to re-find the order just in case
author2idx = {idx2author[i]: i for i in range(len(idx2author))}
n_authors = len(idx2author)
print(f"We have {n_authors} authors.")

We have 50 authors.


Now, let's process everything with the encoder.

In [8]:
import json
import random
from multiprocessing import Pool
from tqdm import tqdm, trange

create_set = False

def proc_line_init(tokenizer_for_authors):
    global tokenizer
    tokenizer = tokenizer_for_authors


# Preprocess one C50 data point
def proc_line(row, truncation = True, max_length = 512):
    article = getattr(row, 'Article')
    author = row['Author']

    article_ids = tokenizer.encode(article, verbose=False, truncation = truncation, max_length = max_length)

    author_id = author2idx[author]

    samp = {
        "Article": article,
        "Author": author,
        "article_ids": article_ids,
        "author_id": author_id,
    }

    return samp


# Preprocess corpus
def preproc():

    article_proc = list()

    for index, row in train_df.iterrows():
      article_proc.append(proc_line(row))

    json.dump(article_proc, open("/content/gdrive/MyDrive/6.864 - NLP/NLP_Project_Author_identification/data/C50_train_enc_DistilBERT.json", 'w'))

    article_proc_test = list()
    for index, row in test_df.iterrows():
      article_proc_test.append(proc_line(row))

    json.dump(article_proc_test, open("/content/gdrive/MyDrive/6.864 - NLP/NLP_Project_Author_identification/data/C50_test_enc_DistilBERT.json", 'w'))
    
    return article_proc, article_proc_test

if create_set:
  article_proc, article_proc_test = preproc()
else:
  article_proc = json.load(open("/content/gdrive/MyDrive/6.864 - NLP/NLP_Project_Author_identification/data/C50_train_enc_DistilBERT.json", 'r'))
  article_proc_test = json.load(open("/content/gdrive/MyDrive/6.864 - NLP/NLP_Project_Author_identification/data/C50_test_enc_DistilBERT.json", 'r'))

In [9]:
print(f"Train has {len(article_proc)} samples.")
print(f"Test has {len(article_proc_test)} samples.")

Train has 3750 samples.
Test has 1250 samples.


## Models

We define 2 models:
* `AuthorDiscoveringModel0` is the model based on the CLS token.
* `AuthorDiscoveringModel1` implements an attention layer on top of the DistilBERT embeddings.

In [10]:
import torch.nn as nn

class ModelOutputs:
    def __init__(self, author_logits=None, loss=None):
        self.author_logits = author_logits
        self.loss = loss

class AuthorDiscoveringModel0(nn.Module):

    def __init__(self, language_model=None, dropout=0.2):
        '''
        lm:         a pretrained transformer language model
        dropout:    dropout rate for the dropout layer
        '''
        super(AuthorDiscoveringModel0, self).__init__()

        self.author_outputs = nn.Linear(language_model.config.dim, n_authors)
        self.language_model = language_model
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, input_ids=None, attention_mask=None, author=None):
        '''
        input_ids:          ids of the concatenated input tokens
        attention_mask:     concatenated attention masks (ques+ctx)
        author:    label (idx) of the author of the articles
        '''
        
        lm_output = self.language_model(
            input_ids = input_ids,
            attention_mask = attention_mask
        )

        last_hidden_state = lm_output.last_hidden_state # size (batch_size, seq_len, hidden_size)
        # cls token stores a sentence-level embedding, can be found at index 0
        cls_token= last_hidden_state[:, 0, :] # size (batch_size, hidden_size)
        cls_token = self.dropout(cls_token)

        author_logits = None

        # author_logits.size() should be (batch_size, n_authors)

        author_logits = self.author_outputs(cls_token)  # the linear layer converts from size (batch_size, hidden_size) to size (batch_size, n_authors)

        total_loss = None

        if author is not None:

            loss_fct = nn.CrossEntropyLoss()
            
            total_loss = loss_fct(author_logits, author)
        
        return ModelOutputs(
            author_logits = author_logits,
            loss = total_loss
            )
        

In [11]:
class AuthorDiscoveringModel1(nn.Module):

    def __init__(self, language_model=None, dropout=0.2, att_dim = None):
        '''
        lm:         a pretrained transformer language model
        dropout:    dropout rate for the dropout layer
        '''
        super(AuthorDiscoveringModel1, self).__init__()

        self.author_outputs = nn.Linear(language_model.config.dim, n_authors)
        self.language_model = language_model
        self.dropout = nn.Dropout(dropout)

        if att_dim == None:
          self.attention = nn.Linear(language_model.config.dim, language_model.config.dim)
          self.context_vector = nn.Linear(language_model.config.dim, 1, bias=False)
        else:
          self.attention = nn.Linear(language_model.config.dim, att_dim)
          self.context_vector = nn.Linear(att_dim, 1, bias=False)

        self.tanh = nn.Tanh()
        self.softmax = nn.Softmax(dim = 1)
    
    def forward(self, input_ids=None, attention_mask=None, author=None):
        '''
        input_ids:          ids of the concatenated input tokens
        attention_mask:     concatenated attention masks (ques+ctx)
        author:    label (idx) of the author of the articles
        '''
        
        lm_output = self.language_model(
            input_ids = input_ids,
            attention_mask = attention_mask
        )

        last_hidden_state = lm_output.last_hidden_state # size (batch_size, seq_len, hidden_size)
        last_hidden_state = self.dropout(last_hidden_state) # size (batch_size, seq_len, hidden_size)

        att = self.tanh(self.attention(last_hidden_state)) # size (batch_size, seq_len, att_dim)
        att = self.context_vector(att).squeeze(2)
        att = self.softmax(att)


        hidden_state = torch.bmm(last_hidden_state.permute(0,2,1), att.unsqueeze(2)) # size (batch_size, hidden_size)

        author_logits = None

        author_logits = self.author_outputs(hidden_state.squeeze(2))  # (batch_size, n_authors)

        total_loss = None

        if author is not None:

            loss_fct = nn.CrossEntropyLoss()
            
            total_loss = loss_fct(author_logits, author)
        
        return ModelOutputs(
            author_logits = author_logits,
            loss = total_loss
            )

## Training the model

In [39]:
# Initialize the model and use GPU - can be changed from model 0 to model 1 here
lm_pretrained = transformers.AutoModel.from_pretrained('distilbert-base-cased')
model_name = "model1"


if model_name == "model0":
  model = AuthorDiscoveringModel0(lm_pretrained)
elif model_name == "model1":
  model = AuthorDiscoveringModel1(lm_pretrained)
else:
  print("Please enter valid model name")
model = model.cuda()

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


**Then we define the training hyper-parameters, the optimizer, and the learning rate scheduler. Read this [document](https://huggingface.co/transformers/main_classes/optimizer_schedules.html#transformers.get_linear_schedule_with_warmup) to understand how the linear learning rate scheduling influences the learning process.**

In [40]:
## TODO WITH MY PARAMETERS

import torch

# Hyper-parameters: you could try playing with different settings
num_epochs = 15 # Default will be 10 epochs
learning_rate = 3e-5
weight_decay = 1e-5
eps = 1e-6
batch_size = 32
warmup_rate = 0.05
art_max_length = 512

# Calculating the number of warmup steps
num_training_cases = len(article_proc)
t_total = (num_training_cases // batch_size + 1) * num_epochs
ext_warmup_steps = int(warmup_rate * t_total)

# Initializing an AdamW optimizer
ext_optim = torch.optim.AdamW(model.parameters(), lr=learning_rate,
                              eps=eps, weight_decay=weight_decay)

# Initializing the learning rate scheduler [details are in the BERT paper]
ext_sche = transformers.get_linear_schedule_with_warmup(
    ext_optim, num_warmup_steps=ext_warmup_steps, num_training_steps=t_total
)

print("***** Training Info *****")
print("  Num examples = %d" % t_total)
print("  Num Epochs = %d" % num_epochs)
print("  Batch size = %d" % batch_size)
print("  Total optimization steps = %d" % t_total)

***** Training Info *****
  Num examples = 1770
  Num Epochs = 15
  Batch size = 32
  Total optimization steps = 1770


In [41]:
def gather_batch(batch):
    art_batch  = [x['Article'] for x in batch]
    author_batch = [x['Author'] for x in batch]
    author_id_batch = [x['author_id'] for x in batch]

    return art_batch, author_batch, author_id_batch

In [42]:
def vectorize_batch(batch, tokenizer):
    art_batch, author_batch, author_id_batch = gather_batch(batch)

    # Encode the article
    art_encode = tokenizer.batch_encode_plus(
        art_batch,
        max_length = art_max_length,
        truncation = True,
        padding = 'longest',
        return_attention_mask = True,
        return_tensors = 'pt'
    )

    # Move the training batch to GPU
    art_ids = art_encode['input_ids'].cuda()
    art_attn_mask = art_encode['attention_mask'].cuda()

    # Move start and end positions to the GPU
    author_id_batch = torch.LongTensor(author_id_batch).cuda()

    return art_ids, art_attn_mask, author_id_batch

In [43]:
Train = False
epoch_to_load = 14 # epoch starts at 0 so 14 means after 15 epochs

if Train:
  model.train()
  max_grad_norm = 1

  step_id = 0
  for epoch in range(num_epochs):

      random.shuffle(article_proc)

      for i in range(0, num_training_cases, batch_size):

          batch = article_proc[i: i + batch_size]
          art_ids, art_attn_mask, author_id_batch = vectorize_batch(batch, tokenizer)

          model.zero_grad() # Does the same as ext_optim.zero_grad()
          
          # Get the model outputs, including logits and losses
          # stored as a ModelOutput object
          outputs = model(
              input_ids = art_ids, 
              attention_mask = art_attn_mask, 
              author = author_id_batch
          )

          # Back-propagate the loss signal and clip the gradients
          loss = outputs.loss.mean()
          loss.backward()
          torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)

          # Update neural network parameters and the learning rate
          ext_optim.step()
          ext_sche.step() # Update learning rate for better convergence


          if step_id % 100 == 0:
              print(f'At step {step_id}, the extraction loss = {loss}')

          step_id += 1

      torch.save(model.state_dict(), f"/content/gdrive/MyDrive/6.864 - NLP/NLP_Project_Author_identification/models/author_ID_{model_name}_{epoch}epo.pt")

  print('Finished Training')

else:
  model.load_state_dict(torch.load(f"/content/gdrive/MyDrive/6.864 - NLP/NLP_Project_Author_identification/models/author_ID_{model_name}_{epoch_to_load}epo.pt"))

## Testing the model

We'll simply maximize the logits out of the model.

In [44]:
def logits_to_author(author_logits):
  '''
  Input sizes:
      author_logits.size() = (batch_size, n_author)
  Output sizes:
      author_pred.size() = (batch_size,)
  '''

  author_id = torch.argmax(author_logits, axis=1, keepdims=True)

  return author_id

def accuracy_metric(author_pred_list, author_gt_list):
  return sum([author_pred_list[i] == author_gt_list[i] for i in range(len(author_pred_list))])/len(author_gt_list)

In [45]:
model.eval()

def eval_model(article_proc_to_use):
  # Prepare the testing set for evaluation
  num_test_cases = len(article_proc_to_use)

  eval_batch_size = 64

  # `author_pred_list` stores the predicted authors
  # in the same order as the contexts of the dev set
  author_pred_list = []
  author_gt_list = [x['Author'] for x in article_proc_to_use]
  article_text = [x['Article'] for x in article_proc_to_use]
  # Might need to join the article if I split by sentence first here.

  for i in range(0, num_test_cases, eval_batch_size):
      eval_batch = article_proc_to_use[i: i + eval_batch_size]
      art_batch_test = [x['Article'] for x in eval_batch]

      # Encode the article
      art_encode_test = tokenizer.batch_encode_plus(
          art_batch_test,
          max_length = art_max_length,
          truncation = True,
          padding = 'longest',
          return_attention_mask = True,
          return_tensors = 'pt'
      )

      # Move the testing batch to GPU
      art_ids_test = art_encode_test['input_ids'].cuda()
      art_attn_mask_test = art_encode_test['attention_mask'].cuda()

      with torch.no_grad():
          outputs = model(
              art_ids_test,
              attention_mask = art_attn_mask_test
          )

      author_logits = outputs.author_logits

      author_pred = logits_to_author(author_logits)

      # store predicted answers in lists
      author_pred_list = author_pred_list + author_pred.squeeze().tolist()

  # Translate indexes to author names
  author_pred_list = [idx2author[author_pred_list[i]] for i in range(len(author_pred_list))]

  return author_pred_list, author_gt_list, accuracy_metric(author_pred_list, author_gt_list), article_text

# Print the evaluation results

a_pred_train, a_gt_train, accuracy_train, article_text_train = eval_model(article_proc)
a_pred_test, a_gt_test, accuracy_test, article_text_test = eval_model(article_proc_test)

print(f"Training accuracy is {accuracy_train}")
print(f"Testing accuracy is {accuracy_test}")

Training accuracy is 1.0
Testing accuracy is 0.7752


In [46]:
model_output = pd.DataFrame({"author_pred": a_pred_train + a_pred_test,
                             "author_gt": a_gt_train + a_gt_test,
                             "train": 5000*[0],
                             "article_text": article_text_train + article_text_test})

model_output["train"].iloc[:len(a_pred_train)] = 1


model_output.to_csv(f"/content/gdrive/MyDrive/6.864 - NLP/NLP_Project_Author_identification/Models_outputs/output_author_ID_{model_name}.csv")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


## Fine-tuning model per epoch

In [48]:
### We load all the epoch models one after the other, we compute the accuracy and we append it to the dataset

model_accuracy = pd.DataFrame({"model": []
                              ,"epoch": [] 
                              ,"train_accuracy": []
                              ,"test_accuracy": []
                               })

# We need to change the model at 2 points: model_name and in the definition of the model
model_name = "model1"

lm_pretrained = transformers.AutoModel.from_pretrained('distilbert-base-cased')
model = AuthorDiscoveringModel1(lm_pretrained)
model = model.cuda()

for epoch_to_load in range(15):
  # Load the corresponding model
  model.load_state_dict(torch.load(f"/content/gdrive/MyDrive/6.864 - NLP/NLP_Project_Author_identification/models/author_ID_{model_name}_{epoch_to_load}epo.pt"))

  model.eval()

  a_pred_train, a_gt_train, train_accuracy, _ = eval_model(article_proc)
  a_pred_test, a_gt_test, test_accuracy, _ = eval_model(article_proc_test)

  model_accuracy = model_accuracy.append({'model': model_name, "epoch": epoch_to_load, "train_accuracy": train_accuracy, "test_accuracy": test_accuracy}, ignore_index=True)

  if epoch_to_load % 3 == 0:
    print(epoch_to_load)

model_accuracy.to_csv(f"/content/gdrive/MyDrive/6.864 - NLP/NLP_Project_Author_identification/Results/{model_name}_model_accuracy.csv")

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


0
3
6
9
12


In [49]:
model_accuracy

Unnamed: 0,model,epoch,train_accuracy,test_accuracy
0,model1,0.0,0.5304,0.4928
1,model1,1.0,0.6992,0.6328
2,model1,2.0,0.783733,0.6744
3,model1,3.0,0.88,0.7192
4,model1,4.0,0.9104,0.7352
5,model1,5.0,0.9544,0.7496
6,model1,6.0,0.969333,0.7552
7,model1,7.0,0.991733,0.7584
8,model1,8.0,0.996533,0.7632
9,model1,9.0,0.997867,0.7632
