<a href="https://colab.research.google.com/github/alexlimatds/fact_extraction/blob/main/AILA2020/FACTS_AILA_data_augmentation_mixup_SBERT_LaBSE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Mixup Data Augmentation

In this notebook we exploit the Mixup data augmentation approach to create additional data using AILA dataset as source. Care was taken to use two vectors from different classes when creating a new agumented vector.

The feature vectors are created with a SBERT/LaBSE model.

#### Notebook parameters

In [1]:
model_id = 'sentence-transformers/LaBSE'
model_reference = 'SBERT-LaBSE'

#### Installing dependencies

In [2]:
pip install -U sentence-transformers



#### Loading dataset

In [3]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
g_drive_dir = '/content/gdrive/MyDrive/'
dataset_dir = 'fact_extraction_AILA/'

Mounted at /content/gdrive


In [4]:
!mkdir data
!mkdir data/train
!tar -xf {g_drive_dir}{dataset_dir}train.tar.xz -C data/train

train_dir = 'data/train/'

mkdir: cannot create directory ‘data’: File exists
mkdir: cannot create directory ‘data/train’: File exists


In [5]:
import pandas as pd
from os import listdir
import csv

def read_docs(dir_name):
  """
  Read the docs in a directory.
  Params:
    dir_name : the directory that contains the documents.
  Returns:
    A dictionary whose keys are the names of the read files and the values are 
    pandas dataframes. Each dataframe has the sentence and label columns.
  """
  docs = {} # key: file name, value: dataframe with sentences and labels
  for f in listdir(dir_name):
    df = pd.read_csv(
        dir_name + f, 
        sep='\t', 
        quoting=csv.QUOTE_NONE, 
        names=['sentence', 'label'])
    docs[f] = df
  return docs

docs_train = read_docs(train_dir)

print(f'TRAIN: {len(docs_train)} documents read.')

TRAIN: 50 documents read.


#### Spliting documents according to folds

In [6]:
# Reading the file containing the sets of trains documents and test documents by fold
train_files_by_fold = {}  # Key: fold ID, value: file names (list of string)

df_folds = pd.read_csv(
  g_drive_dir + dataset_dir + 'train_docs_by_fold.csv', 
  sep=';', 
  names=['fold id', 'train', 'test'], 
  header=0)
for idx, row in df_folds.iterrows():
  train_files_by_fold[row['fold id']] = row['train'].split(',')


#### SBERT model

In [7]:
from sentence_transformers import SentenceTransformer

sent_encoder = SentenceTransformer(model_id)

#### Encoding sentences

In [8]:
encoded_docs_train = {} # key: document ID, value: encoded sentences (PyTorch matrix)
for doc_id, doc_df in docs_train.items():
  encoded_docs_train[doc_id] = sent_encoder.encode(doc_df['sentence'].to_list(), convert_to_tensor=True)

#### Data augmentation functions

In [9]:
import numpy as np
from torch.distributions.beta import Beta

def mixup(xi, xj, yi, yj, alpha):
  """
  Mixup function: generates an synthetic vector from two source vectors. For details, 
  check the mixup paper.
  Arguments:
    xi : the first source vector.
    xj : the second source vector.
    yi : the hot-one-encoded label vector of the first source vector.
    yj : the hot-one-encoded label vector of the second source vector.
    alpha : hyperparameter of the beta distribution to be used to generate the lambda value.
  Returns:
    The generated synthetic vector.
    The label vector of the generated synthetic vector.
  """
  # TODO: generate a tensor of lambdas
  #lam = np.random.beta(alpha, alpha)
  b = Beta(alpha, alpha)
  lam = b.rsample(sample_shape=(xi.shape[0], 1))
  lam_x = lam.broadcast_to(xi.shape).to(xi.device)
  #x_hat = lam * xi + (1 - lam) * xj
  x_hat = lam_x * xi + (1 - lam_x) * xj
  lam_y = lam.broadcast_to(yi.shape).to(yi.device)
  #y_hat = lam * yi + (1 - lam) * yj
  y_hat = lam_y * yi + (1 - lam_y) * yj
  return x_hat, y_hat

In [10]:
import torch
import random
random.seed(0)

def data_by_class(doc_id_list):
  """
  Returns, grouped per class, the vector embeddings of the sentences in a set of 
  documents.
  Params:
    A list of document IDs.
  Returns:
    The embeddings of the Facts class (PyTorch matrix).
    The embeddings of the Other class (PyTorch matrix).
  """
  sent_embeddings = None
  labels = None
  for doc_id in doc_id_list:
    if sent_embeddings is None:
      sent_embeddings = encoded_docs_train[doc_id]
      labels = docs_train[doc_id]['label'].to_numpy()
    else:
      sent_embeddings = torch.vstack((sent_embeddings, encoded_docs_train[doc_id]))
      labels = np.concatenate((labels, docs_train[doc_id]['label'].to_numpy()))
  
  facts_idx = np.nonzero(labels == 'Facts')[0]
  facts_embeddings = sent_embeddings[facts_idx,:]
  other_idx = np.nonzero(labels == 'Other')[0]
  other_embeddings = sent_embeddings[other_idx,:]

  return facts_embeddings, other_embeddings

def augment_data(alpha, doc_id_list):
  """
  Generates a set of synthetic embedding vectors from the sentences in a provided set of 
  documents. The sentences are selected at random. It uses sentences of different 
  classes to generate a synthetic vector.
  Params:
    alpha: hyperparameter of the beta distribution to be used with the mixup algorithm.
    doc_id_list: a list with the IDs of the source documents (list of strings).
  Returns:
    The generated feature vectors (PyTorch tensor).
    The generated target vectors (PyTorch tensor).
  """
  N_synthetic = 3500 # number of synthetic vectors to be generated
  facts_embeddings, other_embeddings = data_by_class(doc_id_list)
  # random indexes for the Facts class
  idx_i = random.choices(range(facts_embeddings.shape[0]), k=N_synthetic)
  # random indexes for the Other class
  idx_j = random.choices(range(other_embeddings.shape[0]), k=N_synthetic)
  # getting source vectors to generate the augmented vectors
  x_i = facts_embeddings[idx_i, :]
  x_j = other_embeddings[idx_j, :]
  y_i = torch.zeros(x_i.shape[0], 2)
  y_i[:, 0] = 1   # targets of the Facts class: [1, 0]
  y_j = torch.zeros(x_j.shape[0], 2)
  y_j[:, 1] = 1  # targets of the Other class: [0, 1]
  # data augmentation
  X_aug, Y_aug = mixup(x_i, x_j, y_i, y_j, alpha)

  return X_aug, Y_aug


#### Generating and writing the augmented data

In [11]:
def write_vectors(X_hat, Y_hat, file_prefix):
  """
  Saves a set of synthetic vectors. The vectors are saved as numpy data.
  Arguments:
    X_hat : the set of synthetic vectors (PyTorch matrix).
    Y_hat : the vector labels of the X_hat vectors (PyTorch matrix).
    file_prefix : the prefix of the generated files' names.
  """
  output_dir = f'{g_drive_dir}{dataset_dir}mixup_data_{model_reference}/'
  np.save(output_dir + file_prefix + '_features.npy', X_hat.detach().cpu())
  np.save(output_dir + file_prefix + '_targets.npy', Y_hat.detach().cpu())

alphas = [0.1, 0.5, 1.0, 4.0]

In [12]:
# Generating augmented data with the hole train set
for a in alphas:
  X_hat, Y_hat = augment_data(a, docs_train.keys())
  file_prefix = f'alpha_{str(a).replace(".", "_")}'
  write_vectors(X_hat, Y_hat, file_prefix)

In [13]:
# Generating augmented data for cross-validation
for a in alphas:
  for fold_id, doc_ids in train_files_by_fold.items():
    X_hat, Y_hat = augment_data(a, doc_ids)
    file_prefix = f'alpha_{str(a).replace(".", "_")}_fold_{fold_id}'
    write_vectors(X_hat, Y_hat, file_prefix)

#### References

- Paheli Bhattacharya, Shounak Paul, Kripabandhu Ghosh, Saptarshi Ghosh, and Adam Wyner. 2019. **Identification of Rhetorical Roles of Sentences in Indian Legal Judgments**. In Proc. International Conference on Legal Knowledge and Information Systems (JURIX).
- Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, David Lopez-Paz: **mixup: Beyond Empirical Risk Minimization**. ICLR (Poster) 2018