# Part 1: Native Language Identification

The goal is to train a neural network on the Cambridge FCE dataset to predict the native language of the author of some text. This will be a classification task: the six languages/countries we focus on are Chinese (China), French (France), German (Germany), Greek (Greece), Portuguese (Portugal), and Spanish (Spain).

Instead of feeding in the raw documents directly into the neural network, we want to use BERT embeddings as the input.

We will compare the neural network's performance with some baseline models: logistic regression and naive bayes.

More information about the dataset can be found in the GitHub repo.

**Acknowledgement: Much of the base code is from LIN 371 lecture demos, Fall 2024, Jessy Li & Hongli Zhan.**

## Libraries, Data, and connecting to GPU

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
### Import relevant libraries

import numpy as np
import pandas as pd
import random
from tqdm import tqdm

# Pytorch
import torch
import torch.nn as nn
import torch.nn.functional as f
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# BERT
from transformers import BertTokenizer, BertModel

# sklearn
from sklearn.metrics import f1_score
from sklearn.datasets import load_files
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

#from pprint import pprint

In [3]:
# Load data

data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/LIN 371/Project/fce_6languages.csv")

In [7]:
### TODO: This is just for testing. Until one-hot encoded version comes, test on binary classification outcome (Chinese = 0 or not = 1)

blah = data["native_language_encoded"]

data["native_language_encoded"] = blah.where(blah == 0, other = 1)

In [8]:
print(data.shape)
data.head()

(1243, 85)


Unnamed: 0,filename,native_language,native_language_encoded,age,total_score,question_number,exam_score,answer,mistakes_tag,AG,...,UJ,UN,UP,UQ,UT,UV,UY,W,X,unknown
0,0100_2000_6/doc2.xml,Chinese,0,16-20,24.0,1,3.2,Dear Mr Ryan.\nThanks for you letter. I am so ...,"['RP', 'DD', 'RJ', 'MD', 'TV', 'RT', 'MD', 'UT...",0,...,0,0,0,0,1,0,0,0,0,0
1,0100_2000_6/doc2.xml,Chinese,0,16-20,24.0,2,2.3,As our class is going to mark a short video ab...,"['RV', 'TV', 'UQ', 'MD', 'RP', 'MT', 'FN', 'MT...",0,...,0,0,0,1,0,1,0,0,0,0
2,0100_2000_6/doc4.xml,Chinese,0,21-25,17.0,1,2.2,"Dear Madam Helen Ryan,\nI have received your l...","['RN', 'TV', 'RY', 'AGV', 'MA', 'W', 'FN', 'R'...",0,...,0,2,1,0,3,2,0,1,0,0
3,0100_2000_6/doc4.xml,Chinese,0,21-25,17.0,3,2.1,I agree! Sometimes shopping is not always enjo...,"['UT', 'U', 'RP', 'MP', 'L', 'FN', 'RT', 'UT',...",0,...,0,0,5,0,3,1,0,0,0,0
4,0100_2000_6/doc24.xml,Spanish,1,16-20,32.0,1,3.3,"Dear Mrs. Ryan,\nFirst of all, I would like to...","['MT', 'RT', 'TV', 'RT', 'MD', 'AGN', 'RP', 'R...",0,...,0,0,0,0,0,0,1,0,0,0


There are 1243 rows. The relevant columns are `native_language_encoded` and `answer`. Because this is a multi-class classification task, we need to one-hot encode the `native_language_encoded` column.

TODO: Change to multi class after getting onehot encoded ver

Then, what we are actually using for training is actually [this set of columns] and `answer`.

Because the task is complex, we need to connect to one of the GPUs offered (for free) by Google Colab, which has better parallel computing capabilities.

In [9]:
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

## Dataset processing

We'll first create training, testing, and validation sets. Here is a nice function written by Anjie to help.

One thing to note: the target will be a matrix with 5 columns (for the 6 native languages we are trying to classify to), instead of 1 column (which corresponds with the binary classification tasks that we have been demo-ing in class).

In [10]:
def train_test_split_by_students(df, train_size=0.8, test_size=0.1, valid_size=0.1):
    # By default, we set aside 80% of data for training, 10% for testing, and 10% for validation
    assert train_size + test_size + valid_size == 1

    # Split students
    students = np.unique(df['filename'])
    students_train, students_temp = train_test_split(students, test_size=(test_size + valid_size), random_state=1000)
    students_test, students_valid = train_test_split(students_temp, test_size=(valid_size / (test_size + valid_size)), random_state=1000)
    print(f"Number of Students in Train: {students_train.size}, Test: {students_test.size}, Validation: {students_valid.size}")

    # Split the data by students
    train = df[df['filename'].isin(students_train)]
    test = df[df['filename'].isin(students_test)]
    valid = df[df['filename'].isin(students_valid)]

    # Input and Output

    ### TODO: Change target to correspond with one-hot encoded representations of the output (native language)
    features = ['answer']
    target = ['native_language_encoded']
    X_train, X_test, X_valid, y_train, y_test, y_valid = train[features], test[features], valid[features], train[target], test[target], valid[target]
    print(f"Training Size: {len(X_train)}, Test Size: {len(X_test)}, Validation Size: {len(X_valid)}")

    return X_train, X_test, X_valid, y_train, y_test, y_valid

In [11]:
# Call train_test_split_by_students() to create our training, testing, and validation sets

X_train, X_test, X_valid, y_train, y_test, y_valid = train_test_split_by_students(df = data, train_size=0.8, test_size=0.1, valid_size=0.1)

Number of Students in Train: 498, Test: 62, Validation: 63
Training Size: 993, Test Size: 124, Validation Size: 126


We'll use DataLoader to do batching, which is necessary for something complicated like BERT. This helps process examples in parallel, and it's more efficient than running each example one by one.

Recall for our project: we have the data in a .csv file, and the input (text) is in a column called `answer`, while the output label (native language) is in a column called `native_language`.

In [12]:
# Use BERT's tokenizer

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [55]:
### Prepare data for DataLoader with a custom Dataset class

# TODO: Check how necessary this is, can we just put it in the dataloader?

class Dataset(torch.utils.data.Dataset):
  def __init__(self, X, y):
    self.texts = [bert_tokenizer(text, padding='max_length', max_length = 512, truncation=True,
                            return_tensors="pt")
                            for text in X]
    self.labels = y

  def __len__(self):
    return len(self.labels)

  def get_batch_labels(self, idx):
    # Fetch a batch of labels
    return np.array(self.labels[idx])

  def get_batch_texts(self, idx):
    # Fetch a batch of inputs
    return self.texts[idx]

  def __getitem__(self, idx):
    batch_texts = self.get_batch_texts(idx)
    batch_y = self.get_batch_labels(idx)
    return batch_texts, batch_y

In [56]:
# Make instance of Dataset for each of train, test, and val

train_dataset = Dataset(X_train, y_train)
test_dataset = Dataset(X_test, y_test)
valid_dataset = Dataset(X_valid, y_valid)

Now we'll prepare the data for training with DataLoader (https://pytorch.org/tutorials/beginner/basics/data_tutorial.html), which will allow us to iterate through the data in batches.

In [57]:
# Specify a batch size for training
batch_size = 11

# Convert Dataset to DataLoader with specified batch size and shuffling
def dataset_to_loader(dataset):
  return DataLoader(dataset, batch_size = batch_size, shuffle = True)

In [58]:
# Make instance of DataLoader for each of train, test, and val

train_loader = dataset_to_loader(train_dataset)
test_loader = dataset_to_loader(test_dataset)
valid_loader = dataset_to_loader(valid_dataset)

## Model base class

We will customize models from already available base model classes. The `nn.Module` base class from Pytorch is the neural network base class, but we need to add a layer for being able to use BERT embeddings for the CLS token as the model's input.

Above, we set the max length of sequences to be 512, so 512 is the input dimension of the network (each embedding has 512 dimensions). We don't need to have additional functions for padding, etc. because the BERT tokenizer already takes care of it.

#### BERT model for embeddings

In [59]:
### TESTING VERSION FOR BINARY, DELETE LATER (THE DIMENSIONS ARE DIFFERENT)

class feedforward_nn_bert(nn.Module):

  def __init__(self, dropout = 0.2, freeze_layer_count = 6):
    super().__init__()

    # Initialize a BERT model as an attribute to help get embeddings
    self.bert = BertModel.from_pretrained('bert-base-uncased')

    # Regularization technique: dropout
    self.dropout = nn.Dropout(dropout)

    # Layers to the neural network
    ### TODO: Fix this so it has the right number of dimensions
    self.fc1 = nn.Linear(512, 10)
    self.fc2 = nn.Linear(10, 1)    # We have 6 languages to classify to


  def forward(self, input_id, mask):

    # First step in the forward pass: get text-level embeddings of the CLS token
    _, CLS = self.bert(input_ids = input_id, attention_mask = mask, return_dict = False)

    # Regularization technique: dropout
    x = self.dropout(CLS)

    ### Layers to the neural network
    ### Softmax activation function ensures probabilities add up to 1

    # First layer
    x = self.fc1(x)
    x = f.relu()

    # Output layer: multi-class classification
    x = self.fc2(x)
    x = f.sigmoid(x)

    return x

In [60]:
class feedforward_nn_bert(nn.Module):

  def __init__(self, dropout = 0.2, freeze_layer_count = 6):
    super().__init__()

    # Initialize a BERT model as an attribute to help get embeddings
    self.bert = BertModel.from_pretrained('bert-base-uncased')

    # Regularization technique: dropout
    self.dropout = nn.Dropout(dropout)

    # Layers to the neural network
    ### TODO: Fix this so it has the right number of dimensions
    self.fc1 = nn.Linear(512, 10)
    self.fc2 = nn.Linear(10, 1)    # We have 6 languages to classify to


  def forward(self, input_id, mask):

    # First step in the forward pass: get text-level embeddings of the CLS token
    _, CLS = self.bert(input_ids = input_id, attention_mask = mask, return_dict = False)

    # Regularization technique: dropout
    x = self.dropout(CLS)

    ### Layers to the neural network
    ### Softmax activation function ensures probabilities add up to 1

    # First layer
    x = self.fc1(x)
    x = f.relu()

    # Output layer: multi-class classification
    x = self.fc2(x)
    x = f.softmax(x)

    return x

Now we can create the neural network as an instance of the `feedforward_nn_bert` class.

In [61]:
X_train.shape

(993, 1)

In [62]:
# Create our neural network as an instance of the feedforward_nn_bert class
#input_dim = train_x.shape[1]

model = feedforward_nn_bert()

## Training

In [63]:
# Define number of epochs (training iterations) and optimizer (loss function)

epochs = 2

criterion = nn.BCELoss()

optimizer = optim.Adam(model.parameters())

In [64]:
def train_nn(model,
             train_loader, test_loader, valid_loader, ## TODO: add validation set
             use_cuda,
             learning_rate = 2e-5, epochs = epochs, batch_size = batch_size):

  if use_cuda:
    model = model.cuda()

  for epoch in range(epochs):
    model.train()
    for inputs, labels in train_loader:
      optimizer.zero_grad()
      outputs = model(inputs)

      ### TODO: Check this squeeze thingy, we should "match shape of labels" so the dimension might be different?
      outputs = torch.squeeze(outputs, dim = 1)
      loss = criterion(outputs, labels)

      loss.backward()
      optimizer.step()

    # Validation step
    val_loss, val_accuracy = eval_nn(model, batch_size, valid_loader)

    # Print as we train the loss and accruacy for each epoch
    print(f'Epoch {epoch+1}/{epochs}, Loss: {loss.item():.3f}, Val Loss: {val_loss:.3f}, Val Accuracy: {val_accuracy:.3f}')

In [65]:
def eval_nn(model, batch_size, valid_loader):

    model.eval()  # Set the model to evaluation mode
    with torch.no_grad():
        val_loss = 0.0
        correct = 0
        total = 0

        ### TODO: Change this to reflect multiclass prediction accuracy
        for inputs, labels in valid_loader:
            outputs = model(inputs)
            outputs = torch.squeeze(outputs, dim=1)  # Squeeze the output tensor
            val_loss += criterion(outputs, labels).item()
            predicted = (outputs > 0.5).float()
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

        val_loss /= len(valid_loader)
        val_accuracy = correct / total
    return val_loss, val_accuracy

We can finally call the functions that we made to make predictions.

In [66]:
train_nn(model,
         train_loader, test_loader, valid_loader,
         use_cuda,
         learning_rate = 2e-5, epochs = epochs, batch_size = batch_size)

IndexError: list index out of range

In [67]:
eval_nn(model, batch_size, valid_loader)

IndexError: list index out of range

## TO DELETE: Load BERT and create word embeddings

For each example, we need to get the embedding of the CLS token. This can be used as input for a classification task.

In [None]:
# Import BERT Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [None]:
# Add column with information about tokenized text from BERT
data["answer_tokenized"] = pd.Series([
    tokenizer(text, padding='max_length', max_length = 512, truncation=True,
              return_tensors="pt") for text in data["answer"]])

In [None]:
### Create BERT embeddings

# Load BERT model for embeddings
bert = BertModel.from_pretrained('bert-base-uncased')
bert = bert.to(device)

# Use DataLoader to avoid overloading the GPU by running too many samples at once
data_dataloader = torch.utils.data.DataLoader(
    data, batch_size=8, shuffle=False)

# , worker_init_fn=np.random.seed(0)

# Create a column for the embedding of the CLS token
empty_list = [None] * data.shape[0]

for index, text in data["answer_tokenized"].items():
  input_ids = text["input_ids"].to(device)
  attention_mask = text["attention_mask"].to(device)
  _, pooled_output = bert(input_ids = input_ids, attention_mask = attention_mask, return_dict=False)

  # Add the CLS token embedding to the list
  empty_list[index] = pooled_output

data["cls_embedding"] = pd.Series(empty_list)

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 11.06 MiB is free. Process 285643 has 14.73 GiB memory in use. Of the allocated memory 14.61 GiB is allocated by PyTorch, and 3.82 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
data.head()

Unnamed: 0,native_language,answer,answer_tokenized
0,0,Dear Mr Ryan.\r\nThanks for you letter. I am s...,"[input_ids, token_type_ids, attention_mask]"
1,0,As our class is going to mark a short video ab...,"[input_ids, token_type_ids, attention_mask]"
2,0,"Dear Madam Helen Ryan,\r\nI have received your...","[input_ids, token_type_ids, attention_mask]"
3,0,I agree! Sometimes shopping is not always enjo...,"[input_ids, token_type_ids, attention_mask]"
10,5,"Dear Mrs. Ryan,\r\nFirst of all, I would like ...","[input_ids, token_type_ids, attention_mask]"
