<a href="https://colab.research.google.com/github/cvillanue/DeepLearning-IdiomaticExpression/blob/main/IdiomaticExpression_StaticBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Deep Learning Based Idiomatic Expression Recognition using BERT**

## Project developed by: Callyn Villanueva 

Article + peer-reviewed sources used to : Rani Horev, Rob Toews.
[A New Approach for Idiom Identification Using Meanings and the Web](https://aclanthology.org/R15-1087) (Verma & Vuppuluri, RANLP 2015)

About the EPIE Corpus Dataset: 
https://arxiv.org/abs/2006.09479 

This dataset contains possible idiomatic expressions instances from 717 idioms divided into two folders:

    Formal Idioms - Idioms which undergo lexical changes.

    Static Idioms - Idioms which stay the same across instances.

Each folder contains 3 sentence aligned files with '*' replaced with either 'Static_Idioms' or 'Formal_Idioms'
*_Words.txt :- Original Sentences
*_Candidates.txt :- Candidate Idiom whose instance is present in the corresponding sentence.
*_Tags.txt :- Sequence labelling tags for each token of the sentence. Each entry delimited by space is treated as a separate token. The labelling follows BIO convention using three tags (B-IDIOM,I-IDIOM,O).

    B-IDIOM:- beginning of possible idiomatic expression span
    I-IDIOM:- continuation of possible idiomatic expression span
    O:- Non-Idiom token

For this project, I will be using BERT (Bidirectional Encoder Representations from Transformers) and will test Static Idioms. The model is designed to output binary classification, where each instance can be classified into one of two possible classes. In the case of idiom recognition, the model is trained to classify each instance as either an idiom or not an idiom.

## Introduction: 
Language enables us to reason abstractly, to develop complex ideas about what the world is and could be, and to build on these ideas across generations and geographies. Almost nothing about modern civilization would be possible without language. One form of language we use is called **Idiomatic Expressions.** They are used to communicate or convey a feeling or emotion.  


Building machines that can understand this form of language has been a complex problem, particulary with the usage and understanding of it. 


So, what are idioms? They’re a type of figurative language. You can’t rely on the words in an idiom to tell you what the phrase means. That’s because they have a meaning that is different from the literal meanings of the individual words themselves. Let’s look at an example. When someone says *it’s raining cats and dogs*, they don’t mean that there are actual animals falling from the sky. It’s an idiom! The phrase means that it’s raining very heavily.


Additionally, some idioms are context dependent. Example:

*The fisherman broke the ice with his tool.*
are we to believe that this is a very suave fisherman?

Another question arises, **is it is possible to teach an AI to use idiomatic phrases to keep up with the culture of humans?**

Observe that humans do not come linguistically "pre-loaded" with idioms. So we can safely assume that idiom usage is a learning task and that the only way for them to keep up is for them to keep learning. So if we solve the idiom learning task we just need to keep our agent online or periodically retrain it on nascent corpora. 



**About BERT:**

BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary.


**Masked LM (MLM)**

Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. In technical terms, the prediction of the output words requires:

    Adding a classification layer on top of the encoder output.
    Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.
    Calculating the probability of each word in the vocabulary with softmax.
    

In [1]:
!pip install bert

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bert
  Downloading bert-2.2.0.tar.gz (3.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting erlastic
  Downloading erlastic-2.0.0.tar.gz (6.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: bert, erlastic
  Building wheel for bert (setup.py) ... [?25l[?25hdone
  Created wheel for bert: filename=bert-2.2.0-py3-none-any.whl size=3763 sha256=9d2d950bc7f971c22a3b64ba598107773c02cdde8539efcc70212020b61f56d8
  Stored in directory: /root/.cache/pip/wheels/81/e5/34/d540d6d58f74eece5ed6a0305c718c18d48f8fa8da359365fb
  Building wheel for erlastic (setup.py) ... [?25l[?25hdone
  Created wheel for erlastic: filename=erlastic-2.0.0-py3-none-any.whl size=6792 sha256=f072c789c4b681c6dac845d5b9423a1e39e8573540175e1ca0f30e3dc170a070
  Stored in directory: /root/.cache/pip/wheels/23/bf/21/6de152eceb51594c538fe8b87584b9dd260cd

In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m54.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m83.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.28.1


In [3]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

In [4]:
# Loading the tokenizer and pre-trained BERT model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [5]:
!unzip Static_Idioms_Corpus.zip

Archive:  Static_Idioms_Corpus.zip
  inflating: Static_Idioms_Corpus/Static_Idioms_Candidates.txt  
  inflating: Static_Idioms_Corpus/Static_Idioms_Tags.txt  
  inflating: Static_Idioms_Corpus/Static_Idioms_Words.txt  


In [6]:
import os

corpus_path = "Static_Idioms_Corpus/"

# create a list of file paths for all *_Words.txt files in the corpus
corpus_files = [os.path.join(corpus_path, f) for f in os.listdir(corpus_path) if f.endswith("_Words.txt")]

# Create list of sentences and corresponding candidate idioms/tags
sentences = []
candidate_idioms = []
tags = []

# Iterate through each file and load data
for words_path in corpus_files:
    candidates_path = words_path.replace("_Words.txt", "_Candidates.txt")
    tags_path = words_path.replace("_Words.txt", "_Tags.txt")
    
    with open(words_path, 'r') as words_file, \
         open(candidates_path, 'r') as candidates_file, \
         open(tags_path, 'r') as tags_file:
        
        words_lines = words_file.readlines()
        candidates_lines = candidates_file.readlines()
        tags_lines = tags_file.readlines()
        
        for words_line, candidates_line, tags_line in zip(words_lines, candidates_lines, tags_lines):
            words = words_line.strip().split()
            candidates = candidates_line.strip().split('\t')
            sentence_tags = tags_line.strip().split()

            sentence_candidates = []
            candidate_tags = []

            # Iterate through each word in the sentence and create candidate idioms and tags
            for i, tag in enumerate(sentence_tags):
                if tag == 'B-IDIOM':
                    # Start of a candidate idiom
                    candidate = words[i]
                    tag = 1  # 1 indicates idiom
                    j = i + 1
                    while j < len(sentence_tags) and sentence_tags[j] == 'I-IDIOM':
                        # Add additional words to candidate idiom
                        candidate += ' ' + words[j]
                        sentence_tags[j] = 'O'  # Mark words as not part of candidate idiom
                        j += 1
                    sentence_candidates.append(candidate)
                    candidate_tags.append(tag)
                elif tag == 'O':
                    # Not part of a candidate idiom
                    sentence_candidates.append(words[i])
                    candidate_tags.append(0)  # 0 indicates not idiom

            sentences.append(words)
            candidate_idioms.append(sentence_candidates)
            tags.append(candidate_tags)

print(type(tags))

<class 'list'>


In [7]:
print(tags)

[[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [8]:
input_ids = []
attention_masks = []
token_type_ids = []
labels = []

for i, sentence_candidates in enumerate(candidate_idioms):
    for j, candidate in enumerate(sentence_candidates):
        encoded_dict = tokenizer.encode_plus(
                            candidate,
                            add_special_tokens = True,
                            max_length = 64,
                            pad_to_max_length = True,
                            return_attention_mask = True,
                            return_token_type_ids = True,
                            return_tensors = 'pt',
                       )

        # Convert tensor elements to int and append to respective lists
        input_ids.append(encoded_dict['input_ids'].squeeze().tolist())
        attention_masks.append(encoded_dict['attention_mask'].squeeze().tolist())
        token_type_ids.append(encoded_dict['token_type_ids'].squeeze().tolist())
        labels.append(int(tags[i][j]))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [9]:
print("Size of input_ids:", len(input_ids))
print("Size of attention_masks:", len(attention_masks))
print("Size of token_type_ids:", len(token_type_ids))
print("Size of labels:", len(labels))

Size of input_ids: 646935
Size of attention_masks: 646935
Size of token_type_ids: 646935
Size of labels: 646935



Assuming that I have 646935 input IDs and will create 32 batches, each batch would have around 20217 input IDs. The output of the neural network is 2, since its binary classification problem.

In [10]:
if all(isinstance(elem, int) for elem in token_type_ids):
    print("All elements are integers")
else:
    print("List contains non-integer elements") 

print(type(token_type_ids))

List contains non-integer elements
<class 'list'>


In [12]:
from sklearn.model_selection import train_test_split

# Split the data into training, validation, and test sets
train_inputs, val_test_inputs, train_labels, val_test_labels = train_test_split(input_ids, labels, test_size=0.2, random_state=42)
train_masks, val_test_masks, _, _ = train_test_split(attention_masks, labels, test_size=0.2, random_state=42)
train_token_type_ids, val_test_token_type_ids, _, _ = train_test_split(token_type_ids, labels, test_size=0.2, random_state=42)

# Split the validation/test set into validation and test sets
val_inputs, test_inputs, val_labels, test_labels = train_test_split(val_test_inputs, val_test_labels, test_size=0.5, random_state=42)
val_masks, test_masks, _, _ = train_test_split(val_test_masks, val_test_labels, test_size=0.5, random_state=42)
val_token_type_ids, test_token_type_ids, _, _ = train_test_split(val_test_token_type_ids, val_test_labels, test_size=0.5, random_state=42)

In [13]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from torch.utils.data import TensorDataset


# Converting le dataa
train_inputs = torch.tensor(train_inputs)
train_labels = torch.tensor(train_labels)
train_masks = torch.tensor(train_masks)
train_token_type_ids = torch.tensor(train_token_type_ids)

val_inputs = torch.tensor(val_inputs)
val_labels = torch.tensor(val_labels)
val_masks = torch.tensor(val_masks)
val_token_type_ids = torch.tensor(val_token_type_ids)

test_inputs = torch.tensor(test_inputs)
test_labels = torch.tensor(test_labels)
test_masks = torch.tensor(test_masks)
test_token_type_ids = torch.tensor(test_token_type_ids)

# Create TensorDatasets
train_data = TensorDataset(train_inputs, train_masks, train_token_type_ids, train_labels)
val_data = TensorDataset(val_inputs, val_masks, val_token_type_ids, val_labels)
test_data = TensorDataset(test_inputs, test_masks, test_token_type_ids, test_labels)

# Set the batch size for training and validation
batch_size = 32

# Create DataLoaders
train_loader = DataLoader(train_data, sampler=RandomSampler(train_data), batch_size=batch_size)
val_loader = DataLoader(val_data, sampler=SequentialSampler(val_data), batch_size=batch_size)
test_loader = DataLoader(test_data, sampler=SequentialSampler(test_data), batch_size=batch_size)

In [19]:
print(train_inputs.shape)
print(val_inputs.shape)
print(train_labels.shape)
print(val_labels.shape)
print(train_masks.shape)
print(val_masks.shape)
print(train_token_type_ids.shape)
print(val_token_type_ids.shape)
print(test_inputs.shape)


torch.Size([517548, 64])
torch.Size([64693, 64])
torch.Size([517548])
torch.Size([64693])
torch.Size([517548, 64])
torch.Size([64693, 64])
torch.Size([517548, 64])
torch.Size([64693, 64])
torch.Size([64694, 64])


In [20]:
from transformers import BertForSequenceClassification, AdamW
import torch

# Load pre-trained BERT model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

optimizer = AdamW(model.parameters(), lr=2e-5)

# Train the model
epochs = 2 #for testing purposes - it was taking a long time to train
train_losses = []
for epoch in range(epochs):
    for batch in train_loader:
        # Load batch to GPU
        batch = tuple(t.to(device) for t in batch)

        # Unpack inputs and labels from batch
        input_ids, attention_mask, token_type_ids, labels = batch

        # Forward pass
        outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, labels=labels)

        # Compute loss
        loss = outputs.loss
        train_losses.append(loss.item())

        # Backward pass
        loss.backward()

        # Update weights
        optimizer.step()

        # Zero gradients
        optimizer.zero_grad()

    # Evaluate model on validation set after each epoch
    val_losses = []
    for batch in val_loader:
        # Load batch to GPU
        batch = tuple(t.to(device) for t in batch)

        # Unpack inputs and labels from batch
        input_ids, attention_mask, token_type_ids, labels = batch

        # Forward pass
        with torch.no_grad():
            outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, labels=labels)

        # Compute loss
        loss = outputs.loss
        val_losses.append(loss.item())

    print("Epoch {}/{}: Train Loss: {:.4f}, Validation Loss: {:.4f}".format(epoch+1, epochs, sum(train_losses)/len(train_losses), sum(val_losses)/len(val_losses)))

print("Training complete!")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Epoch 1/2: Train Loss: 0.0009, Validation Loss: 0.0003
Epoch 2/2: Train Loss: 0.0006, Validation Loss: 0.0000
Training complete!


In [21]:
# Evaluate model on test set
test_losses = []
num_correct = 0
num_samples = 0
model.eval()
with torch.no_grad():
    for batch in test_loader:
    
        batch = tuple(t.to(device) for t in batch)
        input_ids, attention_mask, token_type_ids, labels = batch

    
        outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, labels=labels)

        loss = outputs.loss
        test_losses.append(loss.item())

        # Compute accuracy
        _, predicted = torch.max(outputs.logits, 1)
        num_correct += (predicted == labels).sum().item()
        num_samples += labels.size(0)

test_loss = sum(test_losses) / len(test_losses)
test_accuracy = num_correct / num_samples

print("Test Loss: {:.4f}, Test Accuracy: {:.4f}".format(test_loss, test_accuracy))

Test Loss: 0.0001, Test Accuracy: 1.0000


In [37]:
# Get input text from user
input_text = input("Enter a sentence: ")

# Tokenize input text
input_ids = tokenizer.encode(input_text, add_special_tokens=True, max_length=128, truncation=True)
attention_mask = [1] * len(input_ids)
token_type_ids = [0] * len(input_ids)

input_ids = torch.tensor(input_ids).unsqueeze(0).to(device)
attention_mask = torch.tensor(attention_mask).unsqueeze(0).to(device)
token_type_ids = torch.tensor(token_type_ids).unsqueeze(0).to(device)

model.eval()
with torch.no_grad():
    outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
    logits = outputs.logits
    _, predicted = torch.max(logits, 1)
    
# Print prediction
if predicted == 0:
    print("Negative")
else:
    print("Positive")


Enter a sentence: it's raining cats
Positive


In [38]:
# Save the model
torch.save(model.state_dict(), 'IDOM_BERTmodel.pt')

In [40]:
# Print the model architecture
print(model)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,