<a href="https://colab.research.google.com/github/cvillanue/DeepLearning-IdiomaticExpression/blob/main/IdiomaticExpression_BERT_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Deep Learning Based Idiomatic Expression Recognition using BERT**

## Project developed by: Callyn Villanueva 

Article + peer-reviewed sources used to : Rani Horev, Rob Toews.
[A New Approach for Idiom Identification Using Meanings and the Web](https://aclanthology.org/R15-1087) (Verma & Vuppuluri, RANLP 2015)

About the EPIE Corpus Dataset: 
https://arxiv.org/abs/2006.09479 

This dataset contains possible idiomatic expressions instances from 717 idioms divided into two folders:

    Formal Idioms - Idioms which undergo lexical changes.

    Static Idioms - Idioms which stay the same across instances.

Each folder contains 3 sentence aligned files with '*' replaced with either 'Static_Idioms' or 'Formal_Idioms'
*_Words.txt :- Original Sentences
*_Candidates.txt :- Candidate Idiom whose instance is present in the corresponding sentence.
*_Tags.txt :- Sequence labelling tags for each token of the sentence. Each entry delimited by space is treated as a separate token. The labelling follows BIO convention using three tags (B-IDIOM,I-IDIOM,O).

    B-IDIOM:- beginning of possible idiomatic expression span
    I-IDIOM:- continuation of possible idiomatic expression span
    O:- Non-Idiom token

For this project, I will be using BERT (Bidirectional Encoder Representations from Transformers) and will test Static Idioms. The model is designed to output binary classification, where each instance can be classified into one of two possible classes. In the case of idiom recognition, the model is trained to classify each instance as either an idiom or not an idiom.

## Introduction: 
Language enables us to reason abstractly, to develop complex ideas about what the world is and could be, and to build on these ideas across generations and geographies. Almost nothing about modern civilization would be possible without language. One form of language we use is called **Idiomatic Expressions.** They are used to communicate or convey a feeling or emotion.  


Building machines that can understand this form of language has been a complex problem, particulary with the usage and understanding of it. 


So, what are idioms? They’re a type of figurative language. You can’t rely on the words in an idiom to tell you what the phrase means. That’s because they have a meaning that is different from the literal meanings of the individual words themselves. Let’s look at an example. When someone says *it’s raining cats and dogs*, they don’t mean that there are actual animals falling from the sky. It’s an idiom! The phrase means that it’s raining very heavily.


Additionally, some idioms are context dependent. Example:

*The fisherman broke the ice with his tool.*
are we to believe that this is a very suave fisherman?

Another question arises, **is it is possible to teach an AI to use idiomatic phrases to keep up with the culture of humans?**

Observe that humans do not come linguistically "pre-loaded" with idioms. So we can safely assume that idiom usage is a learning task and that the only way for them to keep up is for them to keep learning. So if we solve the idiom learning task we just need to keep our agent online or periodically retrain it on nascent corpora. 



**About BERT:**

BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary.


**Masked LM (MLM)**

Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. In technical terms, the prediction of the output words requires:

    Adding a classification layer on top of the encoder output.
    Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.
    Calculating the probability of each word in the vocabulary with softmax.



Generalized Steps to train Corpus Data using BERT:

1. You will need to install the transformers library in Python, which provides a high-level interface for working with pre-trained transformer models such as BERT. You can install the library using pip by running the command pip install transformers.

2. Load and Tokenize the Corpus Data: You need to load and tokenize the corpus data using the BertTokenizer class from the transformers library. This class tokenizes the text and maps the tokens to their corresponding IDs for use with BERT.

3. Preprocess the Corpus Data: You will need to preprocess the corpus data to create input examples that can be used for training the BERT model. For example, you can split the data into input sentences of a fixed length and create input sequences by adding special tokens such as [CLS] and [SEP] that are used by BERT.

4. Load the BERT Model: You can load a pre-trained BERT model from the transformers library using the BertForSequenceClassification class. This class provides a BERT model that has been pre-trained on a large corpus of text and can be fine-tuned for specific NLP tasks.

5. Fine-tune the BERT Model: You can fine-tune the BERT model on your corpus data using a technique called transfer learning. This involves training the model on your corpus data for a specific NLP task such as sentiment analysis, text classification, or question-answering.

6. Evaluate the BERT Model: After fine-tuning the model, you can evaluate its performance on a test dataset to measure its accuracy and other metrics.

In [None]:
pip install bert

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bert
  Downloading bert-2.2.0.tar.gz (3.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting erlastic
  Downloading erlastic-2.0.0.tar.gz (6.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: bert, erlastic
  Building wheel for bert (setup.py) ... [?25l[?25hdone
  Created wheel for bert: filename=bert-2.2.0-py3-none-any.whl size=3763 sha256=7a217785f788669dc8b72cdfe74dd48a2fb42caf2ee1e36e115779f3b3651670
  Stored in directory: /root/.cache/pip/wheels/81/e5/34/d540d6d58f74eece5ed6a0305c718c18d48f8fa8da359365fb
  Building wheel for erlastic (setup.py) ... [?25l[?25hdone
  Created wheel for erlastic: filename=erlastic-2.0.0-py3-none-any.whl size=6792 sha256=f4a846f6e3e8184875f65980103cbb8b9a8236d7fafd1616d2034b0964c1cfd6
  Stored in directory: /root/.cache/pip/wheels/23/bf/21/6de152eceb51594c538fe8b87584b9dd260cd

In [None]:
pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.3-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m55.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m73.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.3 tokenizers-0.13.2 transformers-4.27.3


In [None]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

In [None]:
# Loading the tokenizer and pre-trained BERT model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [None]:
!unzip Formal_Idioms_Corpus.zip

Archive:  Formal_Idioms_Corpus.zip
   creating: Formal_Idioms_Corpus/
  inflating: __MACOSX/._Formal_Idioms_Corpus  
  inflating: Formal_Idioms_Corpus/Formal_Idioms_Candidates.txt  
  inflating: __MACOSX/Formal_Idioms_Corpus/._Formal_Idioms_Candidates.txt  
  inflating: Formal_Idioms_Corpus/Formal_Idioms_Tags.txt  
  inflating: __MACOSX/Formal_Idioms_Corpus/._Formal_Idioms_Tags.txt  
  inflating: Formal_Idioms_Corpus/Formal_Idioms_Words.txt  
  inflating: __MACOSX/Formal_Idioms_Corpus/._Formal_Idioms_Words.txt  
  inflating: Formal_Idioms_Corpus/Formal_Idioms_Labels.txt  
  inflating: __MACOSX/Formal_Idioms_Corpus/._Formal_Idioms_Labels.txt  


In [None]:
import os

corpus_path = "Formal_Idioms_Corpus/"

# create a list of file paths for all *_Words.txt files in the corpus
corpus_files = [os.path.join(corpus_path, f) for f in os.listdir(corpus_path) if f.endswith("_Words.txt")]

# read the text from each file and tokenize it
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
max_length = 512
tokenized_texts = []
for file_path in corpus_files:
    with open(file_path, "r") as f:
        text = f.read()
    tokenized_text = tokenizer.tokenize(text)[:max_length]
    tokenized_texts.append(tokenized_text)
print(tokenized_texts)
# pad and truncate tokenized sequences to have the same length
input_ids = []
attention_masks = []
for text in tokenized_texts:
    encoded_dict = tokenizer.encode_plus(text, add_special_tokens=True, max_length=max_length, pad_to_max_length=True, return_attention_mask=True)
    input_ids.append(encoded_dict['input_ids'])
    attention_masks.append(encoded_dict['attention_mask'])

# convert lists to tensors
input_ids = torch.tensor(input_ids)
attention_masks = torch.tensor(attention_masks)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


[['‘', 'you', 'know', ',', 'the', 'panda', 'who', 'keeps', 'an', 'eye', 'on', 'my', 'drinking', 'habits', '.', '’', 'but', 'i', 'will', 'also', 'be', 'keeping', 'an', 'eye', 'on', 'you', '.', '’', '‘', 'i', 'will', 'keep', 'an', 'eye', 'on', 'him', ',', '’', 'reassured', 'jack', '.', 'moreover', ',', 'international', 'operations', 'director', 'richard', 'ferry', 'claims', 'that', 'the', 'bra', '##ck', '##nell', ',', 'berkshire', '-', 'based', 'company', "'", 's', 'services', 'are', 'unique', '—', 'although', 'most', 'rivals', 'can', 'offer', 'central', '##ised', 'expertise', ',', 'he', 'says', ',', 'no', '-', 'one', 'else', 'can', 'monitor', 'computer', 'systems', 'remotely', ',', 'or', 'keep', 'an', 'eye', 'on', 'such', 'critical', 'issues', 'as', 'temperature', 'and', 'air', 'conditioning', 'levels', 'at', 'any', 'given', 'customer', 'site', '.', 'once', 'duncan', 'had', 'managed', 'to', 'fire', 'off', 'a', 'report', 'and', 'stop', 'his', 'watch', '-', 'replacement', 'from', 'stumbli



In [None]:
print(attention_masks.shape[0])

1


In [None]:
print(input_ids.shape)
print(attention_masks.shape)

torch.Size([1, 512])
torch.Size([1, 512])


In [None]:
with open("Formal_Idioms_Corpus/Formal_Idioms_Labels.txt", "r") as f:
    labels_list = [int(label.strip()) for label in f]

print(labels_list)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [None]:
with open('Formal_Idioms_Corpus/Formal_Idioms_Tags.txt', 'r') as f:
    num_lines = sum(1 for line in f)

print(num_lines)

3136


In [None]:
import torch.nn.functional as F
#note to self: labels need to match tensor input_id & attention_mask size
labels = torch.tensor(labels_list)
labels = labels.reshape(-1)
# check the shape of the resulting tensor
print(labels.shape)

torch.Size([3136])


In [None]:
'''
outputs = model(input_ids, attention_masks)
logits = outputs[0]
print(logits.shape)  # should be (batch_size, num_classes)
print(labels.shape)  # should be (batch_size,)
labels = labels.unsqueeze(1)  # add a dimension to match the expected shape of logits
print(labels.shape)  # should be (batch_size, 1)
'''

'\noutputs = model(input_ids, attention_masks)\nlogits = outputs[0]\nprint(logits.shape)  # should be (batch_size, num_classes)\nprint(labels.shape)  # should be (batch_size,)\nlabels = labels.unsqueeze(1)  # add a dimension to match the expected shape of logits\nprint(labels.shape)  # should be (batch_size, 1)\n'

In [None]:
print(input_ids.shape)
print(attention_masks.shape)
print(labels.shape)

torch.Size([1, 512])
torch.Size([1, 512])
torch.Size([3136])


In [None]:
from transformers import BertForSequenceClassification

num_labels = 2 # for binary classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)

# get the number of output nodes
last_layer_units = model.classifier.out_features
print(last_layer_units) # should print 2


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

2


In [None]:
print(model.classifier)

Linear(in_features=768, out_features=2, bias=True)


In [None]:
from torch.utils.data import DataLoader, TensorDataset
from transformers import BertModel
import torch.nn as nn

batch_size = 16

data = TensorDataset(input_ids, attention_masks)
# Use the DataLoader class to create batches of the Dataset
dataloader = DataLoader(data, batch_size=batch_size, shuffle=True)

for batch in dataloader:
    inputs = batch[0]
    attention_masks = batch[1]

    print("Input shape: ", inputs.shape)
    print("Attention mask shape: ", attention_masks.shape)
    print(len(dataloader))
    print(len(data))

Input shape:  torch.Size([1, 512])
Attention mask shape:  torch.Size([1, 512])
1
1
