<a href="https://colab.research.google.com/github/devashishk99/Syntactic-abilities-BERT-family/blob/main/BERT_family_syntax_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **1. Installation and imports**

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m50.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m85.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.15.1 tokenizers-0.13.3 transformers-4.29.2


In [2]:
import pandas as pd

from collections import Counter

from transformers import pipeline

## **2. Dataset**

Below is reported how to load and format the dataset so that it can then be passed directly to the Transformer model. 

In this notebook, we will be utilizing the dataset defined in "Targeted Syntactic Evaluation of Language Models" (Marvin and Linzen, 2018, link to the paper: https://aclanthology.org/D18-1151.pdf). 
The dataset consists of various stimuli designed to assess the language model's proficiency in the following syntactic phenomena:


*   Subject-Verb Agreement;
*   Reflexive Anaphora;
*   Negative Polarity Items.

The dataset can be download from https://github.com/yoavg/bert-syntax.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### **2.1 Data Investigation**

Before proceeding with the data processing, let's examine the structure of the dataset more closely.

In [4]:
test_dataset = "/content/drive/My Drive/lcl 23/data/marvin_linzen_dataset.tsv"

df = pd.read_csv(test_dataset, delimiter='\t', header=None)
df.head()

Unnamed: 0,0,1,2,3
0,obj_rel_across_anim,sing_MS_MV_sing_ES_EV,the author that the guard likes laughs,the author that the guard likes laugh
1,obj_rel_across_anim,sing_MS_MV_sing_ES_EV,the author that the guard likes swims,the author that the guard likes swim
2,obj_rel_across_anim,sing_MS_MV_sing_ES_EV,the author that the guard likes smiles,the author that the guard likes smile
3,obj_rel_across_anim,sing_MS_MV_sing_ES_EV,the author that the guard likes is tall,the author that the guard likes are tall
4,obj_rel_across_anim,sing_MS_MV_sing_ES_EV,the author that the guard likes is old,the author that the guard likes are old


In [5]:
# Visualize the different typologies of phenomena present in the dataset
df[0].unique()

array(['obj_rel_across_anim', 'obj_rel_within_anim',
       'obj_rel_across_inanim', 'obj_rel_within_inanim', 'subj_rel',
       'prep_anim', 'prep_inanim', 'obj_rel_no_comp_across_anim',
       'obj_rel_no_comp_within_anim', 'obj_rel_no_comp_across_inanim',
       'obj_rel_no_comp_within_inanim', 'simple_agrmt', 'sent_comp',
       'vp_coord', 'long_vp_coord', 'reflexives_across',
       'simple_reflexives', 'reflexive_sent_comp', 'npi_across_anim',
       'npi_across_inanim', 'simple_npi_anim', 'simple_npi_inanim'],
      dtype=object)

In [6]:
df[0].unique().size

22

In [6]:
# Samples from the dataset with simple subject-verb agreement
df[df[0] == 'simple_agrmt'].head()

Unnamed: 0,0,1,2,3
118160,simple_agrmt,sing_MS_MV,the author laughs,the author laugh
118161,simple_agrmt,sing_MS_MV,the author swims,the author swim
118162,simple_agrmt,sing_MS_MV,the author smiles,the author smile
118163,simple_agrmt,sing_MS_MV,the author is tall,the author are tall
118164,simple_agrmt,sing_MS_MV,the author is old,the author are old


In [7]:
# Samples from the dataset with subject-verb agreement with VP (verb phrase) coordination
df[df[0] == 'vp_coord'].head()

Unnamed: 0,0,1,2,3
119980,vp_coord,sing_MS_MV_MV,the author laughs and swims,the author laughs and swim
119981,vp_coord,sing_MS_MV_MV,the author laughs and smiles,the author laughs and smile
119982,vp_coord,sing_MS_MV_MV,the author laughs and is tall,the author laughs and are tall
119983,vp_coord,sing_MS_MV_MV,the author laughs and is old,the author laughs and are old
119984,vp_coord,sing_MS_MV_MV,the author laughs and is young,the author laughs and are young


In [8]:
# Samples from the dataset with subject-verb agreement with Long VP (verb phrase) coordination
df[df[0] == 'long_vp_coord'].head()

Unnamed: 0,0,1,2,3
120820,long_vp_coord,sing_MS_LMV_LMV,the author knows many different foreign langua...,the author knows many different foreign langua...
120821,long_vp_coord,sing_MS_LMV_LMV,the author knows many different foreign langua...,the author knows many different foreign langua...
120822,long_vp_coord,sing_MS_LMV_LMV,the author knows many different foreign langua...,the author knows many different foreign langua...
120823,long_vp_coord,sing_MS_LMV_LMV,the author knows many different foreign langua...,the author knows many different foreign langua...
120824,long_vp_coord,sing_MS_LMV_LMV,the author likes to watch television shows and...,the author likes to watch television shows and...


### **2.2 Data Preparation**

In [72]:
def data_process(config):
  prc_dataset = []

  cc = Counter()
  for line in open(test_dataset, 'r'):
    sample = line.strip().split("\t")

    # Select only the configuration with simple subject-verb agreement
    if sample[0] == config:
      cc[sample[1]]+=1

      # Select the correct ('g') and the erroneous sentence ('ug)
      g, ug = sample[-2], sample[-1]
      g = g.split()
      ug = ug.split()
      assert(len(g)==len(ug)),(g,ug)

      # Identify the difference between the two sentences (i.e. the different token)
      diffs = [i for i,pair in enumerate(zip(g,ug)) if pair[0]!=pair[1]]
      if (len(diffs)!=1):
        continue    
      assert(len(diffs)==1),diffs

      # Save in 'gv' and 'ugv' the correct and incorrect token
      gv=g[diffs[0]]   # correct
      ugv=ug[diffs[0]] # incorrect

      # Recreate the input sequence by replace the target token with [MASK]
      g[diffs[0]]="<mask>"
      g.append(".")

      # Filter the sentences that contains 'swims' as possible target token, since 'swims' does not exist in the model vocabulary
      if gv != 'swims' and ugv != 'swims':
        prc_dataset.append((sample[0],sample[1]," ".join(g),gv,ugv))

  return prc_dataset

At the end of the data preparation process, each instance will have the following structure:

*   Phenomena;
*   Construction Template;
*   Sentence with the masked target token;
*   Target token with correct agreement;
*    Target token with incorrect agreement.



In [73]:
processed_dataset = data_process('simple_agrmt')

In [58]:
processed_dataset = data_process('vp_coord')

In [39]:
processed_dataset = data_process('long_vp_coord')

In [54]:
print("Number of samples:", len(processed_dataset))
print()

print("Input sample:", processed_dataset[2])

Number of samples: 720

Input sample: ('vp_coord', 'sing_MS_MV_MV', 'the author laughs and <mask> old .', 'is', 'are')


## **3. Loading HuggingFace Pipeline**

The *pipeline()* object can be instantiated as follows:

```python
nlp = pipeline(<task_name>, model=<model_name>)
```

In [12]:
bbu = pipeline("fill-mask", model="bert-base-uncased")
blu = pipeline("fill-mask", model="bert-large-uncased")
rb  = pipeline("fill-mask", model="roberta-base")
rl  = pipeline("fill-mask", model="roberta-large")
dbu  = pipeline("fill-mask", model="distilbert-base-uncased")

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

## **4. Running the pipeline on the dataset**

After loading the dataset and the model, we can simple call the *pipeline* on one item (i.e. sentence) as follows:

```python
predictions = nlp(<sentence>, targets=<target_tokens>)
```



In [13]:
def getPredictionsDf(bert_model):
  columns = ["phenomena", "template", "sentence", 
            "correct_token", "prob_correct_token", 
            "incorrect_token", "prob_incorrect_token"]

  df = pd.DataFrame(columns=columns)

  for input_sample in processed_dataset:
    sentence = input_sample[2]
    targets = input_sample[3:]

    dict_results = {"phenomena": [input_sample[0]],
                    "template": [input_sample[1]],
                    "sentence": [sentence],
                    "correct_token": [targets[0]],
                    "incorrect_token": [targets[1]]}

    if bert_model == 'bbu':
      predictions = bbu(sentence, targets=targets)
    elif bert_model == 'blu':
      predictions = blu(sentence, targets=targets)
    elif bert_model == 'rb':
      predictions = rb(sentence, targets=targets)
    elif bert_model == 'rl':
      predictions = rl(sentence, targets=targets)
    elif bert_model == 'dbu':
      predictions = dbu(sentence, targets=targets)

    for pred in predictions:
      token = pred["token_str"]
      prob = pred["score"]
      if token == targets[0]:
        dict_results["prob_correct_token"] = [prob]
      else:
        dict_results["prob_incorrect_token"] = [prob]

    df = pd.concat([df, pd.DataFrame(dict_results)])
    
  return df

In [69]:
bbu_df = getPredictionsDf('bbu')

In [70]:
blu_df = getPredictionsDf('blu')

In [71]:
dbu_df = getPredictionsDf('dbu')

In [74]:
rb_df = getPredictionsDf('rb')

The specified target token `smiles` does not exist in the model vocabulary. Replacing with `sm`.
The specified target token `smile` does not exist in the model vocabulary. Replacing with `sm`.
The specified target token `smiles` does not exist in the model vocabulary. Replacing with `sm`.
The specified target token `smile` does not exist in the model vocabulary. Replacing with `sm`.
The specified target token `smiles` does not exist in the model vocabulary. Replacing with `sm`.
The specified target token `smile` does not exist in the model vocabulary. Replacing with `sm`.
The specified target token `smiles` does not exist in the model vocabulary. Replacing with `sm`.
The specified target token `smile` does not exist in the model vocabulary. Replacing with `sm`.
The specified target token `smiles` does not exist in the model vocabulary. Replacing with `sm`.
The specified target token `smile` does not exist in the model vocabulary. Replacing with `sm`.
The specified target token `smiles`

In [75]:
rl_df = getPredictionsDf('rl')

The specified target token `smiles` does not exist in the model vocabulary. Replacing with `sm`.
The specified target token `smile` does not exist in the model vocabulary. Replacing with `sm`.
The specified target token `smiles` does not exist in the model vocabulary. Replacing with `sm`.
The specified target token `smile` does not exist in the model vocabulary. Replacing with `sm`.
The specified target token `smiles` does not exist in the model vocabulary. Replacing with `sm`.
The specified target token `smile` does not exist in the model vocabulary. Replacing with `sm`.
The specified target token `smiles` does not exist in the model vocabulary. Replacing with `sm`.
The specified target token `smile` does not exist in the model vocabulary. Replacing with `sm`.
The specified target token `smiles` does not exist in the model vocabulary. Replacing with `sm`.
The specified target token `smile` does not exist in the model vocabulary. Replacing with `sm`.
The specified target token `smiles`

First 10 rows of the resulting Dataframe:

In [41]:
rb_df.head()

Unnamed: 0,phenomena,template,sentence,correct_token,prob_correct_token,incorrect_token,prob_incorrect_token
0,simple_agrmt,sing_MS_MV,the author <mask> .,laughs,2.526092e-08,laugh,1.149672e-10
0,simple_agrmt,sing_MS_MV,the author <mask> .,smiles,,smile,9.999931e-10
0,simple_agrmt,sing_MS_MV,the author <mask> tall .,is,0.0005512991,are,2.016677e-06
0,simple_agrmt,sing_MS_MV,the author <mask> old .,is,0.0001157206,are,1.963661e-06
0,simple_agrmt,sing_MS_MV,the author <mask> young .,is,8.944152e-05,are,4.050764e-07


## **5. Evaluation**



In [47]:
def evaluate(df_type):
  corr = df_type.query('prob_correct_token > prob_incorrect_token')
  miss = df_type.query('prob_correct_token < prob_incorrect_token')

  tot_corr = len(corr)
  tot_miss = len(miss)

  print("Total number of correct predictions:", tot_corr)
  print("Total number of wrong predictions:", tot_miss)
  print()

  accuracy = tot_corr/(tot_corr+tot_miss)
  print("Accuracy:", accuracy)
  return miss

In [76]:
bbu_res = evaluate(bbu_df)
bbu_res

Total number of correct predictions: 120
Total number of wrong predictions: 0

Accuracy: 1.0


Unnamed: 0,phenomena,template,sentence,correct_token,prob_correct_token,incorrect_token,prob_incorrect_token


In [77]:
blu_res = evaluate(blu_df)
blu_res

Total number of correct predictions: 120
Total number of wrong predictions: 0

Accuracy: 1.0


Unnamed: 0,phenomena,template,sentence,correct_token,prob_correct_token,incorrect_token,prob_incorrect_token


In [78]:
rb_res = evaluate(rb_df)
rb_res

Total number of correct predictions: 97
Total number of wrong predictions: 3

Accuracy: 0.97


Unnamed: 0,phenomena,template,sentence,correct_token,prob_correct_token,incorrect_token,prob_incorrect_token
0,simple_agrmt,sing_MS_MV,the pilot <mask> .,laughs,2.003753e-07,laugh,5.314522e-07
0,simple_agrmt,plur_MS_MV,the farmers <mask> old .,are,2.388415e-08,is,3.024282e-08
0,simple_agrmt,plur_MS_MV,the senators <mask> short .,are,4.584198e-07,is,6.399495e-07


In [79]:
rl_res = evaluate(rl_df)
rl_res

Total number of correct predictions: 98
Total number of wrong predictions: 2

Accuracy: 0.98


Unnamed: 0,phenomena,template,sentence,correct_token,prob_correct_token,incorrect_token,prob_incorrect_token
0,simple_agrmt,sing_MS_MV,the officer <mask> .,laughs,4.627719e-08,laugh,5.846304e-08
0,simple_agrmt,plur_MS_MV,the pilots <mask> .,laugh,7.122033e-07,laughs,7.534084e-07


In [80]:
dbu_res = evaluate(dbu_df)
dbu_res

Total number of correct predictions: 120
Total number of wrong predictions: 0

Accuracy: 1.0


Unnamed: 0,phenomena,template,sentence,correct_token,prob_correct_token,incorrect_token,prob_incorrect_token
