### Auto-correcting grammar for modified sentences.
- Author: Sushma Anand Akoju, Email: sushmaakoju@arizona.edu

In [3]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
import pandas as pd
import numpy as np

In [5]:
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)

Mounted at /content/drive/


In [6]:
file1= "modified_verb_phrases_dec24.xlsx"
file2 = "modified_noun_phrases_dec_24.xlsx"
file3 = "modified_obj_sick20_dec24.xlsx"
path = "/content/drive/MyDrive/Colab Notebooks/natural-logic/final-datasets"

In [7]:
import os
os.path.exists(path)

True

### Analyze Modified examples

For each of the 20 examples selected from SICK dataset, we modify
1. Verb phrases (VP)
2. Subject (Noun Phrases (NP))
3. Objects (Noun Phrases)

For selecting modifiers, we use modifiers as we refer Bill MacCartney's NLI dissertation and Fracas dataset.

#### Object Modifers:
- determiners = "every", "some", "exactly one", "all but one", "no"
- adjectives = "green", "happy", "sad", "good", "bad"
- special adjectives = an abnormal", "an elegant"

Note: we only consider modifying object and not the adjectives preceeding the object. For example, a large pond -> would be "an elegant, large pond" and not "an elegantly large pond". Althought elegantly large semantically not in use, still such a case is not considered. 

#### Verb Phrase Modifers:
- "not"
- adverbs = "abnormally","elegantly","always","never"

Example: is visiting -> is not visiting, is abnormally visiting, is elegantly visiting.

#### Subject (Noun Phrase) Modifers:
- 'every', 'some', 'at least','not every','exactly one', 'all but one','everyone of', 'no'
- adjectives = "green", "happy", "sad", "good", "bad"

Example: An old man -> every old man, some old man, green old man, happy old man.


In [8]:
vp = pd.read_excel(os.path.join(path, file1), sheet_name="modified_verb_phrases_dec24")
np = pd.read_excel(os.path.join(path, file2), sheet_name="modified_noun_phrases_dec_24")
obj = pd.read_excel(os.path.join(path, file3), sheet_name="modified_obj_sick20")
# vp = vp.drop(['Unnamed: 0'], axis=1)
# np = np.drop(['Unnamed: 0'], axis=1)
# vp = vp.fillna("NONE")
# np = np.fillna("NONE")
np['SICK_id'] = np['SICK_id'].astype('int')
np['Sno'] = np['Sno'].astype('int')
vp['SICK_id'] = vp['SICK_id'].astype('int')
vp['Sno'] = vp['Sno'].astype('int')
obj['SICK_id'] = obj['SICK_id'].astype('int')
obj['Sno'] = obj['Sno'].astype('int')

In [9]:
vp.shape[0], np.shape[0], obj.shape[0], vp.shape[0] + np.shape[0] +obj.shape[0]

(320, 800, 726, 1846)

In [10]:
len(vp['SICK_id'].unique()), len(np['SICK_id'].unique()), len(obj['SICK_id'].unique())

(15, 15, 15)

In [11]:
vp['SICK_id'].unique(), np['SICK_id'].unique(), obj['SICK_id'].unique()

(array([ 129,  273,  455,  167,  211,  116,  130,  212,  342,  443, 1131,
          90,  150,  168,  751]),
 array([ 129,  273,  455,  167,  211,  116,  130,  212,  342,  443, 1131,
          90,  150,  168,  751]),
 array([ 129,  273,  455,  167,  211,  116,  130,  212,  342,  443, 1131,
          90,  150,  168,  751]))

In [12]:
vp.head()

Unnamed: 0,Sno,SICK_id,Flipped (Yes/No),Premise,Hypothesis,Modifier,Premise/Hypothesis/Both,Part of Premise/Hypothesis Modified,Label
0,0,129,No,an old man is sitting in a field,a man is sitting in a field,NONE,NONE,NONE,Entailment
1,1,129,No,an old man is not sitting in a field,a man is sitting in a field,not,Premise,Verb,
2,2,129,No,an old man is sitting in a field,a man is not sitting in a field,not,Hypothesis,Verb,
3,3,129,No,an old man is not sitting in a field,a man is not sitting in a field,not,Both,Verb,
4,4,129,No,an old man is abnormally sitting in a field,a man is sitting in a field,abnormally,Premise,Verb,


In [13]:
np.head()

Unnamed: 0,Sno,SICK_id,Flipped (Yes/No),Premise,Hypothesis,Modifier,Premise/Hypothesis/Both,Part of Premise/Hypothesis Modified,Label
0,0,129,No,an old man is sitting in a field,a man is sitting in a field,NONE,NONE,NONE,Entailment
1,1,129,No,every old man is sitting in a field,a man is sitting in a field,every,Premise,Subject,
2,2,129,No,an old man is sitting in a field,every man is sitting in a field,every,Hypothesis,Subject,
3,3,129,No,every old man is sitting in a field,every man is sitting in a field,every,Both,Subject,
4,4,129,No,some old man is sitting in a field,a man is sitting in a field,some,Premise,Subject,


In [14]:
obj.head()

Unnamed: 0,Sno,SICK_id,Flipped (Yes/No),Premise,Hypothesis,Modifier,Premise/Hypothesis/Both,Part of Premise/Hypothesis Modified,Label
0,0,129,No,an old man is sitting in a field,a man is sitting in a field\n,NONE,NONE,NONE,Entailment
1,1,129,No,an old man is sitting in a good field,a man is sitting in a field\n,good,Premise,Object,
2,2,129,No,an old man is sitting in a field,a man is sitting in a good field,good,Hypothesis,Object,
3,3,129,No,an old man is sitting in a good field,a man is sitting in a good field,good,Both,Object,
4,4,129,No,an old man is sitting in all but one field,a man is sitting in a field\n,all but one,Premise,Object,


In [15]:
obj.columns

Index(['Sno', 'SICK_id', 'Flipped (Yes/No)', 'Premise', 'Hypothesis',
       'Modifier', 'Premise/Hypothesis/Both',
       'Part of Premise/Hypothesis Modified', 'Label'],
      dtype='object')

In [16]:
vp.columns

Index(['Sno', 'SICK_id', 'Flipped (Yes/No)', 'Premise', 'Hypothesis',
       'Modifier', 'Premise/Hypothesis/Both',
       'Part of Premise/Hypothesis Modified', 'Label'],
      dtype='object')

In [17]:
all_rows = []
concat_df = pd.concat([np, vp, obj ])
# concat_df.to_excel(os.path.join(path,"all_modified_sentences.xlsx"))

In [55]:
from transformers import pipeline

corrector = pipeline(
              'text2text-generation',
              'pszemraj/grammar-synthesis-large',
              )
raw_text = 'i can has cheezburger'
results = corrector(raw_text)
print(results)

[{'generated_text': 'I can eat a cheeseburger.'}]


In [57]:
raw_text = 'old man not is sitting in the room'
results = corrector(raw_text)
print(results)

[{'generated_text': 'The old man is not sitting in the room.'}]


In [56]:
vp.columns

Index(['Sno', 'SICK_id', 'Flipped (Yes/No)', 'Premise', 'Hypothesis',
       'Modifier', 'Premise/Hypothesis/Both',
       'Part of Premise/Hypothesis Modified', 'Label'],
      dtype='object')

In [65]:
modified_vp = []
counter = 0
for this_set in vp.to_records():
  row = list(this_set)[2:]
  
  if row[5] == 'Premise':
    premise = corrector(row[2])
    print(row[5], row[2], premise)
  if row[5] == 'Hypothesis':
    hypothesis = corrector(row[3])
    print(row[5], row[3], hypothesis)
  if row[5] == "Both":
    premise = corrector(row[2])
    hypothesis = corrector(row[3])
    print(row[5], row[2], premise, row[2], hypothesis)
  counter += 1
  if counter == 10:
    break
  # modified_vp.append({"SICK_id":row[0], "premise": premise, "hypothesis":hypothesis, "modifier":row[3], 'premise/hypothesis/both':row[4],
  #      'part of premise/hypothesis modified':row[5], 'label':row[6]})
  

Premise an old man is not sitting in a field [{'generated_text': 'The old man is not sitting in a field.'}]
Hypothesis a man is not sitting in a field [{'generated_text': 'I am not sitting in a field.'}]
Both an old man is not sitting in a field [{'generated_text': 'The old man is not sitting in a field.'}] an old man is not sitting in a field [{'generated_text': 'I am not sitting in a field.'}]
Premise an old man is abnormally sitting in a field [{'generated_text': 'Is an old man unusually sitting in a field?'}]
Hypothesis a man is abnormally sitting in a field [{'generated_text': 'Is a man abnormally sitting in a field?'}]
Both an old man is abnormally sitting in a field [{'generated_text': 'Is an old man unusually sitting in a field?'}] an old man is abnormally sitting in a field [{'generated_text': 'Is a man abnormally sitting in a field?'}]
Premise an old man is elegantly sitting in a field [{'generated_text': 'I see an old man sitting elegantly in a field.'}]
Hypothesis a man is 

In [18]:
from transformers import pipeline

corrector = pipeline(
              'text2text-generation',
              'pszemraj/flan-t5-large-grammar-synthesis',
              )
raw_text = 'i can has cheezburger'
results = corrector(raw_text)
print(results)

Downloading:   0%|          | 0.00/2.56k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

[{'generated_text': 'I can have a hamburger.'}]


In [19]:
raw_text = 'old man not is sitting in the room'
results = corrector(raw_text)
print(results)

[{'generated_text': 'An old man is not sitting in the room.'}]


In [20]:
modified_vp = []
counter = 0
for this_set in vp.to_records():
  row = list(this_set)[2:]
  
  if row[5] == 'Premise':
    premise = corrector(row[2])
    print(row[5], row[2], premise)
  if row[5] == 'Hypothesis':
    hypothesis = corrector(row[3])
    print(row[5], row[3], hypothesis)
  if row[5] == "Both":
    premise = corrector(row[2])
    hypothesis = corrector(row[3])
    print(row[5], row[2], premise, row[2], hypothesis)
  counter += 1
  if counter == 10:
    break

Premise an old man is not sitting in a field [{'generated_text': 'An old man is not sitting in a field.'}]
Hypothesis a man is not sitting in a field [{'generated_text': 'A man is not sitting in a field.'}]
Both an old man is not sitting in a field [{'generated_text': 'An old man is not sitting in a field.'}] an old man is not sitting in a field [{'generated_text': 'A man is not sitting in a field.'}]
Premise an old man is abnormally sitting in a field [{'generated_text': 'An old man is sitting in a field.'}]
Hypothesis a man is abnormally sitting in a field [{'generated_text': 'A man is sitting in a field.'}]
Both an old man is abnormally sitting in a field [{'generated_text': 'An old man is sitting in a field.'}] an old man is abnormally sitting in a field [{'generated_text': 'A man is sitting in a field.'}]
Premise an old man is elegantly sitting in a field [{'generated_text': 'An old man is sitting quietly in a field.'}]
Hypothesis a man is elegantly sitting in a field [{'generated