#### Auto-Generating Annotator assignments files for the modified sentences as block-wise assignments.
- Author: Sushma Anand Akoju, Email: sushmaakoju@arizona.edu

In [1]:
import pandas as pd
import numpy as np

In [2]:
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)

Mounted at /content/drive/


In [9]:
file1= "modified_verb_phrases_dec24.xlsx"
file2 = "modified_noun_phrases_dec_24.xlsx"
file3 = "modified_obj_sick20_dec24.xlsx"
path = "/content/drive/MyDrive/Colab Notebooks/natural-logic/final-datasets"

In [4]:
import os
os.path.exists(path)

True

## Approach followed

The approach followed here is:


1.  We use SICK dataset <a href="https://huggingface.co/datasets/sick" > HuggingFace SICK dataset </a>
2.  We use select examples which are "compositional" (example: class *full of* students vs class is *empty*) and "conditioned" on specific tokens in the sentence.
3.  We then select 5 examples each of which were labelled as Entailment, Contradiction and Neutral from SICK dataset.
4.  We then take 5 of the examples that are labelled as "Entailment" and flip premise and hypothesis for Reverse Entailment. We could assume Reverse Entailment holds. 
5. There are two additional labels in SICK dataset "entailment_AB" and	"entailment_BA".
6. We then analyze each of sentences that were labelled in SICK dataset as "Entailment" and consider "entailment_AB" value which is marked for all Entailment examples as "A_entails_B".
7. Secondly, we also verify "entailment_BA" column labels for each of the Entailment sentence pairs and found out that reverse is labelled as neutral: i.e. "B_neutral_A" . Thus SICK dataset marks flipped cases as Neutral.
8. We have various directions to pursue here: one way to consider flipped Entailment case is to request anotators to label this (could be Reverse Entailment or Neutral). This would still be a case to discuss since NLI systems do seem to show varied results.
9. Last but not the least, we do observe and note that all of the Entailment cases hold true for Forward Entailment.
10. For the Flipped examples of Entailment, we just need to annotate to find the agreements over RE or Neutral.
11. We additionally can provide flipped case analysis from NLI systems as well.
12. We selected 5 examples of Neutral and Contradiction each additionally for each of which A_NEUTRAL_B and B_NEUTRAL_A or A_CONTRADICTS_B and B_CONTRADICTS_A hold true, respectively.

In [5]:
df = pd.read_excel(os.path.join(path, "modifiers-sentences-sick-dataset.xlsx"), sheet_name="sick-examples")
df['SICK_id'] = df['SICK_id'].astype('int')
df['Sno'] = df['Sno'].astype('int')
df.head()

Unnamed: 0,Sno,SICK_id,Premise,Hypothesis,Flipped yes/no,SICK label,entailment_AB,entailment_BA
0,1,129,an old man is sitting in a field,a man is sitting in a field,No,Entailment,A_entails_B,B_neutral_A
1,2,273,A boy is standing in the cold water,A boy is standing in the water,No,Entailment,A_entails_B,B_neutral_A
2,3,455,Two children are hanging on a large branch,Two children are climbing a tree,No,Entailment,A_entails_B,B_neutral_A
3,4,129,a man is sitting in a field,an old man is sitting in a field,Yes,Possible Neutral,B_neutral_A,A_entails_B
4,5,273,A boy is standing in the water,A boy is standing in the cold water,Yes,Possible Neutral,B_neutral_A,A_entails_B


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Sno             20 non-null     int64 
 1   SICK_id         20 non-null     int64 
 2   Premise         20 non-null     object
 3   Hypothesis      20 non-null     object
 4   Flipped yes/no  20 non-null     object
 5   SICK label      20 non-null     object
 6   entailment_AB   20 non-null     object
 7   entailment_BA   20 non-null     object
dtypes: int64(2), object(6)
memory usage: 1.4+ KB


## We analyse the flipped examples


1.   4 examples are flipped
2.   We see that SICK label is Neutral for flipped cases as suggested by _neutral_  



In [7]:
df[df['entailment_AB'] == "B_neutral_A"][df['SICK label'] == "Possible Neutral"]

  small_sick[small_sick['entailment_AB'] == "B_neutral_A"][small_sick['SICK label'] == "Possible Neutral"]


Unnamed: 0,Sno,SICK_id,Premise,Hypothesis,Flipped yes/no,SICK label,entailment_AB,entailment_BA
3,4,129,a man is sitting in a field,an old man is sitting in a field,Yes,Possible Neutral,B_neutral_A,A_entails_B
4,5,273,A boy is standing in the water,A boy is standing in the cold water,Yes,Possible Neutral,B_neutral_A,A_entails_B
5,6,455,Two children are climbing a tree,Two children are hanging on a large branch,Yes,Possible Neutral,B_neutral_A,A_entails_B
8,9,167,A child is hitting a baseball,A boy is hitting a baseball,Yes,Possible Neutral,B_neutral_A,A_entails_B


### Analyze Modified examples

For each of the 20 examples selected from SICK dataset, we modify
1. Verb phrases (VP)
2. Subject (Noun Phrases (NP))
3. Objects (Noun Phrases)

For selecting modifiers, we use modifiers as we refer Bill MacCartney's NLI dissertation and Fracas dataset.

#### Object Modifers:
- determiners = "every", "some", "exactly one", "all but one", "no"
- adjectives = "green", "happy", "sad", "good", "bad"
- special adjectives = an abnormal", "an elegant"

Note: we only consider modifying object and not the adjectives preceeding the object. For example, a large pond -> would be "an elegant, large pond" and not "an elegantly large pond". Althought elegantly large semantically not in use, still such a case is not considered. 

#### Verb Phrase Modifers:
- "not"
- adverbs = "abnormally","elegantly","always","never"

Example: is visiting -> is not visiting, is abnormally visiting, is elegantly visiting.

#### Subject (Noun Phrase) Modifers:
- 'every', 'some', 'at least','not every','exactly one', 'all but one','everyone of', 'no'
- adjectives = "green", "happy", "sad", "good", "bad"

Example: An old man -> every old man, some old man, green old man, happy old man.


In [21]:
vp = pd.read_excel(os.path.join(path, file1), sheet_name="modified_verb_phrases_dec24")
np = pd.read_excel(os.path.join(path, file2), sheet_name="modified_noun_phrases_dec_24")
obj = pd.read_excel(os.path.join(path, file3), sheet_name="modified_obj_sick20")
# vp = vp.drop(['Unnamed: 0'], axis=1)
# np = np.drop(['Unnamed: 0'], axis=1)
# vp = vp.fillna("NONE")
# np = np.fillna("NONE")
np['SICK_id'] = np['SICK_id'].astype('int')
np['Sno'] = np['Sno'].astype('int')
vp['SICK_id'] = vp['SICK_id'].astype('int')
vp['Sno'] = vp['Sno'].astype('int')
obj['SICK_id'] = obj['SICK_id'].astype('int')
obj['Sno'] = obj['Sno'].astype('int')

In [22]:
vp.head()

Unnamed: 0,Sno,SICK_id,Flipped (Yes/No),Premise,Hypothesis,Modifier,Premise/Hypothesis/Both,Part of Premise/Hypothesis Modified,Label
0,0,129,No,an old man is sitting in a field,a man is sitting in a field,NONE,NONE,NONE,Entailment
1,1,129,No,an old man is not sitting in a field,a man is sitting in a field,not,Premise,Verb,
2,2,129,No,an old man is sitting in a field,a man is not sitting in a field,not,Hypothesis,Verb,
3,3,129,No,an old man is not sitting in a field,a man is not sitting in a field,not,Both,Verb,
4,4,129,No,an old man is abnormally sitting in a field,a man is sitting in a field,abnormally,Premise,Verb,


In [24]:
np.head()

Unnamed: 0,Sno,SICK_id,Flipped (Yes/No),Premise,Hypothesis,Modifier,Premise/Hypothesis/Both,Part of Premise/Hypothesis Modified,Label
0,0,129,No,an old man is sitting in a field,a man is sitting in a field,NONE,NONE,NONE,Entailment
1,1,129,No,every old man is sitting in a field,a man is sitting in a field,every,Premise,Subject,
2,2,129,No,an old man is sitting in a field,every man is sitting in a field,every,Hypothesis,Subject,
3,3,129,No,every old man is sitting in a field,every man is sitting in a field,every,Both,Subject,
4,4,129,No,some old man is sitting in a field,a man is sitting in a field,some,Premise,Subject,


In [25]:
obj.head()

Unnamed: 0,Sno,SICK_id,Flipped (Yes/No),Premise,Hypothesis,Modifier,Premise/Hypothesis/Both,Part of Premise/Hypothesis Modified,Label
0,0,129,No,an old man is sitting in a field,a man is sitting in a field\n,NONE,NONE,NONE,Entailment
1,1,129,No,an old man is sitting in a good field,a man is sitting in a field\n,good,Premise,Object,
2,2,129,No,an old man is sitting in a field,a man is sitting in a good field,good,Hypothesis,Object,
3,3,129,No,an old man is sitting in a good field,a man is sitting in a good field,good,Both,Object,
4,4,129,No,an old man is sitting in all but one field,a man is sitting in a field\n,all but one,Premise,Object,


In [26]:
obj.columns

Index(['Sno', 'SICK_id', 'Flipped (Yes/No)', 'Premise', 'Hypothesis',
       'Modifier', 'Premise/Hypothesis/Both',
       'Part of Premise/Hypothesis Modified', 'Label'],
      dtype='object')

In [27]:
vp.columns

Index(['Sno', 'SICK_id', 'Flipped (Yes/No)', 'Premise', 'Hypothesis',
       'Modifier', 'Premise/Hypothesis/Both',
       'Part of Premise/Hypothesis Modified', 'Label'],
      dtype='object')

In [28]:
# vp = vp.rename(columns={'premise':'Premise', 'hypothesis':'Hypothesis', 'Modifier':'Modifier', 'Premise/Hypothesis/Both':'Premise/Hypothesis/Both',
#        'Part of Premise/Hypothesis Modified':'Part of Premise/Hypothesis Modified', 'SICK_id':'SICK_id'})
# np = np.rename(columns={'premise':'Premise', 'hypothesis':'Hypothesis', 'Modifier':'Modifier', 'Premise/Hypothesis/Both':'Premise/Hypothesis/Both',
#        'Part of Premise/Hypothesis Modified':'Part of Premise/Hypothesis Modified', 'SICK_id':'SICK_id'})

In [29]:
sick = df[['Premise','Hypothesis','SICK_id']]
sick['Modifier'] = "NONE"
sick['Premise/Hypothesis/Both'] = "NONE"
sick['Part of Premise/Hypothesis Modified'] = "NONE"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sick['Modifier'] = "NONE"
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sick['Premise/Hypothesis/Both'] = "NONE"
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sick['Part of Premise/Hypothesis Modified'] = "NONE"


In [32]:
all_rows = []
concat_df = pd.concat([np, vp, obj ])
concat_df.to_excel(os.path.join(path,"all_modified_sentences.xlsx"))

In [41]:
gp = concat_df.groupby(['SICK_id','Flipped (Yes/No)'])
groups = list(gp.groups.keys())

In [43]:
groups

[(90, 'No'),
 (116, 'No'),
 (129, 'No'),
 (129, 'Yes'),
 (130, 'No'),
 (150, 'No'),
 (167, 'No'),
 (167, 'Yes'),
 (168, 'No'),
 (211, 'No'),
 (211, 'Yes'),
 (212, 'No'),
 (273, 'No'),
 (273, 'Yes'),
 (342, 'No'),
 (443, 'No'),
 (455, 'No'),
 (455, 'Yes'),
 (751, 'No'),
 (1131, 'No')]

### Flipped sentences Vs Rest of the sentences that preserve SICK label as well as bidirectional Entailment Relation 
#### Entailment Relations = ["Entailment", "Contradiction", "Neutral"]

In [50]:
filpped_groups = [g for g in groups if g[1] == 'Yes']
final_groups = [g for g in groups if g[1] == 'No']
len(filpped_groups) == 5, len(final_groups) == 15

(True, True)

In [53]:
type(gp.get_group(final_groups[0]))

pandas.core.frame.DataFrame

In [55]:
final_groups[0:4]

[(90, 'No'), (116, 'No'), (129, 'No'), (130, 'No')]

### Sliding window for block distribution
Pseudo code:
 - Start by selecting first 8
 - Stride = 4 (overlap =4)
 - Select next 8 starting from 4th (count starting at 0 per stride)
 - Save to 

In [207]:
window_size = 8
stride = 3
annotator1 = pd.DataFrame()

In [235]:
import random
grouped_keys = []
j = 0
for i in range(0,len(final_groups)-8,3):
  print(i, i+8, i-i+8)
  grouped_keys.append({j:final_groups[i:i+8]})
  j += 1

print(i)
grouped_keys.append({j:final_groups[0:4]+final_groups[i+4:i+8]})
grouped_keys.append({j:final_groups[2:6]+final_groups[i+4:i+8]})

0 8 8
3 11 8
6 14 8
6


In [236]:
s = set()
j = 0
#overlap atmost 4 because window size is > half of total number of examples
for i,a in enumerate(grouped_keys):
  print(list(a.keys())[0], len(list(a.values())[0]), list(a.values())[0])
  j = (j+1)%len(grouped_keys)
  print(i,j,set(list(a.values())[0]).difference(list(grouped_keys[j].values())[0]))

0 8 [(455, 'No'), (212, 'No'), (342, 'No'), (129, 'No'), (167, 'No'), (150, 'No'), (443, 'No'), (90, 'No')]
0 1 {(455, 'No'), (342, 'No'), (212, 'No')}
1 8 [(129, 'No'), (167, 'No'), (150, 'No'), (443, 'No'), (90, 'No'), (116, 'No'), (751, 'No'), (130, 'No')]
1 2 {(150, 'No'), (167, 'No'), (129, 'No')}
2 8 [(443, 'No'), (90, 'No'), (116, 'No'), (751, 'No'), (130, 'No'), (273, 'No'), (168, 'No'), (1131, 'No')]
2 3 {(443, 'No'), (751, 'No'), (116, 'No'), (90, 'No')}
3 8 [(455, 'No'), (212, 'No'), (342, 'No'), (129, 'No'), (130, 'No'), (273, 'No'), (168, 'No'), (1131, 'No')]
3 4 {(455, 'No'), (212, 'No')}
3 8 [(342, 'No'), (129, 'No'), (167, 'No'), (150, 'No'), (130, 'No'), (273, 'No'), (168, 'No'), (1131, 'No')]
4 0 {(130, 'No'), (1131, 'No'), (273, 'No'), (168, 'No')}


In [237]:
print(grouped_keys)

[{0: [(455, 'No'), (212, 'No'), (342, 'No'), (129, 'No'), (167, 'No'), (150, 'No'), (443, 'No'), (90, 'No')]}, {1: [(129, 'No'), (167, 'No'), (150, 'No'), (443, 'No'), (90, 'No'), (116, 'No'), (751, 'No'), (130, 'No')]}, {2: [(443, 'No'), (90, 'No'), (116, 'No'), (751, 'No'), (130, 'No'), (273, 'No'), (168, 'No'), (1131, 'No')]}, {3: [(455, 'No'), (212, 'No'), (342, 'No'), (129, 'No'), (130, 'No'), (273, 'No'), (168, 'No'), (1131, 'No')]}, {3: [(342, 'No'), (129, 'No'), (167, 'No'), (150, 'No'), (130, 'No'), (273, 'No'), (168, 'No'), (1131, 'No')]}]


In [257]:

i = 1
for keys in grouped_keys:
  this_set = list(keys.values())[0]
  print(len(this_set))
  this_annotator = None
  j = 1
  writer = pd.ExcelWriter(os.path.join(path,"annotator_"+str(i)+".xlsx"))
  for group in this_set:
    #print(group)
    sheet_name = "block_"+str(j)+"_"+str(group[0])

    if type(this_annotator) != pd.DataFrame:
      this_annotator = gp.get_group(group)
    j += 1
    #write it to 
    this_annotator.to_excel(writer, sheet_name=sheet_name)
  print(this_annotator.size)
  writer.close()
  i += 1

8
837
8
837
8
846
8
837
8
837


In [256]:
# pd.read_excel(os.path.join(path,"annotator_"+str(1)+".xlsx"))