In [1]:
%load_ext autoreload
%autoreload 2
is_cuda = True

# Overview

This notebook is a more advanced demo on how to create perturbations for the Natural Language Inference task, using [SNLI dataset](https://www.google.com/search?hl=zh-CN&q=huggingface+snli). It describes in detail the running steps in Tailor (i.e., explains the implementation logic to some extent). If this does not look like the level of detail you are looking for, we recommend checking out the Basic Demo notebook.

## Load the data

In [2]:
import datasets
import pandas as pd

In [3]:
def show_as_dataframe(list_of_namedtuples, keys=None):
    if isinstance(list_of_namedtuples[0], list):
        if not keys:
            keys = range(len(list_of_namedtuples))
        return pd.concat([pd.DataFrame(tup_list) for tup_list in list_of_namedtuples], keys=keys)
    return pd.DataFrame(list_of_namedtuples)

In [4]:
# load the data
snli = datasets.load_dataset("snli")
# since SNLI has sentence pairs of (premise, hypothesis), 
# here we denote the perturbation target as the hypothesis sentence.
key_to_perturb = "hypothesis"

Reusing dataset snli (/home/wtshuang/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b)


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
# take a look as some randome sentences in the training dataset
data_dict = snli["train"][:5]
sentences = data_dict[key_to_perturb]
sentences

['A person is training his horse for a competition.',
 'A person is at a diner, ordering an omelette.',
 'A person is outdoors, on a horse.',
 'They are smiling at their parents',
 'There are children present']

## Preprocess the data

Tailor handles semantic perturbation by taking advantage of sentences' semantic roles. Thus, we need to do some preprocessing to augment the texts with semantic features. We provide:
- built-in SpaCy processor, and 
- SRL tagger ([default model from AllenNLP](https://demo.allennlp.org/semantic-role-labeling)).

In [6]:
from tailor.steps.process_with_spacy import GetSpacyModel, ProcessWithSpacy

# there are also two additional default parameters; Can change them if you need.
# spacy_model_name="en_core_web_sm"
# use_white_space_tokenizer=False
spacy_model = GetSpacyModel().run(parse=True)
spacy_outputs = ProcessWithSpacy().run(sentences=sentences, spacy_model=spacy_model)
spacy_outputs

[A person is training his horse for a competition.,
 A person is at a diner, ordering an omelette.,
 A person is outdoors, on a horse.,
 They are smiling at their parents,
 There are children present]

In [7]:
import logging
logging.getLogger("allennlp").setLevel("ERROR")

from tailor.steps.get_srl_tags import GetSRLTags
# the spacy-processed docs can be passed for getting the SRL labels
processed_sentences = GetSRLTags().run(spacy_outputs=spacy_outputs)


Take a look at one result. The result has the following structure:

```
- sentence [str]: The original text.
- spacy_doc [Spacy Doc]: The processed SpaCy doc.
- verbs: A list of the predicates and their corresponding arguments
    {
        verb [str]: the verb
            'training'
        description: [str]: sentence with augmented semtnaic roles.
            '[ARG0: A person] is [V: training] [ARG2: his horse] [ARG1: for a competition] .'
        tags [List of str]:
            Semantic roles in BIO format.
    }[]
```

In [8]:
processed_sentences[0]

ProcessedSentence(sentence='A person is training his horse for a competition .', spacy_doc=A person is training his horse for a competition., verbs=[{'verb': 'is', 'description': 'A person [V: is] training his horse for a competition .', 'tags': ['O', 'O', 'B-V', 'O', 'O', 'O', 'O', 'O', 'O', 'O']}, {'verb': 'training', 'description': '[ARG0: A person] is [V: training] [ARG2: his horse] [ARG1: for a competition] .', 'tags': ['B-ARG0', 'I-ARG0', 'O', 'B-V', 'B-ARG2', 'I-ARG2', 'B-ARG1', 'I-ARG1', 'I-ARG1', 'O']}])

In [9]:
# See the result in a more structured formate
show_as_dataframe(processed_sentences)

Unnamed: 0,sentence,spacy_doc,verbs
0,A person is training his horse for a competiti...,"(A, person, is, training, his, horse, for, a, ...","[{'verb': 'is', 'description': 'A person [V: i..."
1,"A person is at a diner , ordering an omelette .","(A, person, is, at, a, diner, ,, ordering, an,...","[{'verb': 'is', 'description': '[ARG1: A perso..."
2,"A person is outdoors , on a horse .","(A, person, is, outdoors, ,, on, a, horse, .)","[{'verb': 'is', 'description': '[ARG1: A perso..."
3,They are smiling at their parents,"(They, are, smiling, at, their, parents)","[{'verb': 'are', 'description': 'They [V: are]..."
4,There are children present,"(There, are, children, present)","[{'verb': 'are', 'description': 'There [V: are..."


## Tailor perturbations on sentences

Now we start to do perturbations on the processed sentences.

In [10]:
from tailor.common.perturb_function import ChangeVoice, SwapCoreWithContext, SwapCoreWithoutContext, ShortenCoreArgument, ChangeTense
from tailor.steps.perturb_prompt import PerturbPromptWithString, PerturbPromptWithFunction
from tailor.steps.combine_all_prompts import CombineAllPrompts

In [11]:
sentence_prompts = []
perturbs = {
    ChangeVoice: "preserves_meaning",
    ShortenCoreArgument: "preserves_meaning",
    SwapCoreWithContext: "changes_meaning",
    SwapCoreWithoutContext: "changes_meaning",
    ChangeTense: "preserves_meaning"
}
for perturb_fn in [ChangeVoice, SwapCoreWithContext, SwapCoreWithoutContext, ShortenCoreArgument, ChangeTense]:
    perturbations = PerturbPromptWithString().run(
        # the procssed sentence
        processed_sentences=processed_sentences, 
        # we have pre-implemented some functions you can call
        perturb_str_func=perturb_fn(), 
        # you can add arbitrary descriptions
        description=perturbs[perturb_fn])
    sentence_prompts.append(perturbations)




In [25]:
# the available pre-defined functions: 
from tailor.common.perturb_function import PerturbStringFunction
PerturbStringFunction.list_available()

['change_voice',
 'change_tense',
 'change_lemma',
 'delete_text',
 'delete_punctuation',
 'swap_core_with_context',
 'swap_core_without_context',
 'shorten_core_argument']

In [13]:
# additionally, Tailor allows defining perturbation functions. 
# We will not go into details, but please feel free to check the implementaiton.

from tailor.tasks.nli.perturbations import ReplaceCoreWithSubsequence

In [14]:
replace_core_with_subs = PerturbPromptWithFunction().run(
    processed_sentences=processed_sentences, 
    perturb_fn=ReplaceCoreWithSubsequence(), 
    description="changes_meaning")

In [15]:
sentence_prompts.append(replace_core_with_subs)

In [16]:
sentence_prompts = CombineAllPrompts().run(sentence_prompts)

In [17]:
# Here we generate the actual inputs to Tailor (the prompt column). 
# Note that the name and description are preserved for future reference.
show_as_dataframe(sentence_prompts)

Unnamed: 0,Unnamed: 1,prompt,answer,meta,name,description
0,0,[VERB+passive+present: be] <extra_id_0> <extr...,A person [VERB: is] training his horse for a c...,"{'noncore_args': [], 'core_args': [], 'blank_i...",change_voice,preserves_meaning
0,1,[VERB+passive+present: train | AGENT+complete:...,[AGENT: A person] is [VERB: training] [GOAL: h...,"{'noncore_args': [{'tlemma': 'his horse', 'tle...",change_voice,preserves_meaning
0,2,[VERB+active+present: train | AGENT+complete: ...,[PATIENT: A person] is [VERB: training] [GOAL:...,"{'noncore_args': [{'tlemma': 'his horse', 'tle...",swap_core_with_context,changes_meaning
0,3,[VERB+active+present: train | AGENT+complete: ...,[PATIENT: A person] is [VERB: training] [GOAL:...,"{'noncore_args': [{'tlemma': 'his horse', 'tle...",swap_core_without_context,changes_meaning
0,4,[VERB+active+present: train | AGENT+complete: ...,[AGENT: A person] is [VERB: training] [GOAL: h...,"{'noncore_args': [], 'core_args': [{'tlemma': ...",shorten_core_argument,preserves_meaning
0,5,[VERB+active+present: train | AGENT+complete: ...,[AGENT: A person] is [VERB: training] [GOAL: h...,"{'noncore_args': [], 'core_args': [{'tlemma': ...",shorten_core_argument,preserves_meaning
0,6,[VERB+active+past: be] A person <extra_id_0> <...,A person [VERB: is] training his horse for a c...,"{'noncore_args': [], 'core_args': [], 'blank_i...",change_tense,preserves_meaning
0,7,[VERB+active+future: train | AGENT+complete: a...,[AGENT: A person] is [VERB: training] [GOAL: h...,"{'noncore_args': [{'tlemma': 'his horse', 'tle...",change_tense,preserves_meaning
0,8,[VERB+active+present: train | AGENT+complete: ...,[AGENT: A person] is [VERB: training] [GOAL: h...,"{'noncore_args': [], 'core_args': [{'tlemma': ...",replace_core_with_subsequence,changes_meaning
1,0,[VERB+passive+present: be | PATIENT+complete: ...,[PATIENT: A person] [VERB: is] [PREDICATE: at ...,"{'noncore_args': [{'tlemma': 'at a diner', 'tl...",change_voice,preserves_meaning


In [18]:
# finally we can make the actual generation using the prompts.

from tailor.steps.generate_from_prompts import GenerateFromPrompts

generations = GenerateFromPrompts().run(
    processed_sentences=processed_sentences,
    prompts=sentence_prompts,
    spacy_model=spacy_model,
    compute_perplexity=True,
)

In [22]:
show_as_dataframe(generations[0])

Unnamed: 0,prompt_no_header,sentence,meta,annotations,words,vidx,name,description,is_valid,perplexities
0,- [GOAL: his horse]'s [VERB: trained] [AGENT: ...,- his horse 's trained by a person | for . os .,"{'match': '<re.Match object; span=(0, 29), mat...","[{'tag': 'GOAL', 'start': 1, 'end': 3, 'pred':...","[-, his, horse, 's, trained, by, a, person, |,...",4,change_voice,preserves_meaning,,"{'pr_sent': 60.997093200683594, 'pr_phrase': -..."
1,'PREDIOBOMO: FEIRUINE: BENESISTS BASE OF WALLI...,' FEIRUINE : BENESISTS BASE OF WALLINSIDE VALL...,"{'match': '<re.Match object; span=(0, 28), mat...","[{'tag': 'TEMPORAL', 'start': 1, 'end': 8, 'pr...","[', FEIRUINE, :, BENESISTS, BASE, OF, WALLINSI...",-1,swap_core_with_context,changes_meaning,,"{'pr_sent': 463.1181106567383, 'pr_phrase': 14..."
2,"- [AGENT: a person], [VERB: training] [PATIENT...","- a person , training for competition","{'match': '<re.Match object; span=(0, 28), mat...","[{'tag': 'AGENT', 'start': 1, 'end': 3, 'pred'...","[-, a, person, ,, training, for, competition]",4,shorten_core_argument,preserves_meaning,,"{'pr_sent': 5.577220916748047, 'pr_phrase': 7...."
3,- [AGENT: a person]'s [VERB: training] [PATIEN...,- a person 's training for tte competition,"{'match': '<re.Match object; span=(0, 28), mat...","[{'tag': 'AGENT', 'start': 1, 'end': 3, 'pred'...","[-, a, person, 's, training, for, tte, competi...",4,shorten_core_argument,preserves_meaning,,"{'pr_sent': 23.869155883789062, 'pr_phrase': 7..."
4,"- [AGENT: a person], [VERB: training] [PATIENT...","- a person , training for competition","{'match': '<re.Match object; span=(0, 28), mat...","[{'tag': 'AGENT', 'start': 1, 'end': 3, 'pred'...","[-, a, person, ,, training, for, competition]",4,shorten_core_argument,preserves_meaning,,"{'pr_sent': 5.577220916748047, 'pr_phrase': 7...."
5,- [AGENT: A person]'s [VERB: training] [PATIEN...,- A person 's training for a competition,"{'match': '<re.Match object; span=(0, 28), mat...","[{'tag': 'AGENT', 'start': 1, 'end': 3, 'pred'...","[-, A, person, 's, training, for, a, competition]",4,shorten_core_argument,preserves_meaning,,"{'pr_sent': 7.308444976806641, 'pr_phrase': 4...."
6,- [AGENT: a person] [MODAL: will]'be [VERB: tr...,- a person will ' be training his his horse --...,"{'match': '<re.Match object; span=(0, 27), mat...","[{'tag': 'AGENT', 'start': 1, 'end': 3, 'pred'...","[-, a, person, will, ', be, training, his, his...",6,change_tense,preserves_meaning,,"{'pr_sent': 65.06066131591797, 'pr_phrase': 7...."
7,[AGENT: a person]'s [VERB: training] [PATIENT;...,a person 's training for .. tr .,"{'match': '<re.Match object; span=(0, 28), mat...","[{'tag': 'AGENT', 'start': 0, 'end': 2, 'pred'...","[a, person, 's, training, for, .., tr, .]",3,replace_core_with_subsequence,changes_meaning,,"{'pr_sent': 22.215484619140625, 'pr_phrase': 2..."
8,[AGENT: a person]'s [VERB: training] [PATIENT ...,a person 's training for sass,"{'match': '<re.Match object; span=(0, 28), mat...","[{'tag': 'AGENT', 'start': 0, 'end': 2, 'pred'...","[a, person, 's, training, for, sass]",3,replace_core_with_subsequence,changes_meaning,,"{'pr_sent': 8.735004425048828, 'pr_phrase': 2...."
9,[AGENT: a person] [VERB: trains] - [PATIENT; f...,"a person trains - for , say .. :","{'match': '<re.Match object; span=(0, 28), mat...","[{'tag': 'AGENT', 'start': 0, 'end': 2, 'pred'...","[a, person, trains, -, for, ,, say, .., :]",2,replace_core_with_subsequence,changes_meaning,,"{'pr_sent': 27.191421508789062, 'pr_phrase': 2..."


## Data augmentation with Tailor

Tailor also has additional steps that can support data augmentation.

In [20]:
from tailor.tasks.nli.augment import AugmentNLI

new_data = AugmentNLI().run(
    # these are perturbation settings we used at the beginning.
    dataset=data_dict, 
    perturbed_field=key_to_perturb, 
    # with the keywords of "changed_meaning" and "preserved_meaning", we can auto-label the new examples.
    generated_prompt_dicts=generations, 
    max_augment_per_instance=2)

In [21]:
# we can conver them into Huggingface dataset format.

from tailor.steps.convert_dataset_to_dict import ConvertDictToDataset
new_dataset = ConvertDictToDataset().run(new_data)

new_dataset

Dataset({
    features: ['premise', 'hypothesis', 'label'],
    num_rows: 6
})

In [22]:
new_dataset[:]

{'premise': ['A person on a horse jumps over a broken down airplane.',
  'A person on a horse jumps over a broken down airplane.',
  'A person on a horse jumps over a broken down airplane.',
  'Children smiling and waving at camera',
  'Children smiling and waving at camera',
  'Children smiling and waving at camera'],
 'hypothesis': ["A person 's his horse will be training competition in -",
  "a person is at , '' ordered - an omelette",
  "' A person orders - an omelette",
  'They will - uh . c.-RRB _ stymies_dear_children_doer_nsc_fcihi._hj -RSB be smiling at their parents',
  'They will - a. % uh ve fwddguiduidireit@hzd@t > v cbw rcf ncsbmn.rkniuwsi iliiwiaudhi fvghthruud tharhwnaihnrhgivtuniyahnauthi',
  "There 've been - children present"],
 'label': [1, 2, 2, 1, 1, 0]}