In [1]:
%load_ext autoreload
%autoreload 2
is_cuda = True

# Overview

This notebook is a more advanced demo on how to create perturbations for the Natural Language Inference task, using [SNLI dataset](https://www.google.com/search?hl=zh-CN&q=huggingface+snli). It describes in detail the running steps in Tailor (i.e., explains the implementation logic to some extent). If this does not look like the level of detail you are looking for, we recommend checking out the Basic Demo notebook.

## Load the data

In [2]:
import datasets
import pandas as pd

In [3]:
def show_as_dataframe(list_of_namedtuples, keys=None):
    if isinstance(list_of_namedtuples[0], list):
        if not keys:
            keys = range(len(list_of_namedtuples))
        return pd.concat([pd.DataFrame(tup_list) for tup_list in list_of_namedtuples], keys=keys)
    return pd.DataFrame(list_of_namedtuples)

In [4]:
# load the data
snli = datasets.load_dataset("snli")
# since SNLI has sentence pairs of (premise, hypothesis), 
# here we denote the perturbation target as the hypothesis sentence.
key_to_perturb = "hypothesis"

Reusing dataset snli (/home/alexisr/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b)


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
# take a look as some randome sentences in the training dataset
data_dict = snli["train"][20:70]
sentences = data_dict[key_to_perturb]
sentences

['A man in a restaurant is waiting for his meal to arrive.',
 'A blond man getting a drink of water from a fountain in the park.',
 'A blond man wearing a brown shirt is reading a book on a bench in the park',
 'A blond man drinking water from a fountain.',
 'The friends scowl at each other over a full dinner table.',
 'There are two woman in this picture.',
 'The friends have just met for the first time in 20 years, and have had a great time catching up.',
 'The two sisters saw each other across the crowded diner and shared a hug, both clutching their doggie bags.',
 'Two groups of rival gang members flipped each other off.',
 'Two women hug each other.',
 'A team is trying to score the games winning out.',
 'A team is trying to tag a runner out.',
 'A team is playing baseball on Saturn.',
 'A school hosts a basketball game.',
 'A high school is hosting an event.',
 'A school is hosting an event.',
 'The women do not care what clothes they wear.',
 'Women are waiting by a tram.',
 'Th

## Preprocess the data

Tailor handles semantic perturbation by taking advantage of sentences' semantic roles. Thus, we need to do some preprocessing to augment the texts with semantic features. We provide:
- built-in SpaCy processor, and 
- SRL tagger ([default model from AllenNLP](https://demo.allennlp.org/semantic-role-labeling)).

In [6]:
from tailor.steps.process_with_spacy import GetSpacyModel, ProcessWithSpacy

# there are also two additional default parameters; Can change them if you need.
# spacy_model_name="en_core_web_sm"
# use_white_space_tokenizer=False
spacy_model = GetSpacyModel().run(parse=True)
spacy_outputs = ProcessWithSpacy().run(sentences=sentences, spacy_model=spacy_model)
spacy_outputs

[A man in a restaurant is waiting for his meal to arrive.,
 A blond man getting a drink of water from a fountain in the park.,
 A blond man wearing a brown shirt is reading a book on a bench in the park,
 A blond man drinking water from a fountain.,
 The friends scowl at each other over a full dinner table.,
 There are two woman in this picture.,
 The friends have just met for the first time in 20 years, and have had a great time catching up.,
 The two sisters saw each other across the crowded diner and shared a hug, both clutching their doggie bags.,
 Two groups of rival gang members flipped each other off.,
 Two women hug each other.,
 A team is trying to score the games winning out.,
 A team is trying to tag a runner out.,
 A team is playing baseball on Saturn.,
 A school hosts a basketball game.,
 A high school is hosting an event.,
 A school is hosting an event.,
 The women do not care what clothes they wear.,
 Women are waiting by a tram.,
 The women enjoy having a good fashion s

In [7]:
import logging
logging.getLogger("allennlp").setLevel("ERROR")

from tailor.steps.get_srl_tags import GetSRLTags
# the spacy-processed docs can be passed for getting the SRL labels
processed_sentences = GetSRLTags().run(spacy_outputs=spacy_outputs)


Take a look at one result. The result has the following structure:

```
- sentence [str]: The original text.
- spacy_doc [Spacy Doc]: The processed SpaCy doc.
- verbs: A list of the predicates and their corresponding arguments
    {
        verb [str]: the verb
            'training'
        description: [str]: sentence with augmented semtnaic roles.
            '[ARG0: A person] is [V: training] [ARG2: his horse] [ARG1: for a competition] .'
        tags [List of str]:
            Semantic roles in BIO format.
    }[]
```

In [8]:
processed_sentences[0]

ProcessedSentence(sentence='A man in a restaurant is waiting for his meal to arrive .', spacy_doc=A man in a restaurant is waiting for his meal to arrive., verbs=[{'verb': 'is', 'description': 'A man in a restaurant [V: is] waiting for his meal to arrive .', 'tags': ['O', 'O', 'O', 'O', 'O', 'B-V', 'O', 'O', 'O', 'O', 'O', 'O', 'O']}, {'verb': 'waiting', 'description': '[ARG1: A man in a restaurant] is [V: waiting] [ARG2: for his meal to arrive] .', 'tags': ['B-ARG1', 'I-ARG1', 'I-ARG1', 'I-ARG1', 'I-ARG1', 'O', 'B-V', 'B-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'O']}, {'verb': 'arrive', 'description': 'A man in a restaurant is waiting for [ARG1: his meal] to [V: arrive] .', 'tags': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ARG1', 'I-ARG1', 'O', 'B-V', 'O']}])

In [9]:
# See the result in a more structured formate
show_as_dataframe(processed_sentences)

Unnamed: 0,sentence,spacy_doc,verbs
0,A man in a restaurant is waiting for his meal ...,"(A, man, in, a, restaurant, is, waiting, for, ...","[{'verb': 'is', 'description': 'A man in a res..."
1,A blond man getting a drink of water from a fo...,"(A, blond, man, getting, a, drink, of, water, ...","[{'verb': 'getting', 'description': '[ARG0: A ..."
2,A blond man wearing a brown shirt is reading a...,"(A, blond, man, wearing, a, brown, shirt, is, ...","[{'verb': 'wearing', 'description': '[ARG0: A ..."
3,A blond man drinking water from a fountain .,"(A, blond, man, drinking, water, from, a, foun...","[{'verb': 'drinking', 'description': '[ARG0: A..."
4,The friends scowl at each other over a full di...,"(The, friends, scowl, at, each, other, over, a...","[{'verb': 'scowl', 'description': '[ARG0: The ..."
5,There are two woman in this picture .,"(There, are, two, woman, in, this, picture, .)","[{'verb': 'are', 'description': 'There [V: are..."
6,The friends have just met for the first time i...,"(The, friends, have, just, met, for, the, firs...","[{'verb': 'have', 'description': 'The friends ..."
7,The two sisters saw each other across the crow...,"(The, two, sisters, saw, each, other, across, ...","[{'verb': 'saw', 'description': '[ARG0: The tw..."
8,Two groups of rival gang members flipped each ...,"(Two, groups, of, rival, gang, members, flippe...","[{'verb': 'flipped', 'description': '[ARG0: Tw..."
9,Two women hug each other .,"(Two, women, hug, each, other, .)","[{'verb': 'hug', 'description': '[ARG0: Two wo..."


## Tailor perturbations on sentences

Now we start to do perturbations on the processed sentences.

In [10]:
from tailor.common.perturb_function import ChangeVoice, SwapCoreWithContext, SwapCoreWithoutContext, ShortenCoreArgument, ChangeTense
from tailor.steps.perturb_prompt import PerturbPromptWithString, PerturbPromptWithFunction
from tailor.steps.combine_all_prompts import CombineAllPrompts

In [12]:
sentence_prompts = []
perturbs = {
    ChangeVoice: "preserves_meaning",
    ShortenCoreArgument: "preserves_meaning",
    SwapCoreWithContext: "changes_meaning",
    SwapCoreWithoutContext: "changes_meaning",
    ChangeTense: "preserves_meaning"
}
for perturb_fn in [ChangeVoice, SwapCoreWithContext, SwapCoreWithoutContext, ShortenCoreArgument, ChangeTense]:
    perturbations = PerturbPromptWithString().run(
        # the procssed sentence
        processed_sentences=processed_sentences, 
        # we have pre-implemented some functions you can call
        perturb_str_func=perturb_fn(), 
        # you can add arbitrary descriptions
        description=perturbs[perturb_fn])
    sentence_prompts.append(perturbations)


No prompts created; unique_tlemmas: set()


In [13]:
# the available pre-defined functions: 
from tailor.common.perturb_function import PerturbStringFunction
PerturbStringFunction.list_available()

['change_voice',
 'change_tense',
 'change_lemma',
 'delete_text',
 'delete_punctuation',
 'swap_core_with_context',
 'swap_core_without_context',
 'shorten_core_argument']

In [14]:
# additionally, Tailor allows defining perturbation functions. 
# We will not go into details, but please feel free to check the implementaiton.

from tailor.tasks.nli.perturbations import ReplaceCoreWithSubsequence

In [15]:
replace_core_with_subs = PerturbPromptWithFunction().run(
    processed_sentences=processed_sentences, 
    perturb_fn=ReplaceCoreWithSubsequence(), 
    description="changes_meaning")

<generator object at 0x7fc029792d70>
his meal
[his meal]
<generator object at 0x7fc029792d70>
A man
a restaurant
[A man, a restaurant]
<generator object at 0x7fc029792d70>
his meal
[his meal]
<generator object at 0x7fc029792d70>
A man
a restaurant
[A man, a restaurant]
<generator object at 0x7fc029792eb0>
his meal
[his meal]
<generator object at 0x7fc029792eb0>
his meal
[his meal]
<generator object at 0x7fc029792cd0>
a fountain
the park
[a fountain, the park]
<generator object at 0x7fc029792cd0>
a drink
water
[a drink, water]
<generator object at 0x7fc029792cd0>
A blond man
[A blond man]
<generator object at 0x7fc029792eb0>
a fountain
the park
[a fountain, the park]
<generator object at 0x7fc029792eb0>
a drink
water
[a drink, water]
<generator object at 0x7fc029792eb0>
A blond man
[A blond man]
<generator object at 0x7fc1cb7c3190>
a brown shirt
[a brown shirt]
<generator object at 0x7fc1cb7c3190>
A blond man
[A blond man]
<generator object at 0x7fc1cb7c3190>
a brown shirt
[a brown shir

<generator object at 0x7fc1cb7c3550>
The people
[The people]
<generator object at 0x7fc1cb7c3550>
the train
[the train]
<generator object at 0x7fc1cb7c3550>
The people
[The people]
<generator object at 0x7fc1cb7c3550>
the train
[the train]
<generator object at 0x7fc1cb7c3190>
The people
the train
[The people, the train]
<generator object at 0x7fc1cb7c3190>
The people
the train
[The people, the train]
<generator object at 0x7fc1cb7c35f0>
people
a train
[people, a train]
<generator object at 0x7fc1cb7c35f0>
people
a train
[people, a train]
<generator object at 0x7fc1cb7c3550>
a train
[a train]
<generator object at 0x7fc1cb7c3550>
people
[people]
<generator object at 0x7fc1cb7c3550>
[]
<generator object at 0x7fc1cb7c35f0>
a train
[a train]
<generator object at 0x7fc1cb7c35f0>
people
[people]
<generator object at 0x7fc1cb7c35f0>
[]
<generator object at 0x7fc1cb7c35f0>
people
a train
[people, a train]
<generator object at 0x7fc1cb7c35f0>
people
a train
[people, a train]
<generator object at

In [16]:
sentence_prompts.append(replace_core_with_subs)

In [17]:
sentence_prompts = CombineAllPrompts().run(sentence_prompts)

In [18]:
# Here we generate the actual inputs to Tailor (the prompt column). 
# Note that the name and description are preserved for future reference.
show_as_dataframe(sentence_prompts)

Unnamed: 0,Unnamed: 1,prompt,answer,meta,name,description
0,0,[VERB+passive+present: be] <extra_id_0> <extr...,A man in a restaurant [VERB: is] waiting for h...,"{'noncore_args': [], 'core_args': [], 'blank_i...",change_voice,preserves_meaning
0,1,[VERB+passive+present: wait | PATIENT+complete...,[PATIENT: A man in a restaurant] is [VERB: wa...,{'noncore_args': [{'tlemma': 'for his meal to ...,change_voice,preserves_meaning
0,2,[VERB+passive+present: arrive | PATIENT+comple...,A man in a restaurant is waiting for [PATIENT:...,"{'noncore_args': [], 'core_args': [{'tlemma': ...",change_voice,preserves_meaning
0,3,[VERB+active+past: be] A man in a restaurant <...,A man in a restaurant [VERB: is] waiting for h...,"{'noncore_args': [], 'core_args': [], 'blank_i...",change_tense,preserves_meaning
0,4,[VERB+active+past: wait | PATIENT+complete: a ...,[PATIENT: A man in a restaurant] is [VERB: wai...,{'noncore_args': [{'tlemma': 'for his meal to ...,change_tense,preserves_meaning
...,...,...,...,...,...,...
49,4,[VERB+active+present: have | AGENT+partial: A ...,The man is sitting down while [PATIENT: he] [V...,"{'noncore_args': [], 'core_args': [{'tlemma': ...",swap_core_without_context,changes_meaning
49,5,[VERB+active+present: have | AGENT+complete: H...,The man is sitting down while [AGENT: he] [VER...,"{'noncore_args': [], 'core_args': [{'tlemma': ...",shorten_core_argument,preserves_meaning
49,6,[VERB+active+future: be | MODAL: *] The man <e...,The man [VERB: is] sitting down while he has a...,"{'noncore_args': [{'tlemma': '*', 'tlemma_type...",change_tense,preserves_meaning
49,7,[VERB+active+future: sit | PATIENT+complete: t...,[PATIENT: The man] is [VERB: sitting] down [TE...,{'noncore_args': [{'tlemma': 'while he has a s...,change_tense,preserves_meaning


In [19]:
# finally we can make the actual generation using the prompts.

from tailor.steps.generate_from_prompts import GenerateFromPrompts

generations = GenerateFromPrompts().run(
    processed_sentences=processed_sentences,
    prompts=sentence_prompts,
    spacy_model=spacy_model,
    compute_perplexity=True,
)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [02:39<00:00, 53.25s/it]
  0%|                                                                                                                                                                                                         | 0/261 [00:00<?, ?it/s]


BadGenerationError: Bad generation: [VERB+passive+present: be]  <extra_id_0> <extra_id_1> <extra_id_2> <extra_id_3> <extra_id_4> <extra_id_5> <extra_id_6> <extra_id_7>

In [None]:
show_as_dataframe(generations[0])

## Data augmentation with Tailor

Tailor also has additional steps that can support data augmentation.

In [None]:
from tailor.tasks.nli.augment import AugmentNLI

new_data = AugmentNLI().run(
    # these are perturbation settings we used at the beginning.
    dataset=data_dict, 
    perturbed_field=key_to_perturb, 
    # with the keywords of "changed_meaning" and "preserved_meaning", we can auto-label the new examples.
    generated_prompt_dicts=generations, 
    max_augment_per_instance=2)

In [None]:
# we can conver them into Huggingface dataset format.

from tailor.steps.convert_dataset_to_dict import ConvertDictToDataset
new_dataset = ConvertDictToDataset().run(new_data)

new_dataset

In [None]:
new_dataset[:]

In [None]:
new_dataset[26]

In [None]:
new_dataset[28]