# DPML | Data Provenance for Machine Learning

In this notebook, we investigate the provenance capabilities of TextAttack so that we can potentially modify it for our purposes. 

## Load Dependencies

In [1]:
import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /home/coraline/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [2]:
import textattack
from sibyl import ChangeSynonym, ExpandContractions

  from .autonotebook import tqdm as notebook_tqdm
2022-06-28 18:31:42.452460: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


## Create Datasets

Examples of 3 primary kinds of datasets in NLP

In [3]:
# Classification | 1 Text Input | 1 Int Output | Sentiment Analysis
single_text_data = [("I enjoyed the movie a lot!", 1), ("Absolutely horrible film.", 0), ("Our family had a fun time!", 1)]
single_text_dataset = textattack.datasets.Dataset(single_text_data)

# Classification | 2 Text Inputs | 1 Int Output | Natural Language Inference
multi_text_data = [(("A man inspects the uniform of a figure in some East Asian country.", "The man is sleeping"), 1)]
multi_text_dataset = textattack.datasets.Dataset(multi_text_data, input_columns=("premise", "hypothesis"))

# Seq2Seq | 1 Text Input | 1 Text Output
seq2seq_data = [("J'aime le film.", "I love the movie.")]
seq2seq_dataset = textattack.datasets.Dataset(seq2seq_data)

## `AttackedText` Class

This class tracks changes made to python strings and links back to previous version of the data. 

In [4]:
from textattack.shared import AttackedText

### Single-Text

In [5]:
single_text = single_text_dataset[0][0]['text']
attacked_single_text = AttackedText(single_text)

In [6]:
attacked_single_text.printable_text()

'I enjoyed the movie a lot!'

### Multi-Text

In [7]:
multi_text = multi_text_dataset[0][0]
attacked_multi_text = AttackedText(multi_text)

In [8]:
attacked_multi_text.printable_text()

'Premise: A man inspects the uniform of a figure in some East Asian country.\nHypothesis: The man is sleeping'

### Seq2Seq

In [9]:
seq2seq_text = seq2seq_dataset[0][0]
attacked_seq2seq_text = AttackedText(seq2seq_text)

In [10]:
attacked_seq2seq_text.printable_text()

"J'aime le film."

## `LeText` | Lineage-enabled Text

A class that represents a string that can be transformed (or attacked), tracking the changes made to text and label components. 

#### Demo `diff_text` functionality for different granularities

In [11]:
import numpy as np

import difflib
import nltk

In [12]:
def diff_text(a, b, granularity="word"): 
    """
    Intakes two text documents and optionally parses then for a desired
    granularity in ['paragraph', 'sentence', 'word', 'character'].

    Returns a the optionally parsed documents as well as a list of 
    difflib.SequenceMatcher.opcodes where tags reflect the type of 
    opertion and the indices reflect the desired granularities.

    opcode tags
      - 'replace' | a[i1:i2] should be replaced by b[j1:j2].
      - 'delete'  | a[i1:i2] should be deleted. 
      - 'insert'  | b[j1:j2] should be inserted at a[i1:i1]. 
      - 'equal'   | a[i1:i2] == b[j1:j2] (the sub-sequences are equal).
    """

    # parse texts to desired granularity
    if granularity == "paragraph":
        parsed_a = a.split('\n')
        parsed_b = b.split('\n')
    elif granularity == "sentence":
        sent_tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
        parsed_a = sent_tokenizer.tokenize(a)
        parsed_b = sent_tokenizer.tokenize(b)
    elif granularity == "word":
        parsed_a = a.split()
        parsed_b = b.split()
    elif granularity == "character":
        # no change necessary, difflib is character-level by default
        parsed_a = a
        parsed_b = b

    seq = difflib.SequenceMatcher(None, parsed_a, parsed_b)

    return parsed_a, parsed_b, seq.get_opcodes()

In [13]:
a = """This is sentence one in paragraph one. This is sentence two in paragraph one.
This is sentence one in paragraph two. This is sentence two in paragraph two."""
b = """This be sentence one in paragraph one. This iz sentence two in paragraph one.
This is sentence one in paragraph two. This is sentence two in paragraph two."""

for granularity in ['paragraph', 'sentence', 'word', 'character']:
    print('\ngranularity:', granularity)
    parsed_a, parsed_b, changes = diff_text(a, b, granularity)
    for (tag, i1, i2, j1, j2) in changes:
        print ("%7s parsed_a[%d:%d] (%s) parsed_b[%d:%d] (%s)" %
              (tag, i1, i2, parsed_a[i1:i2], j1, j2, parsed_b[j1:j2]))


granularity: paragraph
replace parsed_a[0:1] (['This is sentence one in paragraph one. This is sentence two in paragraph one.']) parsed_b[0:1] (['This be sentence one in paragraph one. This iz sentence two in paragraph one.'])
  equal parsed_a[1:2] (['This is sentence one in paragraph two. This is sentence two in paragraph two.']) parsed_b[1:2] (['This is sentence one in paragraph two. This is sentence two in paragraph two.'])

granularity: sentence
replace parsed_a[0:2] (['This is sentence one in paragraph one.', 'This is sentence two in paragraph one.']) parsed_b[0:2] (['This be sentence one in paragraph one.', 'This iz sentence two in paragraph one.'])
  equal parsed_a[2:4] (['This is sentence one in paragraph two.', 'This is sentence two in paragraph two.']) parsed_b[2:4] (['This is sentence one in paragraph two.', 'This is sentence two in paragraph two.'])

granularity: word
  equal parsed_a[0:1] (['This']) parsed_b[0:1] (['This'])
replace parsed_a[1:2] (['is']) parsed_b[1:2] (['

#### `LeText` Class for **L**ineage-**e**nabled **Text**

In [14]:
class LeText:
    """
    A helper class that represents a string that can be transformed, 
    tracking the transformations made to it.

    Modifying ``LeText`` instances results in the generation of new ``LeText`` 
    instances with a reference pointer (``le_attrs["previous"]``), so that 
    the full chain of transforms might be reconstructed by using this key to 
    form a linked list.

    Args:
       text (str or dict): The string or dict that this ``LeText`` represents
       granularity (str): Specifies the default level at which 
            lineage should be tracked. Value must be in:
                ['paragraph', 'sentence', 'word', 'character']
       le_attrs (dict): Dictionary of various attributes stored while 
            transforming the underlying text. 
    """

    def __init__(self, text_input, granularity="word", le_attrs=None):
        # Read in ``text_input`` as a string .
        if isinstance(text_input, str):
            self._text_input = {"text": text_input}
        elif isinstance(text_input, dict):
            self._text_input = text_input
        else:
            raise TypeError(
                f"Invalid text_input type {type(text_input)} (requires str or dict)"
            )

        if granularity in ['paragraph', 'sentence', 'word', 'character']:
          
            self.granularity = granularity
        else:
            raise TypeError(
                f"Invalid granularity {granularity} (must be one of the \
                following: ['paragraph', 'sentence', 'word', 'character'])"
            )
        
        # Process input lazily.
        self._chars = None
        self._words = None
        self._sents = None
        self._paras = None

        # create le_attrs if none exists
        if le_attrs is None:
            self.le_attrs = dict()
        elif isinstance(le_attrs, dict):
            self.le_attrs = le_attrs
        else:
            raise TypeError(f"Non-dict provided for le_attrs: {type(le_attrs)}.")

        # parsers
        self.sent_tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()

        # Lineage Attributes
        self.le_attrs.setdefault("granularity", self.granularity)

    def __eq__(self, other):
        """Compares two LeText instances, making sure that they also share
        the same lineage attributes.

        Since some elements stored in ``self.le_attrs`` may be numpy
        arrays, we have to take special care when comparing them.
        """
        if not (self.text == other.text):
            return False
        if len(self.le_attrs) != len(other.le_attrs):
            return False
        for key in self.le_attrs:
            if key not in other.le_attrs:
                return False
            elif isinstance(self.le_attrs[key], np.ndarray):
                if not (self.le_attrs[key].shape == other.le_attrs[key].shape):
                    return False
                elif not (self.le_attrs[key] == other.le_attrs[key]).all():
                    return False
            else:
                if not self.le_attrs[key] == other.le_attrs[key]:
                    return False
        return True

    def __hash__(self):
        return hash(self.text)

    def apply(self, fn, granularity="word"):
        """
        Applies fn(self.text), tracking the transformation info and output as
        a new LeText instance with a reference back to the source LeText.
        """

        # apply the provided function to the text stored in LeText
        output_text = fn(self.text)

        # find changes between self.text and output_text
        parsed_a, parsed_b, changes = diff_text(self.text, output_text, granularity)

        new_le_attrs = {
            "granularity": self.granularity,
            "changes": changes,
            "transformation": fn,
            "previous": self # current LeText
        }
        
        output_LeText = LeText(output_text, 
                               granularity=granularity,
                               le_attrs=new_le_attrs)
        
        return output_LeText
        
    def parse_text(self):
        if self.granularity == "paragraph":
            parsed_text = self.text.split('\n')
        elif self.granularity == "sentence":
            parsed_text = self.sent_tokenizer.tokenize(self.text)
        elif self.granularity == "word":
            parsed_text = self.text.split()
        elif self.granularity == "character":
            parsed_text = list(self.text)
        return parsed_text

    @property
    def column_labels(self):
        """Returns the labels for this text's columns.

        For single-sequence inputs, this simply returns ['text'].
        """
        return list(self._text_input.keys())

    @property
    def chars(self):
        if not self._chars:
            self._chars = list(self.text)
        return self._chars

    @property
    def words(self):
        if not self._words:
            self._words = self.text.split()
        return self._words

    @property
    def sents(self):
        if not self._sents:
            self._sents = self.sent_tokenizer(self.text)
        return self._sents

    @property
    def paras(self):
        if not self._paras:
            self._paras = self.text.split("\n")
        return self._paras

    @property
    def num_chars(self):
        """Returns the number of characters in the text."""
        return len(self.chars)

    @property
    def num_words(self):
        """Returns the number of words in the text."""
        return len(self.words)

    @property
    def num_sents(self):
        """Returns the number of sentences in the text."""
        return len(self.sents)

    @property
    def num_paras(self):
        """Returns the number of paragraphs in the text."""
        return len(self.paras)

    @property
    def num_units(self):
        """Returns the number of "units" of text given the default granularity."""
        if self.granularity == "paragraph":
            return self.num_paras
        elif self.granularity == "sentence":
            return self.num_sents
        elif self.granularity == "word":
            return self.num_words
        elif self.granularity == "character":
            return self.num_chars

    @property
    def text(self):
        """Represents full text input.

        Multiple inputs are joined with a line break.
        """
        return "\n".join(self._text_input.values())

    def __repr__(self):
        return f'<LeText "{self.text}">'

In [15]:
text1 = LeText("Don't you just love pizza?")
print(text1)
text1.le_attrs

<LeText "Don't you just love pizza?">


{'granularity': 'word'}

In [16]:
text2 = text1.apply(ChangeSynonym())
print(text2)
text2.le_attrs

<LeText "Don't you just bed pizza?">


{'granularity': 'word',
 'changes': [('equal', 0, 3, 0, 3),
  ('replace', 3, 4, 3, 4),
  ('equal', 4, 5, 4, 5)],
 'transformation': <sibyl.transformations.text.word_swap.change_synse.ChangeSynonym at 0x7f95020b28d0>,
 'previous': <LeText "Don't you just love pizza?">}

In [17]:
text3 = text2.apply(ExpandContractions())
print(text3)
text3.le_attrs

<LeText "Do not you just bed pizza?">


{'granularity': 'word',
 'changes': [('replace', 0, 1, 0, 2), ('equal', 1, 5, 2, 6)],
 'transformation': <sibyl.transformations.text.contraction.expand_contractions.ExpandContractions at 0x7f95020b2c50>,
 'previous': <LeText "Don't you just bed pizza?">}

In [18]:
class LeTarget:
    """
    A helper class that represents target data (e.g. most likely a label in the 
    case of classification, but could also be text or some other arbitary
    data structure for other tasks). This target can be transformed, as with 
    sibylvariant augmentations, and this class tracks changes to the target.

    Modifying ``LeTarget`` instances results in the generation of new ``LeTarget`` 
    instances with a reference pointer (``le_attrs["previous"]``), so that 
    the full chain of transforms might be reconstructed by using this key to 
    form a linked list.

    Args:
       target (any): The data which may be transformed
       le_attrs (dict): Dictionary of various attributes stored while 
            transforming the underlying data. 

    TODO:
        - type check target:
            - if isinstance(target, int):   e.g. classification
            - if isinstance(target, float): e.g. regression
            - if ifinstance(target, str):   e.g. seq2seq (save as LeText?)
            - if ifinstance(target, list):  e.g. multi-label classification, soft-label
            - if ifinstance(target, dict):  e.g. pose detection
        - determine how to do diff-ing for other data types
    """

    def __init__(self, target, le_attrs=None):

        self.target = target
        self.target_type = type(target)
        
        # create le_attrs if none exists
        if le_attrs is None:
            self.le_attrs = dict()
        elif isinstance(le_attrs, dict):
            self.le_attrs = le_attrs
        else:
            raise TypeError(f"Non-dict provided for le_attrs: {type(le_attrs)}.")

    def __eq__(self, other):
        """Compares two LeTarget instances, making sure that they also share
        the same lineage attributes.

        Since some elements stored in ``self.le_attrs`` may be numpy
        arrays, we have to take special care when comparing them.
        """
        if not (self.target == other.target):
            return False
        if len(self.le_attrs) != len(other.le_attrs):
            return False
        for key in self.le_attrs:
            if key not in other.le_attrs:
                return False
            elif isinstance(self.le_attrs[key], np.ndarray):
                if not (self.le_attrs[key].shape == other.le_attrs[key].shape):
                    return False
                elif not (self.le_attrs[key] == other.le_attrs[key]).all():
                    return False
            else:
                if not self.le_attrs[key] == other.le_attrs[key]:
                    return False
        return True

    def __hash__(self):
        return hash(self.target)

    def apply(self, fn):
        """
        Applies fn(self.target), tracking the transformation info and output as
        a new LeTarget instance with a reference back to the source LeTarget.
        """

        # apply the provided function to the target stored in LeTarget
        output_target = fn(self.target)

        # diff 
        # changes = ???

        new_le_attrs = {
            "transformation": fn,
            # "changes": changes, 
            "previous": self # current LeTarget
        }
        
        output_LeTarget = LeTarget(output_target, 
                               le_attrs=new_le_attrs)
        
        return output_LeTarget

    def __repr__(self):
        return f'<LeTarget "{self.target}">'

In [19]:
from sibyl import invert_label

In [20]:
target1 = LeTarget(0)
print(target1)
target1.le_attrs

<LeTarget "0">


{}

In [21]:
target2 = target1.apply(invert_label)
print(target2)
target2.le_attrs

<LeTarget "1">


{'transformation': <function sibyl.transformations.utils.invert_label(y, soften=False, num_classes=None)>,
 'previous': <LeTarget "0">}

In [22]:
from collections.abc import Iterable

class LeRecord:
    """
    A class that represents a data record that can be transformed (or attacked), 
    tracking the changes made to input (text) and target components. 

    Models that take multiple documents as input separate them via 
    ``SPLIT_TOKEN`` by default. 
    Transformations may be be applied in one of the following ways:
      - The entire input, joined into one string, without ``SPLIT_TOKEN``.
      - Specific documents associated with the data record without affecting
        other documents. 

    ``LeRecord`` instances that were transformed from other ``LeRecord``
    objects contain a pointer to the previous instance via an attributes
    dictionary (i.e. ``le_attrs["previous_LeRecord"]``), so that the full chain of
    transformations might be reconstructed by using this key to form a linked
    list.

    Args:
       record (dict): Dictionary of inputs, labels, and potentially other data
           associated with dataset record. 
       granularity (str): Specifies the default level at which 
            lineage should be tracked. Value must be in:
                ['paragraph', 'sentence', 'word', 'character']
       text_key (str or List[str]): key value(s) for the text input(s)
       target_key (str or List[str]): key value(s) for the target(s)
       le_attrs (dict): Dictionary of various attributes stored
           while transforming the underlying data record.
    """

    SPLIT_TOKEN = "<SPLIT>"

    def __init__(self, record, 
                 granularity="word", 
                 text_keys=None,
                 target_keys=None,
                 le_attrs=None):
        if not isinstance(record, dict):
            raise TypeError(
                f"Invalid text_input type {type(record)} (required dict)"
            )
        
        self.record = record
        self.granularity = granularity
        self.text_keys = text_keys
        self.target_keys = target_keys

        # if no text keys are given, guess 'em
        if not self.text_keys:
            self.text_keys = self.infer_text_keys()

        # if no target keys are given, guess 'em
        if not self.target_keys:
            self.target_keys = self.infer_target_keys()

        # parse record into input and target components
        self.extract_texts_and_targets()

        self.add_lineage()

    def infer_text_keys(self):
        """
        Traverses self.record and guesses which keys contain the text inputs
        """
        text_keys = []
        for k, v in self.record.items():
            if isinstance(v, LeText) or k not in ["label", "target"]:
                text_keys.append(k)
        return text_keys

    def infer_target_keys(self):
        """
        Traverses self.record and guesses which keys contain the target
        """
        target_keys = []
        for k, v in self.record.items():
            if isinstance(v, LeTarget) or k in ["label", "target"]:
                target_keys.append(k)
        return target_keys

    def extract_texts_and_targets(self):
        self.texts, self.targets = [], []
        for key in self.text_keys:
            text = self.record[key]
            if isinstance(text, LeText):
                text = text.text
            self.texts.append(text)
        for key in self.target_keys:
            target = self.record[key]
            if isinstance(target, LeTarget):
                target = target.target
            self.targets.append(target)
            
    def backgenerate_LeText(self, old_text, new_text, fn, granularity=None):
        if not granularity:
            granularity = self.granularity
        
        if not isinstance(old_text, LeText):
            old_text = LeText(old_text)
            
        parsed_a, parsed_b, changes = diff_text(old_text.text, 
                                                new_text, 
                                                granularity)
        le_attrs = {
            "granularity": granularity,
            "changes": changes,
            "transformation": fn,
            "previous": old_text
        }
        new_LeText = LeText(new_text, le_attrs=le_attrs)
        return new_LeText

    def backgenerate_LeTarget(self, old_target, new_target, fn):     
        if not isinstance(old_target, LeTarget):
            old_target = LeTarget(old_target)

        le_attrs = {
            # "changes": changes,
            "transformation": fn,
            "previous": old_target
        }
        new_LeTarget = LeTarget(new_target, le_attrs=le_attrs)
        return new_LeTarget

    def add_lineage(self):
        for k in self.text_keys:
            v = self.record[k]
            if not isinstance(v, LeText):
               self.record[k] = LeText(v)
        for k in self.target_keys:
            v = self.record[k]
            if not isinstance(v, LeTarget):
               self.record[k] = LeTarget(v)
        
    def remove_lineage(self):
        for k, v in record.items():
            if isinstance(v, LeText):
                v = v.text
            if isinstance(v, LeTarget):
                v = v.target
            self.record[k] = v

    def __hash__(self):
        return hash(self.record)

    def apply(self, fn, *args, **kwargs):
        """
        Applies fn(self.record), tracking the transformation info 
        and outputs as a new LeRecord instance with a reference back to the 
        source LeRecord. Text and target indices are transformed separately. 
        Args:
          fn (function): function which intakes a dict and returns a dict
        """

        # apply transform to self.record
        self.remove_lineage()
        output_record = fn(self.record, *args, **kwargs)

        output_record = LeRecord(output_record)
        new_texts = output_record.texts
        new_targets = output_record.targets

        # recreate record from text components
        _new_texts = {}
        for key, old_text, new_text in zip(self.text_keys, self.texts, new_texts):
            _new_texts[key] = self.backgenerate_LeText(old_text, new_text, fn)

        # recreate record from target components
        _new_targets = {}
        for key, old_target, new_target in zip(self.target_keys, self.targets, new_targets):
            _new_targets[key] = self.backgenerate_LeTarget(old_target, new_target, fn)

        # join text and target components together
        # output_record = _new_texts | _new_targets # only python 3.9+
        output_record = {**_new_texts, **_new_targets}

        # diff 
        # changes = ???
        
        new_le_attrs = {
            "transformation": fn,
            # "changes": changes, 
            "previous": self # current LeRecord
        }
        
        output_LeRecord = LeRecord(output_record, 
                                   le_attrs=new_le_attrs)
        
        return output_LeRecord

    def apply_to_components(self, fn, *args, **kwargs):
        """
        Applies fn(self.record), tracking the transformation info 
        and outputs as a new LeRecord instance with a reference back to the 
        source LeRecord. Text and target indices are transformed separately. 
        Args:
          fn (function): transformation function which intakes text and target 
                         separately and returns them in a similar way
        """

        # apply transform to self.texts, self.targets
        new_texts, new_targets = fn(self.texts, self.targets, *args, **kwargs)

        # recreate record from text components
        if not isinstance(new_texts, list):
            new_texts = [new_texts]

        _new_texts = {}
        for key, old_text, new_text in zip(self.text_keys, self.texts, new_texts):
            _new_texts[key] = self.backgenerate_LeText(old_text, new_text, fn)

        # recreate record from target components
        if not isinstance(new_targets, Iterable):
            new_targets = [new_targets]

        _new_targets = {}
        for key, old_target, new_target in zip(self.target_keys, self.targets, new_targets):
            _new_targets[key] = self.backgenerate_LeTarget(old_target, new_target, fn)

        # join text and target components together
        # output_record = _new_texts | _new_targets # only python 3.9+
        output_record = {**_new_texts, **_new_targets}

        # diff 
        # changes = ???

        new_le_attrs = {
            "transformation": fn,
            # "changes": changes, 
            "previous": self # current LeRecord
        }
        
        output_LeRecord = LeRecord(output_record, 
                                   le_attrs=new_le_attrs)
        
        return output_LeRecord

    def __repr__(self):
        return f'<LeRecord "{self.record}">'

In [23]:
record1 = {
    "text": "This is a single input record.",
    "label": 0
}

record2 = {
    "text1": "This is a double input record - 1.",
    "text2": "This is a double input record - 2.",
    "label": 0
}

#### `LeRecord` | Toy Examples 

In [24]:
# 1 text + 1 target record + apply
print("1 text + 1 target record + apply\n")

def record_transformer(record):
    text = record['text']
    label = record['label']

    new_text = text + " This is a test."
    new_label = invert_label(label)
    return {"text": new_text, "label": new_label}

record = record1

le_record1 = LeRecord(record)
print(le_record1)
# print(le_record1.texts, le_record1.targets)

le_record2 = le_record1.apply(record_transformer)
print(le_record2)
# print(le_record2.texts, le_record2.targets)

le_record3 = le_record2.apply(record_transformer)
print(le_record3)
# print(le_record3.texts, le_record3.targets)

# 2 text + 1 target record + apply
print("\n2 text + 1 target record + apply\n")

def record_transformer2(record):
    text1 = record['text1']
    text2 = record['text2']
    label = record['label']

    new_text1 = text1 + " Yo."
    new_text2 = text2 + " Whoa!!!"
    new_label = invert_label(label)
    return {"text1": new_text1, "text2": new_text2, "label": new_label}

record = record2

le_record1 = LeRecord(record)
print(le_record1)
# print(le_record1.texts, le_record1.targets)

le_record2 = le_record1.apply(record_transformer2)
print(le_record2)
# print(le_record2.texts, le_record2.targets)

le_record3 = le_record2.apply(record_transformer2)
print(le_record3)
# print(le_record3.texts, le_record3.targets)

# 1 text + 1 target record + apply_to_components
print("\n1 text + 1 target record + apply_to_components\n")

def record_components_transformer(text, label):
    new_text = text[0] + " This is a test."
    new_label = invert_label(label[0])
    return new_text, new_label

record = record1

le_record1 = LeRecord(record)
print(le_record1)
# print(le_record1.texts, le_record1.targets)

le_record2 = le_record1.apply_to_components(record_components_transformer)
print(le_record2)
# print(le_record2.texts, le_record2.targets)

le_record3 = le_record2.apply_to_components(record_components_transformer)
print(le_record3)
# print(le_record3.texts, le_record3.targets)

# 2 text + 1 target record + apply_to_components
print("\n2 text + 1 target record + apply_to_components\n")

def record_components_transformer2(texts, label):
    new_text = [texts[0] + " Yo.", texts[1] + "Whoa!!!"]
    new_label = invert_label(label[0])
    return new_text, new_label

record = record2

le_record1 = LeRecord(record)
print(le_record1)
# print(le_record1.texts, le_record1.targets)

le_record2 = le_record1.apply_to_components(record_components_transformer2)
print(le_record2)
# print(le_record2.texts, le_record2.targets)

le_record3 = le_record2.apply_to_components(record_components_transformer2)
print(le_record3)
# print(le_record3.texts, le_record3.targets)

1 text + 1 target record + apply

<LeRecord "{'text': <LeText "This is a single input record.">, 'label': <LeTarget "0">}">
<LeRecord "{'text': <LeText "This is a single input record. This is a test.">, 'label': <LeTarget "1">}">
<LeRecord "{'text': <LeText "This is a single input record. This is a test.">, 'label': <LeTarget "1">}">

2 text + 1 target record + apply

<LeRecord "{'text1': <LeText "This is a double input record - 1.">, 'text2': <LeText "This is a double input record - 2.">, 'label': <LeTarget "0">}">
<LeRecord "{'text1': <LeText "This is a double input record - 1. Yo.">, 'text2': <LeText "This is a double input record - 2. Whoa!!!">, 'label': <LeTarget "1">}">
<LeRecord "{'text1': <LeText "This is a double input record - 1. Yo.">, 'text2': <LeText "This is a double input record - 2. Whoa!!!">, 'label': <LeTarget "1">}">

1 text + 1 target record + apply_to_components

<LeRecord "{'text': <LeText "This is a single input record.">, 'label': <LeTarget "0">}">
<LeRecord "{'

#### `LeRecord` | Sibyl Example

In [25]:
from sibyl import ChangeAntonym, InsertNegativePhrase, TextMix

In [27]:
task_config = {
    "input_idx": [1],
    "task_name": "sentiment",
    "tran_type": "SIB",
    "label_type": "hard"
}

def trans_fn(X, y, task_config):
    if isinstance(y, list):
        y = y[0]
    new_X, new_y = ChangeAntonym().transform_Xy(X, y, task_config)
    return new_X, new_y

##### `ChangeAntonym`

In [28]:
record1 = {
    "text": "This is a nasty input record.",
    "label": 0
}

In [29]:
print(le_record3.record['text1'])
print(le_record3.record['text1'].le_attrs['previous'])

<LeText "This is a double input record - 1. Yo. Yo.">
<LeText "This is a double input record - 1. Yo.">


In [30]:
record = record1

le_record1 = LeRecord(record)
print(le_record1)
# print(le_record1.texts, le_record1.targets)

le_record2 = le_record1.apply_to_components(trans_fn, task_config=task_config)
print(le_record2)
# print(le_record2.texts, le_record2.targets)

le_record3 = le_record2.apply_to_components(trans_fn, task_config=task_config)
print(le_record3)
# print(le_record3.texts, le_record3.targets)

<LeRecord "{'text': <LeText "This is a nasty input record.">, 'label': <LeTarget "0">}">
<LeRecord "{'text': <LeText "This is a nice input record.">, 'label': <LeTarget "1">}">
<LeRecord "{'text': <LeText "This is a awful input record.">, 'label': <LeTarget "0">}">


In [31]:
record1 = {
    "text": "This is a nasty input record.",
    "label": 0
}

record2 = {
    "text": "This is a beautiful input record.",
    "label": 1
}

##### `TextMix`

When you think about how the data is actually processed, it's almost always done separately (e.g. we have a `inputs` variable and a `labels` variable. Work is done essentially by column rather than record and therefore we should be tracking lineage on by text and target rather than record. LeRecord is likely to be an unnecessary abstraction or maybe just a convenience class that helps quickly enable lineage on a data row. 

Let's experiment with changing tracking the whole dataset.

In [32]:
from datasets import load_dataset

In [33]:
dataset = load_dataset("glue", "sst2", split="train")

Reusing dataset glue (/home/coraline/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


In [34]:
def enable_dataset_lineage(batch):
    """
    To be used with datasets.Dataset.map()
    Example:
    `
    updated_dataset = dataset.map(enable_dataset_lineage, 
                                  batched=True, 
                                  batch_size=batch_size)
    `
    NOTE: This doesn't work because LeText and LeTarget
          are not permissible Arrow data types.
    """
    data = [LeText(t) for t in batch['sentence']]
    target = [LeTarget(t) for t in batch['label']]
    return {'sentence': data, 'label': batch['label']}

In [35]:
def enable_dataset_lineage(dataset, text_keys=None, target_keys=None):

    # guess keys if none provided
    if not text_keys or not target_keys:
        dataset_features = dataset.features
        if not text_keys:
            text_keys = []
            for k in dataset_features.keys():
                if (k in ["text", "sentence"] 
                    or dataset_features[k].dtype == "string"):
                    text_keys.append(k)
        if not target_keys:
            target_keys = []
            for k in dataset_features.keys():
                if k in ["label", "target"]:
                    target_keys.append(k)

    # enable text lineage
    if len(text_keys) == 1:
        le_texts = [LeText(t) for t in dataset[text_keys[0]]]
    else:
        le_texts = []
        for k in text_keys:
            le_texts.append([LeText(t) for t in dataset[k]])

    # enable target lineage
    if len(target_keys) == 1:
        le_targets = [LeTarget(t) for t in dataset[target_keys[0]]]
    else:
        le_targets = []
        for k in target_keys:
            le_targets.append([LeTarget(t) for t in dataset[k]])

    return le_texts, le_targets

In [36]:
texts, labels = enable_dataset_lineage(dataset)

In [37]:
# # create a batch to process
# batch_size = 12
# batch = (texts[:batch_size], labels[:batch_size])

# # define transform function
# trans_fn = TextMix()

# # apply transform
# trans_fn(batch)

In [38]:
# class Provenance
# class FeatureProvenance(Provenance)
# class TransformationProvenance(Provenance)

# class LeText:
#    self.prov = prov 

# look at https://github.com/maligulzar/OptDebug and replicate basic structure into python

In [39]:
class LeContext:

    def __init__(self, original_batch):
        self.original_batch = original_batch
        orig_texts, orig_targets = self.original_batch

        # extract text and targets for transformation
        self.texts = [t.text if isinstance(t, LeText) else t for t in orig_texts]
        self.targets = [t.target if isinstance(t, LeTarget) else t for t in orig_targets]

        # enable lineage if not already
        self.le_texts = [LeText(t) if not isinstance(t, LeText) else t for t in orig_texts]
        self.le_targets = [LeTarget(t) if not isinstance(t, LeTarget) else t for t in orig_targets]

        # variables to extract from LeContext after processing
        self.linked_texts = None
        self.linked_targets = None
    
    def apply(self, fn, *args, **kwargs):
        self.fn = fn
        new_texts, new_targets = fn((self.texts, self.targets), *args, **kwargs)
        self.new_texts = new_texts
        self.new_targets = new_targets

        linked_texts, linked_targets = [], []
        for text1, text2, target1, target2 in zip(self.le_texts,
                                                  self.new_texts,
                                                  self.le_targets,
                                                  self.new_targets):
          
            # track changes to texts
            parsed_a, parsed_b, changes = diff_text(text1.text, 
                                                    text2, 
                                                    granularity=text1.granularity)
            
            text_attrs = {
                "granularity": text1.granularity,
                "changes": changes,
                "transformation": self.fn,
                "previous": text1
            }
            text2 = LeText(text2, text1.granularity, text_attrs)
            linked_texts.append(text2)

            # track changes to targets
            target_attrs = {
                "transformation": self.fn,
                "previous": target1
            }
            target2 = LeTarget(target2, target_attrs)
            linked_targets.append(target2)
        
        self.linked_texts = linked_texts
        self.linked_targets = linked_targets
        return self.linked_texts, self.linked_targets

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_value, exc_tb):
        if isinstance(exc_value, Exception):
            print(f"An exception occurred in your with block: {exc_type}")
            print(f"Exception message: {exc_value}")
            print(f"Traceback info: {exc_tb}")

## Test new LeContext

In [1]:
from lineage import LeContext
from sibyl import ChangeAntonym, InsertNegativePhrase, TextMix

  from .autonotebook import tqdm as notebook_tqdm
2022-06-28 22:03:40.374269: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


In [2]:
transform = TextMix()

in_text = [
    "The characters are unlikeable and the script is awful. It's a waste of the talents of Deneuve and Auteuil.", 
    "Unwatchable. You can't even make it past the first three minutes. And this is coming from a huge Adam Sandler fan!!1",
    "An unfunny, unworthy picture which is an undeserving end to Peter Sellers' career. It is a pity this movie was ever made.",
    "I think it's one of the greatest movies which are ever made, and I've seen many... The book is better, but it's still a very good movie!",
    "The only thing serious about this movie is the humor. Well worth the rental price. I'll bet you watch it twice. It's obvious that Sutherland enjoyed his role.",
    "Touching; Well directed autobiography of a talented young director/producer. A love story with Rabin's assassination in the background. Worth seeing"
]

in_target = [0, 0, 0, 1, 1, 1] # (imdb dataset 0=negative, 1=positive)

batch = (in_text, in_target)



In [3]:
with LeContext(batch) as le:
    new_records = le.apply(transform, num_classes=2)

In [4]:
new_batch = new_records

with LeContext(new_batch) as le:
    new_records2 = le.apply(transform, num_classes=2)

In [5]:
new_records[1].text

'b"Unwatchable. You can\'t even make it past the first three minutes. And this is coming from a huge Adam Sandler fan!!1 Unwatchable. You can\'t even make it past the first three minutes. And this is coming from a huge Adam Sandler fan!!1"'

In [6]:
new_records2[1]

<LeRecord:
	 text="b'b"Unwatchable. You can\'t even make it past the first three minutes. And this is coming from a huge Adam Sandler fan!!1 Unwatchable. You can\'t even make it past the first three minutes. And this is coming from a huge Adam Sandler fan!!1" b"An unfunny, unworthy picture which is an undeserving end to Peter Sellers\' career. It is a pity this movie was ever made. The only thing serious about this movie is the humor. Well worth the rental price. I\'ll bet you watch it twice. It\'s obvious that Sutherland enjoyed his role."'",
	 target="[0.8884123728427289, 0.11158762715727105]",
	 le_attrs={'transformation_provenance': <TransformationProvenance: {(0, "{'class': 'TextMix', 'return_metadata': False}"), (1, "{'class': 'TextMix', 'return_metadata': False}")}>, 'feature_provenance': <FeatureProvenance[edit_seq] {(0, (41, 42), 'replace: [20,21]-[41,42]'), (1, (42, 92), 'insert: [42,42]-[42,92]')}>, 'granularity': 'word'}>