refactor: improve Multilabel design #3658

anakin87 · 2022-12-01T19:22:53Z

Related Issues

fixes Improve MultiLabel serialization #3038

Proposed Changes:

Just a first draft to run the CI

How did you test it?

Introduced the test proposed by @tstadel in #3037

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used the conventional commit convention for my PR title
I documented my code
I ran pre-commit hooks and fixed any issue

anakin87 · 2022-12-01T22:55:34Z

I took the opportunity of this PR to learn more about dataclasses. Sorry for the mess...

@ZanSara feel free to jump in and guide me a bit.

Current solution

For the moment, I've dropped the attributes that are not required in __init__.
The result is that they are created but don't automatically show up in __repr__ and to_dict.
The CI is green but probably it's not the best idea...

The most natural solution would have been to leave these attributes and mark them as field(init=False).
But it doesn't work because std dataclasses and pydantic dataclasses don't collaborate well.

Other proposal

I also tested another solution that preserves the current __repr__ and to_dict,
so it can be considered better from a practical point of view.
Something like this:

@dataclass
class MultiLabel:
    labels: List[Label]
    query: str = Field(default=None)
    answers: List[str] = Field(default=None)
    ...

   # no manually defined __init__

    def __post_init__(self, drop_negative_labels=False, drop_no_answers=False):
        # drop duplicate labels and remove negative labels if needed.
        labels = list(dict.fromkeys(self.labels))
        ...
      # __post_init__ does the work of the current __init__

Rereading it, perhaps this last solution is not bad.
At first, it seemed hard to read and understand for future programmers (or for ourselves in the near future 😄).

WDYT? Any better/simpler ideas?

ZanSara · 2022-12-05T10:00:49Z

After reading this PR, your options and the related issues, I think I have an even more radical opinion about what MultiLabel should be 😄

I can't see what benefits MultiLabel gets from being a dataclass. It has a lot of logic and almost no own attributes. It's more of an aggregator over existing dataclasses and is not written into document stores. We could try to remove the decorator to see if it has any side effects.
Passing anything to this class except labels, drop_negative_labels and drop_no_answers is meaningless. The others are all computed in the __init__, makes no sense to set by hand, and they should update in the case the value of labels change (which is not done, currently). This screams to me: "they are read-only @properties". Can we change them into read-only properties? Something on the lines of:

class MultiLabel:

  def __init__(labels, drop_negative_labels, drop_no_answers):
    """ 
    ...  docstring  ... 
    """
    labels = list(dict.fromkeys(labels))
        if drop_negative_labels:
            labels = [l for l in labels if is_positive_label(l)]
        if drop_no_answers:
            labels = [l for l in labels if l.no_answer == False]
    self.labels = labels
    self.id = hashlib.md5((self.query + json.dumps(self.filters, sort_keys=True)).encode()).hexdigest()

  @property
  def labels(self):
    return self._labels

  @labels.setter
  def labels(self):
    self._labels = labels
    self._query = self._aggregate_labels(key="query", must_be_single_value=True)[0]
    self._filters = self._aggregate_labels(key="filters", must_be_single_value=True)[0]
    
    # Currently no_answer is only true if all labels are "no_answers", we could later introduce a param here to let
    # users decided which aggregation logic they want
    self._no_answer = False not in [l.no_answer for l in self.labels]
    # Answer strings and offsets cleaned for no_answers:
    # If there are only no_answers, offsets are empty and answers will be a single empty string
    # which equals the no_answers representation of reader nodes.
    if self._no_answer:
        self.answers = [""]
        self.offsets_in_documents: List[dict] = []
        self.offsets_in_contexts: List[dict] = []
    else:
        answered = [l.answer for l in self.labels if not l.no_answer and l.answer is not None]
        self.answers = [answer.answer for answer in answered]
        self.offsets_in_documents = []
        self.offsets_in_contexts = []
        for answer in answered:
            if answer.offsets_in_document is not None:
                for span in answer.offsets_in_document:
                    self.offsets_in_documents.append({"start": span.start, "end": span.end})
            if answer.offsets_in_context is not None:
                for span in answer.offsets_in_context:
                    self.offsets_in_contexts.append({"start": span.start, "end": span.end})

    # There are two options here to represent document_ids:
    # taking the id from the document of each label or taking the document_id of each label's answer.
    # We take the former as labels without answers are allowed.
    # For no_answer cases document_store.add_eval_data() currently adds all documents coming from the SQuAD paragraph's context
    # as separate no_answer labels, and thus with document.id but without answer.document_id.
    # If we do not exclude them from document_ids this would be problematic for retriever evaluation as they do not contain the answer.
    # Hence, we exclude them here as well.
    self._document_ids = [l.document.id for l in self.labels if not l.no_answer]
    self._contexts = [str(l.document.content) for l in self.labels if not l.no_answer]

  @property
  def query(self):
    return self._query

  @property
  def filters(self):
    return self._filters

  @property
  def document_ids(self):
    return self._document_ids

  @property
  def contexts(self):
    return self._contexts

  @property
  def no_answers(self):
    return self._no_answers

  @property
  def answers(self):
    return self._answers

  @property
  def offsets_in_documents(self):
    return self._offsets_in_documents

  @property
  def offsets_in_contexts(self):
    return self._offsets_in_contexts

... rest of the class ...

Warning: UNTESTED

If we want the values of these properties in the output of to_dict, or in repr, then let's be clear about it. to_dict should dump those values in the output and from_dict should ignore them. No magic involved 😁

Let me know if you have any concerns about this idea!

anakin87 · 2022-12-06T21:07:37Z

@ZanSara I implemented the approach proposed by you.

I don't see any particular drawbacks.

I had to reimplement/change to_dict, from_dict, to_json and from_json.
I also had to introduce a simple custom __eq__ method.

ZanSara

I'm sorry in advance to make you work twice 😅 but looking at this code again I had a realization.

labels is a list. That means that our nice setter does not always work as intended: it will surely work when the whole list is re-assigned, but it won't trigger on append(), for example. Which is not great 😅

I thought a bit more about this and I have two solutions in mind:

Re-compute all composite values in their respective getters. This is the only one that works for sure, but the performance penalty might be quite high and might strongly slow down evaluation.
Move the content of the labels setter back to the __init__, and remove the setter. By keeping only the getter we make labels unsettable, which is good because it discourages people from trying to modify labels, but such solution is partial, because it still won't prevent people from appending.

I'd go for the second option, which is the fastest one and should not impact evaluation speed. If tests fail, let's see why and decide accordingly. What do you think?

haystack/schema.py

ZanSara

Amazing, thank you! 😊

* first try and new test * fix test * fix unused import * remove comments * no more dataclass * add __eq__ and extend test * better design from review * Update schema.py * fix black * fix openapi * fix openapi 2 * new try to fix openapi * remove newline from openapi json

anakin87 added 4 commits December 1, 2022 20:12

first try and new test

4b39ff8

fix test

898e8b5

fix unused import

262191a

remove comments

dbeff0e

anakin87 marked this pull request as ready for review December 1, 2022 20:03

anakin87 requested a review from a team as a code owner December 1, 2022 20:03

anakin87 requested review from bogdankostic and removed request for a team December 1, 2022 20:03

anakin87 marked this pull request as draft December 1, 2022 21:08

anakin87 marked this pull request as ready for review December 1, 2022 22:55

anakin87 added 2 commits December 6, 2022 20:16

no more dataclass

5b9d92a

add __eq__ and extend test

aab0011

anakin87 changed the title ~~refactor: improve Multilabel serialization~~ refactor: improve Multilabel design Dec 6, 2022

ZanSara added topic:labelling type:refactor Not necessarily visible to the users labels Dec 7, 2022

ZanSara reviewed Dec 7, 2022

View reviewed changes

better design from review

6de5d09

anakin87 marked this pull request as draft December 8, 2022 15:41

anakin87 marked this pull request as ready for review December 8, 2022 15:41

anakin87 requested review from ZanSara and removed request for bogdankostic December 8, 2022 15:42

anakin87 added 6 commits December 12, 2022 11:33

Update schema.py

f5994e5

fix black

7ad89b4

Merge branch 'main' into improve_multilabel_serialization

1888000

fix openapi

1bda9d1

fix openapi 2

66aeba2

new try to fix openapi

06a5722

remove newline from openapi json

c2a7ef3

ZanSara approved these changes Dec 13, 2022

View reviewed changes

ZanSara merged commit e1401f7 into deepset-ai:main Dec 13, 2022

anakin87 deleted the improve_multilabel_serialization branch December 13, 2022 09:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: improve Multilabel design #3658

refactor: improve Multilabel design #3658

anakin87 commented Dec 1, 2022 •

edited

anakin87 commented Dec 1, 2022

ZanSara commented Dec 5, 2022

anakin87 commented Dec 6, 2022

ZanSara left a comment

ZanSara left a comment

refactor: improve Multilabel design #3658

refactor: improve Multilabel design #3658

Conversation

anakin87 commented Dec 1, 2022 • edited

Related Issues

Proposed Changes:

How did you test it?

Checklist

anakin87 commented Dec 1, 2022

Current solution

Other proposal

ZanSara commented Dec 5, 2022

anakin87 commented Dec 6, 2022

ZanSara left a comment

Choose a reason for hiding this comment

ZanSara left a comment

Choose a reason for hiding this comment

anakin87 commented Dec 1, 2022 •

edited