Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: improve Multilabel design #3658

Merged
merged 14 commits into from
Dec 13, 2022
Merged

refactor: improve Multilabel design #3658

merged 14 commits into from
Dec 13, 2022

Conversation

anakin87
Copy link
Member

@anakin87 anakin87 commented Dec 1, 2022

Related Issues

Proposed Changes:

Just a first draft to run the CI

How did you test it?

Introduced the test proposed by @tstadel in #3037

Checklist

@anakin87 anakin87 marked this pull request as ready for review December 1, 2022 20:03
@anakin87 anakin87 requested a review from a team as a code owner December 1, 2022 20:03
@anakin87 anakin87 requested review from bogdankostic and removed request for a team December 1, 2022 20:03
@anakin87 anakin87 marked this pull request as draft December 1, 2022 21:08
@anakin87
Copy link
Member Author

anakin87 commented Dec 1, 2022

I took the opportunity of this PR to learn more about dataclasses. Sorry for the mess...

@ZanSara feel free to jump in and guide me a bit.

Current solution

For the moment, I've dropped the attributes that are not required in __init__.
The result is that they are created but don't automatically show up in __repr__ and to_dict.
The CI is green but probably it's not the best idea...

The most natural solution would have been to leave these attributes and mark them as field(init=False).
But it doesn't work because std dataclasses and pydantic dataclasses don't collaborate well.

Other proposal

I also tested another solution that preserves the current __repr__ and to_dict,
so it can be considered better from a practical point of view.
Something like this:

@dataclass
class MultiLabel:
    labels: List[Label]
    query: str = Field(default=None)
    answers: List[str] = Field(default=None)
    ...

   # no manually defined __init__

    def __post_init__(self, drop_negative_labels=False, drop_no_answers=False):
        # drop duplicate labels and remove negative labels if needed.
        labels = list(dict.fromkeys(self.labels))
        ...
      # __post_init__ does the work of the current __init__

Rereading it, perhaps this last solution is not bad.
At first, it seemed hard to read and understand for future programmers (or for ourselves in the near future 😄).

WDYT? Any better/simpler ideas?

@anakin87 anakin87 marked this pull request as ready for review December 1, 2022 22:55
@ZanSara
Copy link
Contributor

ZanSara commented Dec 5, 2022

After reading this PR, your options and the related issues, I think I have an even more radical opinion about what MultiLabel should be 😄

  1. I can't see what benefits MultiLabel gets from being a dataclass. It has a lot of logic and almost no own attributes. It's more of an aggregator over existing dataclasses and is not written into document stores. We could try to remove the decorator to see if it has any side effects.

  2. Passing anything to this class except labels, drop_negative_labels and drop_no_answers is meaningless. The others are all computed in the __init__, makes no sense to set by hand, and they should update in the case the value of labels change (which is not done, currently). This screams to me: "they are read-only @properties". Can we change them into read-only properties? Something on the lines of:

class MultiLabel:

  def __init__(labels, drop_negative_labels, drop_no_answers):
    """ 
    ...  docstring  ... 
    """
    labels = list(dict.fromkeys(labels))
        if drop_negative_labels:
            labels = [l for l in labels if is_positive_label(l)]
        if drop_no_answers:
            labels = [l for l in labels if l.no_answer == False]
    self.labels = labels
    self.id = hashlib.md5((self.query + json.dumps(self.filters, sort_keys=True)).encode()).hexdigest()

  @property
  def labels(self):
    return self._labels

  @labels.setter
  def labels(self):
    self._labels = labels
    self._query = self._aggregate_labels(key="query", must_be_single_value=True)[0]
    self._filters = self._aggregate_labels(key="filters", must_be_single_value=True)[0]
    
    # Currently no_answer is only true if all labels are "no_answers", we could later introduce a param here to let
    # users decided which aggregation logic they want
    self._no_answer = False not in [l.no_answer for l in self.labels]
    # Answer strings and offsets cleaned for no_answers:
    # If there are only no_answers, offsets are empty and answers will be a single empty string
    # which equals the no_answers representation of reader nodes.
    if self._no_answer:
        self.answers = [""]
        self.offsets_in_documents: List[dict] = []
        self.offsets_in_contexts: List[dict] = []
    else:
        answered = [l.answer for l in self.labels if not l.no_answer and l.answer is not None]
        self.answers = [answer.answer for answer in answered]
        self.offsets_in_documents = []
        self.offsets_in_contexts = []
        for answer in answered:
            if answer.offsets_in_document is not None:
                for span in answer.offsets_in_document:
                    self.offsets_in_documents.append({"start": span.start, "end": span.end})
            if answer.offsets_in_context is not None:
                for span in answer.offsets_in_context:
                    self.offsets_in_contexts.append({"start": span.start, "end": span.end})

    # There are two options here to represent document_ids:
    # taking the id from the document of each label or taking the document_id of each label's answer.
    # We take the former as labels without answers are allowed.
    # For no_answer cases document_store.add_eval_data() currently adds all documents coming from the SQuAD paragraph's context
    # as separate no_answer labels, and thus with document.id but without answer.document_id.
    # If we do not exclude them from document_ids this would be problematic for retriever evaluation as they do not contain the answer.
    # Hence, we exclude them here as well.
    self._document_ids = [l.document.id for l in self.labels if not l.no_answer]
    self._contexts = [str(l.document.content) for l in self.labels if not l.no_answer]

  @property
  def query(self):
    return self._query

  @property
  def filters(self):
    return self._filters

  @property
  def document_ids(self):
    return self._document_ids

  @property
  def contexts(self):
    return self._contexts

  @property
  def no_answers(self):
    return self._no_answers

  @property
  def answers(self):
    return self._answers

  @property
  def offsets_in_documents(self):
    return self._offsets_in_documents

  @property
  def offsets_in_contexts(self):
    return self._offsets_in_contexts

... rest of the class ...

Warning: UNTESTED

  1. If we want the values of these properties in the output of to_dict, or in repr, then let's be clear about it. to_dict should dump those values in the output and from_dict should ignore them. No magic involved 😁

Let me know if you have any concerns about this idea!

@anakin87 anakin87 changed the title refactor: improve Multilabel serialization refactor: improve Multilabel design Dec 6, 2022
@anakin87
Copy link
Member Author

anakin87 commented Dec 6, 2022

@ZanSara I implemented the approach proposed by you.

I don't see any particular drawbacks.

I had to reimplement/change to_dict, from_dict, to_json and from_json.
I also had to introduce a simple custom __eq__ method.

@ZanSara ZanSara added topic:labelling type:refactor Not necessarily visible to the users labels Dec 7, 2022
Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry in advance to make you work twice 😅 but looking at this code again I had a realization.

labels is a list. That means that our nice setter does not always work as intended: it will surely work when the whole list is re-assigned, but it won't trigger on append(), for example. Which is not great 😅

I thought a bit more about this and I have two solutions in mind:

  • Re-compute all composite values in their respective getters. This is the only one that works for sure, but the performance penalty might be quite high and might strongly slow down evaluation.

  • Move the content of the labels setter back to the __init__, and remove the setter. By keeping only the getter we make labels unsettable, which is good because it discourages people from trying to modify labels, but such solution is partial, because it still won't prevent people from appending.

I'd go for the second option, which is the fastest one and should not impact evaluation speed. If tests fail, let's see why and decide accordingly. What do you think?

haystack/schema.py Outdated Show resolved Hide resolved
haystack/schema.py Show resolved Hide resolved
haystack/schema.py Show resolved Hide resolved
haystack/schema.py Show resolved Hide resolved
haystack/schema.py Outdated Show resolved Hide resolved
haystack/schema.py Outdated Show resolved Hide resolved
@anakin87 anakin87 marked this pull request as draft December 8, 2022 15:41
@anakin87 anakin87 marked this pull request as ready for review December 8, 2022 15:41
@anakin87 anakin87 requested review from ZanSara and removed request for bogdankostic December 8, 2022 15:42
Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, thank you! 😊

@ZanSara ZanSara merged commit e1401f7 into deepset-ai:main Dec 13, 2022
@anakin87 anakin87 deleted the improve_multilabel_serialization branch December 13, 2022 09:48
sjrl pushed a commit that referenced this pull request Dec 13, 2022
* first try and new test

* fix test

* fix unused import

* remove comments

* no more dataclass

* add __eq__ and extend test

* better design from review

* Update schema.py

* fix black

* fix openapi

* fix openapi 2

* new try to fix openapi

* remove newline from openapi json
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:labelling type:refactor Not necessarily visible to the users
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve MultiLabel serialization
2 participants