-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: improve Multilabel design #3658
refactor: improve Multilabel design #3658
Conversation
I took the opportunity of this PR to learn more about dataclasses. Sorry for the mess... @ZanSara feel free to jump in and guide me a bit. Current solutionFor the moment, I've dropped the attributes that are not required in The most natural solution would have been to leave these attributes and mark them as Other proposalI also tested another solution that preserves the current @dataclass
class MultiLabel:
labels: List[Label]
query: str = Field(default=None)
answers: List[str] = Field(default=None)
...
# no manually defined __init__
def __post_init__(self, drop_negative_labels=False, drop_no_answers=False):
# drop duplicate labels and remove negative labels if needed.
labels = list(dict.fromkeys(self.labels))
...
# __post_init__ does the work of the current __init__ Rereading it, perhaps this last solution is not bad. WDYT? Any better/simpler ideas? |
After reading this PR, your options and the related issues, I think I have an even more radical opinion about what
class MultiLabel:
def __init__(labels, drop_negative_labels, drop_no_answers):
"""
... docstring ...
"""
labels = list(dict.fromkeys(labels))
if drop_negative_labels:
labels = [l for l in labels if is_positive_label(l)]
if drop_no_answers:
labels = [l for l in labels if l.no_answer == False]
self.labels = labels
self.id = hashlib.md5((self.query + json.dumps(self.filters, sort_keys=True)).encode()).hexdigest()
@property
def labels(self):
return self._labels
@labels.setter
def labels(self):
self._labels = labels
self._query = self._aggregate_labels(key="query", must_be_single_value=True)[0]
self._filters = self._aggregate_labels(key="filters", must_be_single_value=True)[0]
# Currently no_answer is only true if all labels are "no_answers", we could later introduce a param here to let
# users decided which aggregation logic they want
self._no_answer = False not in [l.no_answer for l in self.labels]
# Answer strings and offsets cleaned for no_answers:
# If there are only no_answers, offsets are empty and answers will be a single empty string
# which equals the no_answers representation of reader nodes.
if self._no_answer:
self.answers = [""]
self.offsets_in_documents: List[dict] = []
self.offsets_in_contexts: List[dict] = []
else:
answered = [l.answer for l in self.labels if not l.no_answer and l.answer is not None]
self.answers = [answer.answer for answer in answered]
self.offsets_in_documents = []
self.offsets_in_contexts = []
for answer in answered:
if answer.offsets_in_document is not None:
for span in answer.offsets_in_document:
self.offsets_in_documents.append({"start": span.start, "end": span.end})
if answer.offsets_in_context is not None:
for span in answer.offsets_in_context:
self.offsets_in_contexts.append({"start": span.start, "end": span.end})
# There are two options here to represent document_ids:
# taking the id from the document of each label or taking the document_id of each label's answer.
# We take the former as labels without answers are allowed.
# For no_answer cases document_store.add_eval_data() currently adds all documents coming from the SQuAD paragraph's context
# as separate no_answer labels, and thus with document.id but without answer.document_id.
# If we do not exclude them from document_ids this would be problematic for retriever evaluation as they do not contain the answer.
# Hence, we exclude them here as well.
self._document_ids = [l.document.id for l in self.labels if not l.no_answer]
self._contexts = [str(l.document.content) for l in self.labels if not l.no_answer]
@property
def query(self):
return self._query
@property
def filters(self):
return self._filters
@property
def document_ids(self):
return self._document_ids
@property
def contexts(self):
return self._contexts
@property
def no_answers(self):
return self._no_answers
@property
def answers(self):
return self._answers
@property
def offsets_in_documents(self):
return self._offsets_in_documents
@property
def offsets_in_contexts(self):
return self._offsets_in_contexts
... rest of the class ... Warning: UNTESTED
Let me know if you have any concerns about this idea! |
@ZanSara I implemented the approach proposed by you. I don't see any particular drawbacks. I had to reimplement/change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry in advance to make you work twice 😅 but looking at this code again I had a realization.
labels
is a list. That means that our nice setter does not always work as intended: it will surely work when the whole list is re-assigned, but it won't trigger on append()
, for example. Which is not great 😅
I thought a bit more about this and I have two solutions in mind:
-
Re-compute all composite values in their respective getters. This is the only one that works for sure, but the performance penalty might be quite high and might strongly slow down evaluation.
-
Move the content of the
labels
setter back to the__init__
, and remove the setter. By keeping only the getter we makelabels
unsettable, which is good because it discourages people from trying to modifylabels
, but such solution is partial, because it still won't prevent people fromappend
ing.
I'd go for the second option, which is the fastest one and should not impact evaluation speed. If tests fail, let's see why and decide accordingly. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing, thank you! 😊
* first try and new test * fix test * fix unused import * remove comments * no more dataclass * add __eq__ and extend test * better design from review * Update schema.py * fix black * fix openapi * fix openapi 2 * new try to fix openapi * remove newline from openapi json
Related Issues
MultiLabel
serialization #3038Proposed Changes:
Just a first draft to run the CI
How did you test it?
Introduced the test proposed by @tstadel in #3037
Checklist