-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let SquadData support data from Annotation Tool #2329
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me so far. One concern that I have is about documents sometimes receiving the same, empty id. That could become confusing especially because later on, within DocumentStores, we handle documents as duplicates if they have the same id.
haystack/utils/squad_data.py
Outdated
for paragraph in document["paragraphs"]: | ||
document_id = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would result in documents without a document_id
in paragraph receiving the same, empty document_id
. Is that really what we would like to achieve or could that later on lead to problems? What about using the document title to generate an id as a hash of the title, similar to what we do here:
Line 117 in ac5617e
def _get_id(self, id_hash_keys: Optional[List[str]] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, and please don't forget to add labels to the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good idea, I have tried to generate document_id by hashing the actual text content since sometimes we don't have a title (e.g. data from annotation tool). Labels have also been added now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @julian-risch about the default empty ID causing problems. For the rest, only a small style comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Proposed changes:
Data from our annotation has no title field but has a document_id field. This PR gives the SquadData object the ability to support this kind of data