Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let SquadData support data from Annotation Tool #2329

Merged
merged 6 commits into from
Mar 22, 2022
Merged

Conversation

brandenchan
Copy link
Contributor

Proposed changes:
Data from our annotation has no title field but has a document_id field. This PR gives the SquadData object the ability to support this kind of data

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me so far. One concern that I have is about documents sometimes receiving the same, empty id. That could become confusing especially because later on, within DocumentStores, we handle documents as duplicates if they have the same id.

for paragraph in document["paragraphs"]:
document_id = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would result in documents without a document_id in paragraph receiving the same, empty document_id. Is that really what we would like to achieve or could that later on lead to problems? What about using the document title to generate an id as a hash of the title, similar to what we do here:

def _get_id(self, id_hash_keys: Optional[List[str]] = None):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, and please don't forget to add labels to the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good idea, I have tried to generate document_id by hashing the actual text content since sometimes we don't have a title (e.g. data from annotation tool). Labels have also been added now!

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @julian-risch about the default empty ID causing problems. For the rest, only a small style comment.

haystack/utils/squad_data.py Outdated Show resolved Hide resolved
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@brandenchan brandenchan merged commit 6233dfc into master Mar 22, 2022
@brandenchan brandenchan deleted the squad_data branch March 22, 2022 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants