Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33: remove empty sentences #795

Merged
merged 2 commits into from
Jun 13, 2019
Merged

Conversation

alanakbik
Copy link
Collaborator

Users sometimes have problems with datasets that contain empty sentences (see #33). We currently issue a warning when an empty sentence gets constructed, but leave the users to filter empty sentences by themselves.

This PR adds a new function to the Corpus object, namely filter_empty_sentences(). By calling it, all empty sentences get removed.

Example:

# load dataset (IMDB dataset has no empty sentences)
corpus = IMDB().downsample(0.001)
print(corpus)

# add an empty sentence to the training split
corpus._train += SentenceDataset(Sentence(''))
print(corpus)

# call .filter_empty_sentences() to remove empty sentences
corpus.filter_empty_sentences()
print(corpus)

@kashif
Copy link
Contributor

kashif commented Jun 13, 2019

👍

1 similar comment
@alanakbik
Copy link
Collaborator Author

👍

@alanakbik alanakbik merged commit 3ec40b5 into master Jun 13, 2019
@alanakbik alanakbik deleted the GH-33-remove-empty-sentences branch June 13, 2019 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants