Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assuming independent cases in current version #6

Open
erwanm opened this issue Nov 7, 2017 · 1 comment
Open

Assuming independent cases in current version #6

erwanm opened this issue Nov 7, 2017 · 1 comment

Comments

@erwanm
Copy link
Owner

erwanm commented Nov 7, 2017

issue migrated from original private gitlab repo

Random splitting between train set and test set can lead to biased results if related cases (same documents reused or same author) are distributed in the two sets.

@erwanm
Copy link
Owner Author

erwanm commented Nov 7, 2017

possible option: before splitting, apply some kind of doc to doc comparison (like for impostors) and detect possible duplicates or semi-duplicates. But even in that case it's not always possible to find a good way to split.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant