Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Tokenization #1

Closed
4 of 6 tasks
abuchmueller opened this issue Apr 9, 2021 · 1 comment
Closed
4 of 6 tasks

Better Tokenization #1

abuchmueller opened this issue Apr 9, 2021 · 1 comment
Labels
enhancement New feature or request

Comments

@abuchmueller
Copy link
Owner

abuchmueller commented Apr 9, 2021

@abuchmueller abuchmueller added enhancement New feature or request high priority This has priority labels Apr 9, 2021
abuchmueller added a commit that referenced this issue Apr 14, 2021
- fixes for quanteda 3.0 update close #11
- now returns subset of original data
abuchmueller added a commit that referenced this issue Apr 20, 2021
abuchmueller added a commit that referenced this issue Apr 27, 2021
@abuchmueller
Copy link
Owner Author

There are two possible scenarios for the implementation of external tokenizers:

  1. a copy of pool_tweets() where instead of passing arguments to the quanteda tokenizer, an external tokenizer function is passed as an argument. This will be diffucult and tedious to test since there are many tokenizers.
  2. a copy of the pool_tweets() function where instead of a finished document term matrix only the text/full corpus of the final document pool is returned. This is easier to implement and test but requires more effort from the user to work with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant