Better Tokenization #1

abuchmueller · 2021-04-09T13:17:39Z

~~add support for additional tokenizers~~ (moved to Add external tokenizer support #23)
add support for custom stopwords
handle usernames and hashtags automatically
handle emojis
n-gram support
~~handle social media tags~~ (moved to Handle social media tags #24)

- fixes for quanteda 3.0 update close #11 - now returns subset of original data

…om now on #1

…ned dataframe #1

abuchmueller · 2021-05-10T10:57:36Z

There are two possible scenarios for the implementation of external tokenizers:

a copy of pool_tweets() where instead of passing arguments to the quanteda tokenizer, an external tokenizer function is passed as an argument. This will be diffucult and tedious to test since there are many tokenizers.
a copy of the pool_tweets() function where instead of a finished document term matrix only the text/full corpus of the final document pool is returned. This is easier to implement and test but requires more effort from the user to work with.

…om text #1

abuchmueller added enhancement New feature or request high priority This has priority labels Apr 9, 2021

abuchmueller added a commit that referenced this issue Apr 14, 2021

- added more tokenization options #1

18a2040

- fixes for quanteda 3.0 update close #11 - now returns subset of original data

abuchmueller added a commit that referenced this issue Apr 15, 2021

cosine threshold now customizable by user #1

7468acc

abuchmueller added a commit that referenced this issue Apr 20, 2021

added stopwords parameter #1

6631aba

abuchmueller added a commit that referenced this issue Apr 27, 2021

added remove_emojis() option #1

a252348

abuchmueller added a commit that referenced this issue Apr 27, 2021

remove_emojis() keeps UTF-8 characters instead of ascii characters fr…

3889120

…om now on #1

abuchmueller added a commit that referenced this issue Apr 27, 2021

pool_tweets() now extracts emojis from text appends them to the retur…

1555485

…ned dataframe #1

abuchmueller added a commit that referenced this issue May 10, 2021

added n-gram support #1

ad65375

abuchmueller added high priority This has priority and removed high priority This has priority labels May 10, 2021

abuchmueller mentioned this issue May 10, 2021

Add external tokenizer support #23

Closed

abuchmueller added a commit that referenced this issue May 10, 2021

remove_url parameter now works properly #1

ba8d542

abuchmueller removed the high priority This has priority label May 10, 2021

abuchmueller added a commit that referenced this issue May 10, 2021

rem_emojis has been replaced by str_remove_all for removing emojis fr…

c437700

…om text #1

abuchmueller mentioned this issue May 11, 2021

Handle social media tags #24

Closed

abuchmueller closed this as completed Jun 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Tokenization #1

Better Tokenization #1

abuchmueller commented Apr 9, 2021 •

edited

abuchmueller commented May 10, 2021

Better Tokenization #1

Better Tokenization #1

Comments

abuchmueller commented Apr 9, 2021 • edited

abuchmueller commented May 10, 2021

abuchmueller commented Apr 9, 2021 •

edited