You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
my question is not a pure text2vec question but since other packages are slower and less flexible I would like to connect to the text2vec workflow. Before asking on SO I wanted to ask here.
My question is: What would be the most efficient way to include a dictionary in the workflow, which covers more than one synonym/category per term (this is different from issue #11 and #158).
I have provided a short example of my approach below which is based on data.table and some help for merging columns from SO link. Maybe you have a better idea or a good hint.
library(text2vec)
library(data.table)
library(Matrix)
docs <- c("coffee and tea", "coffee", "tea", "anyterm")
dictionary <- data.table(category = c("energizer", "beverage", "nottea", "beverage")
,word = c("coffee", "coffee", "coffee", "tea"))
tokens <- docs %>% word_tokenizer
it <- itoken(tokens, progressbar = FALSE)
v <- create_vocabulary(it)
dtm <- create_dtm(it, vocab_vectorizer(v))
dtm
# 4 x 4 sparse Matrix of class "dgCMatrix"
# and anyterm coffee tea
# 1 1 . 1 1
# 2 . . 1 .
# 3 . . . 1
# 4 . 1 . .
apply_dictionary <- function(dtm, dict) {
words_idxs <- data.table(colidx = 1:ncol(dtm), word = colnames(dtm))
dict <- dict[words_idxs, on= "word"][
, colname := (ifelse((is.na(category)), word, category))]
dtm <- dtm[,dict[,colidx]]
colnames(dtm) <- dict[,colname]
#rowsum with grouping does not seem to work for sprase matrices
#following workaround is from SO
Matrix(t(rowsum(t(as.matrix(dtm)), group = colnames(dtm))), sparse = T)
}
apply_dictionary(dtm, dictionary)
# 4 x 5 sparse Matrix of class "dgCMatrix"
# and anyterm beverage energizer nottea
# 1 1 . 2 1 1
# 2 . . 1 1 1
# 3 . . 1 . .
# 4 . 1 . . .
The text was updated successfully, but these errors were encountered:
Hi. Not easy, but possible (need to know some R tricks - how to use R's environments as hash-maps).
library(text2vec)
library(data.table)
library(Matrix)
docs= c("coffee and tea", "coffee", "tea", "anyterm")
dictionary= data.table(category= c("energizer", "beverage", "nottea", "beverage"),
word= c("coffee", "coffee", "coffee", "tea"))
# hashed environment can be used as fast hash-mapdict= new.env(hash=TRUE, parent= emptyenv())
dictionary_group=dictionary[, .(category=list(category)), by=word]
for(iin seq_len(nrow(dictionary_group)))
dict[[ dictionary_group$word[[i]] ]] =dictionary_group$category[[i]]
syn_tokenizer=function(x, dict_env) {
# just normal tokenizationtokenized_docs= word_tokenizer(x)
# now we want to substitute words with categoriesfor(iin seq_along(tokenized_docs)) {
# doc is character vector nowdoc=tokenized_docs[[i]]
# extract synonymsdoc_syn= lapply(doc, function(token) {
syn=dict_env[[token]]
# if synonyms found - return them# if not - return token itselfif(is.null(syn)) tokenelsesyn
})
# since doc_syn is a list (was possible to match one token to multiple synonyms)# we need unlist back to character vectrortokenized_docs[[i]] = unlist(doc_syn, recursive=F, use.names=F)
}
tokenized_docs
}
tokens=docs %>% syn_tokenizer(dict)
#[[1]]#[1] "energizer" "beverage" "nottea" "and" "beverage" #[[2]]#[1] "energizer" "beverage" "nottea" #[[3]]#[1] "beverage"#[[4]]#[1] "anyterm"
Hi Dmitry,
my question is not a pure
text2vec
question but since other packages are slower and less flexible I would like to connect to thetext2vec
workflow. Before asking on SO I wanted to ask here.My question is: What would be the most efficient way to include a dictionary in the workflow, which covers more than one synonym/category per term (this is different from issue #11 and #158).
I have provided a short example of my approach below which is based on
data.table
and some help for merging columns from SO link. Maybe you have a better idea or a good hint.The text was updated successfully, but these errors were encountered: