Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dictionary of synonyms with more than one synonym per term (different from #158 and #11) #218

Closed
manuelbickel opened this issue Nov 15, 2017 · 2 comments
Assignees
Labels

Comments

@manuelbickel
Copy link
Contributor

manuelbickel commented Nov 15, 2017

Hi Dmitry,

my question is not a pure text2vec question but since other packages are slower and less flexible I would like to connect to the text2vec workflow. Before asking on SO I wanted to ask here.

My question is: What would be the most efficient way to include a dictionary in the workflow, which covers more than one synonym/category per term (this is different from issue #11 and #158).

I have provided a short example of my approach below which is based on data.table and some help for merging columns from SO link. Maybe you have a better idea or a good hint.

library(text2vec)
library(data.table)
library(Matrix)

docs <- c("coffee and tea", "coffee", "tea", "anyterm")
dictionary <- data.table(category = c("energizer", "beverage", "nottea", "beverage")
                      ,word = c("coffee", "coffee", "coffee", "tea"))

tokens <- docs %>% word_tokenizer
it <- itoken(tokens, progressbar = FALSE)
v <- create_vocabulary(it)
dtm <- create_dtm(it, vocab_vectorizer(v))
dtm
# 4 x 4 sparse Matrix of class "dgCMatrix"
#   and anyterm coffee tea
# 1   1       .      1   1
# 2   .       .      1   .
# 3   .       .      .   1
# 4   .       1      .   .

apply_dictionary <- function(dtm, dict) {
 
  words_idxs <- data.table(colidx = 1:ncol(dtm), word = colnames(dtm))
  
  dict <- dict[words_idxs, on= "word"][
           , colname := (ifelse((is.na(category)), word, category))]
  
  dtm <- dtm[,dict[,colidx]]
  colnames(dtm) <- dict[,colname]
  #rowsum with grouping does not seem to work for sprase matrices
  #following workaround is from SO
  Matrix(t(rowsum(t(as.matrix(dtm)), group = colnames(dtm))), sparse = T)
  
}

apply_dictionary(dtm, dictionary)
# 4 x 5 sparse Matrix of class "dgCMatrix"
#     and anyterm beverage energizer nottea
# 1   1       .        2         1      1
# 2   .       .        1         1      1
# 3   .       .        1         .      .
# 4   .       1        .         .      .
@dselivanov
Copy link
Owner

dselivanov commented Nov 17, 2017

Hi. Not easy, but possible (need to know some R tricks - how to use R's environments as hash-maps).

library(text2vec)
library(data.table)
library(Matrix)


docs = c("coffee and tea", "coffee", "tea", "anyterm")
dictionary = data.table(category = c("energizer", "beverage", "nottea", "beverage"),
                        word = c("coffee", "coffee", "coffee", "tea"))

# hashed environment can be used as fast hash-map
dict = new.env(hash = TRUE, parent = emptyenv())
dictionary_group = dictionary[, .(category = list(category)), by = word]
for(i in seq_len(nrow(dictionary_group)))
  dict[[ dictionary_group$word[[i]] ]] = dictionary_group$category[[i]]

syn_tokenizer = function(x, dict_env) {
  # just normal tokenization
  tokenized_docs = word_tokenizer(x)
  # now we want to substitute words with categories
  for(i in seq_along(tokenized_docs)) {
    # doc is character vector now
    doc = tokenized_docs[[i]]
    # extract synonyms
    doc_syn = lapply(doc, function(token) {
      syn = dict_env[[token]]
      # if synonyms found - return them
      # if not - return token itself
      if(is.null(syn)) token else syn
    })
    # since doc_syn is a list (was possible to match one token to multiple synonyms)
    # we need unlist back to character vectror
    tokenized_docs[[i]] = unlist(doc_syn, recursive = F, use.names = F)
  }
  tokenized_docs
}
tokens = docs %>% syn_tokenizer(dict)
#[[1]]
#[1] "energizer" "beverage"  "nottea"    "and"       "beverage" 

#[[2]]
#[1] "energizer" "beverage"  "nottea"   

#[[3]]
#[1] "beverage"

#[[4]]
#[1] "anyterm"
it = itoken(tokens, progressbar = FALSE)
v = create_vocabulary(it)
dtm = create_dtm(it, vocab_vectorizer(v))
dtm
#4 x 5 sparse Matrix of class "dgCMatrix"
#  and anyterm nottea energizer beverage
#1   1       .      1         1        2
#2   .       .      1         1        1
#3   .       .      .         .        1
#4   .       1      .         .        .

I believe your numbers in new dtm were incorrect?

@dselivanov dselivanov self-assigned this Nov 17, 2017
@dselivanov
Copy link
Owner

@manuelbickel I'm going to close this. I hope I've answered. If not - please post question on SO (I'm following text2vec tag).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants