Dictionary of synonyms with more than one synonym per term (different from #158 and #11) #218

manuelbickel · 2017-11-15T15:10:52Z

Hi Dmitry,

my question is not a pure text2vec question but since other packages are slower and less flexible I would like to connect to the text2vec workflow. Before asking on SO I wanted to ask here.

My question is: What would be the most efficient way to include a dictionary in the workflow, which covers more than one synonym/category per term (this is different from issue #11 and #158).

I have provided a short example of my approach below which is based on data.table and some help for merging columns from SO link. Maybe you have a better idea or a good hint.

library(text2vec)
library(data.table)
library(Matrix)

docs <- c("coffee and tea", "coffee", "tea", "anyterm")
dictionary <- data.table(category = c("energizer", "beverage", "nottea", "beverage")
                      ,word = c("coffee", "coffee", "coffee", "tea"))

tokens <- docs %>% word_tokenizer
it <- itoken(tokens, progressbar = FALSE)
v <- create_vocabulary(it)
dtm <- create_dtm(it, vocab_vectorizer(v))
dtm
# 4 x 4 sparse Matrix of class "dgCMatrix"
#   and anyterm coffee tea
# 1   1       .      1   1
# 2   .       .      1   .
# 3   .       .      .   1
# 4   .       1      .   .

apply_dictionary <- function(dtm, dict) {
 
  words_idxs <- data.table(colidx = 1:ncol(dtm), word = colnames(dtm))
  
  dict <- dict[words_idxs, on= "word"][
           , colname := (ifelse((is.na(category)), word, category))]
  
  dtm <- dtm[,dict[,colidx]]
  colnames(dtm) <- dict[,colname]
  #rowsum with grouping does not seem to work for sprase matrices
  #following workaround is from SO
  Matrix(t(rowsum(t(as.matrix(dtm)), group = colnames(dtm))), sparse = T)
  
}

apply_dictionary(dtm, dictionary)
# 4 x 5 sparse Matrix of class "dgCMatrix"
#     and anyterm beverage energizer nottea
# 1   1       .        2         1      1
# 2   .       .        1         1      1
# 3   .       .        1         .      .
# 4   .       1        .         .      .

The text was updated successfully, but these errors were encountered:

dselivanov · 2017-11-17T12:12:49Z

Hi. Not easy, but possible (need to know some R tricks - how to use R's environments as hash-maps).

library(text2vec)
library(data.table)
library(Matrix)


docs = c("coffee and tea", "coffee", "tea", "anyterm")
dictionary = data.table(category = c("energizer", "beverage", "nottea", "beverage"),
                        word = c("coffee", "coffee", "coffee", "tea"))

# hashed environment can be used as fast hash-map
dict = new.env(hash = TRUE, parent = emptyenv())
dictionary_group = dictionary[, .(category = list(category)), by = word]
for(i in seq_len(nrow(dictionary_group)))
  dict[[ dictionary_group$word[[i]] ]] = dictionary_group$category[[i]]

syn_tokenizer = function(x, dict_env) {
  # just normal tokenization
  tokenized_docs = word_tokenizer(x)
  # now we want to substitute words with categories
  for(i in seq_along(tokenized_docs)) {
    # doc is character vector now
    doc = tokenized_docs[[i]]
    # extract synonyms
    doc_syn = lapply(doc, function(token) {
      syn = dict_env[[token]]
      # if synonyms found - return them
      # if not - return token itself
      if(is.null(syn)) token else syn
    })
    # since doc_syn is a list (was possible to match one token to multiple synonyms)
    # we need unlist back to character vectror
    tokenized_docs[[i]] = unlist(doc_syn, recursive = F, use.names = F)
  }
  tokenized_docs
}
tokens = docs %>% syn_tokenizer(dict)
#[[1]]
#[1] "energizer" "beverage"  "nottea"    "and"       "beverage" 

#[[2]]
#[1] "energizer" "beverage"  "nottea"   

#[[3]]
#[1] "beverage"

#[[4]]
#[1] "anyterm"

it = itoken(tokens, progressbar = FALSE)
v = create_vocabulary(it)
dtm = create_dtm(it, vocab_vectorizer(v))
dtm
#4 x 5 sparse Matrix of class "dgCMatrix"
#  and anyterm nottea energizer beverage
#1   1       .      1         1        2
#2   .       .      1         1        1
#3   .       .      .         .        1
#4   .       1      .         .        .

I believe your numbers in new dtm were incorrect?

dselivanov · 2017-11-19T07:57:56Z

@manuelbickel I'm going to close this. I hope I've answered. If not - please post question on SO (I'm following text2vec tag).

dselivanov self-assigned this Nov 17, 2017

dselivanov added the question label Nov 17, 2017

dselivanov closed this as completed Nov 19, 2017

manuelbickel mentioned this issue Dec 10, 2017

How to train/modify collocation model with existing (ngram) dictionary? (question) #224

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dictionary of synonyms with more than one synonym per term (different from #158 and #11) #218

Dictionary of synonyms with more than one synonym per term (different from #158 and #11) #218

manuelbickel commented Nov 15, 2017 •

edited

Loading

dselivanov commented Nov 17, 2017 •

edited

Loading

dselivanov commented Nov 19, 2017

Dictionary of synonyms with more than one synonym per term (different from #158 and #11) #218

Dictionary of synonyms with more than one synonym per term (different from #158 and #11) #218

Comments

manuelbickel commented Nov 15, 2017 • edited Loading

dselivanov commented Nov 17, 2017 • edited Loading

dselivanov commented Nov 19, 2017

manuelbickel commented Nov 15, 2017 •

edited

Loading

dselivanov commented Nov 17, 2017 •

edited

Loading