Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option of weighting words #35

Open
guivivi opened this issue Jan 26, 2021 · 3 comments
Open

Option of weighting words #35

guivivi opened this issue Jan 26, 2021 · 3 comments

Comments

@guivivi
Copy link

guivivi commented Jan 26, 2021

Dear Jan,

Many thanks for this outstanding package.

I am learning the second example of the help file for ?embed_sentencespace and I have the following question:

When obtaining the sentence similarities, I am wondering if there is a way to weight the words that make up the sentence. For example, in sentence <- "Wat zijn de cijfers qua doorstroming van 2016? let's say that I would like to emphasize that the most important word to find the similar sentences is 'cijfers'.

Is it possible to assign a weight to tell the algorithm to try to orientate to sentences that contain 'cijfers'?

Looking at the package manual, I see that there are some arguments related to weighting, namely, wordWeight and useWeight, but I do not know how they must be used.

Any help would be very much appreciated.

Kind regards,

Guillermo

@jwijffels
Copy link
Contributor

jwijffels commented Jan 26, 2021

The package always starts from building a model based on a file.
If you can construct a file which looks like this (see Starspace README https://github.com/facebookresearch/StarSpace/blob/master/README.md), you can build a model with specific useWeight = TRUE

word_1:wt_1 word_2:wt_2 ... word_k:wt_k __label__1:lwt_1 ... __label__r:lwt_r

It might as well that you are looking for something called word mover distance (http://proceedings.mlr.press/v37/kusnerb15.pdf)? While I was working on R package doc2vec (https://www.bnosac.be/index.php/blog/103-doc2vec-in-r and https://github.com/bnosac/doc2vec), the C++ backend there allows to provide weights to certain words as well but I removed that functionality last week in order to comply to CRAN policies.
R package text2vec from @dselivanov has a function called RelaxedWordMoversDistance, based on which you can plug in the embeddings coming from either R packages ruimtehol, text2vec, word2vec or doc2vec

And nothing stops you from calculating a different embedding for each document by using whichever linear combination of the word vectors that is coming out of these different packages.

@guivivi
Copy link
Author

guivivi commented Jan 27, 2021

Hi Jan, many thanks for the insights.

Regarding creating the file with weights, I think I have been able to do it. Following the second example of embed_sentencespace, the idea is to paste an added column with the weights. This is an illustration for the case that I wanted to highlight the importance of the word 'cijfers':

library(udpipe)
data(dekamer, package = "ruimtehol")
x <- udpipe(dekamer$question, "dutch", tagger = "none", parser = "none", trace = 100)
x <- x[, c("doc_id", "sentence_id", "sentence", "token")]

x <- x %>% 
  filter(doc_id == "doc115", sentence_id == "7") %>%
  mutate(weight = ifelse(token == "cijfers", 1, 0))
x
 doc_id   sentence_id                            sentence         token   weight
doc115                    7   Kunt u cijfers meedelen?          Kunt            0
doc115                    7   Kunt u cijfers meedelen?                u            0
doc115                    7   Kunt u cijfers meedelen?         cijfers            1
doc115                    7   Kunt u cijfers meedelen?   meedelen            0
doc115                    7   Kunt u cijfers meedelen?                 ?            0

x <- split(x, f = x$doc_id)
x <- sapply(x, FUN = function(tokens) {
  sentences <- split(tokens, tokens$sentence_id)
  sentences <- sapply(sentences, FUN = function(x) paste(x$token, ":", x$weight, sep = "", 
                                                         collapse = " "))
  paste(sentences, collapse = "\t")
})  
x
"Kunt:0 u:0 cijfers:1 meedelen:0 ?:0"

For anyone interested, the extended function is available at:
https://www.uv.es/vivigui/docs/embed_sentencespace_weighted.R

Basically I have added the former paste(x$token, ":", x$weight, sep = "", collapse = " ") and the condition
stopifnot(all(c("doc_id", "sentence_id", "token", "weight") %in% colnames(x)))

I have tried a couple of tests with embed_sentencespace_weighted(..., useWeight = TRUE) and indeed seems to take into account the added weigths.

Please correct me if I am wrong in my procedure.

I am going to learn now the word mover distance, an unknown concept to me so far.

@jwijffels
Copy link
Contributor

Looks correct to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants