Skip to content

Support to more Semantic Similarity measurements

No due date 0% complete

Currently, the only semantic measurement we are computing are based on spacy pre-defined word vectors and it is the only coherence measurement that we try to infer based on semantic similarity of words. However, there are plenty of open source models available that could be used depending on context. There are also other features that we can extract using…

Currently, the only semantic measurement we are computing are based on spacy pre-defined word vectors and it is the only coherence measurement that we try to infer based on semantic similarity of words. However, there are plenty of open source models available that could be used depending on context. There are also other features that we can extract using word vectors, for example:

  • Semantic Givenness, as a measure of newly information added in the text.
  • Semantic measurements For example we could compute average distance to the centroid of the vector cluster based on word embeddings, we could also replicate the Getti's G index applied, and other interesting indices.

For this, we would need to think about providing support to:

  • Vector space model (e.g. any word embedding scheme, LSA). I think we can set two options, one would be generating the word vectors ourselves, or adding pre-trained models. We will need to think about a good interface so we can support both use cases.
  • Pre-trained models, for example: https://huggingface.co/models
  • We would still need to fix the issue on model loading and refactor associated code.