Replace Cortical fingerprints with non-proprietary word vectors #9

ahirner · 2016-02-21T02:14:12Z

The essence of the cortical API is just mapping words into fixed-length sparse bit vectors. You can get the same functionality with dense vectors. The most famous famous algorithms are for example implemented in Gensim which allows for subsequent:

clustering to discover common types of documents
(approximate) nearest neighbor search to form recommendations for similar tables
semantic search (e.g. "doctors +
The main "secret sauce" is to do efficient matrix decomposition on term-frequencies around the word in focus (original paper by Mikolov et. al 2013, good explanation on Quora).
Many pre-learnt word vectors on different corpora exist (Wikipedia, news articles, etc.). Thus, it's feasible to just load such a dictionary once and put them on a server and avoid dependency on Cortical. This includes basic operations such as averaging on bag-of-words.

A ready made server implementation is from 3Top: https://github.com/3Top/word2vec-api
If we need more sophisticated NLP with syntactic parsing, e.g. to allow disambiguation of words depending on their context, we will extend the API-fy with this library.

ahirner added the backend label Feb 21, 2016

ahirner self-assigned this Feb 21, 2016

ahirner assigned ahirner and unassigned ahirner Mar 16, 2016

ahirner added this to the Architecture Freeze milestone Mar 16, 2016

This was referenced Mar 16, 2016

Query Engine #14

Open

Calculate the similarities from a given table to other tables #5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace Cortical fingerprints with non-proprietary word vectors #9

Replace Cortical fingerprints with non-proprietary word vectors #9

ahirner commented Feb 21, 2016

Replace Cortical fingerprints with non-proprietary word vectors #9

Replace Cortical fingerprints with non-proprietary word vectors #9

Comments

ahirner commented Feb 21, 2016