Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create generic array similarity functions #20

Open
aih opened this issue May 3, 2022 · 0 comments
Open

Create generic array similarity functions #20

aih opened this issue May 3, 2022 · 0 comments

Comments

@aih
Copy link
Owner

aih commented May 3, 2022

This function builds on the functions in this repository and the Go functions in aih/bills.

Assumptions:

  • Each 'document' consists of an array of strings. The document has a unique id and each item in the array is also uniquely identified (either by an id or its ordinal position in the array).
  • The length of each document array may vary

The generic similarity functions would:

  1. Calculate a vocabulary of n-grams from the total corpus of documents (an array of documents).
  2. Vectorize the documents so that they each document can be stored as a (sparse) array of the length of the vocabulary
  3. Store the vectorized matrix of all documents in a pickle file (or eventually in Postgresql) (MOD- matrix of all documents)
  4. Calculate the similarity between each item of each array and all other items in the MOD
  5. Apply an item threshold to find similar items for each item in a document
  6. Apply a document threshold to find similar documents
  7. Return 5 and 6 in a model form that can be stored to a database (item-to-item and document-to-document similarity)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant