Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Legal Text Repository and KNN based CID search engine #923

Closed
endomorphosis opened this issue Aug 29, 2022 · 4 comments
Closed

Legal Text Repository and KNN based CID search engine #923

endomorphosis opened this issue Aug 29, 2022 · 4 comments
Assignees

Comments

@endomorphosis
Copy link

endomorphosis commented Aug 29, 2022

Open Grant Proposal: Legal Text Repository and KNN based CID search engine

Name of Project:

Proposal Category: research Learn what these categories are here: https://github.com/filecoin-project/devgrants/tree/master/open-grants#readme

Proposer: Endomorphosis

Do you agree to open source all work you do on behalf of this RFP and dual-license under MIT, APACHE2, or GPL licenses?: "Yes"

Project Description

I am planning on uploading the documents from the PACER database in the RECAP archive https://www.courtlistener.com/recap/ to filecoin.

However I want to vectorize the documents, such that the content ID reflects the semantic content of the document, so that instead of relying on a search engine, a person can generate a vector representation of their search query, and search through the IPFS documents using K Nearest Neighbors.

The document database will then be used for a retrieval based large language model for legal text, for the purpose of assisting public defense attorneys in Oregon, which are having a crisis of unrepresented clients because the state only has 31% of the required number of attorneys.

https://www.opb.org/article/2022/01/20/american-bar-association-finds-oregon-has-just-13-of-needed-public-defenders/
https://www.youtube.com/watch?v=J-ENDjlvgLs

Value

This would be valuable to filecoin, because the vector based content ID's can be used to locate content without a centralized search engine, and this also allows for machine learning models to search for content to either train on or use retrieval assisted inference.

https://www.deepmind.com/publications/improving-language-models-by-retrieving-from-trillions-of-tokens
https://arxiv.org/pdf/2208.03299.pdf

The benefits of getting this right, is that instead of having large datasets in a monolithic database, that datasets can be atomized and therefore reduce some traffic overhead for distributed training and inference.

The other benefit is that it will assist the Oregon Public Defense Commission with not having people unlawfully incarcerated in jail without an attorney, and prevent people from being unjustly convicted because of the attorney overwork.

The risk is wasted time / money

I dont know enough about filecoin's CID method, and whether it allows enough space to represent the vectors, such that the search can be accurate, and otherwise it would require keeping the KNN index separate from the actual content IDs

Deliverables

The final deliverable should be to upload the entire recap archive to filecoin, and an index of the documents for a search engine interface, and a web GUI for searching the index of documents.

Development Roadmap

I don't have a roadmap at this point, just an idea.

Total Budget Requested

Budget will depend on what it takes for compute for generating the indexes, and maintaining the files on the IPFS network, which I can only know when I do benchmarking, and refine the methods.

Maintenance and Upgrade Plans

I do not plan on maintenance, other than a script that will automatically add new documents as they are uploaded to Recap.

Team

Team Members

Endomorphosis

Team Member LinkedIn Profiles

Team Website

NA

Relevant Experience

Web Development 2008 - 2015
Cloud infrastructure engineer 2015-2016
Prison 2016-2021
Self educating in ML 2021-2022

Team code repositories

Additional Information

Richard Blythman informed me of the grants program

starworks5@gmail.com is my email address

Yannic Kilcher can vouch for my sincerity towards the purpose of the project.

@mishmosh
Copy link
Contributor

Hello @endomorphosis and thanks for your submission. @anshu93 is familiar with Filecoin and CIDs, and has indicated willingness to be the technical sponsor for this grant proposal. We will review this in the next 2 weeks and get back to you with any questions.

@ErinOCon
Copy link
Collaborator

Thank you for your proposal! In order to consider a project at this scale, the following is needed:

  • A development roadmap that details the specific outputs of the project
  • A fully formed team with the capacity to address the broad scope of work.
    Once a team is fully formed and the details for a full development roadmap are available, please feel welcome to reapply!

@endomorphosis
Copy link
Author

I would like to request that you revisit this project, in light of myself having won an award for an IPFS data hack, can you confer with @jenks-guo-filecoin.

I built a proof of concept of the type of KNN index that I want to make, but i also have alot of work to do on this business startup, who along with my two other cofounders are trying to build the entire stack.

Can you either ask if anyone needs a KNN search engine and put some maintainers on a joint project to flesh out the ipfs database interface, and give us some diskspace to do the side by side analysis with using S3 on coreweave, and enough openai / huggingface credits or api endpoint credentials to generate the embeddings with.

@web3jenks
Copy link

Hi @endomorphosis, thanks for building to Filecoin. Congratulation to winning the .storage prize on Open Data Hack. The grants team will be assessing your application in time. I will certainly provide my input into their evaluation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants