Legal Text Repository and KNN based CID search engine #923

endomorphosis · 2022-08-29T22:00:17Z

Open Grant Proposal: `Legal Text Repository and KNN based CID search engine`

Name of Project:

Proposal Category: research Learn what these categories are here: https://github.com/filecoin-project/devgrants/tree/master/open-grants#readme

Proposer: Endomorphosis

Do you agree to open source all work you do on behalf of this RFP and dual-license under MIT, APACHE2, or GPL licenses?: "Yes"

Project Description

I am planning on uploading the documents from the PACER database in the RECAP archive https://www.courtlistener.com/recap/ to filecoin.

However I want to vectorize the documents, such that the content ID reflects the semantic content of the document, so that instead of relying on a search engine, a person can generate a vector representation of their search query, and search through the IPFS documents using K Nearest Neighbors.

The document database will then be used for a retrieval based large language model for legal text, for the purpose of assisting public defense attorneys in Oregon, which are having a crisis of unrepresented clients because the state only has 31% of the required number of attorneys.

https://www.opb.org/article/2022/01/20/american-bar-association-finds-oregon-has-just-13-of-needed-public-defenders/
https://www.youtube.com/watch?v=J-ENDjlvgLs

Value

This would be valuable to filecoin, because the vector based content ID's can be used to locate content without a centralized search engine, and this also allows for machine learning models to search for content to either train on or use retrieval assisted inference.

https://www.deepmind.com/publications/improving-language-models-by-retrieving-from-trillions-of-tokens
https://arxiv.org/pdf/2208.03299.pdf

The benefits of getting this right, is that instead of having large datasets in a monolithic database, that datasets can be atomized and therefore reduce some traffic overhead for distributed training and inference.

The other benefit is that it will assist the Oregon Public Defense Commission with not having people unlawfully incarcerated in jail without an attorney, and prevent people from being unjustly convicted because of the attorney overwork.

The risk is wasted time / money

I dont know enough about filecoin's CID method, and whether it allows enough space to represent the vectors, such that the search can be accurate, and otherwise it would require keeping the KNN index separate from the actual content IDs

Deliverables

The final deliverable should be to upload the entire recap archive to filecoin, and an index of the documents for a search engine interface, and a web GUI for searching the index of documents.

Development Roadmap

I don't have a roadmap at this point, just an idea.

Total Budget Requested

Budget will depend on what it takes for compute for generating the indexes, and maintaining the files on the IPFS network, which I can only know when I do benchmarking, and refine the methods.

Maintenance and Upgrade Plans

I do not plan on maintenance, other than a script that will automatically add new documents as they are uploaded to Recap.

Team

Team Members

Endomorphosis

Team Member LinkedIn Profiles

Team Website

NA

Relevant Experience

Web Development 2008 - 2015
Cloud infrastructure engineer 2015-2016
Prison 2016-2021
Self educating in ML 2021-2022

Team code repositories

Additional Information

Richard Blythman informed me of the grants program

starworks5@gmail.com is my email address

Yannic Kilcher can vouch for my sincerity towards the purpose of the project.

The text was updated successfully, but these errors were encountered:

mishmosh · 2022-08-31T00:23:06Z

Hello @endomorphosis and thanks for your submission. @anshu93 is familiar with Filecoin and CIDs, and has indicated willingness to be the technical sponsor for this grant proposal. We will review this in the next 2 weeks and get back to you with any questions.

ErinOCon · 2022-09-30T16:35:01Z

Thank you for your proposal! In order to consider a project at this scale, the following is needed:

A development roadmap that details the specific outputs of the project
A fully formed team with the capacity to address the broad scope of work.
Once a team is fully formed and the details for a full development roadmap are available, please feel welcome to reapply!

endomorphosis · 2023-10-14T07:48:22Z

I would like to request that you revisit this project, in light of myself having won an award for an IPFS data hack, can you confer with @jenks-guo-filecoin.

I built a proof of concept of the type of KNN index that I want to make, but i also have alot of work to do on this business startup, who along with my two other cofounders are trying to build the entire stack.

Can you either ask if anyone needs a KNN search engine and put some maintainers on a joint project to flesh out the ipfs database interface, and give us some diskspace to do the side by side analysis with using S3 on coreweave, and enough openai / huggingface credits or api endpoint credentials to generate the embeddings with.

web3jenks · 2023-10-16T00:47:47Z

Hi @endomorphosis, thanks for building to Filecoin. Congratulation to winning the .storage prize on Open Data Hack. The grants team will be assessing your application in time. I will certainly provide my input into their evaluation.

endomorphosis added the Open Grant label Aug 29, 2022

endomorphosis assigned realChainLife Aug 29, 2022

ErinOCon closed this as completed Sep 30, 2022

endomorphosis mentioned this issue Oct 14, 2023

Legal Text Repository and KNN based CID search engine #1662

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Legal Text Repository and KNN based CID search engine #923

Legal Text Repository and KNN based CID search engine #923

endomorphosis commented Aug 29, 2022 •

edited

Loading

mishmosh commented Aug 31, 2022

ErinOCon commented Sep 30, 2022

endomorphosis commented Oct 14, 2023

web3jenks commented Oct 16, 2023

Legal Text Repository and KNN based CID search engine #923

Legal Text Repository and KNN based CID search engine #923

Comments

endomorphosis commented Aug 29, 2022 • edited Loading

Open Grant Proposal: Legal Text Repository and KNN based CID search engine

Project Description

Value

Deliverables

Development Roadmap

Total Budget Requested

Maintenance and Upgrade Plans

Team

Team Members

Team Member LinkedIn Profiles

Team Website

Relevant Experience

Team code repositories

Additional Information

mishmosh commented Aug 31, 2022

ErinOCon commented Sep 30, 2022

endomorphosis commented Oct 14, 2023

web3jenks commented Oct 16, 2023

endomorphosis commented Aug 29, 2022 •

edited

Loading

Open Grant Proposal: `Legal Text Repository and KNN based CID search engine`