Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR reduce memory load using efficient embedding storage and access strategy #120

Open
dvl00 opened this issue Mar 29, 2023 · 7 comments
Open
Labels
enhancement New feature or request

Comments

@dvl00
Copy link

dvl00 commented Mar 29, 2023

Hi there! First of all thank you for this amazing extension. My vault, which is about 18k notes, has an embedding file of 471mb. Everytime that I now start loading obsidian with the created embedding file, it crashes and turns into a black screen. Help! I really would like to use this extension as my vault is growing so much! Thank youuuu

@brianpetro
Copy link
Owner

Hi @dvl00 and thanks for the report!

So that entire 471mb of embeddings is currently loaded into memory, and it's likely that is causing your issue. Anything to increase your available RAM would be helpful in preventing the crash.

I do have some strategies/plans to improve this process so that all 471mb aren't constantly in memory.

Good news is, I have some other ideas that will be increasing the number of embeddings per note, so this issue is likely to become more common among users with less massive vaults. So that's good news for getting the above strategies/plans that will solve your issue. It means they should be implemented sooner than later, since more people affected increases the priority.

I hope that clears things up for you.

Thanks again for contributing this report!

Brian 🌴

@dvl00
Copy link
Author

dvl00 commented Mar 29, 2023

Thank you @brianpetro !

Really appreciate the prompt response.

I will remove the extension for now but will keep my embedding file because it was very expensive to produce it! I will anxiously wait for the updated version. I got to try it for a little bit as my embedding file was being created and oh my goodness, it was very useful!! Thank you for your attention and diligence on this. For many of us obsidian has become part of our day to day life and people like you, who develop these sort of innovative plug ins, are very much appreciated!

Thanks again~~

@dvl00
Copy link
Author

dvl00 commented Mar 29, 2023

@brianpetro Sorry to bother you again, do you have a rough timeline for these releases? I just want to make sure to install the extension once its suited for my needs.

Thank you!

@brianpetro brianpetro changed the title Obsidian crashes when loading embeding file. FR reduce memory load using efficient embedding storage and access strategy Mar 29, 2023
@brianpetro brianpetro added the enhancement New feature or request label Mar 29, 2023
@brianpetro
Copy link
Owner

@dvl00 no ETA as of yet.

@yekingyan
Copy link

I have some suggestions for optimizing storage.

Have you considered using CSV instead of JSON? As each file's information is a line of similar data, JSON requires storing data in the form of key-value objects, resulting in many redundant keys. Converting objects to arrays can optimize storage and memory, especially for the "vec" object in the embeddings-2.json file.

Furthermore, I have added the storage of embeddings to the Git repository. If we use the CSV format, it would be clear which files have updated or added embeddings, making it easier to track changes than modifying the entire embeddings-2.json file.

I want to express my sincere gratitude for your great work on this project. This plugin has become a vital part of my life, and I appreciate the effort and dedication you put into it.

@dvl00
Copy link
Author

dvl00 commented Apr 1, 2023

I think the csv idea might be good, but it might not be enough for mobile devices. They have limited storage and processing power, so we might need a more robust solution. Maybe we could use a cloud service to store and access the data more efficiently. But this is just my opinion - brianpetro is the genius behind this plugin and he knows best what works for his project.

@brianpetro
Copy link
Owner

@yekingyan I have thought about CSV. Even though they're redundant, the JSON keys make up <1% of the file. This is because each embedding contains a vector (array of decimals) that's >15,000 characters. This plus using JSON makes data storage object both easier to work with and more flexible. So for those reasons is why I decided to stick with it opposed to CSV.

@dvl00 I tend to want to stay away from adding additional cloud services, but, I do see eventual integration as likely because that would allow people to reuse their embeddings in other applications. In the meantime, I think the biggest performance gains will come from strategically splitting the embeddings.json file into more parts. This way, only some embeddings can be loaded into memory at a time, and syncing the embeddings won't require re-downloading all of the embeddings every time.

Thanks both of you for your thoughts on this,
Brian🌴

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants