Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High token usage due to embeddings file failing to save (Empty embeddings-2.json) #32

Closed
vratclarkson opened this issue Feb 4, 2023 · 41 comments

Comments

@vratclarkson
Copy link

Thanks for this plugin, but it is economically not feasible to use if it happens to use so many tokens.
Any suggestions?

@brianpetro
Copy link
Owner

brianpetro commented Feb 4, 2023

How many notes are in your vault? Depending on your vault and usage, the first run should be 90—99% of the cost. And that first run could, theoretically, take a few days, depending on the size of your vault and rate limits.

I spend a lot of energy doing everything possible to ensure that tokens aren't wasted. So once your vault is fully embedded, your cost should be reduced to something like pennies per day.

Of course, an edge case could always exist based on how you operate your vault. So if your vault size seems too small for the cost, I'd be happy to help you explore what might be happening. Particularly if another plugin constantly alters the contents of notes and thus triggers re-embedding.

But if the number of notes in your vault is >10K or you have a significant number of really large notes with lots of headings, then the usage cost should reduce exponentially in the coming days.

Please let me know what you think about that and if it applies to you.

And thanks for the feedback @vratclarkson

Edit: Here's a comment referencing expected typical cost #31 (comment)

@pinuke
Copy link
Contributor

pinuke commented Feb 5, 2023

$10 is a bit excessive. Of course I don't have an encyclopedia stored in my vault.

@brianpetro
Copy link
Owner

For reference, it cost me ~$0.75 to re-embed 1,500 notes.

@pinuke
Copy link
Contributor

pinuke commented Feb 5, 2023

Though, it may be worth implementing an in-house rate limiter. The rate limiters on OpenAI aren't exactly intuitive.

@vratclarkson
Copy link
Author

I have about 1.5k notes.

Problem could be that I reindex the notes two times.
Yesterday and today, costing me ~4$ each time.

I have disabled few plugins based on your suggestion.

I will let you know if I get charged more.

Thank you for your time. Really appreciate it.

@pinuke
Copy link
Contributor

pinuke commented Feb 6, 2023

I think I may have discovered a possible cause. Embeddings file doesn't sync with Obsidian Sync. Not sure what the cause is, but I just noticed that on my dev chromebook that obsidian isn't syncing my embeddings file everytime I cycle it through a powerwash.

In all fairness, Obsidian Sync has been throwing away random files when syncing for god knows what reasons

@brianpetro
Copy link
Owner

@smartguy1196 good catch, I haven't done much thinking re:syncing because I only have one desktop device, and the Smart Connections plugin is desktop-only.

Keep me posted if you have any thoughts on addressing this.

Additionally, for anyone who doesn't think the syncing is the problem, there are settings to log more information to the console log, including which files were embedded each passthrough. This might help narrow down the cause in the case of continuous embedding.

Thanks for the update!

@pinuke
Copy link
Contributor

pinuke commented Feb 6, 2023

The version that isn't pulling from sync that should be is the linux one on the chromebook.

I tend to not use plug-ins on the android version on the Chromebook

@brianpetro
Copy link
Owner

@smartguy1196 this issue #20 (comment) also referred to file problems with Linux.

I've considered using IndexedDB instead of a file for cold storage of the embeddings. If this continues to be a problem, I might have to add some priority. However, I don't know if IndexedDB data is synced between instances. So I will have to investigate further.

And thanks for keeping me updated with that additional information!

@brianpetro brianpetro mentioned this issue Feb 7, 2023
@nigelthomp
Copy link

How many notes are in your vault? Depending on your vault and usage, the first run should be 90—99% of the cost. And that first run could, theoretically, take a few days, depending on the size of your vault and rate limits.

I spend a lot of energy doing everything possible to ensure that tokens aren't wasted. So once your vault is fully embedded, your cost should be reduced to something like pennies per day.

Of course, an edge case could always exist based on how you operate your vault. So if your vault size seems too small for the cost, I'd be happy to help you explore what might be happening. Particularly if another plugin constantly alters the contents of notes and thus triggers re-embedding.

But if the number of notes in your vault is >10K or you have a significant number of really large notes with lots of headings, then the usage cost should reduce exponentially in the coming days.

Please let me know what you think about that and if it applies to you.

And thanks for the feedback @vratclarkson

Edit: Here's a comment referencing expected typical cost #31 (comment)

Yes, Brian, I am having the same problem! It is very costly to run this plug-in. I am disabling this for the time being. it works great but the cost..phew!

@brianpetro
Copy link
Owner

Hey @nigelthomp

Thanks for following up on this issue. I know it's frustrating when the software isn't working as expected. I appreciate you going out of your way to provide feedback that might help solve the issue.

And clearly, we're encountering some bugs here because, besides the initial embedding of your entire vault, which is a factor of vault size, the recurring cost should be almost negligible.

When you check your .smart-connections folder, is there anything inside the embeddings-2.json file? There are some edge cases where this file isn't saving or is being overwritten. If this is the case, anything you can tell me about your vault setup may help me track down the issue.

Do you have any plugins that may continually update large amounts of notes? Unfortunately, this may also be triggering unnecessary re-embeddings. And depending on the use of those files, the remedy may be as easy as excluding them via the Smart Connections settings.

Additionally, in the settings, you can toggle on additional console logs, including details on the number of tokens being used and which notes are being processed. Keeping an eye on this may hint at which notes are eating up the tokens.

Thanks for your help in solving this!

@nigelthomp
Copy link

Hi Brian,

Thanks for getting back to me with your suggestions. I will look at this and report back. I have a lot of plugins, so I might start by turning off the ones I don't use a lot to see if things improve. Unfortunately, my OpenAI bill this month is a lot so i will have to wait for the time being. So it would be great to get feedback from other users in the meantime. BTW, I think your plugin is very exciting, and I can see lots of potential in this area of Ai and notes taking, thank you so much for all your work!!

@pinuke
Copy link
Contributor

pinuke commented Feb 10, 2023

FTR, I still suspect problems with sync.

@brianpetro what is the thought-process behind not storing the embeddings files inside of the plugins folder?

@pinuke
Copy link
Contributor

pinuke commented Feb 10, 2023

You know what I think it could be? Perhaps the load order from sync. If smart-connections is somehow loading before the embeddings file gets synced over, the plugin might think there isn't one and might start automatically creating a new embeddings file despite a pending sync

A fix for this might be to check if there's a pending sync for the embeddings file and wait for it to download prior to starting the plugin.

@pinuke
Copy link
Contributor

pinuke commented Feb 10, 2023

That actually makes a lot of sense, because if the embeddings file is larger than the plugin, it would take longer to download it than the plugin.

@brianpetro
Copy link
Owner

@smartguy1196 thanks for the thought about sync order. I mentioned in #36 (comment) which focuses on the syncing issue.

Folder choice was pretty arbitrary since this was my first Obsidian plugin and I'm still unsure whether it makes sense to store hundreds of megabytes to gigabytes worth of Embeddings in the plugins folder. Other than that, I did want people to know they had access to their Embeddings. Good question though. Do you know of any plugins with similar storage requirements that use the plugins folder to store it? I am exploring storage options right now so it would be interesting to investigate.

@pinuke
Copy link
Contributor

pinuke commented Feb 10, 2023

Do you know of any plugins with similar storage requirements that use the plugins folder to store it? I am exploring storage options right now so it would be interesting to investigate.

TBH, no. I'm working on getting my first plugin working as well (obsidian-selenium).

You may have to unpack obsidian's asar and look at the source code for the internal sync plugin to find out how it works.

@harpreetchima
Copy link

This is an amazing extension. I ran it on a fairly large vault that cost $25, and it worked brilliantly.

However, I ran into the same issue some of the other folks have. The Embeddings and -2 files are 1KB. I'm syncing Obsidian between a Windows 10 and MacOS machine using Obsidian Sync, if that helps.

@brianpetro
Copy link
Owner

@harpreetchima in order to minimize your OpenAI API costs, it might be best to disable syncing and maintain two separate Embeddings files. Let me know if that enables your embeddings-2.json file to be written correctly, as it shouldn't be only 1KB.

If the embeddings aren't being saved to that file, then your usage costs will quickly add up.

Thanks for your feedback and help solving this issue!

@pinuke
Copy link
Contributor

pinuke commented Feb 20, 2023

Hmmm... I wonder if you could solve the sync issues with git?

Perhaps set up the smart-connections folder as a git repository and anytime the embeddings file gets inexplicably deleted have the plugin run a git-revert?

Perhaps integrate git-reversion into the same part of the script that detects no embeddings file?

@pinuke
Copy link
Contributor

pinuke commented Feb 20, 2023

The only issue I see is if obsidian-sync inexplicably deletes the git files.

@pinuke
Copy link
Contributor

pinuke commented Feb 20, 2023

On a side note, I may have found the issue: I think that there may be a bug in how Obsidian Sync handles deleted/non-existent files.

Specifically, I think there might be a bug that occurs when a vault connects to the sync after opening. I've noticed that most of the files that get deleted by Obsidian Sync happen when I open the vault on a machine without certain files.

For some reason, Obsidian Sync thinks that the just opened vault is the most up-to-date one, and deletes the embeddings file and anything else that is absent inside the local vault from the sync's vault

@pinuke
Copy link
Contributor

pinuke commented Feb 20, 2023

I would open a bug report, but I haven't recorded it happening.

One of the 2 of us may have to write a unit test to expose the bug.

I'm thinking maybe write a unit test that does this:

  • create a test vault and clone it (recommend git) with a bunch of test notes (doesn't necessarily need to have the embeddings)
  • close the cloned vault.
  • create a new file in the original vault, and wait for it to sync.
    • I think the bug happens when changes are made in the cloned vault after it is closed.
      • (to further test, we may have to play with editing, creating, and deleting files in the new and old vaults until we find the bug)
  • reopen the cloned vault. wait for the sync.

My hands are full with the Obsidian-Selenium plugin and School (EDIT: and Miraclecast and Chromebrew and Work - I have too much to do...) at the moment.

ANOTHER EDIT: feel free to use any of my source code :)

@brianpetro
Copy link
Owner

@smartguy1196 GitHub push starts having issues when the file reaches 100mb, so I'm not sure if that could be used for syncing, unless you're suggesting to only use a local git instance and still rely on Obsidian sync.

Has anyone with this bug confirmed whether the file gets written at all? Modified time might not be reliable if it is a sync issue. But maybe someone could observe the file being present, with a size greater than 1kb, then at some point reverting back to 1kb.

If we can confirm the file is being written but at some point is overwritten, then we can follow up with Obsidian Sync to see if this is a big they may be able to fix on their end.

@pinuke
Copy link
Contributor

pinuke commented Feb 21, 2023

@smartguy1196 GitHub push starts having issues when the file reaches 100mb, so I'm not sure if that could be used for syncing, unless you're suggesting to only use a local git instance and still rely on Obsidian sync.

I was thinking of using a local(-ish) instance of git that gets stored in the vault (therefore on Obsidian-Sync) as a checksum with the ability to restore the missing file.

If Obsidian successfully syncs the git data, but omits the embeddings file, git can detect this.

Has anyone with this bug confirmed whether the file gets written at all? Modified time might not be reliable if it is a sync issue. But maybe someone could observe the file being present, with a size greater than 1kb, then at some point reverting back to 1kb.

If we can confirm the file is being written but at some point is overwritten, then we can follow up with Obsidian Sync to see if this is a bug they may be able to fix on their end.

If I have freetime between all of my projects, I can help write a test

@RobinLandy
Copy link

Same problem here:

  1. SmartConnections keeps on re-generating embeddings, making thousands of requests to OpenAI API (I have ~1000 notes)
  2. embeddings-2.json is 2 bytes
  3. I disabled sync for the .smart-connections folder but the problem persists

@brianpetro
Copy link
Owner

Hey @RobinLandy

I'm still trying to narrow down the cause of this. Which OS are you using?

And thanks for your report!

@RobinLandy
Copy link

Hey @brianpetro Thanks for working on this.

I'm using MacOS 13.2.

Let me know if I can provide any other details that might help.

@vratclarkson
Copy link
Author

If it helps, I also sailing the same boat.

  1. MacOS 13.2.1
  2. I was using Obsidian Sync, once I turned it off, my usage isn't that much.

@brianpetro
Copy link
Owner

@vratclarkson thanks for the info!

Another question, were you syncing between multiple desktop devices?

@RobinLandy
Copy link

Another question, were you syncing between multiple desktop devices?

No. One desktop device only.

@brianpetro brianpetro changed the title $10 for 3 days, can you do something about this? High token usage due to embeddings file failing to save Mar 1, 2023
@brianpetro
Copy link
Owner

@RobinLandy, thanks for the update. cc: @vratclarkson

Please check out the latest version 1.2.1 where I added a "write file" test to the settings. If the writing fails, then it should return an error that will help debug this.

#45 (comment)

Thanks for all your help!

@brianpetro brianpetro changed the title High token usage due to embeddings file failing to save High token usage due to embeddings file failing to save (Empty embeddings-2.json) Mar 1, 2023
@vratclarkson
Copy link
Author

@vratclarkson thanks for the info!

Another question, were you syncing between multiple desktop devices?

Hi @brianpetro
I used to sync between my MacBook and iPhone.

And thank you for this plugin.

@RobinLandy
Copy link

Please check out the latest version 1.2.1 where I added a "write file" test to the settings. If the writing fails, then it should return an error that will help debug this.

Done. It successfully wrote an 8.8MB file named embeddings-test.json

Screenshot 2023-03-01 at 15 11 02

@RobinLandy
Copy link

In 1.2.1 the "Making smart connections" counter went over 500, but the embedding-2.json is still 2 bytes and the date modified is still 27 Feb.

@brianpetro
Copy link
Owner

@vratclarkson thanks!

@RobinLandy interesting! If you rename the test file to embeddings-2.json, that should temporarily save your embeddings' current state.

And thanks for letting me know about the counter. It's not indicative of anything specific at this time.

@RobinLandy
Copy link

Over the past 20 minutes....

Screenshot 2023-03-01 at 15 21 39

@brianpetro
Copy link
Owner

@RobinLandy, as long as the embeddings aren't being saved, the plugin will re-embed your entire vault every time the plugin or Obsidian is restarted.

@RobinLandy
Copy link

@RobinLandy, as long as the embeddings aren't being saved, the plugin will re-embed your entire vault every time the plugin or Obsidian is restarted.

Makes sense. I've disabled the plugin, and will await the next update.

@brianpetro
Copy link
Owner

@RobinLandy, I added a "Manual Save" button in the settings in version 1.2.2.

This will try to write to the embeddings-2 file and should return an error if there is any issue.

@brianpetro
Copy link
Owner

@smartguy1196 @harpreetchima @nigelthomp @vratclarkson

I believe this is now fixed as of version 1.2.4!

There was a logical error, so, unfortunately, it probably should have been fixed sooner 🤦‍♂️

Thank you to everyone who helped me get this figured out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants