Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atomic index #5

Merged
merged 3 commits into from
Sep 5, 2018
Merged

Atomic index #5

merged 3 commits into from
Sep 5, 2018

Conversation

avigoldman
Copy link
Contributor

Fixes #1

@Haroenv
Copy link
Contributor

Haroenv commented Jun 23, 2018

cool, thanks! I think this should work, but the main reason I wanted atomic operations is to avoid doing a complete reindex on every build. You're right that this will be more correct (i.e. no downtime & no deleted data in the index), but it will still do as many operations as objects in the index.

It would be cool to save all the hashes of the objects in a second index, compare those to the hashes that should be pushed, lastly delete or update those hashes.

Does that make sense?

Since this is such a simple file you could already for now definitely use this solution in your own app, but just was wondering if you'd be interested in exploring further

@avigoldman
Copy link
Contributor Author

Ah, yes. I think I follow now.

So just to outline the steps:

  1. Pull the current index
  2. Hash the new index and the pulled index
  3. Find the diffs
  4. Delete, update, or add the differences

Sound right?

@Haroenv
Copy link
Contributor

Haroenv commented Jun 23, 2018

Note that we have the hashes already since every node in gatsby has the hash in there somehow. Storing these in a second index would indeed be the preferred way I think.

So 1. Would be: “get the hashes” (each objectID needs to have a hash) from the hashes index. Then calculate hashtable for the “to push index”. Do diffs of the hashes and push/delete etc. This last step can probably happen directly in the prod index since batch operations are considered atomic by Algolia (in the order they arrive in the index).

This should be a good way of handling it.

Thanks again for picking this up (and sorry for not being able to clone/contribute for now, I’m only on my phone in the weekend)

@Haroenv Haroenv mentioned this pull request Jul 10, 2018
@coreyward
Copy link

coreyward commented Sep 4, 2018

Weighing in here in hopes of getting some additional attention on this issue. I'm scoping out the stack for a project for a client now and, on account of this issue, I'll be using Lunr.js instead of Algolia. I believe this is the third separate projects I've had to do this on. Perhaps it doesn't seem like a big deal, but due to the way development works in an organization (many people running local instances, lots of restarts to pick up new data, test builds, etc) and the way Algolia prices per indexing op, this ends up crazy expensive.

For example: with 500 blog posts on a website under active development where Gatsby gets booted 20 times per day on average (fairly conservative), there end up 300,000 records in Algolia within 30 days, costing over $300 a month on the Essential plan. That continues to grow as Gatsby gets rebooted.

Comparatively, I can build this into a Lunr.js index and send it to the client compressed in about 200kb and have a reasonable search for free. I'd rather use Algolia for the additional features, but again, cost, which is really tracking back to this specific issue.

Hopefully Algolia can dedicate some resources to this issue or otherwise make it possible to use this library by the time my next client with search project begins.

@Haroenv
Copy link
Contributor

Haroenv commented Sep 4, 2018

Hey @coreyward, I'm aware that this is definitely something to work on, but since I'm working on lots of other things at the moment, I haven't yet had time to fit this in.

Note that this PR was already tested by @avigoldman and he said it worked, where I was looking for a solution that does even less operations. The plan I had in mind for this is:

  1. add proper tests
  2. merge this PR
  3. make a real atomic indexing solution

@Haroenv
Copy link
Contributor

Haroenv commented Sep 5, 2018

@coreyward, are you using unique objectIDs or not?

@Haroenv Haroenv merged commit f19c332 into algolia:master Sep 5, 2018
@Haroenv
Copy link
Contributor

Haroenv commented Sep 5, 2018

Thanks @avigoldman and sorry for the delay here.

@coreyward you probably just need to be sure you use unique objects. If you update 1000 times per day, or once per day doesn't change how many records you have. However, this will cause as many operations as there are items on every build. This is a separate issue we can fix another time.

@Haroenv Haroenv mentioned this pull request Sep 5, 2018
@avigoldman avigoldman deleted the atomic-index branch September 17, 2018 15:20
@avigoldman avigoldman restored the atomic-index branch September 17, 2018 15:20
@avigoldman avigoldman deleted the atomic-index branch September 17, 2018 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants