Large files cause a problem with indexing... #102

Closed
docwhat opened this Issue Mar 11, 2013 · 5 comments

Projects

None yet

3 participants

@docwhat
Contributor
docwhat commented Mar 11, 2013

How to reproduce:

  1. Create a new geminabox instance.
  2. Put several large (~20 mb) gems in it.
  3. Run /reindex
  4. Wait for a long time (30-40 seconds)

What I expect:

It should take less than a second.

Notes:

I can reproduce with some proprietary indexes. Running gem generate_index is nearly instantaneous. But /reindex is painfully slow and causes problems with uploading and deleting gems.

I suspect that geminbox is opening the gems and reading through the whole file for some reason. I notice in the logs, that the output from generate_index doesn't show up till very late... So I suspect geminabox is doing something before running the generate_index that is causing the slowup.

This is with geminabox 0.10.1 and rubygems 1.8.24

@docwhat
Contributor
docwhat commented Mar 11, 2013

Oh-ho! I think I just reproduced the problem....

I just cd'd into the gems directory and ran /usr/bin/time gem generate_index --directory ..:

$ /usr/bin/time gem generate_index --directory ..
Generating Marshal quick index gemspecs for 54 gems
......................................................
Complete
Generated Marshal quick index gemspecs: 0.040s
Generating Marshal master index
Generated Marshal master index: 0.008s
Generating specs index
Generated specs index: 0.003s
Generating latest specs index
Generated latest specs index: 0.001s
Generating prerelease specs index
Generated prerelease specs index: 0.000s
Compressing indicies
Compressed indicies: 0.002s
65.96user 2.44system 1:12.42elapsed 94%CPU (0avgtext+0avgdata 177792maxresident)k
1922328inputs+560outputs (1major+19683minor)pagefaults 0swaps

Meanwhile, with the same gems:

$ mkdir -p q && /usr/bin/time gem generate_index --directory q
Generating Marshal quick index gemspecs for 0 gems

Complete
Generated Marshal quick index gemspecs: 0.000s
Generating Marshal master index
Generated Marshal master index: 0.000s
Generating specs index
Generated specs index: 0.000s
Generating latest specs index
Generated latest specs index: 0.001s
Generating prerelease specs index
Generated prerelease specs index: 0.000s
Compressing indicies
Compressed indicies: 0.003s
0.30user 0.03system 0:00.37elapsed 89%CPU (0avgtext+0avgdata 45776maxresident)k
1616inputs+64outputs (16major+3087minor)pagefaults 0swaps

Just to try a sibling directory:

$ mkdir -p ../sibling && /usr/bin/time gem generate_index --directory ../sibling
Generating Marshal quick index gemspecs for 0 gems

Complete
Generated Marshal quick index gemspecs: 0.000s
Generating Marshal master index
Generated Marshal master index: 0.000s
Generating specs index
Generated specs index: 0.000s
Generating latest specs index
Generated latest specs index: 0.000s
Generating prerelease specs index
Generated prerelease specs index: 0.000s
Compressing indicies
Compressed indicies: 0.001s
0.26user 0.02system 0:00.29elapsed 98%CPU (0avgtext+0avgdata 45760maxresident)k
0inputs+64outputs (0major+3103minor)pagefaults 0swaps

I suspect that the rubygems generate_index is traversing into the gems directory and opening up all the files for some reason...

The quick fix is to change how geminabox stores the metadata -- put in a sibling-directory or sub-directory to the gems directory.

The longer term fix is to fix rubygems.

@docwhat
Contributor
docwhat commented Mar 11, 2013

I opened up Issue #509 with Rubygems for this.

I still think we should fix this in geminabox just because not everyone will run whatever version rubygems fixes it...unless they just say "don't do that".

@docwhat
Contributor
docwhat commented Mar 12, 2013

I'm apparently daft and can't read documentation. The problem is that rubygems is opening and reading the whole .gem file.

@docwhat
Contributor
docwhat commented Mar 12, 2013

Okay. I dug through the rubygems code a bit more and what's going on is that when a gem is loaded to read its "spec" then rubygems does a checksum check (by ungzipping the contents to verify that the gzip checksum says the file is intact).

Unless rubygems adds a no_verify option to the indexer, this is going to remain slow.

How about adding a simple fork-and-lock mechanism for re-indexing?

The steps would be something like:

  1. fork reindexer
  2. reindexer waits to acquire a lock.
  3. Once the lock is acquired, then reindex.
  4. reindexer quits

Pros:

  • simple
  • not likely fail

Cons:

  • web UI will be out-of-date until reindex finishes
  • If a reindexing is waiting and everything shuts down, then a needed reindex may be needed.

We could address the web UI being out-of-date by doing a refresh every 5 seconds until the lock is free (no reindexers running). not perfect, but better than waiting 60+ seconds for a reindex.

Ciao!

@NathanZook

While slow, limiting the server to one update at a time works. While our (Whitepages) application does not like the idea of blocking reads as well, that is certainly something which could be made configurable. If I get a pull request in, I'll try to remember this.

There are related issues relating to rubygems & bundler. The workaround in rubygems causes reindexing take 3x as long as it should.

A full reindex should not be required for deleting a gem. If a partial were done, that would pretty much cover these problems.

@tomlea tomlea added the Stale label Feb 15, 2016
@tomlea tomlea closed this Feb 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment