Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing routines take a while to finish #12

Closed
alexander-bauer opened this issue Oct 27, 2012 · 0 comments
Closed

Indexing routines take a while to finish #12

alexander-bauer opened this issue Oct 27, 2012 · 0 comments
Labels
Milestone

Comments

@alexander-bauer
Copy link
Owner

This is caused primarily by the fact that every page Distru indexes has to have an individual http request. These are currently run linearly; each page must wait for the previous one to finish. Each site must wait for the previous site to finish. Most of the lag, in this case, is network lag; Distru can handle the indexing very fast, but the target sites are slow to respond, and therefore they slow down our indexing.

This can be fixed by allowing Distru to send many HTTP requests simultaneously, mainly to separate sites, so that they can be indexed simultaneously. More complicatedly, requests to different pages could be handled simultaneously. This would require a bit more overhead, to keep track of pages already being requested, but it is definitely doable.

alexander-bauer added a commit that referenced this issue Oct 27, 2012
Distru used to invoke NewIndex() on startup to initialize the variable
Idx. This meant that that function had to be run before any other
pieces of the software could initialize. I just moved this initializ-
ation to a new goroutine, which is invoked inside of Serve(). This
allows the server to start first, and then for the index to build while
it is running.

The server will only serve only an empty index until it finishes invoking the constructor.

ref #12
alexander-bauer added a commit that referenced this issue Oct 27, 2012
Previously, Indexes were very nondynamic structures, for which there
existed one constructor, which would index a list of sites in series
(a very slow process) using only one thread.

This commit changes that process entirely, so that indexes are instead
maintained by a number of constantly running processes, Indexers. These
read from a common channel, owned by Serve(), that controls what sites
should be indexed. Once put into the channel, they are addressed by the
first free Indexer, which immediately removes them from the channel and
begins to index the site. When it finishes, it looks for any new items
in the channel, and so on. If the channel is closed with close(c), the
Indexers tied to that channel are terminated.

The number of indexers to run is passed as an argument to MaintainIndex(),
so the number of sites that can be indexed concurrently is settable.
Furthermore, target urls may be added at any time, by any process which
knows about the queue channel. This may in the future be tied to a user
interface.

ref #12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant