Indexing routines take a while to finish #12

alexander-bauer · 2012-10-27T00:22:39Z

This is caused primarily by the fact that every page Distru indexes has to have an individual http request. These are currently run linearly; each page must wait for the previous one to finish. Each site must wait for the previous site to finish. Most of the lag, in this case, is network lag; Distru can handle the indexing very fast, but the target sites are slow to respond, and therefore they slow down our indexing.

This can be fixed by allowing Distru to send many HTTP requests simultaneously, mainly to separate sites, so that they can be indexed simultaneously. More complicatedly, requests to different pages could be handled simultaneously. This would require a bit more overhead, to keep track of pages already being requested, but it is definitely doable.

Distru used to invoke NewIndex() on startup to initialize the variable Idx. This meant that that function had to be run before any other pieces of the software could initialize. I just moved this initializ- ation to a new goroutine, which is invoked inside of Serve(). This allows the server to start first, and then for the index to build while it is running. The server will only serve only an empty index until it finishes invoking the constructor. ref #12

Previously, Indexes were very nondynamic structures, for which there existed one constructor, which would index a list of sites in series (a very slow process) using only one thread. This commit changes that process entirely, so that indexes are instead maintained by a number of constantly running processes, Indexers. These read from a common channel, owned by Serve(), that controls what sites should be indexed. Once put into the channel, they are addressed by the first free Indexer, which immediately removes them from the channel and begins to index the site. When it finishes, it looks for any new items in the channel, and so on. If the channel is closed with close(c), the Indexers tied to that channel are terminated. The number of indexers to run is passed as an argument to MaintainIndex(), so the number of sites that can be indexed concurrently is settable. Furthermore, target urls may be added at any time, by any process which knows about the queue channel. This may in the future be tied to a user interface. ref #12

alexander-bauer mentioned this issue Oct 29, 2012

Multiple HTTP GET requests #13

Closed

alexander-bauer closed this as completed in 53d8457 Nov 6, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing routines take a while to finish #12

Indexing routines take a while to finish #12

alexander-bauer commented Oct 27, 2012

Indexing routines take a while to finish #12

Indexing routines take a while to finish #12

Comments

alexander-bauer commented Oct 27, 2012