Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support cached responses for any URL within a crawled network #23

Open
premasagar opened this issue Aug 15, 2012 · 7 comments
Open

Support cached responses for any URL within a crawled network #23

premasagar opened this issue Aug 15, 2012 · 7 comments
Labels

Comments

@premasagar
Copy link
Member

Aspirational feature, i.e. not too important right now.

Summary: Support cached responses to a URL's network when the URL is already known as a secondary URL in a previously cached response.

The crawling required to discover a rel=me network of URLs is intensive and takes some time, so we want to avoid as much excessive processing as much as possible.

Currently, we cache the network based only on the originally requested URL, e.g. example.com. However, we should also index the data by every other URL in the network - e.g. if a request is later made to foo.com, and if foo.com is already known to part of example.com's network, then the previously cached data should be served (although, this time, foo.com will be considered the original URL and so foo.com will not be included in the URL list, but example.com will be).

@premasagar
Copy link
Member Author

Or, is this a bad idea, since this could lead to, for example, a domain that is 10 crawl steps away from the original request's domain? (See issue on crawl depth, #24).

If the secondary domain had been crawled, it would have 10 more crawl steps to attempt to find other domains that were not in the previously cached network list.

@premasagar
Copy link
Member Author

My issue raised in my previous comment is not actually a problem, since the second URL, which was 10 crawl steps away from the first, would reset the counter to zero when found, and it would then become the starting point for a new crawl. Hence, the resultant network will always be the same, no matter which URL in the network was originally requested.

So, the whole network should indeed be cached, with a request for any of its URLs able to retrieve the cached data.

@chrisnewtn
Copy link
Member

Sorry Prem, I realize my lack of documentation is a major problem here.

At the moment, the cacher works by taking the place of the scraper if it contains data pertaining to a url. Here's the code in question in the Page class:

if (cache.has(self.url)) {
  cache.fetch(self.url, populate);
} else {
  scraper.scrape(self.url, populate);
}

I should stress that graphs are not cached, pages which contain links are. I.E. Node Socialgraph caches not the graph generated by the requested domain, but the individual pages which constitute the graph as generated by the requested domain.

The cache in it's current incarnation is purely concerned with preventing the creation of JSDOM instances, that's it. The graph is still built from scratch on every request, but it constructed using cached data. To save on this processing I could cache the graph itself, as well as it's constituent pages. I gather this is how you thought it worked anyway?

@premasagar
Copy link
Member Author

Ok, well in the interests of faster response time, reducing CPU usage
and server memory to store additional content, I think we should
definitely be caching the network lists themselves.

I can imagine caching like this (initially just in memory, although in
future this may be on Redis or similar):

  • When a network list is first assembled, assign it a unique id (e.g.
    from an auto-incremented counter or a hash if the contents or
    something more exotic)
  • Store the lists in a key-value object, where the key is the of the
    network and the value is an array of all URLs in the list
    (including the URL given in the original request)
  • Store a separate key-value object that has a key for each URL in
    the list, an where the value is the network id
  • When a request comes in, look up the URL and serve the network list.
  • The default behaviour could be to return the entire network list,
    including the URL in the request query.
  • An optional flag could be passed, say omit_request, which would
    have the server remove the requested URL from the list, making the
    handler code in the client a bit simpler.

How does that sound?

@chrisnewtn
Copy link
Member

I think I get what you're saying. I'll try and figure out the best way of doing this within the application's architecture.

chrisnewtn added a commit that referenced this issue Sep 8, 2012
Add link cache timer. Add graph caching. #23
@chrisnewtn
Copy link
Member

Ok I've built in a caching timer, now anything older than an hour is discarded. It's probably not perfect, but it's simple and effective.

I've also mostly rewritten the server code to support caching of whole graphs which uses the same module as the link caching.

I've also renamed the q url parameter to url which I think is a bit more of an intuitive name. On a plus note the server no longer crashes if the url parameter is missing either!

The changes have been merged into the master branch and as soon as I've figured out how to get Jitsu onto my new Ubuntu laptop, I'll put them live there too.

Now's probably a good time for me to document this thing eh?

@premasagar
Copy link
Member Author

Great!

jitsu: [sudo] npm install jitsu -g

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants