Skip to content

Commit

Permalink
Add a list of implementation limitations
Browse files Browse the repository at this point in the history
  • Loading branch information
carlmw committed May 26, 2015
1 parent 8433ee0 commit 4b7f115
Showing 1 changed file with 15 additions and 3 deletions.
18 changes: 15 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ When the module is installed globally, Cartographer should be added to your path
# TODO

- Better error handling
- I've taken the happy path throughout
- Better asset matching
- Would be great to find a library that does this far better than my RegExps
- Extract the recursion and remove state
Expand All @@ -60,12 +61,23 @@ Spelunking through trumpet's `package.json` I found [html-tokenize](https://gith

For making HTTP requests I went with [request](https://github.com/request/request) because I just didn't want to be messing with different schemes and request gives me a simple stream I can plug straight into html-tokenize.

Internally Node throttles concurrent requests to 5 and does a pretty good job of reusing connections from its pool. This could possibly be tuned, but for now it means I don't really have to worry about managing connections myself.

My approach involves hitting the first page to find anchors, scripts and stylesheets. When an anchor resolves to a page on the same host I then recurse into it collecting its assets and those of its children.

When an asset is found its [type, path, source page] are output on a firehose.

My intention would be to pipe this into any number of formatters. This would be great for creating an asset dependency tree or site hierarchy.




# Caveats, limitations and considerations

* The current implementation ignores images altogether
* It would be trivial to add a matcher for images
* Perhaps it would need to handle `srcset` and `<picture>`?
* It could also parse CSS and find images referenced therein
* Should the scrapper respect `robots.txt` and/or `rel="nofollow"`
* Perhaps it should respect `rel="canonical"` to slightly mitigate the likelyhood of finding the same page at multiple URLs?
* Currently when assets are output on the firehouse they are not resolved relative to the rootUrl
* This could make matching the same asset across multiple pages a little difficult.
* It would be greate to respect `Cache-Control` and `Expires` headers.
* We could use this to cache results on disk to speed up future runs

0 comments on commit 4b7f115

Please sign in to comment.