Add a list of implementation limitations

carlmw · May 26, 2015 · 4b7f115 · 4b7f115
1 parent 8433ee0
commit 4b7f115
Showing 1 changed file with 15 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -39,6 +39,7 @@ When the module is installed globally, Cartographer should be added to your path
 # TODO
 
 - Better error handling
+  - I've taken the happy path throughout
 - Better asset matching
   - Would be great to find a library that does this far better than my RegExps
 - Extract the recursion and remove state
@@ -60,12 +61,23 @@ Spelunking through trumpet's `package.json` I found [html-tokenize](https://gith
 
 For making HTTP requests I went with [request](https://github.com/request/request) because I just didn't want to be messing with different schemes and request gives me a simple stream I can plug straight into html-tokenize.
 
+Internally Node throttles concurrent requests to 5 and does a pretty good job of reusing connections from its pool. This could possibly be tuned, but for now it means I don't really have to worry about managing connections myself.
+
 My approach involves hitting the first page to find anchors, scripts and stylesheets. When an anchor resolves to a page on the same host I then recurse into it collecting its assets and those of its children.
 
 When an asset is found its [type, path, source page] are output on a firehose.
 
 My intention would be to pipe this into any number of formatters. This would be great for creating an asset dependency tree or site hierarchy.
 
-
-
-
+# Caveats, limitations and considerations
+
+* The current implementation ignores images altogether
+  * It would be trivial to add a matcher for images
+  * Perhaps it would need to handle `srcset` and `<picture>`?
+  * It could also parse CSS and find images referenced therein
+* Should the scrapper respect `robots.txt` and/or `rel="nofollow"`
+* Perhaps it should respect `rel="canonical"` to slightly mitigate the likelyhood of finding the same page at multiple URLs?
+* Currently when assets are output on the firehouse they are not resolved relative to the rootUrl
+  * This could make matching the same asset across multiple pages a little difficult.
+* It would be greate to respect `Cache-Control` and `Expires` headers.
+  * We could use this to cache results on disk to speed up future runs