Lazy crawler+bundler implementation #44

Gozala · 2019-04-12T17:14:11Z

Proposed changes introduce concept of Resource which on instantiation is just a pointer to the corresponding sub-resource link (where link is what it used to be). Each resource is a lazily freeze-dried on demand:

Original resource content can be obtained via resource.dowload():Promise<Response> method that will use options.fetchResource for IO and will cache response so next call will return same promise.
Immediate sub-resources can be obtained via resource.resources():Promise<Iterable<Resource>> method that will internally use this.download() to get original resource and extract links wrapping them into new Resource instances (note that subresources returned will not have fetched anything yet).
Resource could be freeze-dried via resource.text() / resource.blob() methods that would do all of the above and replace links with URLs returned by options.resolveURL(resource):Promise<string>. Since options.resolveURL is async it can recurse and inline blobToDataURL(await resource.blob()) or return some URL without having to fetch actual resource.

Primary API has also changed from freezeDry(opts):Promise<string> to freezeDry(opts):Resource there for to accomplish what old API did user will need to await freezeDry(opts).text() instead. This change also means user could do await freezeDry(opts).blob() or in more advanced use case choose to hand roll alternative serialization:

const root = freezeDry(opts);
const resources = await root.resources()
for (const resource of resources) {
  switch (resource.type) {
    case "image": 
    case "audio":
    case "video":
    case "font": {
      const content = await resource.dowload()
      const blob = await content.blob()
      const url = await blobToDataURL(blob)
      resource.replaceURL(url)
      break
    }
    case "style": {
      const subresources = await resource.resources()
      for (const subresource of subresource) {
        // .... probably recurse here
      }
    }
    case "document": {
      const subresources = await resource.resources()
      for (const subresource of subresource) {
        // ....
      }
    }
  }
}

@Treora

Consult @Treora if it was intended to add CSP & momento data only on the top document. Implementation used to generate data URLs for all resources so it was not that useful there, however if resources are saved in separate files it might make sense to add metadata everywhere. At the moment we just do it on the top document to match assumbtions in tests.

Gozala · 2019-04-12T18:15:22Z

Hmm... So it seems that current implementation of freeze-dry actually carries on crawling / inlining resources it's just it will also start serializing DOM so I would guess on timeout you will wind up with some resources inlined and others stay intact. There are also few TODOs to abort tasks on timeout and to do something about resources that have not being processed.

I think it might be worth discussing what would be a good solution here, I don't think replicating current behavior would be ideal, although it is possible it would introduce some messiness to otherwise (IMO) well structured code.

Gozala · 2019-04-12T18:28:29Z

src/archive.js

+  getDocInFrame?: (HTMLIFrameElement) => ?Document;
+  fetch?: (Resource) => Promise<Response>;
+  fetchResource?:Fetch;
+  resolveURL?: (Resource) => Promise<string>;


This function takes resource and creates a URL for it

Gozala · 2019-04-12T19:32:12Z

src/archiver.js

+  }
+  static async links(resource/*:DocumentResource*/) {
+    const document = await resource.captureDocument()
+    return extractLinksFromDom(document, {docUrl:resource.url}).map(link => makeLinkAbsolute(link, resource))


Does links turning it into absolute make sense here ?

Gozala · 2019-04-12T19:33:15Z

src/archiver.js

+    try {
+      const css = postcss.parse(source)
+      const url = response.url || resource.link.absoluteTarget
+      const links = extractLinksFromCss(css, url)


Need to makeLinkAbsolute here.

Gozala · 2019-04-13T16:24:10Z

(Mostly noting this so I don't forget) I think I have an idea for a way of dealing with timeouts. We could use AbortController that can be passed to each fetch that will be issued by dreezeDry and used to abort all of them as necessary.

Gozala · 2019-04-15T16:41:01Z

@Treora I'm having trouble upgrading jest-fetch-mock to latest version that should have support for AbortSignal, it seems that everything breaks after upgrade.

I did notice that after update await (await fetch(docUrl)).text() is [object Object] instead of markup, which I suspect is the reason. I also noticed that fetch.Response and Response used in mockFetch aren't equal and that makes me think it's related.

I'm also not sure where Response and Blob globals are coming from in first place, would you mind helping with that ?

Gozala · 2019-04-15T17:03:08Z

I have figured the source of the issue. It's related to node-fetch/node-fetch#392 (comment):

New jest-mock-fetch uses cross-fetch that in turn uses node-fetch that has it's own Blob implementation that is incompatible with one that jsdom expects, which basically ends up calling blob.toString() and that's where [object Object] comes from.

Unfortunately I can't see any way to resolve this without mass-forking the whole dependency chain in attempt to resolve this.

jimmywarting · 2019-04-15T19:11:30Z

jsdom was unfortunately unwilling to use the new Public blob API to read the content of the blob with promises.

Maybe you have to use jsdom's own fetch implementation instead of node-fetch (to get there internal blob)?
the alternativ is if you can: avoid creating a res.blob() and get it as arrayBuffer using res.arrayBuffer() directly instead of calling something like fr.readAsArrayBuffer(blob)

the other way maybe could be to get it as a arraybuffer and then construct a blob using jsdom if you really need a blob

res.arrayBuffer().then(ab => {
  new Blob([ab])
})

Or try to convince him to use the FileReader to allow reading jsdom like object with the public api - i have already tried and done my part

Gozala · 2019-04-15T21:05:08Z

I end up resorting to a horrible hack to get jest-mock-fetch > cross-fetch > node-fetch interop with jsdom. I wish there was a better way but there isn't suggestion above assumes we control the full stack, but we don't.

Treora · 2019-04-15T21:36:55Z

I'm also not sure where Response and Blob globals are coming from in first place, would you mind helping with that ?

As you probably found out already, these are set by jest-fetch-mock.

I have not fully grasped the issue you are dealing with (and would be very glad if I do not need to); but if you found some changes are required in the test setup etcetera, of course feel free to propose. I'm quite indifferent about what test frameworks and such we use; as long as not too much effort is wasted maintaining them. I hope the hacks can mostly be abstracted away from sight, and added in such a way that we can easily remove them if the problematic libraries will have solved issues underneath..

We could use AbortController that can be passed to each fetch that will be issued by dreezeDry and used to abort all of them as necessary.

Sounds good. Also, something signal-based approach like AbortController seems a good API we could offer instead of the timeout option. (a user could just use setTimeout themself to still get the timeout behaviour)

Gozala · 2019-04-15T22:11:31Z

I have not fully grasped the issue you are dealing with (and would be very glad if I do not need to); but if you found some changes are required in the test setup etcetera, of course feel free to propose. I'm quite indifferent about what test frameworks and such we use; as long as not too much effort is wasted maintaining them. I hope the hacks can mostly be abstracted away from sight, and added in such a way that we can easily remove them if the problematic libraries will have solved issues underneath..

Problem is version of jest-fetch-mock did not had support of AbortController. New version does, but it also switches underlying implementation with cross-fetch which uses node-fetch in node. After updating jest-fetch-mock all tests were breaking, because global.Request / global.Response are from node-fetch while global.Blob is from jsdom worse yet Response constructor does not understand jsdom.Blob and just serializes it to [object String], furthermore Response(...).blob is a Blob instance from node-fetch that JSDOM does not understand.

This means even if we managed to make this.Blob be the one from node-fetch (which we can't because maintainers of this library don't want to expose it node-fetch/node-fetch#392), we would still have an issue with passing Response's back to JSDOM world because when it attempts to use response.blob() will not be JSDOM.Blob but would be NodeFetch.Blob instead.

Ideally differences here would be resolved across jsdom and node-fetch by eliminating competing Blob implementations and just create a common one that both would use (jsdom/jsdom#2555 (comment)). Until that happens I end up making a hack that changes JSDOM.Blob.prototype.__proto__ = NodeFetch.Blob.prototype adds methods from NodeFetch.Blob.prototype to call corresponding methods on JSDOM.Blob.prototype. Which is horrible, but only way to make blob instanceof Blob work regardless in Blob is from jsdom or node-fetch.

Sadly they monkeypatching can't "be out of sight" that is because NodeFetch.Blob isn't really available and can only be obtained async ((await Response(...).blob().constructor)

Gozala · 2019-04-16T04:06:53Z

@Treora I discovered one peculiar behavior as I was trying to change subresources from being an Iterable<Resource> to a Map<string, Resource> - that is current implementation does not recognize duplicates: link to the same image from both document and styles will end up fetching it twice.

I would like to address this with an alternate design, where resource:link mapping isn't 1 : 1, but one 1:n. This pull is huge already & it might make more sense to do change to address this as followup instead.

Treora · 2019-04-16T09:32:18Z

current implementation does not recognize duplicates: link to the same image from both document and styles will end up fetching it twice.

Yes, thus far it seemed best to keep things simple and leave deduplication to the fetch function: the user-supplied fetchResource callback, or even the browser's cache. This could be changed, but would need to be done carefully as a URL does not necessarily always return the same resource: request headers, e.g. the Accept'ed content types, could matter.

Treora · 2019-04-16T09:41:59Z

This pull is huge already & it might make more sense to do change to address this as followup instead.

Indeed; this already PR is already so large that I suppose the most workable plan will be, when you feel you got to a somewhat stable solution you feel confident about, to try collaboratively review and incorporate various aspects of it in a few iterations. It is very helpful when any orthogonal changes are put into separate commits, or possibly in separate branches.

Treora · 2019-04-16T09:43:09Z

Problem is version of jest-fetch-mock did not had support of AbortController. ......

Thanks for the clear explanation, that helps!

Gozala · 2019-04-16T20:30:10Z

Indeed; this already PR is already so large that I suppose the most workable plan will be, when you feel you got to a somewhat stable solution you feel confident about, to try collaboratively review and incorporate various aspects of it in a few iterations. It is very helpful when any orthogonal changes are put into separate commits, or possibly in separate branches.

I already feel like it's done & I can't really do any meaningful progress without a review. So I suggest we do that.

Gozala · 2019-12-12T22:04:37Z

Hi @Treora,

Any chance we could figure out a way to get some of this work into the tree ? It looks like I'll be needing this in my future work and I would very much prefer not to go on with my own fork.

Treora · 2019-12-13T00:47:50Z

I am sorry I ended up doing exactly what I intended to avoid: post-poning this work for several months. However, getting back to work on freeze-dry has been slowly rising up on my to-do list. Besides getting this bundling customisability incorporated I figured it may be worth calling it v1.0 and making some better documentation and publicity for it (just a web-page, no road-side billboards). Your reminder may just get me started on this, thanks for poking. As I think I said before, I will probably not merge this PR as-is, but rather take the architecture and API design you demonstrated here as a guide and make changes in a few steps; while trying to keep the code a bit more modular than this PR. Also, I am still eager to make the code typed (see issue #28), but might do that afterwards.

Gozala · 2019-12-16T22:27:39Z

As I think I said before, I will probably not merge this PR as-is, but
rather take the architecture and API design you demonstrated here as a
guide and make changes in a few steps; while trying to keep the code a
bit more modular than this PR.

Is there way we could collaborate ? I do want to use freeze-dry for the work & that would require things like changes in this pull request & more. However I am not sure at this point what would be a best strategy to go about it without diverging further.

Also, I am still eager to make the code typed (see issue #28), but might
do that afterwards.

For what it's worth I would advice going about it other way round. Type system pose API constraints that otherwise could be overlooked. Additionally type checker is very helpful in large refactoring as it will identify places you've missed.

Gozala · 2019-12-16T22:36:31Z

Assuming you're in favor of public API & risk of sounding biased, I would suggest you to consider landing this without rewriting and then followup witch changing internals to match desired modularity. Benefits of this approach would be:

We all get desired functionality sooner.
Big chunk of code will already be typed making followup changes easier.
It will enable us to collaborate - My fork won't need to diverge further and you will be able to land incremental changes to the internal modules.

Treora · 2019-12-19T00:11:17Z

Is there way we could collaborate ? I do want to use freeze-dry for the work & that would require things like changes in this pull request & more. However I am not sure at this point what would be a best strategy to go about it without diverging further.

I do not know either what may be the best strategy. I suppose it helps if we exchange more details about our plans; and if we try to explain changes we make/need so they could be redone (‘rebased’, conceptually) if needed.

I am curious, what kind of changes do you expect the “& more” will contain?

Assuming you're in favor of public API & risk of sounding biased, I would suggest you to consider landing this without rewriting and then followup witch changing internals to match desired modularity.

I am considering this indeed, but need to look into the code again and fully understand it. As I intend to do a substantial rewrite, and I am already having trouble wrapping my head around some aspects, I suspect that for me it would be much easier and thus quicker to rewrite code I am most familiar with.

As for the API, I am still thinking about making ‘two levels’ of what you have sort of done in one; I’d like to make the code more ‘graph-oriented’ (perhaps passing subgraphs of subresources&links around), and provide the ‘callback-oriented’ API you created only as a convenience function on top of that. Not sure that makes sense to you, nor if it will keep making sense to me as I dig deeper. :)

Treora · 2019-12-19T00:39:33Z

As for the API, I am still thinking about making ‘two levels’ of what you have sort of done in one

I may be inaccurate here, as you do leave the possibility to interact with the resource graph without using the resolveURL callback (as you exemplify above). I will look into this again and see how much that already matches what I had in mind.

Treora · 2019-12-19T00:42:04Z

Also, I am still eager to make the code typed (see issue #28), but might
do that afterwards.

For what it's worth I would advice going about it other way round. Type system pose API constraints that otherwise could be overlooked. Additionally type checker is very helpful in large refactoring as it will identify places you've missed.

Good points. I was thinking that it may slow down rewriting, especially as I am not fluent with typed javascript. Could work out either way, I guess.

As for which type system/language to use; I think you once said you use flow for its more powerful templating system, but that typescript may be helpful because of its wider uptake, community, etc. How is your opinion on this now? I have been a little biased towards adopting typescript, but if you recommend flow I am fine with following your lead on this.

Gozala · 2019-12-19T02:26:34Z

As for which type system/language to use; I think you once said you use flow for its more powerful templating system, but that typescript may be helpful because of its wider uptake, community, etc. How is your opinion on this now? I have been a little biased towards adopting typescript, but if you recommend flow I am fine with following your lead on this.

I still think flow has a better & more flexible type system, however typescript is probably a better choice given it’s popularity.

Only argument I would consider is that with typescript you’ll need to do builds etc... In that regard I think flow has an advantage as you can add types as comments (like in this pull) and code would load natively in runtimes that support ES modules.

Gozala · 2019-12-19T02:43:59Z

As for the API, I am still thinking about making ‘two levels’ of what you have sort of done in one
I may be inaccurate here, as you do leave the possibility to interact with the resource graph without using the resolveURL callback (as you exemplify above). I will look into this again and see how much that already matches what I had in mind.

Indeed you can traverse it as a graph, and resolveURL is just a convince kind of like array.reduce but you don’t have to use it.

That being said I find it fairly convenient way to avoid multiple passes over the graph.

I think what you’re disliking about the API is side effect of it being designed to traverse and bundle graph in a single pass, which in turn requires pipelining fetch, parse, remap, save operations such that as you descend it does all of them.

I don’t know if there’s a better way to go about it, but my experience so far had being it allows me to customize behavior very simply

Gozala · 2019-12-19T02:48:16Z

Alternative API could follow visitor pattern which accomplishes more or less same goals, but is more OOP style.

Current solution allows you to either use pushed base interface via resolveURL hook which is your visitor pattern I suppose but it also allows you to use pull based interface as shown in the example in the description.

Gozala · 2019-12-19T02:57:47Z

I do not know either what may be the best strategy. I suppose it helps if we exchange more details about our plans; and if we try to explain changes we make/need so they could be redone (‘rebased’, conceptually) if needed.

I am curious, what kind of changes do you expect the “& more” will contain?

I do not exactly know yet, but here is a bit of overview of the context I'm using freeze-dry right now https://github.com/gozala/artifacts

Readme does not quite talks about it yet, but streamingable web bundle formats are also on todo, meaning I'll need to start stream bundle as document is being parsed and bundled & ideally such that parallel requests to sub-resources could be streamed in parallel (not forcing them to queue up).

Another thing that I intend to change about current architecture is the source DOM tree mutation, that currently occurs during freeze-drying.

Finally I want to also add a way to plug things like CSS parsers which currently are baked in.

Hope this gives enough context of what's on my list.

Treora · 2020-04-12T17:09:50Z

Happy anniversary to this PR. 🎂 (just kidding, not proud of this..)

As discussed, I started with converting the project to typescript; took a while, but it’s pretty much done, see #52! (any comments there are welcome, it’s my first typescript project)

Next up is getting this functionality in, using this PR as reference but obviously writing everything anew.

Combining iterators with recursion was never going to work, at least not without ruining the elegance it was supposed to provide. Instead, use a callback-based approach, where the callback is invoked for each link to a subresource and is expected to recurse to handle sub-subresources etc. The approach is similar to (and inspired by) gozala’s approach in PR #44; however, it differs in several ways: - It does not create ‘stub’ resources, because the Link is the item that is passed to the callback function. - Fetching, recursing and updating the link target are explicitly performed by the callback function; which may thus require a few more lines, but less magic is happening. Also the Resource classes need not know anything about these activities, becoming a more passive object. - Capturing the DOM state is still a separate step doing its own recursion, in order to make that happen synchronously. Might change in subsequent edits.

Gozala added 7 commits April 10, 2019 13:42

Expose blobToURL option for turning blobs to URLs

cd2889e

draft of lazy archiver

91d7926

Do code cleanup.

e0100af

Make bunch of fixes

b131003

fix stylesheet resource inlining

2f74b08

Fix bug that caused docUrl to be ignored.

07d4d58

Gozala commented Apr 12, 2019

View reviewed changes

Gozala added 2 commits April 12, 2019 16:01

Simplify by combining PlainResource with IO class

d1eda20

Merge all DocumentResources into one class.

ced4230

Gozala force-pushed the proxy branch from e324a67 to be56348 Compare April 12, 2019 23:41

Rename files & remove former implementation

d57205d

Gozala force-pushed the proxy branch from be56348 to d57205d Compare April 12, 2019 23:44

Gozala added 2 commits April 12, 2019 17:12

Switch to simpler API.

47a0055

Remove dry-resources

7c48006

Gozala mentioned this pull request Apr 13, 2019

Expose blobToURL option for turning blobs to URLs #43

Closed

Gozala added 2 commits April 12, 2019 17:25

remove crawler which got integrated into resource

26fa00b

remove unecessary content from .flowconfig

642b6c7

Workout solution for timeouts.

1694a7f

Gozala changed the title ~~[In progress] lazy crawler+bundler implementation~~ Lazy crawler+bundler implementation Apr 15, 2019

Gozala mentioned this pull request Apr 16, 2019

Implement Blob.stream, Blob.text and Blob.arrayBuffer jsdom/jsdom#2555

Open

Treora mentioned this pull request Dec 20, 2019

Use a type system #28

Closed

Gozala mentioned this pull request Dec 10, 2020

Save entire Web page to IPFS ipfs/ipfs-companion#91

Open

Treora mentioned this pull request Jul 15, 2022

Allow customisation, use playwright for tests, write documentation #59

Merged

Treora closed this in #59 Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy crawler+bundler implementation #44

Lazy crawler+bundler implementation #44

Gozala commented Apr 12, 2019 •

edited

Gozala commented Apr 12, 2019

Gozala Apr 12, 2019

Gozala Apr 12, 2019

Gozala Apr 12, 2019

Gozala commented Apr 13, 2019

Gozala commented Apr 15, 2019

Gozala commented Apr 15, 2019

jimmywarting commented Apr 15, 2019 •

edited

Gozala commented Apr 15, 2019

Treora commented Apr 15, 2019

Gozala commented Apr 15, 2019

Gozala commented Apr 16, 2019

Treora commented Apr 16, 2019

Treora commented Apr 16, 2019

Treora commented Apr 16, 2019

Gozala commented Apr 16, 2019

Gozala commented Dec 12, 2019

Treora commented Dec 13, 2019 via email

Gozala commented Dec 16, 2019

Gozala commented Dec 16, 2019

Treora commented Dec 19, 2019

Treora commented Dec 19, 2019

Treora commented Dec 19, 2019

Gozala commented Dec 19, 2019

Gozala commented Dec 19, 2019

Gozala commented Dec 19, 2019

Gozala commented Dec 19, 2019

Treora commented Apr 12, 2020

Lazy crawler+bundler implementation #44

Lazy crawler+bundler implementation #44

Conversation

Gozala commented Apr 12, 2019 • edited

Gozala commented Apr 12, 2019

Gozala Apr 12, 2019

Choose a reason for hiding this comment

Gozala Apr 12, 2019

Choose a reason for hiding this comment

Gozala Apr 12, 2019

Choose a reason for hiding this comment

Gozala commented Apr 13, 2019

Gozala commented Apr 15, 2019

Gozala commented Apr 15, 2019

jimmywarting commented Apr 15, 2019 • edited

Gozala commented Apr 15, 2019

Treora commented Apr 15, 2019

Gozala commented Apr 15, 2019

Gozala commented Apr 16, 2019

Treora commented Apr 16, 2019

Treora commented Apr 16, 2019

Treora commented Apr 16, 2019

Gozala commented Apr 16, 2019

Gozala commented Dec 12, 2019

Treora commented Dec 13, 2019 via email

Gozala commented Dec 16, 2019

Gozala commented Dec 16, 2019

Treora commented Dec 19, 2019

Treora commented Dec 19, 2019

Treora commented Dec 19, 2019

Gozala commented Dec 19, 2019

Gozala commented Dec 19, 2019

Gozala commented Dec 19, 2019

Gozala commented Dec 19, 2019

Treora commented Apr 12, 2020

Gozala commented Apr 12, 2019 •

edited

jimmywarting commented Apr 15, 2019 •

edited