Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy crawler+bundler implementation #44

Closed
wants to merge 15 commits into from
Closed

Conversation

Gozala
Copy link
Contributor

@Gozala Gozala commented Apr 12, 2019

Proposed changes introduce concept of Resource which on instantiation is just a pointer to the corresponding sub-resource link (where link is what it used to be). Each resource is a lazily freeze-dried on demand:

  • Original resource content can be obtained via resource.dowload():Promise<Response> method that will use options.fetchResource for IO and will cache response so next call will return same promise.
  • Immediate sub-resources can be obtained via resource.resources():Promise<Iterable<Resource>> method that will internally use this.download() to get original resource and extract links wrapping them into new Resource instances (note that subresources returned will not have fetched anything yet).
  • Resource could be freeze-dried via resource.text() / resource.blob() methods that would do all of the above and replace links with URLs returned by options.resolveURL(resource):Promise<string>. Since options.resolveURL is async it can recurse and inline blobToDataURL(await resource.blob()) or return some URL without having to fetch actual resource.

Primary API has also changed from freezeDry(opts):Promise<string> to freezeDry(opts):Resource there for to accomplish what old API did user will need to await freezeDry(opts).text() instead. This change also means user could do await freezeDry(opts).blob() or in more advanced use case choose to hand roll alternative serialization:

const root = freezeDry(opts);
const resources = await root.resources()
for (const resource of resources) {
  switch (resource.type) {
    case "image": 
    case "audio":
    case "video":
    case "font": {
      const content = await resource.dowload()
      const blob = await content.blob()
      const url = await blobToDataURL(blob)
      resource.replaceURL(url)
      break
    }
    case "style": {
      const subresources = await resource.resources()
      for (const subresource of subresource) {
        // .... probably recurse here
      }
    }
    case "document": {
      const subresources = await resource.resources()
      for (const subresource of subresource) {
        // ....
      }
    }
  }
}

Consult @Treora if it was intended to add CSP & momento
data only on the top document. Implementation used to
generate data URLs for all resources so it was not that
useful there, however if resources are saved in separate files
it might make sense to add metadata everywhere.
At the moment we just do it on the top document to match
assumbtions in tests.
@Gozala
Copy link
Contributor Author

Gozala commented Apr 12, 2019

Hmm... So it seems that current implementation of freeze-dry actually carries on crawling / inlining resources it's just it will also start serializing DOM so I would guess on timeout you will wind up with some resources inlined and others stay intact. There are also few TODOs to abort tasks on timeout and to do something about resources that have not being processed.

I think it might be worth discussing what would be a good solution here, I don't think replicating current behavior would be ideal, although it is possible it would introduce some messiness to otherwise (IMO) well structured code.

src/archive.js Outdated
getDocInFrame?: (HTMLIFrameElement) => ?Document;
fetch?: (Resource) => Promise<Response>;
fetchResource?:Fetch;
resolveURL?: (Resource) => Promise<string>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function takes resource and creates a URL for it

src/archiver.js Outdated
}
static async links(resource/*:DocumentResource*/) {
const document = await resource.captureDocument()
return extractLinksFromDom(document, {docUrl:resource.url}).map(link => makeLinkAbsolute(link, resource))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does links turning it into absolute make sense here ?

src/archiver.js Outdated
try {
const css = postcss.parse(source)
const url = response.url || resource.link.absoluteTarget
const links = extractLinksFromCss(css, url)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to makeLinkAbsolute here.

@Gozala
Copy link
Contributor Author

Gozala commented Apr 13, 2019

(Mostly noting this so I don't forget) I think I have an idea for a way of dealing with timeouts. We could use AbortController that can be passed to each fetch that will be issued by dreezeDry and used to abort all of them as necessary.

@Gozala
Copy link
Contributor Author

Gozala commented Apr 15, 2019

@Treora I'm having trouble upgrading jest-fetch-mock to latest version that should have support for AbortSignal, it seems that everything breaks after upgrade.

I did notice that after update await (await fetch(docUrl)).text() is [object Object] instead of markup, which I suspect is the reason. I also noticed that fetch.Response and Response used in mockFetch aren't equal and that makes me think it's related.

I'm also not sure where Response and Blob globals are coming from in first place, would you mind helping with that ?

@Gozala
Copy link
Contributor Author

Gozala commented Apr 15, 2019

I have figured the source of the issue. It's related to node-fetch/node-fetch#392 (comment):

New jest-mock-fetch uses cross-fetch that in turn uses node-fetch that has it's own Blob implementation that is incompatible with one that jsdom expects, which basically ends up calling blob.toString() and that's where [object Object] comes from.

Unfortunately I can't see any way to resolve this without mass-forking the whole dependency chain in attempt to resolve this.

@jimmywarting
Copy link

jimmywarting commented Apr 15, 2019

jsdom was unfortunately unwilling to use the new Public blob API to read the content of the blob with promises.

Maybe you have to use jsdom's own fetch implementation instead of node-fetch (to get there internal blob)?
the alternativ is if you can: avoid creating a res.blob() and get it as arrayBuffer using res.arrayBuffer() directly instead of calling something like fr.readAsArrayBuffer(blob)

the other way maybe could be to get it as a arraybuffer and then construct a blob using jsdom if you really need a blob

res.arrayBuffer().then(ab => {
  new Blob([ab])
})

Or try to convince him to use the FileReader to allow reading jsdom like object with the public api - i have already tried and done my part

@Gozala
Copy link
Contributor Author

Gozala commented Apr 15, 2019

I end up resorting to a horrible hack to get jest-mock-fetch > cross-fetch > node-fetch interop with jsdom. I wish there was a better way but there isn't suggestion above assumes we control the full stack, but we don't.

@Treora
Copy link
Contributor

Treora commented Apr 15, 2019

I'm also not sure where Response and Blob globals are coming from in first place, would you mind helping with that ?

As you probably found out already, these are set by jest-fetch-mock.

I have not fully grasped the issue you are dealing with (and would be very glad if I do not need to); but if you found some changes are required in the test setup etcetera, of course feel free to propose. I'm quite indifferent about what test frameworks and such we use; as long as not too much effort is wasted maintaining them. I hope the hacks can mostly be abstracted away from sight, and added in such a way that we can easily remove them if the problematic libraries will have solved issues underneath..

We could use AbortController that can be passed to each fetch that will be issued by dreezeDry and used to abort all of them as necessary.

Sounds good. Also, something signal-based approach like AbortController seems a good API we could offer instead of the timeout option. (a user could just use setTimeout themself to still get the timeout behaviour)

@Gozala
Copy link
Contributor Author

Gozala commented Apr 15, 2019

I have not fully grasped the issue you are dealing with (and would be very glad if I do not need to); but if you found some changes are required in the test setup etcetera, of course feel free to propose. I'm quite indifferent about what test frameworks and such we use; as long as not too much effort is wasted maintaining them. I hope the hacks can mostly be abstracted away from sight, and added in such a way that we can easily remove them if the problematic libraries will have solved issues underneath..

Problem is version of jest-fetch-mock did not had support of AbortController. New version does, but it also switches underlying implementation with cross-fetch which uses node-fetch in node. After updating jest-fetch-mock all tests were breaking, because global.Request / global.Response are from node-fetch while global.Blob is from jsdom worse yet Response constructor does not understand jsdom.Blob and just serializes it to [object String], furthermore Response(...).blob is a Blob instance from node-fetch that JSDOM does not understand.

This means even if we managed to make this.Blob be the one from node-fetch (which we can't because maintainers of this library don't want to expose it node-fetch/node-fetch#392), we would still have an issue with passing Response's back to JSDOM world because when it attempts to use response.blob() will not be JSDOM.Blob but would be NodeFetch.Blob instead.

Ideally differences here would be resolved across jsdom and node-fetch by eliminating competing Blob implementations and just create a common one that both would use (jsdom/jsdom#2555 (comment)). Until that happens I end up making a hack that changes JSDOM.Blob.prototype.__proto__ = NodeFetch.Blob.prototype adds methods from NodeFetch.Blob.prototype to call corresponding methods on JSDOM.Blob.prototype. Which is horrible, but only way to make blob instanceof Blob work regardless in Blob is from jsdom or node-fetch.

Sadly they monkeypatching can't "be out of sight" that is because NodeFetch.Blob isn't really available and can only be obtained async ((await Response(...).blob().constructor)

@Gozala Gozala changed the title [In progress] lazy crawler+bundler implementation Lazy crawler+bundler implementation Apr 15, 2019
@Gozala
Copy link
Contributor Author

Gozala commented Apr 16, 2019

@Treora I discovered one peculiar behavior as I was trying to change subresources from being an Iterable<Resource> to a Map<string, Resource> - that is current implementation does not recognize duplicates: link to the same image from both document and styles will end up fetching it twice.

I would like to address this with an alternate design, where resource:link mapping isn't 1 : 1, but one 1:n. This pull is huge already & it might make more sense to do change to address this as followup instead.

@Treora
Copy link
Contributor

Treora commented Apr 16, 2019

current implementation does not recognize duplicates: link to the same image from both document and styles will end up fetching it twice.

Yes, thus far it seemed best to keep things simple and leave deduplication to the fetch function: the user-supplied fetchResource callback, or even the browser's cache. This could be changed, but would need to be done carefully as a URL does not necessarily always return the same resource: request headers, e.g. the Accept'ed content types, could matter.

@Treora
Copy link
Contributor

Treora commented Apr 16, 2019

This pull is huge already & it might make more sense to do change to address this as followup instead.

Indeed; this already PR is already so large that I suppose the most workable plan will be, when you feel you got to a somewhat stable solution you feel confident about, to try collaboratively review and incorporate various aspects of it in a few iterations. It is very helpful when any orthogonal changes are put into separate commits, or possibly in separate branches.

@Treora
Copy link
Contributor

Treora commented Apr 16, 2019

Problem is version of jest-fetch-mock did not had support of AbortController. ......

Thanks for the clear explanation, that helps!

@Gozala
Copy link
Contributor Author

Gozala commented Apr 16, 2019

Indeed; this already PR is already so large that I suppose the most workable plan will be, when you feel you got to a somewhat stable solution you feel confident about, to try collaboratively review and incorporate various aspects of it in a few iterations. It is very helpful when any orthogonal changes are put into separate commits, or possibly in separate branches.

I already feel like it's done & I can't really do any meaningful progress without a review. So I suggest we do that.

@Gozala
Copy link
Contributor Author

Gozala commented Dec 12, 2019

Hi @Treora,

Any chance we could figure out a way to get some of this work into the tree ? It looks like I'll be needing this in my future work and I would very much prefer not to go on with my own fork.

@Treora
Copy link
Contributor

Treora commented Dec 13, 2019 via email

@Gozala
Copy link
Contributor Author

Gozala commented Dec 16, 2019

As I think I said before, I will probably not merge this PR as-is, but
rather take the architecture and API design you demonstrated here as a
guide and make changes in a few steps; while trying to keep the code a
bit more modular than this PR.

Is there way we could collaborate ? I do want to use freeze-dry for the work & that would require things like changes in this pull request & more. However I am not sure at this point what would be a best strategy to go about it without diverging further.

Also, I am still eager to make the code typed (see issue #28), but might
do that afterwards.

For what it's worth I would advice going about it other way round. Type system pose API constraints that otherwise could be overlooked. Additionally type checker is very helpful in large refactoring as it will identify places you've missed.

@Gozala
Copy link
Contributor Author

Gozala commented Dec 16, 2019

Assuming you're in favor of public API & risk of sounding biased, I would suggest you to consider landing this without rewriting and then followup witch changing internals to match desired modularity. Benefits of this approach would be:

  1. We all get desired functionality sooner.
  2. Big chunk of code will already be typed making followup changes easier.
  3. It will enable us to collaborate - My fork won't need to diverge further and you will be able to land incremental changes to the internal modules.

@Treora
Copy link
Contributor

Treora commented Dec 19, 2019

Is there way we could collaborate ? I do want to use freeze-dry for the work & that would require things like changes in this pull request & more. However I am not sure at this point what would be a best strategy to go about it without diverging further.

I do not know either what may be the best strategy. I suppose it helps if we exchange more details about our plans; and if we try to explain changes we make/need so they could be redone (‘rebased’, conceptually) if needed.

I am curious, what kind of changes do you expect the “& more” will contain?

Assuming you're in favor of public API & risk of sounding biased, I would suggest you to consider landing this without rewriting and then followup witch changing internals to match desired modularity.

I am considering this indeed, but need to look into the code again and fully understand it. As I intend to do a substantial rewrite, and I am already having trouble wrapping my head around some aspects, I suspect that for me it would be much easier and thus quicker to rewrite code I am most familiar with.

As for the API, I am still thinking about making ‘two levels’ of what you have sort of done in one; I’d like to make the code more ‘graph-oriented’ (perhaps passing subgraphs of subresources&links around), and provide the ‘callback-oriented’ API you created only as a convenience function on top of that. Not sure that makes sense to you, nor if it will keep making sense to me as I dig deeper. :)

@Treora
Copy link
Contributor

Treora commented Dec 19, 2019

As for the API, I am still thinking about making ‘two levels’ of what you have sort of done in one

I may be inaccurate here, as you do leave the possibility to interact with the resource graph without using the resolveURL callback (as you exemplify above). I will look into this again and see how much that already matches what I had in mind.

@Treora
Copy link
Contributor

Treora commented Dec 19, 2019

Also, I am still eager to make the code typed (see issue #28), but might
do that afterwards.

For what it's worth I would advice going about it other way round. Type system pose API constraints that otherwise could be overlooked. Additionally type checker is very helpful in large refactoring as it will identify places you've missed.

Good points. I was thinking that it may slow down rewriting, especially as I am not fluent with typed javascript. Could work out either way, I guess.

As for which type system/language to use; I think you once said you use flow for its more powerful templating system, but that typescript may be helpful because of its wider uptake, community, etc. How is your opinion on this now? I have been a little biased towards adopting typescript, but if you recommend flow I am fine with following your lead on this.

@Gozala
Copy link
Contributor Author

Gozala commented Dec 19, 2019

As for which type system/language to use; I think you once said you use flow for its more powerful templating system, but that typescript may be helpful because of its wider uptake, community, etc. How is your opinion on this now? I have been a little biased towards adopting typescript, but if you recommend flow I am fine with following your lead on this.

I still think flow has a better & more flexible type system, however typescript is probably a better choice given it’s popularity.

Only argument I would consider is that with typescript you’ll need to do builds etc... In that regard I think flow has an advantage as you can add types as comments (like in this pull) and code would load natively in runtimes that support ES modules.

@Gozala
Copy link
Contributor Author

Gozala commented Dec 19, 2019

As for the API, I am still thinking about making ‘two levels’ of what you have sort of done in one
I may be inaccurate here, as you do leave the possibility to interact with the resource graph without using the resolveURL callback (as you exemplify above). I will look into this again and see how much that already matches what I had in mind.

Indeed you can traverse it as a graph, and resolveURL is just a convince kind of like array.reduce but you don’t have to use it.

That being said I find it fairly convenient way to avoid multiple passes over the graph.

I think what you’re disliking about the API is side effect of it being designed to traverse and bundle graph in a single pass, which in turn requires pipelining fetch, parse, remap, save operations such that as you descend it does all of them.

I don’t know if there’s a better way to go about it, but my experience so far had being it allows me to customize behavior very simply

@Gozala
Copy link
Contributor Author

Gozala commented Dec 19, 2019

Alternative API could follow visitor pattern which accomplishes more or less same goals, but is more OOP style.

Current solution allows you to either use pushed base interface via resolveURL hook which is your visitor pattern I suppose but it also allows you to use pull based interface as shown in the example in the description.

@Gozala
Copy link
Contributor Author

Gozala commented Dec 19, 2019

I do not know either what may be the best strategy. I suppose it helps if we exchange more details about our plans; and if we try to explain changes we make/need so they could be redone (‘rebased’, conceptually) if needed.

I am curious, what kind of changes do you expect the “& more” will contain?

I do not exactly know yet, but here is a bit of overview of the context I'm using freeze-dry right now https://github.com/gozala/artifacts

Readme does not quite talks about it yet, but streamingable web bundle formats are also on todo, meaning I'll need to start stream bundle as document is being parsed and bundled & ideally such that parallel requests to sub-resources could be streamed in parallel (not forcing them to queue up).

Another thing that I intend to change about current architecture is the source DOM tree mutation, that currently occurs during freeze-drying.

Finally I want to also add a way to plug things like CSS parsers which currently are baked in.

Hope this gives enough context of what's on my list.

@Treora Treora mentioned this pull request Dec 20, 2019
@Treora
Copy link
Contributor

Treora commented Apr 12, 2020

Happy anniversary to this PR. 🎂 (just kidding, not proud of this..)

As discussed, I started with converting the project to typescript; took a while, but it’s pretty much done, see #52! (any comments there are welcome, it’s my first typescript project)

Next up is getting this functionality in, using this PR as reference but obviously writing everything anew.

Treora added a commit that referenced this pull request Jul 28, 2021
Combining iterators with recursion was never going to work, at least not
without ruining the elegance it was supposed to provide. Instead, use a
callback-based approach, where the callback is invoked for each link to
a subresource and is expected to recurse to handle sub-subresources etc.

The approach is similar to (and inspired by) gozala’s approach in PR #44;
however, it differs in several ways:
- It does not create ‘stub’ resources, because the Link is the item that
  is passed to the callback function.
- Fetching, recursing and updating the link target are explicitly
  performed by the callback function; which may thus require a few more
  lines, but less magic is happening. Also the Resource classes need not
  know anything about these activities, becoming a more passive object.
- Capturing the DOM state is still a separate step doing its own
  recursion, in order to make that happen synchronously. Might change in
  subsequent edits.
@Treora Treora closed this in #59 Jul 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants