-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy crawler+bundler implementation #44
Conversation
Consult @Treora if it was intended to add CSP & momento data only on the top document. Implementation used to generate data URLs for all resources so it was not that useful there, however if resources are saved in separate files it might make sense to add metadata everywhere. At the moment we just do it on the top document to match assumbtions in tests.
Hmm... So it seems that current implementation of freeze-dry actually carries on crawling / inlining resources it's just it will also start serializing DOM so I would guess on timeout you will wind up with some resources inlined and others stay intact. There are also few TODOs to abort tasks on timeout and to do something about resources that have not being processed. I think it might be worth discussing what would be a good solution here, I don't think replicating current behavior would be ideal, although it is possible it would introduce some messiness to otherwise (IMO) well structured code. |
src/archive.js
Outdated
getDocInFrame?: (HTMLIFrameElement) => ?Document; | ||
fetch?: (Resource) => Promise<Response>; | ||
fetchResource?:Fetch; | ||
resolveURL?: (Resource) => Promise<string>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function takes resource and creates a URL for it
src/archiver.js
Outdated
} | ||
static async links(resource/*:DocumentResource*/) { | ||
const document = await resource.captureDocument() | ||
return extractLinksFromDom(document, {docUrl:resource.url}).map(link => makeLinkAbsolute(link, resource)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does links turning it into absolute make sense here ?
src/archiver.js
Outdated
try { | ||
const css = postcss.parse(source) | ||
const url = response.url || resource.link.absoluteTarget | ||
const links = extractLinksFromCss(css, url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to makeLinkAbsolute
here.
(Mostly noting this so I don't forget) I think I have an idea for a way of dealing with timeouts. We could use |
@Treora I'm having trouble upgrading I did notice that after update I'm also not sure where |
I have figured the source of the issue. It's related to node-fetch/node-fetch#392 (comment): New Unfortunately I can't see any way to resolve this without mass-forking the whole dependency chain in attempt to resolve this. |
jsdom was unfortunately unwilling to use the new Public blob API to read the content of the blob with promises. Maybe you have to use jsdom's own fetch implementation instead of node-fetch (to get there internal blob)? the other way maybe could be to get it as a arraybuffer and then construct a blob using jsdom if you really need a blob res.arrayBuffer().then(ab => {
new Blob([ab])
}) Or try to convince him to use the FileReader to allow reading jsdom like object with the public api - i have already tried and done my part |
I end up resorting to a horrible hack to get |
As you probably found out already, these are set by I have not fully grasped the issue you are dealing with (and would be very glad if I do not need to); but if you found some changes are required in the test setup etcetera, of course feel free to propose. I'm quite indifferent about what test frameworks and such we use; as long as not too much effort is wasted maintaining them. I hope the hacks can mostly be abstracted away from sight, and added in such a way that we can easily remove them if the problematic libraries will have solved issues underneath..
Sounds good. Also, something signal-based approach like AbortController seems a good API we could offer instead of the |
Problem is version of This means even if we managed to make Ideally differences here would be resolved across Sadly they monkeypatching can't "be out of sight" that is because |
@Treora I discovered one peculiar behavior as I was trying to change I would like to address this with an alternate design, where |
Yes, thus far it seemed best to keep things simple and leave deduplication to the fetch function: the user-supplied |
Indeed; this already PR is already so large that I suppose the most workable plan will be, when you feel you got to a somewhat stable solution you feel confident about, to try collaboratively review and incorporate various aspects of it in a few iterations. It is very helpful when any orthogonal changes are put into separate commits, or possibly in separate branches. |
Thanks for the clear explanation, that helps! |
I already feel like it's done & I can't really do any meaningful progress without a review. So I suggest we do that. |
Hi @Treora, Any chance we could figure out a way to get some of this work into the tree ? It looks like I'll be needing this in my future work and I would very much prefer not to go on with my own fork. |
I am sorry I ended up doing exactly what I intended to avoid:
post-poning this work for several months. However, getting back to work
on freeze-dry has been slowly rising up on my to-do list. Besides
getting this bundling customisability incorporated I figured it may be
worth calling it v1.0 and making some better documentation and publicity
for it (just a web-page, no road-side billboards). Your reminder may
just get me started on this, thanks for poking.
As I think I said before, I will probably not merge this PR as-is, but
rather take the architecture and API design you demonstrated here as a
guide and make changes in a few steps; while trying to keep the code a
bit more modular than this PR.
Also, I am still eager to make the code typed (see issue #28), but might
do that afterwards.
|
Is there way we could collaborate ? I do want to use freeze-dry for the work & that would require things like changes in this pull request & more. However I am not sure at this point what would be a best strategy to go about it without diverging further.
For what it's worth I would advice going about it other way round. Type system pose API constraints that otherwise could be overlooked. Additionally type checker is very helpful in large refactoring as it will identify places you've missed. |
Assuming you're in favor of public API & risk of sounding biased, I would suggest you to consider landing this without rewriting and then followup witch changing internals to match desired modularity. Benefits of this approach would be:
|
I do not know either what may be the best strategy. I suppose it helps if we exchange more details about our plans; and if we try to explain changes we make/need so they could be redone (‘rebased’, conceptually) if needed. I am curious, what kind of changes do you expect the “& more” will contain?
I am considering this indeed, but need to look into the code again and fully understand it. As I intend to do a substantial rewrite, and I am already having trouble wrapping my head around some aspects, I suspect that for me it would be much easier and thus quicker to rewrite code I am most familiar with. As for the API, I am still thinking about making ‘two levels’ of what you have sort of done in one; I’d like to make the code more ‘graph-oriented’ (perhaps passing subgraphs of subresources&links around), and provide the ‘callback-oriented’ API you created only as a convenience function on top of that. Not sure that makes sense to you, nor if it will keep making sense to me as I dig deeper. :) |
I may be inaccurate here, as you do leave the possibility to interact with the resource graph without using the |
Good points. I was thinking that it may slow down rewriting, especially as I am not fluent with typed javascript. Could work out either way, I guess. As for which type system/language to use; I think you once said you use flow for its more powerful templating system, but that typescript may be helpful because of its wider uptake, community, etc. How is your opinion on this now? I have been a little biased towards adopting typescript, but if you recommend flow I am fine with following your lead on this. |
I still think flow has a better & more flexible type system, however typescript is probably a better choice given it’s popularity. Only argument I would consider is that with typescript you’ll need to do builds etc... In that regard I think flow has an advantage as you can add types as comments (like in this pull) and code would load natively in runtimes that support ES modules. |
Indeed you can traverse it as a graph, and That being said I find it fairly convenient way to avoid multiple passes over the graph. I think what you’re disliking about the API is side effect of it being designed to traverse and bundle graph in a single pass, which in turn requires pipelining fetch, parse, remap, save operations such that as you descend it does all of them. I don’t know if there’s a better way to go about it, but my experience so far had being it allows me to customize behavior very simply |
Alternative API could follow visitor pattern which accomplishes more or less same goals, but is more OOP style. Current solution allows you to either use pushed base interface via |
I do not exactly know yet, but here is a bit of overview of the context I'm using freeze-dry right now https://github.com/gozala/artifacts Readme does not quite talks about it yet, but streamingable web bundle formats are also on todo, meaning I'll need to start stream bundle as document is being parsed and bundled & ideally such that parallel requests to sub-resources could be streamed in parallel (not forcing them to queue up). Another thing that I intend to change about current architecture is the source DOM tree mutation, that currently occurs during freeze-drying. Finally I want to also add a way to plug things like CSS parsers which currently are baked in. Hope this gives enough context of what's on my list. |
Happy anniversary to this PR. 🎂 (just kidding, not proud of this..) As discussed, I started with converting the project to typescript; took a while, but it’s pretty much done, see #52! (any comments there are welcome, it’s my first typescript project) Next up is getting this functionality in, using this PR as reference but obviously writing everything anew. |
Combining iterators with recursion was never going to work, at least not without ruining the elegance it was supposed to provide. Instead, use a callback-based approach, where the callback is invoked for each link to a subresource and is expected to recurse to handle sub-subresources etc. The approach is similar to (and inspired by) gozala’s approach in PR #44; however, it differs in several ways: - It does not create ‘stub’ resources, because the Link is the item that is passed to the callback function. - Fetching, recursing and updating the link target are explicitly performed by the callback function; which may thus require a few more lines, but less magic is happening. Also the Resource classes need not know anything about these activities, becoming a more passive object. - Capturing the DOM state is still a separate step doing its own recursion, in order to make that happen synchronously. Might change in subsequent edits.
Proposed changes introduce concept of
Resource
which on instantiation is just a pointer to the corresponding sub-resourcelink
(wherelink
is what it used to be). Each resource is a lazily freeze-dried on demand:resource.dowload():Promise<Response>
method that will useoptions.fetchResource
for IO and will cache response so next call will return same promise.resource.resources():Promise<Iterable<Resource>>
method that will internally usethis.download()
to get original resource and extract links wrapping them into newResource
instances (note that subresources returned will not have fetched anything yet).resource.text()
/resource.blob()
methods that would do all of the above and replace links with URLs returned byoptions.resolveURL(resource):Promise<string>
. Sinceoptions.resolveURL
is async it can recurse and inlineblobToDataURL(await resource.blob())
or return some URL without having to fetch actual resource.Primary API has also changed from
freezeDry(opts):Promise<string>
tofreezeDry(opts):Resource
there for to accomplish what old API did user will need toawait freezeDry(opts).text()
instead. This change also means user could doawait freezeDry(opts).blob()
or in more advanced use case choose to hand roll alternative serialization: