New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rake task will download entire file looking for webmention endpoint #12

Closed
0xdevalias opened this Issue Apr 17, 2016 · 6 comments

Comments

Projects
None yet
2 participants
@0xdevalias
Contributor

0xdevalias commented Apr 17, 2016

While processing URLs, the rake task will download the entire file while looking for the webmention endpoint. On most html pages, this is fine, but it seems it will also download any arbitrary binary/etc links (eg zip files, pdf's, iso's, etc) that are linked from posts.

@aarongustafson

This comment has been minimized.

Show comment
Hide comment
@aarongustafson

aarongustafson Jul 14, 2016

Owner

Interesting. Any thoughts on how to avoid this? I suppose we could look at the MIME response in the header, but the file would still be captured. I could do a header request only, but then I would have to request the html file as well, adding additional time to the process. Another option would be to parse the URL and try to guess, but that’s also fraught with potential issues (assumptions, you know).

I’d love your thoughts on addressing this.

Owner

aarongustafson commented Jul 14, 2016

Interesting. Any thoughts on how to avoid this? I suppose we could look at the MIME response in the header, but the file would still be captured. I could do a header request only, but then I would have to request the html file as well, adding additional time to the process. Another option would be to parse the URL and try to guess, but that’s also fraught with potential issues (assumptions, you know).

I’d love your thoughts on addressing this.

@0xdevalias

This comment has been minimized.

Show comment
Hide comment
@0xdevalias

0xdevalias Jul 27, 2016

Contributor

My original thought was the HEAD thing.. but the extra requests could be a pain.

The other idea I was having was just an arbitrary cutoff point, since the webmention should be in the head tags (from memory), it'd be unlikely that the head would be beyond 1/10/100kb for example. So while not an ideal solution, would prevent trying to pull down 100's of mb's of zip/iso/etc.

While it's sort of more annoying, I think ultimately the best solution might actually come from a combination of all 3. A size cutoff point (configurable with good default), a whitelist approach to file extensions (not ideal, but could catch some cases. might become obsolete with the file size limit though?), and then HEAD for anything that doesn't match the whitelist. Or something to that effect.

Contributor

0xdevalias commented Jul 27, 2016

My original thought was the HEAD thing.. but the extra requests could be a pain.

The other idea I was having was just an arbitrary cutoff point, since the webmention should be in the head tags (from memory), it'd be unlikely that the head would be beyond 1/10/100kb for example. So while not an ideal solution, would prevent trying to pull down 100's of mb's of zip/iso/etc.

While it's sort of more annoying, I think ultimately the best solution might actually come from a combination of all 3. A size cutoff point (configurable with good default), a whitelist approach to file extensions (not ideal, but could catch some cases. might become obsolete with the file size limit though?), and then HEAD for anything that doesn't match the whitelist. Or something to that effect.

@aarongustafson

This comment has been minimized.

Show comment
Hide comment
@aarongustafson
Owner

aarongustafson commented Jun 21, 2017

@0xdevalias

This comment has been minimized.

Show comment
Hide comment
@0xdevalias

0xdevalias Jun 21, 2017

Contributor

That sounds awesome. I'm all for using standardised libraries where possible.

This is accomplished by doing a HEAD request and looking at the headers for supported servers, if none are found, then it searches the body of the page.

Would have to check what they do for large files, but if it's still an issue probably better fixed as a PR there anyway, help more people :)

Contributor

0xdevalias commented Jun 21, 2017

That sounds awesome. I'm all for using standardised libraries where possible.

This is accomplished by doing a HEAD request and looking at the headers for supported servers, if none are found, then it searches the body of the page.

Would have to check what they do for large files, but if it's still an issue probably better fixed as a PR there anyway, help more people :)

@aarongustafson

This comment has been minimized.

Show comment
Hide comment
@aarongustafson

aarongustafson Jul 5, 2017

Owner

I’ve implemented the webmention gem lookup in 2.3. It goes for the request header first, then grabs the document and selects the appropriate link element.

Owner

aarongustafson commented Jul 5, 2017

I’ve implemented the webmention gem lookup in 2.3. It goes for the request header first, then grabs the document and selects the appropriate link element.

@0xdevalias

This comment has been minimized.

Show comment
Hide comment
@0xdevalias

0xdevalias Jul 5, 2017

Contributor

Sweet :)

Contributor

0xdevalias commented Jul 5, 2017

Sweet :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment