Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve Relative URIs #104

Closed
jangernert opened this issue Mar 21, 2021 · 12 comments · Fixed by #107
Closed

Resolve Relative URIs #104

jangernert opened this issue Mar 21, 2021 · 12 comments · Fixed by #107

Comments

@jangernert
Copy link
Contributor

Some feeds provide relative URIs for things like enclosures. I recently implemented some basic code to resolve these URIs in my Feed Reader based on feed-rs. The bug report I got was refreshingly detailed and helpful. I got linked a step by step description how a big python feed parser handles this problem:

https://pythonhosted.org/feedparser/resolving-relative-links.html#how-relative-uris-are-resolved

But since the first few steps rely on knowledge of the feeds XML they probably should be implemented in feed-rs. Later steps which use fields of the HTTP header however can't be implemented in feed-rs and need to be handled in the library/program that makes use of it.

What is your opinion on the issue?

If URIs get resolved but some of them fail because there is not enough information in the XML itself should the resulting Link struct indicate that it is a partial URL? Or should the calling library just watch for url::ParseError::RelativeUrlWithoutBase?

@markpritchard
Copy link
Contributor

Hi @jangernert , that is an interesting use case.

One of the principles I wanted to follow with feed-rs is to perform standard transformations/processing on behalf of users. e.g. all formats should be represented by the same data model. It also extends to implementing aspects of specifications that everyone would also have to implement to be "correct" e.g. propagating top-level MediaRSS properties down through the tree.

So this functionality definitely falls within feed-rs scope. Certainly the base functionality, but I also think parsing it relative to a source URL could be interesting too (e.g. another parse() that takes the base URL as well). Do you have a link to the bug report agaonst your feed reader? Would be great to get an example of the feed in question etc.

@markpritchard
Copy link
Contributor

Unrelated to this specific issue, I'm slowly chipping away at iTunes support which does change the way feed-rs will populate its model, so I'll release that as 0.6. Would be good to include something like this in 0.6 as well given its a change to the way URLs will be constructed.

@jangernert
Copy link
Contributor Author

The provided feed is just a minimal example, so I'm not sure how useful it is for you: https://gitlab.com/news-flash/news_flash_gtk/-/issues/206

@jangernert
Copy link
Contributor Author

jangernert commented Mar 22, 2021

At least this one should not be handled by feed-rs

If no xml:base is specified, the feed has a default base URI defined in the Content-Location HTTP header.

as it would require another HTTP request.

edit: it could make sense to pass the content-location as an option if that is what you meant.

@jangernert
Copy link
Contributor Author

In addition to enclosures xml:base is also used for links in html content: https://gitlab.com/news-flash/news_flash_gtk/-/issues/212 (example feed: https://insanity.industries/index.xml)

This seems a bit more tricky. I don't see a way around parsing the html and then parsing all the links looking for url::ParseError::RelativeUrlWithoutBase.

@markpritchard
Copy link
Contributor

That is unfortunate.

I'm comfortable switching all the URLs in the RSS/Atom content to absolute by applying xml:base but I wouldn't want to add an HTML parser to feed-rs by default. Might be worth playing around with as a feature (I've never done that in Rust ... might be interesting to learn).

@jangernert
Copy link
Contributor Author

jangernert commented Mar 30, 2021

Alternatively you could provide the base URL as part of the Content model. WebkitGtk can load HTML together with a base URL and will complete relative URLs. That would be good enough for me.

Regarding HTML parsing in rust: I was indirectly using html5ever. But at least for my use case (html2text) it was quite slow compared to good old libxml2.

@jangernert
Copy link
Contributor Author

Awesome!

Could you provide a short summary of what went into #107 ? The description speak of content:encoded but so far I haven't found the part of the code that handles it. And is it behind a feature gate as you intended? (also haven't found anything related to it)

I'm assuming content-location header as base is still up to the user of feed-rs?

@markpritchard
Copy link
Contributor

Hah, you beat me to it.

There is a new method, parse_with_uri that takes an optional uri .
If that uri is provided it is used as the default base for all links, but is overridden by the xml:base spec

e.g. if you had a feed retrieved from http://example.com/feed/rss.xml you'd pass this to parse_with_uri. It will be attached to any Content items, used for Links etc. Should any of the entries in that feed also include xml:base attributes they will be used in preference. The parser maintains a stack of base URIs as it walks through the tree.

There are a couple of test cases that might make it more clear in feed-rs/src/xml/tests.rs

@markpritchard
Copy link
Contributor

markpritchard commented Apr 10, 2021

Oh, and the test cases you provided in the original NewsFlash issue are in the RSS2 test suite as well.

@markpritchard
Copy link
Contributor

markpritchard commented Apr 10, 2021

And is it behind a feature gate as you intended? (also haven't found anything related to it)

I didn't worry about it as the additional dependency on the url crate wasn't too invasive.

I'm assuming content-location header as base is still up to the user of feed-rs?

Yes, correct - you'd pass that as the uri param to parse_with_uri. If you don't have that, you could pass the URL for the feed itself.

@jangernert
Copy link
Contributor Author

Thanks! I'll make use of the new version as soon as possible :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants