Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QUESTION: Should the lib stop caring about encoding/cleaning/fixing content.data? #49

Open
pedrosanta opened this issue Mar 7, 2018 · 4 comments
Labels

Comments

@pedrosanta
Copy link
Collaborator

Hello,

Following up on @jamesporter #38, I had this question for a while: should it really be responsibility of the lib to be concerned with the data the lib users add on each content, namely on cleaning up/encoding content.data?

In the past there was an effort to make the lib produce valid version 2 and 3 EPUBs, but I feel that concern should only be present in the parts directly to do with the lib, and should be the responsibility of the lib user to provide valid EPUB 2/3 XHTML for their added contents.

This way the issue @jamesporter raised on #38 would not occur too.

So, I would vote to remove this verification/cleanup/encoding from the lib all together. I can still admit the warning about unsupported tags, but even so, I think it should warn but not alter the content user added to its EPUB.

@cyrilis @jamesporter thoughts?

@cyrilis
Copy link
Owner

cyrilis commented Mar 8, 2018

Hi, @pedrosanta
I agree with you on your opinion that lib user should be responsible for the content validation and we shouldn't alter the content just for making less warnings. I always considered it as a tool instead of a lib, which is not quite accurate, and sorry for recently change caused by this.

So I vote to remove verification/cleanup/encoding, too. lib user should make sure they passed "correct" content, which makes this lib more pure.

I guess we can even remove the cheerio dependency, right?

@pedrosanta
Copy link
Collaborator Author

pedrosanta commented Mar 8, 2018

Thanks for the reply @cyrilis, glad to see you agree. 🙂 Just merged #51 which removes the entities encoding of content.data (because it was wrong and causing all sorts of issues) which goes along this line.

I want to think a bit more about your comment and reply a bit more in detail soon. But yhea, I would say we could deepen/simplify how we're processing the content.data, like, namely on that allowed tags part and all that, I think that could be rethought – although I feel warning for possible validation issues to be a plus, I would say we shouldn't alter/try to guess what the correct contents would be and I also feel we could do it in a more robust way if that's so, probably using a more complete library or so to avoid situations where we miss certain tags/etc like #34.

@pedrosanta
Copy link
Collaborator Author

@cyrilis:

I agree with you on your opinion that lib user should be responsible for the content validation and we shouldn't alter the content just for making less warnings.

Yes, even the allowed attributes and allowed tags mechanism cleanup, I wonder if isn't better to simplify and remove that.

As in terms of validation, I think the lib should at most warn about invalid contents, but like we said above, not alter the contents. But in that case, I think it could be better to look out for another lib that gives us a more robust/complete validation instead of keeping a manual array of allowed tags and attributes. 🙂

I always considered it as a tool instead of a lib, which is not quite accurate, and sorry for recently change caused by this.

It's cool, we're all here to help/contribute. #praiseOSS 😎 🙌

Anyway, I don't think those two are like, mutually exclusive. Like, with a bit of an effort we could pick the idea suggested on #13 and add a CLI tool to this as well, while allowing it to work as a lib too. Just a thought.

So I vote to remove verification/cleanup/encoding, too. lib user should make sure they passed "correct" content, which makes this lib more pure.

Ok, so, I already removed the entities encoding part on #51 but there's still some code to check and clean the contents (namely the allowed attributes/tags part, etc), but yes, I agree we could simplify this – making the lib leaner/pure like you said.

I'll draft a PR removing/simplifying this as soon as I can. 🙂

I guess we can even remove the cheerio dependency, right?

We can... 🤔 (and rely on Regexes, or so)... but, it does come handy to parse all the image URLs/src's. IDK, perhaps we can keep it for now – basically for reason mentioned before – and reassess its need later on, WDYT?

Cheers, thanks for the comments @cyrilis! 🙌

@cyrilis
Copy link
Owner

cyrilis commented Mar 26, 2018

@pedrosanta
Thanks for your reply, and thanks for your contribution. cheerio will still be in our dependency list.
Haha, we still need it to parse the HTMLs. And I think CLI tool is also a good idea, which could makes it more easy to use.

zbrongyaozb pushed a commit to zbrongyaozb/epub-gen that referenced this issue Aug 2, 2022
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants