Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add xhtml inline png image support #35

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

Conversation

kippr
Copy link

@kippr kippr commented Nov 6, 2015

Images are already present in the Document model when read from RTF.
This change converts any PNG images found into img tags with inline
base64 encoded data elements. Other images (for example WMF alternatives
and jpegs) are ignored.

Previous behaviour was to write out the hex-encoded image string.

I needed this behaviour - hope it might be useful to others as well. Inline data images are widely supported now (http://caniuse.com/#feat=datauri) so I think this is a reasonable way to handle rtf to html conversion.

Thanks

Kris Powell and others added 14 commits November 6, 2015 12:40
Images are already present in the Document model when read from RTF.
This change converts any PNG images found into <img> tags with inline
base64 encoded data elements. Other images (for example WMF alternatives
and jpegs) are ignored.

Previous behaviour was to write out the hex-encoded image string.
Prior commit took images being read in RTF and added them as inline png images to XHTML.
This completes the reverse: inline png images in XHTML are read into the Document model,
and writing those Documents to RTF will now include the inline image.

Width/ height attributes are also transformed, assuming a standard conversion of 15 twips
per pixel.

Only PNG images are supported.
Comment regarding ~, - and _  in commit c72d457 suggests dropping them was
intended behavior.. but instead they are included as text output.

Spec at http://www.biblioscape.com/rtf15_spec.htm:
\~: Nonbreaking space.
\-: Optional hyphen.
\_: Nonbreaking hyphen.

A future extension might be to extend document to represent these, and then
let writers decide whether they want to include them or not (e.g. as &NBSP;
in XHTML).
Previously in RTF documents containing nested lists that 'ended' on a nested
item, the outer most item would be added into the list above it, but the list
above it would never be added in the lists/ doc above that, so would get dropped.
Also confirmed round trip from XHTML to RTF with tests:
 - Checking that RTF reads underlining markup into Document
 - Checking that RTF writes underline formatting
 - Checking that XHTML reads u tags or css underline styling into Document
 - Checking that XHTML writes u tags
For now its better to parse html ordered lists as unordered lists rather than creating invalid document structures that crash
parsing. (ListItems right under Paras because ol is ignored)
Found plenty of examples of these in the wild.. This fix adds a para up front but doesn't add it to list stack,
so we also hold on the last pop of the list stack when unwinding lists because there is no final holding paragraph
Previously, sublists were always added to their own li element, but this renders as double bullets in HTML:

 * Top level
 * - Sub list item

Now we add the nested ul directly to the prior non-list flow item (Top level para in example above), which gives
expected single-bullet nesting:

 * Top level
   - Sub list item
Currently these characters get writter out verbatim to RTF stream, rendering the result invalid.
Instead they should be escaped with a leading backslash.
If HTML entities were escaped when converting from HTML to whatever other format,
don't escape the ampersands in them again on the way out from whatever format back
to HTML.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants