Add xhtml inline png image support #35

kippr · 2015-11-06T12:53:00Z

Images are already present in the Document model when read from RTF.
This change converts any PNG images found into img tags with inline
base64 encoded data elements. Other images (for example WMF alternatives
and jpegs) are ignored.

Previous behaviour was to write out the hex-encoded image string.

I needed this behaviour - hope it might be useful to others as well. Inline data images are widely supported now (http://caniuse.com/#feat=datauri) so I think this is a reasonable way to handle rtf to html conversion.

Thanks

Images are already present in the Document model when read from RTF. This change converts any PNG images found into <img> tags with inline base64 encoded data elements. Other images (for example WMF alternatives and jpegs) are ignored. Previous behaviour was to write out the hex-encoded image string.

Prior commit took images being read in RTF and added them as inline png images to XHTML. This completes the reverse: inline png images in XHTML are read into the Document model, and writing those Documents to RTF will now include the inline image. Width/ height attributes are also transformed, assuming a standard conversion of 15 twips per pixel. Only PNG images are supported.

Comment regarding ~, - and _ in commit c72d457 suggests dropping them was intended behavior.. but instead they are included as text output. Spec at http://www.biblioscape.com/rtf15_spec.htm: \~: Nonbreaking space. \-: Optional hyphen. \_: Nonbreaking hyphen. A future extension might be to extend document to represent these, and then let writers decide whether they want to include them or not (e.g. as &NBSP; in XHTML).

Previously in RTF documents containing nested lists that 'ended' on a nested item, the outer most item would be added into the list above it, but the list above it would never be added in the lists/ doc above that, so would get dropped.

Also confirmed round trip from XHTML to RTF with tests: - Checking that RTF reads underlining markup into Document - Checking that RTF writes underline formatting - Checking that XHTML reads u tags or css underline styling into Document - Checking that XHTML writes u tags

As per http://www.w3.org/TR/xhtml1/dtds.html#dtdentry_xhtml1-strict.dtd_sub this is the recommended way.

For now its better to parse html ordered lists as unordered lists rather than creating invalid document structures that crash parsing. (ListItems right under Paras because ol is ignored)

Found plenty of examples of these in the wild.. This fix adds a para up front but doesn't add it to list stack, so we also hold on the last pop of the list stack when unwinding lists because there is no final holding paragraph

Previously, sublists were always added to their own li element, but this renders as double bullets in HTML: * Top level * - Sub list item Now we add the nested ul directly to the prior non-list flow item (Top level para in example above), which gives expected single-bullet nesting: * Top level - Sub list item

Currently these characters get writter out verbatim to RTF stream, rendering the result invalid. Instead they should be escaped with a leading backslash.

If HTML entities were escaped when converting from HTML to whatever other format, don't escape the ampersands in them again on the way out from whatever format back to HTML.

Kris Powell and others added 14 commits November 6, 2015 12:40

Fix nested lists rtf15.reader bug

6ab8375

Previously in RTF documents containing nested lists that 'ended' on a nested item, the outer most item would be added into the list above it, but the list above it would never be added in the lists/ doc above that, so would get dropped.

Add forgotten rtf example for test

ae8c813

Use sub/ super xhtml tags for super/subscript text

5db24d5

As per http://www.w3.org/TR/xhtml1/dtds.html#dtdentry_xhtml1-strict.dtd_sub this is the recommended way.

Treat xhtml ol as ul

4171d29

For now its better to parse html ordered lists as unordered lists rather than creating invalid document structures that crash parsing. (ListItems right under Paras because ol is ignored)

Fix for RTF documents that open with a list

4966573

Found plenty of examples of these in the wild.. This fix adds a para up front but doesn't add it to list stack, so we also hold on the last pop of the list stack when unwinding lists because there is no final holding paragraph

Workaround for non-utf chars embedded in charBuffer

8116549

Encode control chars {} and \ when writing RTF

5abf0e1

Currently these characters get writter out verbatim to RTF stream, rendering the result invalid. Instead they should be escaped with a leading backslash.

Don't 'double escape' html entities when writing

fd95cbf

If HTML entities were escaped when converting from HTML to whatever other format, don't escape the ampersands in them again on the way out from whatever format back to HTML.

Ignore images when writing to plaintext

79d0dd1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add xhtml inline png image support #35

Add xhtml inline png image support #35

kippr commented Nov 6, 2015

Add xhtml inline png image support #35

Are you sure you want to change the base?

Add xhtml inline png image support #35

Conversation

kippr commented Nov 6, 2015