New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong encoding in HTML output of references #919

Closed
asmaier opened this Issue Jan 6, 2018 · 5 comments

Comments

Projects
None yet
3 participants
@asmaier

asmaier commented Jan 6, 2018

I use the following test documents,
test.tex:

\documentclass[a4paper,12pt,twoside,openright]{book}
\usepackage[english]{babel}		
\usepackage[utf8]{inputenc} 
\usepackage{natbib}
\begin{document}
\tableofcontents
\chapter{Präfatßio}
Löräm ipßum dolor ßit amet, conßäctetur adipisiki älit, sed äüsmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exärcitation ullamco laboris nisi ut aliquid ex ea commodi conßequat. Quis aüte iüre reprähenderit in voluptate velit eße cillüm dolore äu fugiat nulla pariatur. Exceptäür sint öbcäcat cupiditat nön proident, ßünt in kulpa kwi offitßia deserunt mollit anim id äßt laborum \cite{Wuersst1887}.
\nocite{*}
\bibliographystyle{apalike}
\bibliography{test}
\end{document}

and test.bib:

@BOOK{Wuersst1887,
  title = {{The german umlauts - A pain in the ähß}},
  publisher = {Springer, Berlin},
  year = {1887},
  author = {Hantß Würßt},
  note = {Deutsche Ausgabe: Würßt, Die deutschen Umlaute ÄÖÜß, Springer, Berlin 1871},
  owner = {asmaier},
}

On Mac OS X with basictex and latexml 0.8.2 I do the following

$ latexml test.tex > test.xml
$ iconv -f iso-8859-1 -t utf-8 test.xml > testutf8.xml   # Necessary, because of problem #917
$ latexmlpost  --dest=test.html testutf8.xml

When I open test.html in Chrome I see the following:
latexmlumlauts
Note the encoding issues in the reference to the book and also in the author and title of the bibliography entry.
The bibliography file test.bib is stored as utf-8:

$ file test.bib 
test.bib: BibTeX text file, UTF-8 Unicode text

The corresponding PDF output of latex and bibtex doesn't show these issues. So I guess, this could be an encoding issue in the bibtex processing of latexmlpost.

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Jan 6, 2018

essentially this is the same error as #918.

@dginev dginev added this to the LaTeXML-0.8.4 milestone Jan 6, 2018

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Jan 6, 2018

Oh, maybe not a dup, after all; somehow the encoding of the bib file is not being seen correctly.

@dginev

This comment has been minimized.

Collaborator

dginev commented Jan 7, 2018

I did some digging today, and this bug is a good old friend - we are looking at bibliography fields which have undergone two UTF-8 encoding passes, rather than one, hence the garbled content.

From as early as getting read in by Mouth::file the bibliography is encoded as UTF-8, and the derivative fields and strings remain so encoded until the very end of the processing, where the I believe the final XML serialization performs a second encoding pass before writing to the file.

You can reproduce the bad encoding by simply running:

latexml test.bib

which prints doubly encoded unicode. And likewise you can validate that the input encoding is not handled right, as using the alternative invocation:

latexml test.bib --inputencoding=utf8

creates the correct XML output, as the bibliography strings are first decoded to native Perl strings. I wonder if it may be worth it to try and auto-detect unicode inputs and decode them by default, as this could also lead to subtle bugs in regex matching down the road.

All that said, the converter used in MakeBibliography isn't connected to the --inputencoding parameter in any of the executables, so that warrants an separate upgrade, after we decide on a solution for the basic latexml call.

@dginev

This comment has been minimized.

Collaborator

dginev commented Jan 7, 2018

latexml test.bib
which prints doubly encoded unicode.

Ah, good to clarify - that's on my #920 branch. On master that command will output the correct unicode, because we never encode the output for STDOUT. So the bug in #918 is distinct and if fact concealed the double-encoding bug here - which can be seen if you instead write to a file on the master branch via:

latexml test.bib --dest=test.xml

which is double-encoded, while the explicit encoding fixes things:

latexml test.bib --dest=test.xml --inputencoding=utf8
@brucemiller

This comment has been minimized.

Owner

brucemiller commented Feb 13, 2018

So, really what's going on is that BibTeX just passes through whatever bytes it got, other than the parts it specifically recognizes. Its output is expected to be processed within a LaTeX document that has loaded whatever packages are needed, in particular inputencd (usually bib files are written w/o much expectations). Since we're processing the bibliiography for a particular document which has the loaded packages recorded in the processing instructions, all we have to do is fetch that list and preload them. I've only done that for inputenc, atm. Of course, converting a bibliography standalone, you'll have to preload whatever is necessary. But for the case at hand, this seems to work just fine.

Thanks for the report!

@dginev dginev modified the milestones: LaTeXML-0.8.4, LaTeXML-0.8.3 Feb 25, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment