{{ message }}

brucemiller / LaTeXML

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Wrong encoding in HTML output of references #919

Closed
opened this issue Jan 6, 2018 · 5 comments
Closed

Wrong encoding in HTML output of references#919

opened this issue Jan 6, 2018 · 5 comments
Labels
Milestone

asmaier commented Jan 6, 2018

 I use the following test documents, test.tex: \documentclass[a4paper,12pt,twoside,openright]{book} \usepackage[english]{babel} \usepackage[utf8]{inputenc} \usepackage{natbib} \begin{document} \tableofcontents \chapter{Präfatßio} Löräm ipßum dolor ßit amet, conßäctetur adipisiki älit, sed äüsmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exärcitation ullamco laboris nisi ut aliquid ex ea commodi conßequat. Quis aüte iüre reprähenderit in voluptate velit eße cillüm dolore äu fugiat nulla pariatur. Exceptäür sint öbcäcat cupiditat nön proident, ßünt in kulpa kwi offitßia deserunt mollit anim id äßt laborum \cite{Wuersst1887}. \nocite{*} \bibliographystyle{apalike} \bibliography{test} \end{document}  and test.bib: @BOOK{Wuersst1887, title = {{The german umlauts - A pain in the ähß}}, publisher = {Springer, Berlin}, year = {1887}, author = {Hantß Würßt}, note = {Deutsche Ausgabe: Würßt, Die deutschen Umlaute ÄÖÜß, Springer, Berlin 1871}, owner = {asmaier}, }  On Mac OS X with basictex and latexml 0.8.2 I do the following $latexml test.tex > test.xml$ iconv -f iso-8859-1 -t utf-8 test.xml > testutf8.xml # Necessary, because of problem #917 $latexmlpost --dest=test.html testutf8.xml  When I open test.html in Chrome I see the following: Note the encoding issues in the reference to the book and also in the author and title of the bibliography entry. The bibliography file test.bib is stored as utf-8: $ file test.bib test.bib: BibTeX text file, UTF-8 Unicode text  The corresponding PDF output of latex and bibtex doesn't show these issues. So I guess, this could be an encoding issue in the bibtex processing of latexmlpost. The text was updated successfully, but these errors were encountered:

brucemiller commented Jan 6, 2018

 essentially this is the same error as #918.
added labels Jan 6, 2018
added this to the LaTeXML-0.8.4 milestone Jan 6, 2018

brucemiller commented Jan 6, 2018

 Oh, maybe not a dup, after all; somehow the encoding of the bib file is not being seen correctly.

dginev commented Jan 7, 2018

 I did some digging today, and this bug is a good old friend - we are looking at bibliography fields which have undergone two UTF-8 encoding passes, rather than one, hence the garbled content. From as early as getting read in by Mouth::file the bibliography is encoded as UTF-8, and the derivative fields and strings remain so encoded until the very end of the processing, where the I believe the final XML serialization performs a second encoding pass before writing to the file. You can reproduce the bad encoding by simply running: latexml test.bib  which prints doubly encoded unicode. And likewise you can validate that the input encoding is not handled right, as using the alternative invocation: latexml test.bib --inputencoding=utf8  creates the correct XML output, as the bibliography strings are first decoded to native Perl strings. I wonder if it may be worth it to try and auto-detect unicode inputs and decode them by default, as this could also lead to subtle bugs in regex matching down the road. All that said, the converter used in MakeBibliography isn't connected to the --inputencoding parameter in any of the executables, so that warrants an separate upgrade, after we decide on a solution for the basic latexml call.

dginev commented Jan 7, 2018 • edited

 latexml test.bib which prints doubly encoded unicode. Ah, good to clarify - that's on my #920 branch. On master that command will output the correct unicode, because we never encode the output for STDOUT. So the bug in #918 is distinct and if fact concealed the double-encoding bug here - which can be seen if you instead write to a file on the master branch via: latexml test.bib --dest=test.xml  which is double-encoded, while the explicit encoding fixes things: latexml test.bib --dest=test.xml --inputencoding=utf8 
mentioned this issue Jan 7, 2018

brucemiller commented Feb 13, 2018

 So, really what's going on is that BibTeX just passes through whatever bytes it got, other than the parts it specifically recognizes. Its output is expected to be processed within a LaTeX document that has loaded whatever packages are needed, in particular inputencd (usually bib files are written w/o much expectations). Since we're processing the bibliiography for a particular document which has the loaded packages recorded in the processing instructions, all we have to do is fetch that list and preload them. I've only done that for inputenc, atm. Of course, converting a bibliography standalone, you'll have to preload whatever is necessary. But for the case at hand, this seems to work just fine. Thanks for the report!
modified the milestones: LaTeXML-0.8.4, LaTeXML-0.8.3 Feb 25, 2018
to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet