All file reading and writing explicitly uses UTF-8 #20

laszlopandy · 2015-06-12T04:59:05Z

Changes two things:

All reads and writes go through utility functions which explicitly force UTF-8 instead of using the system's default encoding.
Catches the decoding exception if someone isn't using UTF-8:
- hGetContents: invalid argument (invalid byte sequence),
- and changes it to read: Bad encoding; the file must be valid UTF-8

People share Elm files on GitHub, and elsewhere. So even if you have a strange system using a strange locale, we should still use UTF-8 everywhere to avoid compatibility issues.

I believe this will fix elm/compiler#914, however I could not reproduce the bug on my machine so I cannot say for sure. More testing is needed with a custom build of this branch.

laszlopandy · 2015-06-12T05:09:56Z

@rtfeldman If you could build this branch on your VM and test if it fixes the issue that would be very helpful!

evancz · 2015-06-12T05:29:36Z

Nice, I like this a lot! I have two main questions:

How did you find all the read/write locations? How confident are you that this covers everything?
Do we pay any performance penalty to read things in strictly? I guess that'd happen one way or another, so maybe it's no big deal?

I ask 2 because I believe reading and writing to disk is one of the slowest parts of building a project right now. I don't have good numbers on this, but I am vaguely concerned that a small change could mess with compile times. Separate from this issue, it may be time to start setting up some proper profiling so we can assess this.

laszlopandy · 2015-06-12T08:45:26Z

I found the read/write locations with a regex readFile|writeFile|withFile|openFile|ReadMode|WriteMode in elm-make and elm-compiler.
Haskell lazy IO by default buffers blocks (ie. reads 4k at a time). So for small files, there should be no difference at all. In elm-core 85% of the files are less than 8k (two blocks). But at this level we should also take into consideration the syscall overhead of calling the kernel N times vs reading all the data at once, and also how aggressively the kernel pre-caches blocks in RAM. In my anecdotal experience any file less than 1mb is not worth breaking up into buffers if you plan on reading to the end.

My hypothesis is that the difference will be unmeasurable.

laszlopandy · 2015-06-12T08:49:11Z

And keep in mind that I've only made the reading of *.elm files strict. The reading of object files, and more importantly the writing of the final output are still lazy. The lazy output might be important if you are compiling a giant Elm program, where the output .js is several megabytes. But other than that I don't think it matter either way.

rtfeldman · 2015-06-12T16:40:00Z

Unfortunately my personal bandwidth is extremely limited between now and the end of mloc...I realistically won't have time to test this out until after the conference. Sorry about that!

evancz · 2015-06-13T00:39:38Z

LGTM, thank you!

rtfeldman · 2015-07-13T19:46:03Z

For future reference, confirmed that we ran the new version successfully on our CI server without setting environment variables. Thanks @laszlopandy!

evancz · 2015-07-13T19:56:02Z

Whoo, thanks @laszlopandy, and thanks @rtfeldman for confirmation!

laszlopandy added 2 commits June 12, 2015 06:48

All file reading and writing explicitly uses UTF-8

b7d7380

Match on constructor instead of string.

516a010

evancz merged commit 516a010 into elm-lang:master Jun 13, 2015

donutcho mentioned this pull request Jun 16, 2015

Specifying Source Character Encoding elm/compiler#914

Closed

rtfeldman mentioned this pull request Jul 13, 2015

Revert setting env variables rtfeldman/node-elm-compiler#6

Closed

rtfeldman mentioned this pull request Jul 13, 2015

Default encoding is not always UTF-8 #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All file reading and writing explicitly uses UTF-8 #20

All file reading and writing explicitly uses UTF-8 #20

laszlopandy commented Jun 12, 2015

laszlopandy commented Jun 12, 2015

evancz commented Jun 12, 2015

laszlopandy commented Jun 12, 2015

laszlopandy commented Jun 12, 2015

rtfeldman commented Jun 12, 2015

evancz commented Jun 13, 2015

rtfeldman commented Jul 13, 2015

evancz commented Jul 13, 2015

All file reading and writing explicitly uses UTF-8 #20

All file reading and writing explicitly uses UTF-8 #20

Conversation

laszlopandy commented Jun 12, 2015

laszlopandy commented Jun 12, 2015

evancz commented Jun 12, 2015

laszlopandy commented Jun 12, 2015

laszlopandy commented Jun 12, 2015

rtfeldman commented Jun 12, 2015

evancz commented Jun 13, 2015

rtfeldman commented Jul 13, 2015

evancz commented Jul 13, 2015