Skip to content
This repository has been archived by the owner on Aug 23, 2018. It is now read-only.

All file reading and writing explicitly uses UTF-8 #20

Merged
merged 2 commits into from
Jun 13, 2015

Conversation

laszlopandy
Copy link
Member

Changes two things:

  • All reads and writes go through utility functions which explicitly force UTF-8 instead of using the system's default encoding.
  • Catches the decoding exception if someone isn't using UTF-8:
    • hGetContents: invalid argument (invalid byte sequence),
    • and changes it to read: Bad encoding; the file must be valid UTF-8

People share Elm files on GitHub, and elsewhere. So even if you have a strange system using a strange locale, we should still use UTF-8 everywhere to avoid compatibility issues.

I believe this will fix elm/compiler#914, however I could not reproduce the bug on my machine so I cannot say for sure. More testing is needed with a custom build of this branch.

@laszlopandy
Copy link
Member Author

@rtfeldman If you could build this branch on your VM and test if it fixes the issue that would be very helpful!

@evancz
Copy link
Contributor

evancz commented Jun 12, 2015

Nice, I like this a lot! I have two main questions:

  1. How did you find all the read/write locations? How confident are you that this covers everything?
  2. Do we pay any performance penalty to read things in strictly? I guess that'd happen one way or another, so maybe it's no big deal?

I ask 2 because I believe reading and writing to disk is one of the slowest parts of building a project right now. I don't have good numbers on this, but I am vaguely concerned that a small change could mess with compile times. Separate from this issue, it may be time to start setting up some proper profiling so we can assess this.

@laszlopandy
Copy link
Member Author

  1. I found the read/write locations with a regex readFile|writeFile|withFile|openFile|ReadMode|WriteMode in elm-make and elm-compiler.
  2. Haskell lazy IO by default buffers blocks (ie. reads 4k at a time). So for small files, there should be no difference at all. In elm-core 85% of the files are less than 8k (two blocks). But at this level we should also take into consideration the syscall overhead of calling the kernel N times vs reading all the data at once, and also how aggressively the kernel pre-caches blocks in RAM. In my anecdotal experience any file less than 1mb is not worth breaking up into buffers if you plan on reading to the end.

My hypothesis is that the difference will be unmeasurable.

@laszlopandy
Copy link
Member Author

And keep in mind that I've only made the reading of *.elm files strict. The reading of object files, and more importantly the writing of the final output are still lazy. The lazy output might be important if you are compiling a giant Elm program, where the output .js is several megabytes. But other than that I don't think it matter either way.

@rtfeldman
Copy link

Unfortunately my personal bandwidth is extremely limited between now and the end of mloc...I realistically won't have time to test this out until after the conference. Sorry about that!

@evancz evancz merged commit 516a010 into elm-lang:master Jun 13, 2015
@evancz
Copy link
Contributor

evancz commented Jun 13, 2015

LGTM, thank you!

@rtfeldman
Copy link

For future reference, confirmed that we ran the new version successfully on our CI server without setting environment variables. Thanks @laszlopandy!

@evancz
Copy link
Contributor

evancz commented Jul 13, 2015

Whoo, thanks @laszlopandy, and thanks @rtfeldman for confirmation!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Specifying Source Character Encoding
3 participants