locale.gen mangling insufficiently parses format #940

hsitter · 2018-04-23T12:36:54Z

Submission type

Bug report
Feature Request

localecfg has the following logic to enable (i.e. uncomment) locales in /etc/locale.gen

            for line in text:
                # always enable en_US
                if line.startswith("#" + en_us_locale):
                    # uncomment line
                    line = line[1:].lstrip()

                for locale_value in locale_values:
                    if line.startswith("#" + locale_value):
                        # uncomment line
                        line = line[1:].lstrip()

Lines are insufficiently parsed here. They'd fail when

⇥# the hash is preceded by any whitespace character such as a tab
␣# or a space
⇥␣# or both
#␣or the hash is followed by any whitespace character

what it should do is something along the line of (pseudo code)

foreach line
  if not line.strip_left_whitespaces.starts_with('#')
     continue # not a comment
  if not line.split('#', 2)[1].strip_left_whitespaces.starts_with(locale_value)
     continue # not a commented version of locale_value
  uncomment

The text was updated successfully, but these errors were encountered:

kkofler · 2018-04-23T12:43:55Z

We had a somewhat more flexible parsing there (but still without accepting whitespace before comments), but it was explicitly changed because some distros want a distinction between #comment and # comment (they use only the former for commented-out languages and the latter for free-form comments). See commit 382c193 and commit 58952b6.

I objected to this distinction which is clearly not specified by the file format, but it was merged anyway.

Since the distro I work with does not use locale.gen anyway, I will let you and @AlmAck (who did the 2 commits I referenced) sort that out.

hsitter · 2018-04-23T12:56:25Z

Could be a bool flag in localecfg.conf stripLines: true to support both. But I agree that this distinction is fairly fishy. Mind you, even with the distinction I find it a bit unlikely for stripped' matching to go wrong as the comment would have to start with a full locale identifier.

Alternatively, the lines could be parsed entirely instead of doing a startswith approximation.^\s?#\s?([^\s]+)\s+([^\s]+)\s?$ should be a good start

adriaandegroot · 2018-04-26T14:27:29Z

The whole file is fishy, and I don't really like messing around with it one way or another:

# So the file starts with a bunch of comments
# And here is an example locale:
#   en_US UTF-8
# But now the actual list of locales starts
# en_US UTF-8

Good luck making any sense out of that automatically. The original change was requested, I believe, because the locales listed as examples were enabled twice. If anything changes in this module, I'd move to not uncomment anything, but to copy the entire file and at the end list those locales that Calamares would have uncommented (once each) that appear in the input as comments following the RE ^\w*#\w*[a-zA-Z_.0-9-]+\w . Doing so preserves the comments, prevents duplicates, .. of course, then people won't understand that the commented-out locales may be listed later, and someone will ship a distro where one of the example locales isn't actually available but is enabled in Calamares.

adriaandegroot · 2018-04-26T14:36:59Z

If you can find a meaningful file-format description for locale.gen, I'd be grateful. The locale-gen manpage from Ubuntu says:

The main configuration file, which has a simple format: every line that
is not empty and does not begin with a # is treated as a locale
definition that is to be built.

So that might be interpreted at the least as disallowing lines with a space before the comment character. Whether locale-gen actually follows that interpretation .. well, no.

hsitter · 2018-04-27T07:55:37Z

So, I've had a look at ubuntu, arch, debian and opensuse, and omg this is a mess. locale-gen isn't part of glibc, it's something distros add to manage locales easier than what localedef does.

Ubuntu's locale-gen has editing of locale.gen built-in in addition to the actual generating (which is actually breaking locale.gen apart and calling localedef (that's the tool that is actually POSIX) as necessary
Arch's locale-gen only supports the actual localedef call but nothing else
Debian's is very close to Arch's but has some extra validation (seems it checks if perl likes the locale)
Opensuse seems to have no locale-gen at all and instead the manual tells you to run localdef manually (supposedly they'd also have it configured somewhere in their yast configs though)

The scripts are somewhat different even.

Ubuntu uniqifies (sort -u) the config, then while read locale charset, skips everything starting with #, skips everything that is "", performs entry validation (calling locale), then runs the entry through seds to split it into the arguments localedef needs, and runs it.
Arch does much of the same except less of it. Format overall is the same, it's not uniquifying though, and has less entry validation going on.
Debian is somewhere inbetween the two (doesn't uniqify though)

The seds look the same and I expect the scripts have some common ancestry somewhere, but given there is no definition or standard or anything behind this, it probably matters not.

So, you can pretty much throw any format expectation out the window. For all we know Floppy Linux is using json as config format.

example: https://git.archlinux.org/svntogit/packages.git/tree/trunk/locale-gen?h=packages/glibc

Knowing what we know, I still think 100% whitespace handling would be the way to go as the likely least wrong option as far as the format is concerned.

But! If we appreciate that this format is basically undefined we probably should let distros handle this i.e. instead of editing the format in cala it'd call distro provided scripts to do it. That way some can have arbitrary meaning of spaces, others can use json as config format, or no config at all.

localeConfigHelper: /bin/locale-helper --add # called foreach POSIX locale as argument
localeGenHelper: /bin/locale-gen

Splitting it like that means ubuntu systems can simply defer to already existing tech in the locale gen tech and set locale-gen --keep-existing as localeConfigHelper and use no localeGenHelper. And to keep calamares backwards compatible we can simply use a default localeConfigHelper which does what we have right now and call locale-gen as localeGenHelper. That way everything would be fully backwards compatible.

This could obviously also be a single script which is called with all locales in the arguments, leaving it to the script to foreach the locales (slightly less lovely from an ubuntu pov though, but ultimately makes no difference).

FIXES #940

adriaandegroot added this to the v3.2.1 milestone Jun 20, 2018

adriaandegroot closed this as completed in a906bd7 Jun 20, 2018

adriaandegroot added a commit that referenced this issue Jun 20, 2018

Merge remote-tracking branch 'origin/fix-localegen'

47b130c

FIXES #940

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

locale.gen mangling insufficiently parses format #940

locale.gen mangling insufficiently parses format #940

hsitter commented Apr 23, 2018

kkofler commented Apr 23, 2018 •

edited

Loading

hsitter commented Apr 23, 2018

adriaandegroot commented Apr 26, 2018

adriaandegroot commented Apr 26, 2018

hsitter commented Apr 27, 2018

locale.gen mangling insufficiently parses format #940

locale.gen mangling insufficiently parses format #940

Comments

hsitter commented Apr 23, 2018

Submission type

kkofler commented Apr 23, 2018 • edited Loading

hsitter commented Apr 23, 2018

adriaandegroot commented Apr 26, 2018

adriaandegroot commented Apr 26, 2018

hsitter commented Apr 27, 2018

kkofler commented Apr 23, 2018 •

edited

Loading