Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

locale.gen mangling insufficiently parses format #940

Closed
1 of 2 tasks
hsitter opened this issue Apr 23, 2018 · 5 comments
Closed
1 of 2 tasks

locale.gen mangling insufficiently parses format #940

hsitter opened this issue Apr 23, 2018 · 5 comments
Milestone

Comments

@hsitter
Copy link
Contributor

hsitter commented Apr 23, 2018

Submission type

  • Bug report
  • Feature Request

localecfg has the following logic to enable (i.e. uncomment) locales in /etc/locale.gen

            for line in text:
                # always enable en_US
                if line.startswith("#" + en_us_locale):
                    # uncomment line
                    line = line[1:].lstrip()

                for locale_value in locale_values:
                    if line.startswith("#" + locale_value):
                        # uncomment line
                        line = line[1:].lstrip()

Lines are insufficiently parsed here. They'd fail when

⇥# the hash is preceded by any whitespace character such as a tab
␣# or a space
⇥␣# or both
#␣or the hash is followed by any whitespace character

what it should do is something along the line of (pseudo code)

foreach line
  if not line.strip_left_whitespaces.starts_with('#')
     continue # not a comment
  if not line.split('#', 2)[1].strip_left_whitespaces.starts_with(locale_value)
     continue # not a commented version of locale_value
  uncomment
@kkofler
Copy link
Contributor

kkofler commented Apr 23, 2018

We had a somewhat more flexible parsing there (but still without accepting whitespace before comments), but it was explicitly changed because some distros want a distinction between #comment and # comment (they use only the former for commented-out languages and the latter for free-form comments). See commit 382c193 and commit 58952b6.

I objected to this distinction which is clearly not specified by the file format, but it was merged anyway.

Since the distro I work with does not use locale.gen anyway, I will let you and @AlmAck (who did the 2 commits I referenced) sort that out.

@hsitter
Copy link
Contributor Author

hsitter commented Apr 23, 2018

Could be a bool flag in localecfg.conf stripLines: true to support both. But I agree that this distinction is fairly fishy. Mind you, even with the distinction I find it a bit unlikely for stripped' matching to go wrong as the comment would have to start with a full locale identifier.

Alternatively, the lines could be parsed entirely instead of doing a startswith approximation.^\s?#\s?([^\s]+)\s+([^\s]+)\s?$ should be a good start

@adriaandegroot
Copy link
Contributor

The whole file is fishy, and I don't really like messing around with it one way or another:

# So the file starts with a bunch of comments
# And here is an example locale:
#   en_US UTF-8
# But now the actual list of locales starts
# en_US UTF-8

Good luck making any sense out of that automatically. The original change was requested, I believe, because the locales listed as examples were enabled twice. If anything changes in this module, I'd move to not uncomment anything, but to copy the entire file and at the end list those locales that Calamares would have uncommented (once each) that appear in the input as comments following the RE ^\w*#\w*[a-zA-Z_.0-9-]+\w . Doing so preserves the comments, prevents duplicates, .. of course, then people won't understand that the commented-out locales may be listed later, and someone will ship a distro where one of the example locales isn't actually available but is enabled in Calamares.

@adriaandegroot
Copy link
Contributor

If you can find a meaningful file-format description for locale.gen, I'd be grateful. The locale-gen manpage from Ubuntu says:

The main configuration file, which has a simple format: every line that
is not empty and does not begin with a # is treated as a locale
definition that is to be built.

So that might be interpreted at the least as disallowing lines with a space before the comment character. Whether locale-gen actually follows that interpretation .. well, no.

@hsitter
Copy link
Contributor Author

hsitter commented Apr 27, 2018

So, I've had a look at ubuntu, arch, debian and opensuse, and omg this is a mess. locale-gen isn't part of glibc, it's something distros add to manage locales easier than what localedef does.

  • Ubuntu's locale-gen has editing of locale.gen built-in in addition to the actual generating (which is actually breaking locale.gen apart and calling localedef (that's the tool that is actually POSIX) as necessary
  • Arch's locale-gen only supports the actual localedef call but nothing else
  • Debian's is very close to Arch's but has some extra validation (seems it checks if perl likes the locale)
  • Opensuse seems to have no locale-gen at all and instead the manual tells you to run localdef manually (supposedly they'd also have it configured somewhere in their yast configs though)

The scripts are somewhat different even.

  • Ubuntu uniqifies (sort -u) the config, then while read locale charset, skips everything starting with #, skips everything that is "", performs entry validation (calling locale), then runs the entry through seds to split it into the arguments localedef needs, and runs it.
  • Arch does much of the same except less of it. Format overall is the same, it's not uniquifying though, and has less entry validation going on.
  • Debian is somewhere inbetween the two (doesn't uniqify though)

The seds look the same and I expect the scripts have some common ancestry somewhere, but given there is no definition or standard or anything behind this, it probably matters not.

So, you can pretty much throw any format expectation out the window. For all we know Floppy Linux is using json as config format.

example: https://git.archlinux.org/svntogit/packages.git/tree/trunk/locale-gen?h=packages/glibc

Knowing what we know, I still think 100% whitespace handling would be the way to go as the likely least wrong option as far as the format is concerned.

But! If we appreciate that this format is basically undefined we probably should let distros handle this i.e. instead of editing the format in cala it'd call distro provided scripts to do it. That way some can have arbitrary meaning of spaces, others can use json as config format, or no config at all.

localeConfigHelper: /bin/locale-helper --add # called foreach POSIX locale as argument
localeGenHelper: /bin/locale-gen

Splitting it like that means ubuntu systems can simply defer to already existing tech in the locale gen tech and set locale-gen --keep-existing as localeConfigHelper and use no localeGenHelper. And to keep calamares backwards compatible we can simply use a default localeConfigHelper which does what we have right now and call locale-gen as localeGenHelper. That way everything would be fully backwards compatible.

This could obviously also be a single script which is called with all locales in the arguments, leaving it to the script to foreach the locales (slightly less lovely from an ubuntu pov though, but ultimately makes no difference).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants