feat: xtask gettext#12676
Conversation
| .args(["--no-wrap", "--no-obsolete", "-o"]) | ||
| .arg(msgattrib_output_file.path()) | ||
| .arg(file_to_update.as_ref()) | ||
| .get_stdout_or_error()?; |
There was a problem hiding this comment.
The output of get_stdout_or_error() is never used, so we probably shouldn't add it as-is.
It also seems weird to call get_stdout_or_error() while ignoring stdout.
I guess the difference to run_or_bail() is that
get_stdout_or_error()captures stdout and stderrrun_or_bail()lets the child inherit stdout/stderr
but get_stdout_or_error() displays stderr of failure, so I'm not sure which one to use.
AFAICT, run_or_bail() semantics are better (like for slow child processes, or processes that print warnings to stderr but still succeed).
Maybe merge run_or_bail and get_stdout_or_error into one function while there's no significant difference.
If we capture stdout/stderr, we could bail or warn when the exit status is 0 but they are non-empty.
Here's how we propagate sphinx-build warnings in this case:
if !out.stderr.is_empty() {
rsconf::warn!("sphinx-build: {}", String::from_utf8_lossy(&out.stderr));
}
Maybe we can put such warnings/assertions in a shared function, perhaps give it an optional input to stdin so that it can support msguniq.
There was a problem hiding this comment.
Yes, these two functions used to have meaningful differences, but after switching error handling to anyhow, these are no longer present. Only the run_or_bail() functionality is needed at the moment, so I replaced the calls to get_stdout_or_error() and renamed run_or_bail() to just run() as you suggested.
The function for providing custom stdin and capturing stdout is useful, but it could be expanded, to remove duplication at the call sites, so that's what I did.
Right now, stderr isn't captured by any of the functions. I don't think we currently have a reason to check whether stderr is empty, the exit status should be enough to decide whether to error out. One thing we might want to consider is interleaving of the stderr output of concurrent threads.
| Some(dir) => Template::new(dir)?, | ||
| None => { | ||
| let temp_dir = fish_tempfile::new_dir().context("Failed to create temp file")?; | ||
| Command::new(env!("CARGO")) |
There was a problem hiding this comment.
I guess we use env!("CARGO") instead of hardcoded cargo because the first is slightly more correct in edge cases. Not sure why exactly though.
Now that env!("CARGO") is used twice, maybe we should extract it as const CARGO (or const HOST_CARGO or const NATIVE_CARGO).
There was a problem hiding this comment.
I added CARGO. For our xtasks, I don't think it matters much which cargo is used (particularly host vs native).
| } | ||
| let mut error = None; | ||
| for handle in thread_handles { | ||
| // SAFETY: `handle.join()` only returns `Err` if the thread panicked. |
There was a problem hiding this comment.
Not sure if SAFETY is the correct word here because crashing via unwrap() should be safe (as in, not UB).
But I don't have a better suggestion.
| .read(true) | ||
| .open(&entry_path) | ||
| .with_context(|| format!("Failed to open file {entry_path:?}"))?; | ||
| file.read_to_end(&mut concatenated_content) |
There was a problem hiding this comment.
now it seems a bit odd that messages extracted from Rust are in PO format, while messages from fish script are extracted on the fly,
which means we have logic to create msgfmt entries in both gettext-extraction and xtask gettext.
We could get rid of the one in gettext-extraction now, but we'd need to find a new file format that supports a list of (potentially multiline) messages.
There was a problem hiding this comment.
Yes, we could change that, but I don't think it's a priority, since the plan is to rip out that code once we have fully switched to Fluent for Rust messages.
| } | ||
| // The Header entry needs to be removed again, | ||
| // because it is added outside of this function. | ||
| let expected_prefix = MINIMAL_HEADER.as_bytes(); |
There was a problem hiding this comment.
re the comment: then why are we adding MINIMAL_HEADER in the first place?
msguniq --no-wrap --sort-output shouldn't need it, no?
Asserting that the empty msgid sorts before anything doesn't seem important either.
There was a problem hiding this comment.
The problem there is that gettext tools don't like non-ASCII msgids without an encoding specified, which was the reason why the header was introduced. That didn't change with the Rust rewrite, just the method for inserting and removing the header. Asserting that it comes first is important to ensure that we actually remove the header again, instead of some other stuff.
Rewrite the PO file handling logic in Rust and make it available via an
xtask. Replaces the
`build_tools/{update_translations,fish_xgettext}.fish` scripts.
Main benefits:
- Better ergonomics
- Better error handling
- Eliminates the need for a fish executable for updating PO files,
which is particularly useful in CI
- Improved performance, mainly due to concurrent threads working on the
PO files in parallel
The behavior is mostly unchanged, with the minor exception that section
headers for empty sections are now omitted in PO files.
The interface for invoking the tooling is quite different. Instead of
working with flags, `cargo xtask gettext` has 3 subcommands:
- `update` modifies the PO files to match the current sources
- `check` is like update, but instead of modifying the PO files, it
shows diffs between the current version of the PO files and what they
would look like after updating. When there is a difference, the xtask
exits non-zero, making it useful for checks to detect outdated PO
files.
- `new` creates a new PO file for the given language.
Both the `update` and `check` command take any number of file paths to
specify the PO files to consider. If none are specified, all files in
`localization/po/` are considered.
Extracting gettext messages from Rust still requires compiling with the
`gettext-extract` feature active. In situations where compilation is
needed for other purposes as well, it can make sense to only build once
and then tell the gettext xtask about the directory into which the
messages have been extracted. This can be done via the
`--rust-extraction-dir` flag. If we stop having gettext messages in
Rust, this logic can be removed.
Closes fish-shell#12676
9fe02a7 to
fba0b9d
Compare
| .arg(template.as_ref()) | ||
| .run()?; | ||
| let msgattrib_output_file = fish_tempfile::new_file().context("Failed to create temp file")?; | ||
| Command::new("msgattrib") | ||
| .args(["--no-wrap", "--no-obsolete", "-o"]) | ||
| .arg(msgattrib_output_file.path()) |
There was a problem hiding this comment.
If I had known about the trick of repeatedly calling args()/arg(),
I probably woudn't have written as_os_strs.
I guess that one might still be a little bit nicer because it allows us to write all arguments in a single array (assuming there's no other array to add to the arguments).
If we use args/args, then I think it might make sense to put the -o in a separate .arg() call, so it's not too disconnected from its option argument.
There was a problem hiding this comment.
Yes, these are two different ways of achieving the same thing. I think I slightly favor repeated .arg()/.args() calls in this case, mainly to avoid having to deal with degraded functionality of rust-analyzer in macros, but in other cases I think as_os_strs is the better option, e.g. for the sphinx-build invocation for the HTML docs stask. If we keep using as_os_strs, I think we should improve it via #12688.
| .with_context(|| format!("Failed to write to temp file {:?}", template_file.path()))?; | ||
| template_file | ||
| .get_mut() | ||
| .flush() |
There was a problem hiding this comment.
Ok I guess flush is enough to guarantee that our child processes see what we wrote.
So I guess we don't need to actually close the file here.
There was a problem hiding this comment.
Yes, closing would be inconvenient here, since the tempfile API has custom Drop logic to delete the file on drop, so keeping template_file in scope and open until we no longer need the backing file is useful.
| let tmp_copy = | ||
| fish_tempfile::new_file().context("Failed to create temp file")?; | ||
| crate::copy_file(&path, tmp_copy.path())?; | ||
| update_po_file(tmp_copy.path(), template_path_buf)?; |
There was a problem hiding this comment.
Threads are a bit of a heavy-weight way to parallelize things here
because all the heavy-lifting is already done by external processes
(msgmerge, msgattrib and diff).
The class way would be the Unix shell one:
- let job1proc1 = spawn "msgmerge de.po"
- let job2proc1 = spawn "msgmerge fr.po"
- let r = loop { if let Some(r) = job1proc1.try_wait().or(job2proc1.try_wait()) { break r; }; sleep 1 }
- assume, without loss of generality: r = job1proc1
- job1proc2 = spawn "msgattrib ..."
- ...
There's no strong need to address this comment
(especially if we later replace "diff" with a Rust-native, in-process solution),
but as written now, I think we should be able to get rid of threads with few changes by adding async/await to update_po_file() and other code without changing the structure.
Haven't tried it though.
There was a problem hiding this comment.
I agree that spawning threads is a bit heavy here, but it does have benefits. If we have single-threaded Rust code spawning processes as you suggest, it becomes much more annoying to write this code, especially if we don't want to lose out on parallelism compared to the threaded approach. Your syntax works for a small, fixed number of languages known when writing the code. I had a quick look how this approach would be implemented in idiomatic Rust, and it doesn't look like this approach is really used. Instead, threads or async is used. We could use async as well, but that would require introducing an async runtime dependency (tokio or some alternative), which I think is overkill if all we want from it is the fairly low overhead of spawning a few OS threads. I also think that async code would be harder to understand than the current threaded approach.
| /// Create a gettext template. | ||
| /// `rust_extraction_dir` must be the path to a directory which contains the messages | ||
| /// extracted from the Rust sources. | ||
| pub fn new<P: AsRef<Path>>(rust_extraction_dir: P) -> Result<Self> { |
There was a problem hiding this comment.
usually we define constructors before other methods like serialize but I can see why you'd put the shorter one first
| path: P, | ||
| ) -> Result<Option<LocalizationTier>> { | ||
| static L10N_ANNOTATION: LazyLock<Regex> = | ||
| LazyLock::new(|| Regex::new(r"(?:^|\n)# localization: (?<tier>.*)\n").unwrap()); |
There was a problem hiding this comment.
I guess technically <tier> should be called annotation_value.
Or just annotation if we rename annotation to m or r#match.
There was a problem hiding this comment.
Good point, I'll rename it to annotation_value.
|
|
||
| fn format_msgid_for_po(msgid: &str) -> String { | ||
| let escaped_msgid = msgid.replace("\\", "\\\\").replace("\"", "\\\""); | ||
| format!("msgid \"{escaped_msgid}\"\nmsgstr \"\"\n\n") |
There was a problem hiding this comment.
I like using the "indoc" crate for multiline strings, but actually native multiline string literals are nice to read too, we only need to add \n
```rust
format!(
"\
msgid \"{escaped_msgid}\"\n\
msgstr \"\"\n\
\n\
\n\
"
)
```
but what you wrote is fine too
There was a problem hiding this comment.
I agree that using multi-line literals are better for this use case. Introducing a dependency for it seems overkill.
|
no requirement to add to changelog since devs can check diff --git a/CHANGELOG.rst b/CHANGELOG.rst
index 775007d147..8f00cae4f9 100644
--- a/CHANGELOG.rst
+++ b/CHANGELOG.rst
@@ -25,6 +25,7 @@
For distributors and developers
-------------------------------
- When the default global config directory (``$PREFIX/etc/fish``) exists but has been overridden via ``-DCMAKE_INSTALL_SYSCONFDIR``, fish will now respect that override (:issue:`10748`).
+- ``build_tools/update_translations.fish`` has been replaced by ``cargo xtask gettext {new,update}`` (:issue:`12676`).
Regression fixes:
----------------- |
| for path in all_paths { | ||
| for dir_entry in WalkDir::new(path.as_ref()) { | ||
| let dir_entry = dir_entry | ||
| .with_context(|| format!("Failed to check paths at {:?}", path.as_ref()))?; |
There was a problem hiding this comment.
oh actually it looks like we don't need to specify the path,
because anyhow displays them already(?):
```shell
$ env -u RUST_BACKTRACE cargo xtask gettext update localization/po/fo.po
Finished `dev` profile [unoptimized + debuginfo] target(s) in 2.26s
Running `target/debug/xtask gettext update localization/po/fo.po`
Compiling fish-gettext-extraction v0.0.0 (/home/johannes/git/fish-shell/crates/gettext-extraction)
Checking fish v4.6.0 (/home/johannes/git/fish-shell)
Finished `dev` profile [unoptimized + debuginfo] target(s) in 9.49s
Error: Failed to check paths at "localization/po/fo.po"
Caused by:
0: IO error for operation on localization/po/fo.po: No such file or directory (os error 2)
1: No such file or directory (os error 2)
```
so in general we can be lazy about the "hard facts" and focus on saying what part went wrong.
There was a problem hiding this comment.
Hm, that's a good observation. What's missing is information about what the code tried to do. Especially in cases where more than one path is relevant, like copying or moving files, it might still be valuable to be explicit about what was the source and what was the target. What's your preference regarding this function? Should we leave it as is, return to plain std::fs::copy, or modify the error message somehow?
There was a problem hiding this comment.
actually the part after "Caused by" does not come from anyhow but it's how walkdir's impl fmt::Display for walkdir::error::Error.
So in cases like std::fs::copy we need to keep paths
Rewrite the PO file handling logic in Rust and make it available via an
xtask. Replaces the
`build_tools/{update_translations,fish_xgettext}.fish` scripts.
Main benefits:
- Better ergonomics
- Better error handling
- Eliminates the need for a fish executable for updating PO files,
which is particularly useful in CI
- Improved performance, mainly due to concurrent threads working on the
PO files in parallel
The behavior is mostly unchanged, with the minor exception that section
headers for empty sections are now omitted in PO files.
The interface for invoking the tooling is quite different. Instead of
working with flags, `cargo xtask gettext` has 3 subcommands:
- `update` modifies the PO files to match the current sources
- `check` is like update, but instead of modifying the PO files, it
shows diffs between the current version of the PO files and what they
would look like after updating. When there is a difference, the xtask
exits non-zero, making it useful for checks to detect outdated PO
files.
- `new` creates a new PO file for the given language.
Both the `update` and `check` command take any number of file paths to
specify the PO files to consider. If none are specified, all files in
`localization/po/` are considered.
Extracting gettext messages from Rust still requires compiling with the
`gettext-extract` feature active. In situations where compilation is
needed for other purposes as well, it can make sense to only build once
and then tell the gettext xtask about the directory into which the
messages have been extracted. This can be done via the
`--rust-extraction-dir` flag. If we stop having gettext messages in
Rust, this logic can be removed.
Closes fish-shell#12676
fba0b9d to
e37bfb1
Compare
| /// Format files or check if they are correctly formatted. | ||
| Format(FormatArgs), | ||
| /// Update gettext files | ||
| Gettext(GettextArgs), |
There was a problem hiding this comment.
Wouldn't it be better to call it translate? gettext is very implementation dependent (will that still be valid with fluent?) and not very telling to people who don't know what it is.
There was a problem hiding this comment.
gettext is implementation-dependent because it does implementation-dependent things. The same goes for fluent. We don't have plans for moving the fish script message localization away from gettext, so both will coexist. For some workflows, the implementation details are not relevant. There it makes sense to have a shared xtask (maybe called localization) offering combined functionality (check, update, maybe new). But there is still a reason to have the implementation-specific xtasks, both to improve performance and to allow for subcommands that only make sense with one implementation like rename or show-missing for fluent.
There was a problem hiding this comment.
from gettext the only thing that's needed in everyday scenarios is cargo xtask gettext update.
Wouldn't hurt to have cargo xtask fix someday, which would run this and other code-gen things and maybe formatting.
I think most users won't care about implementation details and about whether it takes a bit longer.
But for now this is fine, we can rename it later.
There was a problem hiding this comment.
gettext is implementation-dependent because it does implementation-dependent things
To be clear, I was only suggesting to rename the xtask name (enum), not the rest of the code. main.rs could still call gettext.rs directly, or via a very thin localization.rs shim layer if you absolutely want an .rs file matching the task name.
Basically, I was suggesting to start with xtask localization, and only add the implementation CLI API (whatever that maybe, an xtask, a parameter, ...) once there are implementation-specific operations available.
As-is, it means that contributors will have to learn to call xtask gettext instead of update_translations.fish, only to then have to learn xtask localization later. Might as well start with the latter, and let them learn the former only if/when needed for the specific ops later.
There was a problem hiding this comment.
I was only suggesting to rename the xtask name
Yes, that's how I understood it.
start with
xtask localization
My point is that the current operations are gettext-specific. For Fluent, we'll have a separate xtask fluent, which offers different functionality with a different CLI. This is still being designed, so it's too early to settle on a common xtask localization, because we don't know yet what that CLI would have to look like to support both Fluent and gettext. This might also depend on whether we introduce it before or after porting all Rust messages to Fluent.
Generally, I agree that we probably want some xtask localization CLI, but I think it's too early to build that. For now, I think the xtask gettext is fine, especially because the only subcommand that's really relevant for most contributors is update.
|
I think this looks good, feel free to merge if you don't need feedback, else I can look at the range-diff also |
Rewrite the PO file handling logic in Rust and make it available via an
xtask. Replaces the
`build_tools/{update_translations,fish_xgettext}.fish` scripts.
Main benefits:
- Better ergonomics
- Better error handling
- Eliminates the need for a fish executable for updating PO files,
which is particularly useful in CI
- Improved performance, mainly due to concurrent threads working on the
PO files in parallel
The behavior is mostly unchanged, with the minor exception that section
headers for empty sections are now omitted in PO files.
The interface for invoking the tooling is quite different. Instead of
working with flags, `cargo xtask gettext` has 3 subcommands:
- `update` modifies the PO files to match the current sources
- `check` is like update, but instead of modifying the PO files, it
shows diffs between the current version of the PO files and what they
would look like after updating. When there is a difference, the xtask
exits non-zero, making it useful for checks to detect outdated PO
files.
- `new` creates a new PO file for the given language.
Both the `update` and `check` command take any number of file paths to
specify the PO files to consider. If none are specified, all files in
`localization/po/` are considered.
Extracting gettext messages from Rust still requires compiling with the
`gettext-extract` feature active. In situations where compilation is
needed for other purposes as well, it can make sense to only build once
and then tell the gettext xtask about the directory into which the
messages have been extracted. This can be done via the
`--rust-extraction-dir` flag. If we stop having gettext messages in
Rust, this logic can be removed.
Closes fish-shell#12676
e37bfb1 to
96ee8a2
Compare
Rewrite the PO file handling logic in Rust and make it available via an
xtask. Replaces the
build_tools/{update_translations,fish_xgettext}.fishscripts.Main benefits:
which is particularly useful in CI
PO files in parallel
The behavior is mostly unchanged, with the minor exception that section
headers for empty sections are now omitted in PO files.
The interface for invoking the tooling is quite different. Instead of
working with flags,
cargo xtask gettexthas 3 subcommands:updatemodifies the PO files to match the current sourcescheckis like update, but instead of modifying the PO files, itshows diffs between the current version of the PO files and what they
would look like after updating. When there is a difference, the xtask
exits non-zero, making it useful for checks to detect outdated PO
files.
newcreates a new PO file for the given language.Both the
updateandcheckcommand take any number of file paths tospecify the PO files to consider. If none are specified, all files in
localization/po/are considered.Extracting gettext messages from Rust still requires compiling with the
gettext-extractfeature active. In situations where compilation isneeded for other purposes as well, it can make sense to only build once
and then tell the gettext xtask about the directory into which the
messages have been extracted. This can be done via the
--rust-extraction-dirflag. If we stop having gettext messages inRust, this logic can be removed.
Stacked on #12674, #12675
TODOs: