Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore using icu4 #1227

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Explore using icu4 #1227

wants to merge 3 commits into from

Conversation

bim9262
Copy link

@bim9262 bim9262 commented Aug 26, 2023

Relates to #854, but doesn't replace datetime code as a mentioned possible approach.

Uses icu4 for:

  • short_months
  • long_months
  • short_weekdays
  • long_weekdays
  • am_pm
  • decimal_point

Still uses pure_rust_locales for (I couldn't figure out how to get this from icu4 yet...):

  • d_fmt
  • d_t_fmt
  • t_fmt
  • t_fmt_ampm

Thanks for contributing to chrono!

If your feature is semver-compatible, please target the 0.4.x branch;
the main branch will be used for 0.5.0 development going forward.

Please consider adding a test to ensure your bug fix/feature will not break in the future.

Relates to chronotope#854, but doesn't replace datetime code as a mentioned
possible approach.

Uses icu4 for:

* short_months
* long_months
* short_weekdays
* long_weekdays
* am_pm
* decimal_point

Still uses pure_rust_locales for (I couldn't figure out how to get this
from icu4 yet...):

* d_fmt
* d_t_fmt
* t_fmt
* t_fmt_ampm
@codecov
Copy link

codecov bot commented Aug 26, 2023

Codecov Report

Merging #1227 (c83af79) into 0.4.x (7eb6db1) will decrease coverage by 0.11%.
The diff coverage is 86.69%.

@@            Coverage Diff             @@
##            0.4.x    #1227      +/-   ##
==========================================
- Coverage   86.14%   86.03%   -0.11%     
==========================================
  Files          37       37              
  Lines       13394    13663     +269     
==========================================
+ Hits        11538    11755     +217     
- Misses       1856     1908      +52     
Files Changed Coverage Δ
src/format/locales.rs 82.31% <81.36%> (-17.69%) ⬇️
src/format/formatting.rs 94.78% <96.80%> (+0.05%) ⬆️
src/datetime/tests.rs 96.44% <100.00%> (ø)
src/format/strftime.rs 98.94% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@djc
Copy link
Member

djc commented Aug 26, 2023

What's the benefit here? How does it impact compile sizes and binary sizes?

@bim9262
Copy link
Author

bim9262 commented Aug 26, 2023

My end goal is to support #281 and to improve localization overall - although I suppose it is subjective if the data from icu is better than the data in pure_rust_locales). In the tests you can see that there's some discrepencies in the localization data between the two.

For the compile size are you referring to the size of libchrono.rlib?

I made a little test wrapper that depends on chrono ( chrono = { version = "0.4.26", features = ["unstable-locales"] }) and compared the size with and without the icu addition.

without icu the bin is 4.6M
with icu the bin is 7.5M.

@djc
Copy link
Member

djc commented Aug 28, 2023

So a 3M binary size penalty seems like a big downside...

@pitdicker
Copy link
Collaborator

My end goal is to support #281 and to improve localization overall - although I suppose it is subjective if the data from icu is better than the data in pure_rust_locales).

Good that you look at localization overall and not just adding "nd", "th" suffixes 👍. That looks to English-focused to me, as there are similar but different shortcomings when formatting in other languages.

But I don't think our strftime-format string is going to be flexible enough to provide optimal localized formatting. And I don't think we have the experience to come up with enough extensions to get this right either. If one project can design a good format and API it would be ICU.

Do we want to combine some localization data from ICU, while lacking their formatting API, with the formatting string format from libc?

Given that we use a strftime formatting string like libc, it makes sense to me that we also use the localization data from libc. Our localization would be 'as good as', and match, libc. That is already a useful thing for the Rust ecosystem.
And when you want more advanced formatting, ICU4X would be the way to go.

@pitdicker
Copy link
Collaborator

pitdicker commented Aug 28, 2023

So a 3M binary size penalty seems like a big downside...

We already don't include timezone data but leave it in chrono-tz because it would add 2+ Mb to the binary size (and there is the problem it can get outdated...). I would argue that that data is a lot closer to the core functionality of chrono.

assert_eq!(dt.format_localized("%T%.3f", ar_SY).to_string(), "18:58:00.123");
assert_eq!(dt.format_localized("%T%.6f", ar_SY).to_string(), "18:58:00.123456");
assert_eq!(dt.format_localized("%T%.9f", ar_SY).to_string(), "18:58:00.123456780");
assert_eq!(dt.format_localized("%T%.f", ar_SY).to_string(), "18:58:00٫123456780");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An Arabic comma (or what is it called) in roman numerals doesn't seem right.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, some locales have outdigits defined as part of LC_CTYPE, but pure-rust-locales doesn't support these. I did check though and there's nothing defined for ar_SY. I beleive that ICU has the numbers, though I'm still researching.

let nano = t.nanosecond() % 1_000_000_000;
if nano == 0 {
let ret =
match *spec {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rustfmt is at it again 😆. #1082 (comment)

@pitdicker
Copy link
Collaborator

Some links for a bit of context/history of our localization support:

Some of the goals where to not generate the data at build time, to not increase the binary size with locale data that is not used, and to work without allocating.

@bim9262
Copy link
Author

bim9262 commented Aug 28, 2023

I'll take a look at generating rust files from the icu data that only has the data we'd need. For example there was a lot of extra information included about non-gregorian calendars. Producing data in a format like pure-rust-locales should allow us to have &'static strs, and not need to alloc. This should also cut down on the size of the additional allocation information. 🤞

@bim9262
Copy link
Author

bim9262 commented Aug 29, 2023

Upon further testing by dumping out some mods from the ICU data I see results like:

#[allow(non_snake_case,non_camel_case_types,dead_code,unused_imports)]
pub mod aa_DJ {
    pub const SHORT_MONTHS: &[&str] = &["M01", "M02", "M03", "M04", "M05", "M06", "M07", "M08", "M09", "M10", "M11", "M12"];
    pub const LONG_MONTHS: &[&str] = &["M01", "M02", "M03", "M04", "M05", "M06", "M07", "M08", "M09", "M10", "M11", "M12"];
    pub const SHORT_WEEKDAYS: &[&str] = &["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"];
    pub const LONG_WEEKDAYS: &[&str] = &["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"];
    pub const AM_PM: &[&str] = &["AM", "PM"];
    pub const DECIMAL_POINT: &str = ".";
    pub const DIGITS: &[char] = &['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'];
}

Since there the data for months doesn't look good. I don't think that it would be a good to use icu for months (and probably not weekdays either)

The data for decimal and digits looks promising (the digits are displayed 9-0 on github, even though they are 0-9 in my terminal. Probably due to left to right vs right to left. As a side note, although arabic is read right to left, the numbers are constructed left to right):

pub mod ar_SY {
    [...]
    pub const DECIMAL_POINT: &str = "٫";
    pub const DIGITS: &[char] = &['٠', '١', '٢', '٣', '٤', '٥', '٦', '٧', '٨', '٩'];
}

Still need to investigate the ordinal numbers...

@bim9262
Copy link
Author

bim9262 commented Aug 29, 2023

This shows a comparison between the glibc and icu for am_pm and the decimal point

https://gist.github.com/bim9262/920d29bd6155e207239bf11f42fe5da6

There are a lot of interesting cases where glibc has something more specific than am/pm or icu will be more specific.

Some examples to check out:

  • hsb_DE (glibc blank, icu localized)
  • ig_NG (glibc am/pm, icu localized)
  • hif_FJ (glibc localized, icu am/pm)
  • ks_IN (both localized, but very different)

I'd be tempted to take use glibc unless it's blank or am/pm then try icu as the fallback.

@autarch
Copy link

autarch commented Feb 11, 2024

But I don't think our strftime-format string is going to be flexible enough to provide optimal localized formatting. And I don't think we have the experience to come up with enough extensions to get this right either. If one project can design a good format and API it would be ICU.

ICU has a set of patterns for date & time components. See https://cldr.unicode.org/translation/date-time/date-time-patterns

It covers a lot more cases than the strftime formatting specifiers. You can use the ICU data with the strftime specifiers as well. You just have to make some decisions about the mapping. I did this with my Perl module DateTime. I could enumerate the choices I made there if there's interest in this.

@autarch
Copy link

autarch commented Feb 11, 2024

Upon further testing by dumping out some mods from the ICU data I see results like:

#[allow(non_snake_case,non_camel_case_types,dead_code,unused_imports)]
pub mod aa_DJ {
    pub const SHORT_MONTHS: &[&str] = &["M01", "M02", "M03", "M04", "M05", "M06", "M07", "M08", "M09", "M10", "M11", "M12"];
    pub const LONG_MONTHS: &[&str] = &["M01", "M02", "M03", "M04", "M05", "M06", "M07", "M08", "M09", "M10", "M11", "M12"];
    pub const SHORT_WEEKDAYS: &[&str] = &["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"];
    pub const LONG_WEEKDAYS: &[&str] = &["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"];
    pub const AM_PM: &[&str] = &["AM", "PM"];
    pub const DECIMAL_POINT: &str = ".";
    pub const DIGITS: &[char] = &['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'];
}

Since there the data for months doesn't look good. I don't think that it would be a good to use icu for months (and probably not weekdays either)

This sort of thing happens when the ICU project's data for a locale is incomplete. You can see the underlying data for that locale at https://github.com/unicode-org/cldr/blob/main/common/main/aa_DJ.xml. It's nearly empty. Compare it to a more complete locale at https://github.com/unicode-org/cldr/blob/main/common/main/am.xml.

There's a whole system for locale "inheritance". In the case of aa-DJ it should fall back to aa, which does have data (at least in the CLDR GitHub repo). But the output above looks like it's falling back to the root locale. I'm not sure how you dumped that data. Or maybe it's using an older version of the data where the aa locale was empty.

@autarch
Copy link

autarch commented Feb 11, 2024

One more thing re: CLDR formatting. If you want to support this I think it's best to come up with new functions that accept these formats, like format_cldr or something along those lines.

@pitdicker
Copy link
Collaborator

So a 3M binary size penalty seems like a big downside...

I don't think our strftime-format string is going to be flexible enough to provide optimal localized formatting. And I don't think we have the experience to come up with enough extensions to get this right either. If one project can design a good format and API it would be ICU.

Do we want to combine some localization data from ICU, while lacking their formatting API, with the formatting string format from libc?

Given that we use a strftime formatting string like libc, it makes sense to me that we also use the localization data from libc. Our localization would be 'as good as', and match, libc. That is already a useful thing for the Rust ecosystem. And when you want more advanced formatting, ICU4X would be the way to go

In my opinion chrono should focus mostly on being a good date-and-time library. ICU can be better at localization. I propose to close this PR and instead make it easy to convert to/from icu::calendar::{Date, DateTime}. Maybe behind an icu feature?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants