Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add String.ToLower() and String.ToUpper() #143

Merged
merged 1 commit into from
Feb 19, 2024

Conversation

caesay
Copy link
Contributor

@caesay caesay commented Feb 16, 2024

I think there might be something wonky with the C implementation, so you might want to check that before merging. I can't get the tests to run properly on my machine as of yet.

Copy link
Collaborator

@pfusik pfusik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution!
My main concern is that the C and C++ implementations are not Unicode-aware. For C, there are GLib functions. C++ probably needs some external dependency.
Also, please run make before committing (libfut.js is missing) and document in doc/reference.md.

GenC.fu Outdated Show resolved Hide resolved
obj.Accept(this, FuPriority.Argument);
Write("; ");
Write("std::transform(data.begin(), data.end(), data.begin(), ");
Write("[](unsigned char c) { return std::tolower(c); }); ");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not Unicode-aware.

test/StringLower.fu Outdated Show resolved Hide resolved
test/StringLower.fu Outdated Show resolved Hide resolved
test/StringLower.fu Outdated Show resolved Hide resolved
@pfusik pfusik added the enhancement New feature or request label Feb 17, 2024
test/StringUpper.fu Outdated Show resolved Hide resolved
@caesay
Copy link
Contributor Author

caesay commented Feb 17, 2024

I have incorporated your suggestions, and rebased to 8edec9b. The only thing I couldn't do is fix the unicode C++ support. I would suggest that we include https://github.com/sheredom/utf8.h in the hpp output (similar to how I suggested outputting subprocess.h and json.h) and use utf8lwr etc, but there's no precedent for how to do this in fut at the moment. I guess we need to store it as a string somewhere and include it if necessary.

Copy link

codecov bot commented Feb 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (8edec9b) 96.66% compared to head (f28c9ee) 96.67%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #143   +/-   ##
=======================================
  Coverage   96.66%   96.67%           
=======================================
  Files           2        2           
  Lines       17256    17306   +50     
=======================================
+ Hits        16680    16730   +50     
  Misses        494      494           
  Partials       82       82           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@pfusik pfusik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, modulo C++ which I'll handle myself. Thanks!

@pfusik pfusik merged commit 0c75701 into fusionlanguage:master Feb 19, 2024
2 of 3 checks passed
@pfusik
Copy link
Collaborator

pfusik commented Feb 19, 2024

I would suggest that we include https://github.com/sheredom/utf8.h

I'm no expert in Unicode, but https://github.com/sheredom/utf8.h/blob/2aa5709fe39c66d2868c0d52d42788899b90dc92/utf8.h#L1338 looks too simple to handle https://www.unicode.org/Public/15.1.0/ucd/CaseFolding.txt
I have more trust in ICU.

pfusik added a commit that referenced this pull request Feb 19, 2024
@caesay
Copy link
Contributor Author

caesay commented Feb 19, 2024

I tried utf8.h yesterday and agree, ran into some issues with it. ICU looks like a very big / complex dependency to try and include for a library I feel. You have to link in it's static libs as well as bring in it's header files. A lot of the appeal of fusion for me comes from the fact that the output is very standalone and can just be dropped into another project as a source file. Also at the moment afaik, vckpg only supports source-only libraries, so if making a library using fusion and distributing it via vcpkg will only be possible without ICU.

Headers like json.h or subprocess.h are great because they can be prepended to the header that is produced by fusion, I saw on a StackOverflow post that boost's unicode string library is header only, but it is just an abstraction, and relies on a configurable backend.

If we included boost.locale header, by default we could use the winapi backend on windows, and posix on non-windows. Since boost also supports ICU as a backend, we could add support for the consumer of the library to optionally rely on ICU instead. This keeps our fusion output fairly portable but also allows for "proper" unicode support at the consumers discretion.

@caesay
Copy link
Contributor Author

caesay commented Feb 19, 2024

@pfusik
Copy link
Collaborator

pfusik commented Feb 19, 2024

I have no experience with Boost, ICU or vcpkg. I made a decision based on https://stackoverflow.com/a/24063783/2032514
Unicode case manipulation is no trivial topic and I don't want to reinvent the wheel by maintaining my own implementation.

I got ICU on Windows with:

pacman -S mingw-w64-x86_64-icu

I suppose it's not harder on Linux or macOS. Doesn't vcpkg provide the ICU package?

@caesay
Copy link
Contributor Author

caesay commented Feb 19, 2024

Speaking of proper Unicode support - I looked at that issue you linked, which also mentioned that std::string->substr is unaware of utf-8 multi-byte sequences, so could end up cutting a character in half. This is not consistent with the C# behavior, which uses UTF-16 for example, so Substring always acts on characters, not bytes.

@pfusik
Copy link
Collaborator

pfusik commented Feb 19, 2024

In Fusion it's not "characters", but code units. Just as in UTF-8 one code point can be stored in several bytes, in UTF-16 one code point can be stored as two values (surrogate pair).

pfusik added a commit that referenced this pull request Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants