-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for ascii #11
Comments
One of the simple points to discuss (connected to the style guide #3) is naming conventions. For example for the function that checks whether a character is a letter some of the possible names are:
Personally, I like the underscored version (it matches the general verbosity of Fortran). I have also noticed that in C++, they have a separate set of the these function for "wide" strings (e.g. |
@ivan-pi let's move the naming conventions discussion into #3 (comment), where I just commented. |
To address character kinds, this is my comment from the ascii proto-type repo:
Well, not exactly, because, as you noted, they're not guaranteed to exist, and when they do exist, "DEFAULT" is often/usually the same kind as "ASCII", so you can't create overloaded functions with arguments that are "ascii" and "default". (In that case you'd have a duplicate interface.) That's one nice thing about jin2for: It doesn't assume anything and interrogates the numeric kinds from the compiler to then generate the code. So if only one character kind is supported then your code will only have that one kind. I'll cross post this on the new issue you made. |
I see what you mean. If "default" and "ascii" weren't the same kind, then I could just create generic interfaces for all three character kinds (with conditional compilation if the compiler supports UCS using CMake). How likely are we to meet a processor where "ascii" is not equal to "default"? Since the module is supposed to work for ascii characters, I think it is possible to write all the functions in such a way that they work with either "default" (non-ascii) or "ascii" straight of the box (as long as the processor supports ascii) using the For the ascii subset of unicode characters it should be possible to create overloaded functions because they have a different kind (more bits). Since there are not so many functions in this module I could just do it manually. We can always bring in jin2for later. |
For the current use case, I don't think it makes sense to use Jin2For either. Let's start with default character kinds while we consider if there's anything clever that we can do. We can always do some introspection with CMake and then use |
Is it possible to have the operating system somehow simulate "EBCDIC" defaults or maybe it is possible setup a Docker image? I have no means of testing non-"ascii" defaults at the moment. |
I think the way the procedures are written now (https://github.com/ivan-pi/fortran-ascii/blob/master/stdlib_ascii.f90) would work with both "default" and "ascii" characters. |
Before I submit a pull request, there are a few more issues worth discussing:
|
Thanks @ivan-pi for submitting a PR with this! Much appreciated. Let's discuss the API. |
Thanks @ivan-pi for this nice implementation. |
Perhaps I have not understood your question correctly. The long tests generally loop through all characters (therefore they are "long") and are meant only as unit tests to verify the correctness of the functions. Since the short/long adjectives might lead to some confusion I would not mind if we rename them. The last test - Since the functions follow the same pattern what might be worth including in the module is an abstract interface: abstract interface
pure logical function validation_func_interface(c)
character(len=1), intent(in) :: c
end function
end interface although apart from my test case which uses an array of procedure pointers to loop through the character validation routines, I don't really know if their would be any use cases. |
@ivan-pi I think they are fine.
we can discuss and see later if it could be useful or not. |
This would be useful and indeed and I would not mind preparing some examples for users. The way I see this done with other languages/libraries is to add usage examples in the documentation string. We should work this out under issue #4 . These could be integrated into a documentation website, kind of like with Sympy or D (see here. I am sure @milancurcic or @certik have some ideas how to publish the API and documentation as a website in the future. |
@ivan-pi ideally in the future we could write doctests just like in Python, as I just commented at #4 (comment). |
@ivan-pi Thank you for your work and sorry to be late to this thread, I missed reading it all the way through.
Overall, if a I just checked and confirmed that they can all be made |
One downside of elemental procedures is that they cannot be used as procedure pointers (https://stackoverflow.com/questions/15225007/elemental-functions-cannot-be-pointed-to-by-procedure-pointers). I am not sure whether this matters in practical usage cases or not. |
Ah, okay, I didn't know. That seems like an important restriction to take into account. We may just have to consider In this case, I don't know the answer as I don't have much experience working with text in Fortran (this may change now that we have ascii module :)). I can imagine it being useful to feed an array of characters to |
We can start with elemental, and since we are still in "experimental", we can remove elemental if we discover issues. |
Question regarding control characters. They are currently defined as public parameters: ! All control characters in the ASCII table (see www.asciitable.com).
character(len=1), public, parameter :: NUL = achar(z'00') !! Null
character(len=1), public, parameter :: SOH = achar(z'01') !! Start of heading
... ! 30 more parameters ommitted
character(len=1), public, parameter :: DEL = achar(z'7F') !! Delete First, consider that a user may just do However, consider a user that wants to work specifically with control characters. Their only options are:
Alternatives would be to wrap these in a private derived type So, are we happy with the current API of control characters or is this a concern? |
I agree this is a concern. Your suggestion is in line with one of my previous comments:
Personally, I think we should go with your suggestion. In fact in D, they have something similar (https://dlang.org/phobos/std_ascii.html#ControlChar). I can prepare a new pull request with your suggestion, modify the procedures to be elemental, and also replace the |
Today I was listing through the book "Migrating to Fortran 90" by James F. Kerrigan and on page 195 the author uses a derived type to encapsulate a set of cursor control strings useful for communication with an ANSI 3.64 compliant video display terminal. Given this prior art, I think we can go ahead with the solution @milancurcic has suggested. |
Thanks @ivan-pi. Another minor nit-pick. Should this: pure logical function is_alphanum(c)
character(len=1), intent(in) :: c !! The character to test.
is_alphanum = (c >= '0' .and. c <= '9') .or. (c >= 'a' .and. c <= 'z') &
.or. (c >= 'A' .and. c <= 'Z')
end function be written as this? pure logical function is_alphanum(c)
character(len=1), intent(in) :: c !! The character to test.
is_alphanum = is_digit(c) .or. is_alpha(c)
end function |
I agree the second version is cleaner and saves a few characters. It would be interesting to compare the differences at assembly level with regard to different optimization flags. |
Should we think here about getting the most optimized code? Certainly some optimizations are needed. But, as a user, I would first focus on the ease of use. If I would need performance, I would most likely implement what I need myself. Also, as a developer, I prefer the second solution of @milancurcic . If something must be changed, modifiying only |
I tried to create a benchmark to test the two versions above and it is more difficult than I expected 😦 . With no optimization flags and a low number of runs it seems like the second version is a bit slower as is it invokes two function calls. With Let´s go for the second version then. I will make the changes in the next iteration of my PR after we agree what to do with the character constants in #49. |
Since we are already discussing implementation details I was wondering how the character classification functions are defined in the C library. They use a different approach, quoting from Wikipedia:
This can be done in Fortran by setting up a constant array of 127 integers (say 16-bit) for the set of ascii characters (this could be done with a list of binary literals). The bit values are then used to indicate the different properties of a character (alphabetical, digit, punctuation, control, etc.). For example if the first bit is used to represent whether the character is alphabetical or not, the elemental logical function is_alpha(c)
character, intent(in) :: c
integer :: ic
ic = iachar(c)
is_alpha = btest(table(ic),0) ! access ascii character table
end function I am not sure though what is the behavior of Such a table can be easily generated using the "current" functions: program gen_ascii_table
use stdlib_experimental_ascii
implicit none
integer :: ascii_table(0:127)
integer :: i
character(len=1) :: c
! initialize all bits to zero
ascii_table = 0
do i = 0, 127
c = achar(i)
if (is_alpha(c)) ascii_table(i) = ibset(ascii_table(i),0)
if (is_digit(c)) ascii_table(i) = ibset(ascii_table(i),1)
if (is_alphanum(c)) ascii_table(i) = ibset(ascii_table(i),2)
if (is_punctuation(c)) ascii_table(i) = ibset(ascii_table(i),3)
if (is_control(c)) ascii_table(i) = ibset(ascii_table(i),4)
if (is_graphical(c)) ascii_table(i) = ibset(ascii_table(i),5)
if (is_printable(c)) ascii_table(i) = ibset(ascii_table(i),6)
if (is_white(c)) ascii_table(i) = ibset(ascii_table(i),7)
if (is_blank(c)) ascii_table(i) = ibset(ascii_table(i),8)
if (is_lower(c)) ascii_table(i) = ibset(ascii_table(i),9)
if (is_upper(c)) ascii_table(i) = ibset(ascii_table(i),10)
if (is_octal_digit(c)) ascii_table(i) = ibset(ascii_table(i),11)
if (is_hex_digit(c)) ascii_table(i) = ibset(ascii_table(i),12)
end do
write(*,'(A,128(I0,:,","))',advance='no') "[",(ascii_table(i),i=0,127)
write(*,'(a1)') "]"
end program The table of integers is then:
As you can see different integers correspond to different character properties (e.g. 613 are lowercase letters which are note hex digits, 104 are punctuation characters, 6246 are octal digits...). Is there any reason we would prefer to go down this bit route? |
Trying to compile the program
I get the following errors with gfortran:
However for the |
On my phone so I’ll be brief/terse and can’t easily look at the code.
Also, FYI, IIRC: compilers may not be required to handle non-ascii characters by the standard. I seem to recall this to be true from when I implemented Unicode support in JSON-Fortran. |
If I make the procedures pure instead of elemental, I am still "allowed" to pass non-ascii characters to the non-intrinsic
In that case is the best we can do to simply state in the documentation the behavior is undefined for non-ascii symbols? |
Sorry, I wasn't entirely clear in my previous comment. My recollection is that compilers are not required to handle non-ascii characters in program source code. But I may be mistaken here. I seem to recall having to use the backslash notation with GFortran:
https://gcc.gnu.org/onlinedocs/gfortran/Fortran-Dialect-Options.html |
Also, file encoding issues may be in play here, as you noted. |
@dev-zero wrote in #32 (comment):
Thanks for your suggestion! We could also just directly call the C routines. Over at https://github.com/ivan-pi/fortran-ascii/tree/master I've actually prepared 4 different versions of the character validation routines (three in Fortran, and one directly calling the C routines). I've done some micro-benchmarking and the differences can be up to a factor of 4. C++ still comes out a tiny bit faster for some reason. I'm writing a blog post about it (hopefully I finish it this weekend). |
I use the following simple routines in C++ to encode and decode unicode strings: https://github.com/certik/terminal/blob/69ee07e5aee2fe4c4bff4fa164364ec049c66069/terminal.h#L430 Here is how to use the I wrote the |
I've done some testing of different implementation approaches of the character validation routines:
If anyone is interested, the results are available here (scroll up for a description). There is no clear winner (besides C++). The results do change around 5 % if I switch compiler flags, or even if I change the order of comparison in certain relational operations. I did not check what's going on at assembly level, and I think even the timing routines might skew the results somehow (e.g. the Fortran timings improved when I switched from If anyone has some suggestions, how to improve the routines or make the measurements more accurate, please open an issue at my repository: https://github.com/ivan-pi/fortran-ascii |
@ivan-pi can this issue be closed or is the ascii module still an open-ended project? |
The specifications of the character validation routines are still missing. I will see to have them done soon. |
Hey @ivan-pi, |
This module should include functions for character classification and conversion (lower, upper). I have prepared a basic implementation at https://github.com/ivan-pi/fortran-ascii.
The plan is to cover the same functionality as found in the C, C++, and D libraries:
@zbeekman has already opened an issue (see ivan-pi/fortran-ascii#1) on dealing with different character kinds. The problem is that the ascii and iso_10646 character sets need not be supported by the compilers. Even if they are supported their bitwise representation might be different from the default kind.
I realized while creating these functions, that agreeing upon a style guide #3 and documentation #4 early on would be helpful to improve future pull requests. Some agreement upon unit testing will also be necessary.
cc: @jacobwilliams
The text was updated successfully, but these errors were encountered: