-
Notifications
You must be signed in to change notification settings - Fork 194
Update ascii #233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ascii #233
Conversation
Correct the is_lower function - I overlooked that one.
Solve a glitch in the function is_printable
The |
I can think of a number of solutions:
- The lazy one: forget that there are other encodings than ASCII because
they are too rare to be of practical interest
- Look up the character in an alphabet string (k = index( 'ABCDE..', c);
to_lower = lowercase(k:k)).
- Use a look-up table, constructed at run time (the first invocation)
- Use a static look-up table, constructed at build time
I started this, of course, because I think that the library should be as
general as practically possible, therefore that we should care about
non-ASCII encodings - to avoid puzzling or even nasty surprises. Which, in
my opinion, rules out possibility 1 ;).
Possibility 2 is the easiest to implement, but might be "slow" due to the
searching in a string.
Possibility 3 requires care with multithreaded environments, as the
initialisation works with static data.
Possibility 4 requires support from the build environment - I imagine a
small, separate, Fortran program that writes the ables in an include file.
CMake is perfectly capable of arranging that sort of thing.
Regards,
Arjen
Op wo 16 sep. 2020 om 18:13 schreef Ian Giestas Pauli <
notifications@github.com>:
… The to_lower and to_upper functions assumes that the input char. is in
ASCII. The EBCDIC <https://en.wikipedia.org/wiki/EBCDIC> conversion is
trickier and not compatible with ASCII table
<https://en.wikipedia.org/wiki/EBCDIC#Compatibility_with_ASCII>.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#233 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YR7F5RUTWL6YUEWCU4DSGDP2FANCNFSM4RPBDAVA>
.
|
Thank you @arjenmarkus for fixing this. I was aware of these issues since posting about the character validation functions at the Fortran Discourse (see https://fortran-lang.discourse.group/t/character-validation-functions/131) but hadn't taken the time to fix them. In fact at some point I would like to see all of these routines replaced by static lookup tables (https://github.com/ivan-pi/fortran-ascii/blob/master/fortran_ascii_bit.f90), as this solution appears to be the most performant one. For the functions +1 to merge |
Personally, I do not like to add extra parentheses around such expressions,
I just left them when I found them. While I think it is a matter of style,
I would indeed opt for removing the superfluous ones, unless they increase
readability, such as when you have .and. and .or. combined.
I am all for explicit comments like the proposed one to explain the use of
~.
I will make these changes.
Op vr 18 sep. 2020 om 01:50 schreef Ian Giestas Pauli <
notifications@github.com>:
… ***@***.**** commented on this pull request.
------------------------------
In src/stdlib_ascii.f90
<#233 (comment)>:
> end function
!> Checks whether `c` is a lowercase ASCII letter (a .. z).
pure logical function is_lower(c)
character(len=1), intent(in) :: c !! The character to test.
- is_lower = (c >= 'a') .and. (c <= 'z')
+ integer :: ic
+ ic = iachar(c)
+ is_lower = (ic >= iachar('a')) .and. (ic <= iachar('z'))
Is this parenthesis really needed here?
⬇️ Suggested change
- is_lower = (ic >= iachar('a')) .and. (ic <= iachar('z'))
+ is_lower = ic >= iachar('a') .and. ic <= iachar('z')
------------------------------
In src/stdlib_ascii.f90
<#233 (comment)>:
> @@ -145,13 +145,15 @@ pure logical function is_printable(c)
character(len=1), intent(in) :: c !! The character to test.
integer :: ic
ic = iachar(c) ! '~'
z'7E' = '~'
I suggest either adding some spaces so the character is aligned again
(or removing it and adding a explicit comment in a new line showing how we
are checking it)
⬇️ Suggested change
- ic = iachar(c) ! '~'
+ ic = iachar(c)
+ !The character is printable if it's between ' ' and '~' in the ASCII table
What do you think?
------------------------------
In src/stdlib_ascii.f90
<#233 (comment)>:
> end function
!> Checks whether `c` is a lowercase ASCII letter (a .. z).
pure logical function is_lower(c)
character(len=1), intent(in) :: c !! The character to test.
- is_lower = (c >= 'a') .and. (c <= 'z')
+ integer :: ic
+ ic = iachar(c)
+ is_lower = (ic >= iachar('a')) .and. (ic <= iachar('z'))
I noticed the entire code has those extra parenthesis except the change
above so... I think we should either remove all of them or add some extra
parenthesis in the function is_printable above so it matches the style of
the other ones.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#233 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YR2H274ZGQKU3MU4KMTSGKOFPANCNFSM4RPBDAVA>
.
|
Well, I was curious about the performance, so I wrote a small program to
test it. I had to use a size of 100 million characters to see any
significant CPU time and both implementations give roughly the same CPU
time - order 0.06 seconds. The differences - if different values are
reported - are not consistent: either of the implementations may appear as
fastest. Of course, performance tests are very difficult to get right, but
I do think this indicates that performance is not an issue.
So, a simple robust and general implementation should be enough, certainly
for a first version.
Op do 17 sep. 2020 om 12:43 schreef Ivan <notifications@github.com>:
Thank you @arjenmarkus <https://github.com/arjenmarkus> for fixing this.
I was aware of these issues since posting about the character validation
functions at the Fortran Discourse (see
https://fortran-lang.discourse.group/t/character-validation-functions/131)
but hadn't taken the time to fix them. In fact at some point I would like
to see all of these routines replaced by static lookup tables (
https://github.com/ivan-pi/fortran-ascii/blob/master/fortran_ascii_bit.f90),
as this solution appears to be the most performant one.
For the functions to_upper and to_lower, I've seen a dozen of different
possibilities on comp.lang.fortran. The solution I used here was adapted
from @certik <https://github.com/certik>'s library:
https://github.com/certik/fortran-utils/blob/master/src/utils.f90#L18
Are you worried about non-ascii characters or other encodings?
+1 to merge
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#233 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YR4PHMKOZ3AMDZAVQ6LSGHR4VANCNFSM4RPBDAVA>
.
! test_lower.f90 --
! Test the performance of two implementations of to_lower.
!
module ascii
implicit none
contains
!> Checks whether `c` is an uppercase ASCII letter (A .. Z).
pure logical function is_upper(c)
character(len=1), intent(in) :: c !! The character to test.
integer :: ic
ic = iachar(c)
is_upper = (ic >= iachar('A')) .and. (ic <= iachar('Z'))
end function
pure function to_lower(c) result(t)
character(len=1), intent(in) :: c !! A character.
character(len=1) :: t
integer :: diff
diff = iachar('A')-iachar('a')
t = c
! if uppercase, make lowercase
if (is_upper(t)) t = achar(iachar(t) - diff)
end function
pure function to_lower_2(c) result(t)
character(len=1), intent(in) :: c !! A character.
character(len=1) :: t
character(len=26), parameter :: lower_case = 'abcdefghijklmnopqrstuvwxyz'
character(len=26), parameter :: upper_case = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
integer :: k
k = index( upper_case, c )
if ( k > 0 ) then
t = lower_case(k:k)
else
t = c
endif
end function
end module ascii
program test_lower
use ascii
implicit none
real :: r
character(len=1), dimension(10000000) :: c, t ! Store the results, to avoid false results from aggressive optimisation
integer :: ic, i
real :: t1, t2, t_method1, t_method2
do i = 1,size(c)
call random_number( r )
ic = int( 256 * r )
c(i) = achar(ic)
enddo
!
! Method 1
!
call cpu_time( t1 )
do i = 1,size(c)
t(i) = to_lower(c(i))
enddo
call cpu_time( t2 )
t_method1 = t2 - t1
!
! Method 2
!
call cpu_time( t1 )
do i = 1,size(c)
t(i) = to_lower(c(i))
enddo
call cpu_time( t2 )
t_method2 = t2 - t1
!
! Print the results
!
write(*,*) 'Method 1 (shift): ', t_method1
write(*,*) 'Method 2 (look-up): ', t_method2
do i = 1,100
write(*,*) c(i), t(i)
enddo
end program test_lower
|
Hi @arjenmarkus, I noticed in the benchmark code you posted that you appear to be testing the same implementation twice as opposed to two different implementations - is this a typo? On my machine I see the lookup implementation as being almost two orders of magnitude slower than the shift implementation. I am aware performance is not the primary concern for this PR or the |
I agree. |
Hi Laurence,
O dear, that is a stupid mistake. No wonder things looked so much the same!
Yes, after correcting the program I get a clear difference:
0.05 seconds for the "shift" implementation and 0.2 for the "look-up" one.
This is roughly the same with Intel Fortran and gfortran (on Windows and
Cygwin).
Optimisation with -O2 does not give substantial changes to the second one,
but I need to do something with the result array, as otherwise both loops
give zero CPU time with gfortran.
Still, 0.2 seconds for 100 million characters is not that bad :).
Op vr 18 sep. 2020 om 10:36 schreef Laurence Kedward <
notifications@github.com>:
… Hi @arjenmarkus <https://github.com/arjenmarkus>, I noticed in the
benchmark code you posted that you appear to be testing the same
implementation twice as opposed to two different implementations - is this
a typo? On my machine I see the lookup implementation as being almost two
orders of magnitude slower than the shift implementation.
As an aside, the lookup approach can be implemented much more efficiently
by using the character code directly as an index into a static table
instead of using index.
I am aware performance is not the primary concern for this PR or the
stdlib reference implementation but I thought I would point it out.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#233 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YR5BFMU2VV73XAFLICLSGMLXHANCNFSM4RPBDAVA>
.
|
It makes sense since in the lookup method in the worst case scenario (letter 'z') has to check 26 chars, on the other side the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 from me, thank you Arjen. I'm also in favor of removing extra sets of parentheses throughout, but don't feel strongly about it.
Co-authored-by: Ian Giestas Pauli <iangiestaspauli@gmail.com>
Co-authored-by: Ian Giestas Pauli <iangiestaspauli@gmail.com>
Closing this after committing the suggestions. I will make a few more edits (especially remove the extra parentheses to make the code a bit more consistent) and I will also implement the to_lower and to_upper functions differently. |
@arjenmarkus Did you mean to merge it? You only closed the PR without merging it. |
Hi @milan Curcic <caomaco@gmail.com>, yes, good heavens, again the wrong
action. How do I properly merge it?
Op do 24 sep. 2020 om 21:11 schreef Milan Curcic <notifications@github.com>:
… @arjenmarkus <https://github.com/arjenmarkus> Did you mean to merge it?
You only closed the PR without merging it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#233 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YR5H2CE3FXNZTGMFQALSHOKVJANCNFSM4RPBDAVA>
.
|
Hm, I see a failure from CI, but I do not see what went wrong - the code I
committed works fine on my machine. Or at least I am convinced it does. How
can I see the reason?
Op do 24 sep. 2020 om 21:28 schreef Arjen Markus <arjen.markus895@gmail.com
…:
Hi @milan Curcic ***@***.***>, yes, good heavens, again the wrong
action. How do I properly merge it?
Op do 24 sep. 2020 om 21:11 schreef Milan Curcic ***@***.***
>:
> @arjenmarkus <https://github.com/arjenmarkus> Did you mean to merge it?
> You only closed the PR without merging it.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#233 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAN6YR5H2CE3FXNZTGMFQALSHOKVJANCNFSM4RPBDAVA>
> .
>
|
Right, the failure is on MacOS: Error:
/Users/runner/work/stdlib/stdlib/build is not a directory
This happens for all three versions (7,8 and 9). Any suggestions as to how
to solve this?
Op do 24 sep. 2020 om 21:38 schreef Jeremie Vandenplas <
notifications@github.com>:
… Hm, I see a failure from CI, but I do not see what went wrong - the code I
committed works fine on my machine. Or at least I am convinced it does. How
can I see the reason? Op do 24 sep. 2020 om 21:28 schreef Arjen Markus <
***@***.***
… <#m_6303207014123977784_>
: Hi @milan <https://github.com/milan> Curcic *@*.*>, yes, good heavens,
again the wrong action. How do I properly merge it? Op do 24 sep. 2020 om
21:11 schreef Milan Curcic @.* >: > @arjenmarkus
<https://github.com/arjenmarkus> https://github.com/arjenmarkus Did you
mean to merge it? > You only closed the PR without merging it. > > — > You
are receiving this because you were mentioned. > Reply to this email
directly, view it on GitHub > <#233 (comment)
<#233 (comment)>>,
> or unsubscribe >
https://github.com/notifications/unsubscribe-auth/AAN6YR5H2CE3FXNZTGMFQALSHOKVJANCNFSM4RPBDAVA
> . >
You can go in Actions to find the details. Here is the link
<https://github.com/fortran-lang/stdlib/runs/1162227548>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#233 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YR7FCXMFG673DMDPVSTSHON2PANCNFSM4RPBDAVA>
.
|
A problem already appeared earlier:
git -c http.extraheader="AUTHORIZATION: basic ***" fetch --tags --prune
--progress --no-recurse-submodules origin
+refs/heads/*:refs/remotes/origin/*
+refs/pull/233/merge:refs/remotes/pull/233/merge
28
<https://github.com/fortran-lang/stdlib/runs/1162227548?check_suite_focus=true#step:2:28>Error:
fatal: couldn't find remote ref refs/pull/233/merge
29
<https://github.com/fortran-lang/stdlib/runs/1162227548?check_suite_focus=true#step:2:29>Warning:
Git fetch failed with exit code 128, back off 6.955 seconds before retry.
30
<https://github.com/fortran-lang/stdlib/runs/1162227548?check_suite_focus=true#step:2:30>git
-c http.extraheader="AUTHORIZATION: basic ***" fetch --tags --prune
--progress --no-recurse-submodules origin
+refs/heads/*:refs/remotes/origin/*
+refs/pull/233/merge:refs/remotes/pull/233/merge
31
<https://github.com/fortran-lang/stdlib/runs/1162227548?check_suite_focus=true#step:2:31>Error:
fatal: couldn't find remote ref refs/pull/233/merge
32
<https://github.com/fortran-lang/stdlib/runs/1162227548?check_suite_focus=true#step:2:32>Warning:
Git fetch failed with exit code 128, back off 4.998 seconds before retry.
No idea what it could be.
Le jeu. 24 sept. 2020 à 21:42, Arjen Markus <notifications@github.com> a
écrit :
… Right, the failure is on MacOS: Error:
/Users/runner/work/stdlib/stdlib/build is not a directory
This happens for all three versions (7,8 and 9). Any suggestions as to how
to solve this?
Op do 24 sep. 2020 om 21:38 schreef Jeremie Vandenplas <
***@***.***>:
> Hm, I see a failure from CI, but I do not see what went wrong - the code
I
> committed works fine on my machine. Or at least I am convinced it does.
How
> can I see the reason? Op do 24 sep. 2020 om 21:28 schreef Arjen Markus <
> ***@***.***
> … <#m_6303207014123977784_>
> : Hi @milan <https://github.com/milan> Curcic *@*.*>, yes, good heavens,
> again the wrong action. How do I properly merge it? Op do 24 sep. 2020 om
> 21:11 schreef Milan Curcic @.* >: > @arjenmarkus
> <https://github.com/arjenmarkus> https://github.com/arjenmarkus Did you
> mean to merge it? > You only closed the PR without merging it. > > — >
You
> are receiving this because you were mentioned. > Reply to this email
> directly, view it on GitHub > <#233 (comment)
> <#233 (comment)
>>,
> > or unsubscribe >
>
https://github.com/notifications/unsubscribe-auth/AAN6YR5H2CE3FXNZTGMFQALSHOKVJANCNFSM4RPBDAVA
> > . >
>
> You can go in Actions to find the details. Here is the link
> <https://github.com/fortran-lang/stdlib/runs/1162227548>
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#233 (comment)
>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AAN6YR7FCXMFG673DMDPVSTSHON2PANCNFSM4RPBDAVA
>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#233 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD5RO7HMPAMARM3XG2RS4RTSHOOJVANCNFSM4RPBDAVA>
.
|
Okay, thanks anyway. At the moment there is not much we can do about it
then.
Op do 24 sep. 2020 om 21:48 schreef Jeremie Vandenplas <
notifications@github.com>:
… A problem already appeared earlier:
git -c http.extraheader="AUTHORIZATION: basic ***" fetch --tags --prune
--progress --no-recurse-submodules origin
+refs/heads/*:refs/remotes/origin/*
+refs/pull/233/merge:refs/remotes/pull/233/merge
28
<
https://github.com/fortran-lang/stdlib/runs/1162227548?check_suite_focus=true#step:2:28
>Error:
fatal: couldn't find remote ref refs/pull/233/merge
29
<
https://github.com/fortran-lang/stdlib/runs/1162227548?check_suite_focus=true#step:2:29
>Warning:
Git fetch failed with exit code 128, back off 6.955 seconds before retry.
30
<
https://github.com/fortran-lang/stdlib/runs/1162227548?check_suite_focus=true#step:2:30
>git
-c http.extraheader="AUTHORIZATION: basic ***" fetch --tags --prune
--progress --no-recurse-submodules origin
+refs/heads/*:refs/remotes/origin/*
+refs/pull/233/merge:refs/remotes/pull/233/merge
31
<
https://github.com/fortran-lang/stdlib/runs/1162227548?check_suite_focus=true#step:2:31
>Error:
fatal: couldn't find remote ref refs/pull/233/merge
32
<
https://github.com/fortran-lang/stdlib/runs/1162227548?check_suite_focus=true#step:2:32
>Warning:
Git fetch failed with exit code 128, back off 4.998 seconds before retry.
No idea what it could be.
Le jeu. 24 sept. 2020 à 21:42, Arjen Markus ***@***.***> a
écrit :
> Right, the failure is on MacOS: Error:
> /Users/runner/work/stdlib/stdlib/build is not a directory
>
> This happens for all three versions (7,8 and 9). Any suggestions as to
how
> to solve this?
>
>
> Op do 24 sep. 2020 om 21:38 schreef Jeremie Vandenplas <
> ***@***.***>:
>
> > Hm, I see a failure from CI, but I do not see what went wrong - the
code
> I
> > committed works fine on my machine. Or at least I am convinced it does.
> How
> > can I see the reason? Op do 24 sep. 2020 om 21:28 schreef Arjen Markus
<
> > ***@***.***
> > … <#m_6303207014123977784_>
> > : Hi @milan <https://github.com/milan> Curcic *@*.*>, yes, good
heavens,
> > again the wrong action. How do I properly merge it? Op do 24 sep. 2020
om
> > 21:11 schreef Milan Curcic @.* >: > @arjenmarkus
> > <https://github.com/arjenmarkus> https://github.com/arjenmarkus Did
you
> > mean to merge it? > You only closed the PR without merging it. > > — >
> You
> > are receiving this because you were mentioned. > Reply to this email
> > directly, view it on GitHub > <#233 (comment)
> > <
#233 (comment)
> >>,
> > > or unsubscribe >
> >
>
https://github.com/notifications/unsubscribe-auth/AAN6YR5H2CE3FXNZTGMFQALSHOKVJANCNFSM4RPBDAVA
> > > . >
> >
> > You can go in Actions to find the details. Here is the link
> > <https://github.com/fortran-lang/stdlib/runs/1162227548>
> >
> > —
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub
> > <
#233 (comment)
> >,
> > or unsubscribe
> > <
>
https://github.com/notifications/unsubscribe-auth/AAN6YR7FCXMFG673DMDPVSTSHON2PANCNFSM4RPBDAVA
> >
> > .
> >
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#233 (comment)
>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AD5RO7HMPAMARM3XG2RS4RTSHOOJVANCNFSM4RPBDAVA
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#233 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YR57Q7NXNJSYDNVTNGDSHOPB7ANCNFSM4RPBDAVA>
.
|
Well, I am learning the rules of Github the hard way, I guess. Hopefully I
will turn out to be a quick learner.
So: once you have made a pull request for a particular branch, you have to
create a new branch to continue work. Seems reasonably simple.
Op do 24 sep. 2020 om 21:58 schreef Jeremie Vandenplas <
notifications@github.com>:
… Well, I see that you opened a new PR #235
<#235> from the same branch
update_ascii, and that the CI for this PR #235
<#235> is fine. So, I would
say that the problem is solved by himself (at least temporarily) ;)
Furthermore this PR cannot be reopened because #235
<#235> exists.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#233 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YRZLEGQVRNGYBMDUR2TSHOQHRANCNFSM4RPBDAVA>
.
|
I am not sure what you mean and what were your intentions. However, for me, #235 seems to be #233 with some additional commits that were pushed to |
The changes I made allow the code to work properly on a system that uses a different encoding than ASCII, for isntance systems that use EBCDIC.
Note: the conversion to lower case and upper case should be regarded with care! I am uncertain that they would work as intended.