Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement non-fancy functional string type #320

Merged
merged 9 commits into from Mar 11, 2021

Conversation

awvwgk
Copy link
Member

@awvwgk awvwgk commented Feb 13, 2021

This patch proposes a non-fancy string type, which basically provides the same functionality as a deferred length character but can be used in an elemental rather than a pure way. The idea was to have a scaffold for the string type in stdlib which can be extended later but already provides everything we are used to have from the deferred length character without the rough edges, like not being able to read into an unallocated deferred length character variable or being unable to have arrays.

Related issues

Goals

  • add a non-extendible derived type representing a character sequence (string_type)
  • support all intrinsic operators for character variables (llt, lgt, lle, lge, len, len_trim, trim, adjustl, adjustr, index, scan, verify, repeat, ichar, iachar)
  • index and slice operations are implemented with the char function, which also allows to return a string_type as fixed length character variable
  • support user defined derived type IO (formatted and unformatted)
  • implement assignment from character values
  • implement constructing string from character values

Non-goals (for this patch)

  • add new functionality beyond the intrinsic functionality of deferred length character variables
  • define the internal representation of the represented character sequence

FPM project

To ease review I created an fpm project at https://github.com/awvwgk/stdlib_string, which I will keep in sync with this patch. It also provides the API documentation and the specs at https://awvwgk.github.io/stdlib_string.

@awvwgk awvwgk added the topic: utilities containers, strings, files, OS/environment integration, unit testing, assertions, logging, ... label Feb 13, 2021
- support all intrinsic operators for character variables
  (llt, lgt, lle, lge, len, len_trim, trim, adjustl, adjustr,
  index, scan, verify, repeat, char, ichar, iachar)
- support derived type IO (formatted and unformatted)
- implement assignment from character values
- implement constructing string from character values
@milancurcic
Copy link
Member

Thank you, I think this will be a very nice addition.

I learned something new. If you define and import, say, operator(.gt.), you get operator(>) for free. Do you know that this is standard behavior? I was surprised to see it.

Why did you implement the operators as module procedures instead of type bound methods? Currently, you need to explicitly import the operators from the module if you want to use them:

    use stdlib_string_type, only : string_type, assignment(=), len, &
        operator(.gt.), operator(.lt.), operator(.ge.), operator(.le.), &
        operator(.ne.), operator(.eq.), operator(//)

But if they were type-bound, they'd come with the type. Another advantage is that they would work with types that extend string_type (assuming the dummy arguments are declared as class(string_type)). There may be some disadvantage to the type-bound operator approach that I'm not aware of.

I also suggest adding an option to assign a string_type to an allocatable character variable:

    !> Assign a character sequence to a string and vice versa.
    interface assignment(=)
        module procedure :: assign_string_char
        module procedure :: assign_char_string
    end interface assignment(=)
...
    !> Assign a string to a character sequence.
    elemental subroutine assign_char_string(lhs, rhs)
        character(len=*), intent(inout) :: lhs
        type(string_type), intent(in) :: rhs
        lhs = rhs%raw
    end subroutine assign_char_string

If you agree, let me know if you'd like me to add the implementation and tests for it to this PR.

One thing that may be a non-goal for this PR, but I think useful, is a function (type-bound or otherwise) that returns a raw Fortran character string without having to assign it. This will allow string_type instance to be easily passed to procedure that expect an ordinary character variable. This would allow a "progressive" (incremental) incorporation of string_type into larger code that works with ordinary strings.

@awvwgk
Copy link
Member Author

awvwgk commented Feb 13, 2021

Why did you implement the operators as module procedures instead of type bound methods? Currently, you need to explicitly import the operators from the module if you want to use them:

None of the operators is defined for a polymorphic object so far, therefore they cannot be type-bound. Also the string_type is a non-extendible derived type and can therefore not have type-bound procedures, it would be cool to have a non-extendible derived type carry generic interfaces but we don't have those in Fortran. Also, it's a design choice aiming for an functional rather than for an object oriented design, for two reasons:

  • non of the existing (intrinsic) string routines are type-bound, they all operate in a functional way rather than on the character sequence itself, defining type bound procedures for some (the operators) is inconsistent in my opinion
  • type bound procedures are resolved at runtime, a procedure pointer is usually carried in the object instance, which makes the string object larger than it has to be

I did want to explore a string class at some point (maybe we can call it string_class) which allows to work on the character sequence it is representing in place. It comes which much more additional headache I wanted to avoid on the first pass. ftlString is such an example, even there not all operator interfaces can be type-bound so it won't be a complete package in any case.


I also suggest adding an option to assign a string_type to an allocatable character variable:

There is the usual tie between fixed length characters and deferred length characters, you can only define one interface but not both.

You can either have a pure assignment to a deferred length character, but this disallows the assignment to a fixed length character, because it doesn't have the allocatable attribute

pure subroutine assign_char_string(lhs, rhs)
  character(len=:), allocatable, intent(inout) :: lhs
  type(string_type), intent(in) :: rhs
end subroutine assign_char_string

Or you can have an elemental assignment to a fixed length character, which works nicely for allocated deferred length characters, but segfaults for unallocated deferred length characters (iso_varying_string has such an interface and it annoys me every time I have to move a varying_string to a deferred length character):

elemental subroutine assign_char_string(lhs, rhs)
  character(len=*), intent(inout) :: lhs
  type(string_type), intent(in) :: rhs
end subroutine assign_char_string

The best option in my opinion was to drop this case entirely and use the automatic left-hand-side reallocating by explicitly returning the character sequences from the string_type with the char method to cover both deferred length characters and fixed length characters. It's still not as elegant as it wish it were, but with the current standard this is how far we can get. Also the char method could have been so much nicer if GCC 8 had support for my original implementation.

@awvwgk
Copy link
Member Author

awvwgk commented Feb 13, 2021

One thing that may be a non-goal for this PR, but I think useful, is a function (type-bound or otherwise) that returns a raw Fortran character string without having to assign it.

Have a look at the char procedure, it implements exactly this behaviour together with the index access and slice functionality.

@milancurcic
Copy link
Member

Excellent, I didn't put two and two together and realize that this also implements char which works for both use cases (assign and pass as argument).

@milancurcic
Copy link
Member

Also the string_type is a non-extendible derived type and can therefore not have type-bound procedures, it would be cool to have a non-extendible derived type carry generic interfaces but we don't have those in Fortran.

Is sequence the only thing making it non-extendable? I thought sequence was deprecated a while ago. Modern Fortran Explained says so in A.2. But I can't find a similar statement in the standard so I'm confused about its status.

it's a design choice aiming for an functional rather than for an object oriented design, for two reasons:

  • non of the existing (intrinsic) string routines are type-bound, they all operate in a functional way rather than on the character sequence itself, defining type bound procedures for some (the operators) is inconsistent in my opinion
  • type bound procedures are resolved at runtime, a procedure pointer is usually carried in the object instance, which makes the string object larger than it has to be

Well, operators being type-bound is an implementation detail. The UI would still be functional.

The 2nd argument (run-time resolution) is valid, IMO. So it's important to note that this choice would be a compromise between performance (compile-time resolution of operators) and UI (having to import operators explicitly).

As long as this module exports only its type and the procedure interfaces that work on it, I'd be happy to recommend users just doing use stdlib_string_type and be done with it.

@awvwgk
Copy link
Member Author

awvwgk commented Feb 15, 2021

Is sequence the only thing making it non-extendable? I thought sequence was deprecated a while ago. Modern Fortran Explained says so in A.2. But I can't find a similar statement in the standard so I'm confused about its status.

See https://j3-fortran.org/doc/year/18/18-007r1.pdf#subsubsection.573, no mention about any deprecation here or elsewhere in this document. To my knowledge, sequence and bind(c) are the only ways to make a type non-extendible. That said, the current specification would allow a C struct and implementation and only provide iso_c_binding interfaces in stdlib_string_type.

@ivan-pi
Copy link
Member

ivan-pi commented Feb 15, 2021

As long as this module exports only its type and the procedure interfaces that work on it, I'd be happy to recommend users just doing use stdlib_string_type and be done with it.

This is a good general rule we can adopt in a future Fortran style guide. Particularly in the case when the procedures overload intrinsic functions and provide equivalent functionality.

@epagone
Copy link

epagone commented Feb 15, 2021

As long as this module exports only its type and the procedure interfaces that work on it, I'd be happy to recommend users just doing use stdlib_string_type and be done with it.

This is a good general rule we can adopt in a future Fortran style guide. Particularly in the case when the procedures overload intrinsic functions and provide equivalent functionality.

I agree on the fact that it should be a general rule, but I don't know if we have a majority on this. Maybe @milancurcic wants to consider this case-by-case, considering his pre-clause above (as long as...).

He might not be alone. For example, I did raise a related issue while reviewing the string list PR (together with other observations) where the namespace was polluted with a list_end variable and I suggested to encapsulated it within the derived type. Well, although we got to string list "new", well, let's say that my comment did not get much traction 😃

@milancurcic
Copy link
Member

See https://j3-fortran.org/doc/year/18/18-007r1.pdf#subsubsection.573, no mention about any deprecation here or elsewhere in this document. To my knowledge, sequence and bind(c) are the only ways to make a type non-extendible.

I just asked about this on Discourse.

But I'm still unclear what are you trying to do with sequence. Do you use it only to forbid extending the type? From my reading of sequence, it's meant only for enforcing the storage order of derived type components (please correct me if I'm wrong). But string_type has only one scalar component. So does sequence do anything useful here?

@awvwgk
Copy link
Member Author

awvwgk commented Feb 15, 2021

Do you use it only to forbid extending the type?

It's the only surefire method I know to make a type non-extendible, that's the reason I'm applying it here. But maybe there is a better way.

The problem is that extending another class from the string_type wouldn't work as expected given the current implementation, the new class would basically become useless unless one implements all functionality from stdlib_string_type again. For this reason I decided to it should be non-extendible.

Designing an abstract base class that can support the full functionality of the intrinsic deferred length character is not easy, also it is not a lightweight object anymore. I will upload a stdlib_string_class.f90 to awvwgk/stdlib_string once I'm done writing the ABC for an extendible string object.

@milancurcic
Copy link
Member

@awvwgk okay, all good, I understand now. I agree with this design choice. 👍

@milancurcic
Copy link
Member

An idea: What do you think about calling this str_type? It's more concise. If this type is successful, it will be used often. str is also ubiquitous to mean "string" in programming. But string_type spells out exactly what it is, which I also think is good. I like them both, but str_type slightly better.

@awvwgk
Copy link
Member Author

awvwgk commented Feb 17, 2021

An idea: What do you think about calling this str_type? It's more concise. If this type is successful, it will be used often. str is also ubiquitous to mean "string" in programming. But string_type spells out exactly what it is, which I also think is good. I like them both, but str_type slightly better.

IMO, being explicit rather than implicit should be preferred, similar like we prefer _type over _t I would prefer string_type over str_type here. Either way it is already a huge gain over the deferred length character in length and expressiveness:

character(len=:), allocatable :: str
type(string_type) :: str
type(str_type) :: str

Copy link
Member

@jvdp1 jvdp1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Thank you @awvwgk .
For the specs, I suggest to explicitely mention program demo_.... and end program demo_....
While it is correct like that, it might be confusing for beginners who would read these specs.

doc/specs/stdlib_string_type.md Outdated Show resolved Hide resolved
doc/specs/stdlib_string_type.md Outdated Show resolved Hide resolved
doc/specs/stdlib_string_type.md Outdated Show resolved Hide resolved
doc/specs/stdlib_string_type.md Outdated Show resolved Hide resolved
doc/specs/stdlib_string_type.md Outdated Show resolved Hide resolved
doc/specs/stdlib_string_type.md Outdated Show resolved Hide resolved
doc/specs/stdlib_string_type.md Show resolved Hide resolved
doc/specs/stdlib_string_type.md Outdated Show resolved Hide resolved
src/stdlib_string_type.f90 Show resolved Hide resolved
doc/specs/stdlib_string_type.md Show resolved Hide resolved
Copy link
Member

@ivan-pi ivan-pi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't managed to go through the implementation yet (not that I expect any wrongdoing), but I left a few comments in the specs.

doc/specs/stdlib_string_type.md Outdated Show resolved Hide resolved
doc/specs/stdlib_string_type.md Outdated Show resolved Hide resolved
Comment on lines +128 to +129
The module defines an assignment operations, `=`, to create a string type
from a character scalar.

This comment was marked as resolved.

doc/specs/stdlib_string_type.md Outdated Show resolved Hide resolved
doc/specs/stdlib_string_type.md Outdated Show resolved Hide resolved
doc/specs/stdlib_string_type.md Outdated Show resolved Hide resolved
doc/specs/stdlib_string_type.md Outdated Show resolved Hide resolved
doc/specs/stdlib_string_type.md Outdated Show resolved Hide resolved
doc/specs/stdlib_string_type.md Outdated Show resolved Hide resolved
doc/specs/stdlib_string_type.md Outdated Show resolved Hide resolved
awvwgk and others added 3 commits March 2, 2021 18:11
Co-authored-by: Ivan Pribec <ivan.pribec@gmail.com>
Co-authored-by: Jeremie Vandenplas <jeremie.vandenplas@gmail.com>
Copy link
Member

@milancurcic milancurcic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I'm happy with this PR!

Copy link
Member

@ivan-pi ivan-pi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to merge.

@awvwgk
Copy link
Member Author

awvwgk commented Mar 10, 2021

Ready to go from my side.

@ivan-pi
Copy link
Member

ivan-pi commented Mar 11, 2021

@jvdp1 do you have any unresolved comments? If not I suggest we go ahead and merge this later today.

Copy link
Member

@jvdp1 jvdp1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. It is ok from my side. It can be merged. Thank you @awvwgk for this work!

@ivan-pi ivan-pi merged commit b522bbb into fortran-lang:master Mar 11, 2021
@awvwgk awvwgk deleted the string-type branch March 11, 2021 16:36
@awvwgk
Copy link
Member Author

awvwgk commented Mar 11, 2021

Thanks a lot your comments and the review. I already pushed the follow up for the non-fancy string type: an abstract base class for string objects (see #333), which basically provides the same functionality as the string type but in an object-oriented rather than a functional way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: utilities containers, strings, files, OS/environment integration, unit testing, assertions, logging, ...
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants