Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linked list #68

Open
milancurcic opened this issue Jan 2, 2020 · 91 comments · May be fixed by #491
Open

Linked list #68

milancurcic opened this issue Jan 2, 2020 · 91 comments · May be fixed by #491
Labels
topic: container (Abstract) data structures and containers topic: utilities containers, strings, files, OS/environment integration, unit testing, assertions, logging, ...

Comments

@milancurcic
Copy link
Member

milancurcic commented Jan 2, 2020

Problem

Linked list is one of the essential data structures beside an array. It allows you to add, insert, or remove elements in constant time, without re-allocating the whole structure.

Fortran doesn't have a linked list. There are 3rd party libraries, but no obvious go-to solution. Fortran stdlib should have a linked list. I would use it.

Examples

What kind of data can the linked list hold?

There's various levels of capability we could pursue:

  1. Single type: Basically just like an array, but allows insertion in constant time;
  2. Elements can be of any intrinsic type in a single list;
  3. Can take intrinsic type and user-defined derived types (is this even possible in current Fortran?)

API

I don't know, something like this?

use stdlib_experimental_collections, only :: List
type(List) :: a = List()

call a % append(42)
call a % append(3.141)
call a % append('text')
print *, a % get(2) ! prints 3.141
call a % remove(3) ! a is now List([42, 3.141])
call a % insert(2, 'hello') ! a is now List([42, 'hello', 3.141])

a = List([1, 2, 3]) ! instantiate a list from an array
@certik
Copy link
Member

certik commented Jan 2, 2020

C++ has std::list for this. (I added Petaca to your Examples above.)

I would mention that I personally have never had a need for a std::list in C++, nor any linked list implementation in Fortran, because linked list is very slow (to create, traverse, destrogy, ...) compared to just a regular array or std::vector. The only operation that might be faster is insertion or deletion of individual items in the middle. In my use cases, I typically need to add elements to the end, in which case array works great.

But since there are at least 6 different people who reimplemented this already in Fortran and given that C++ has it too in their standard library, I would say that this would be a good candidate to include in stdlib, so that if people want to use it, they can. So +1 from me.

@milancurcic
Copy link
Member Author

In my use cases, I typically need to add elements to the end, in which case array works great.

Me too, but how do you do it? I thought that appending to an array always re-allocates on heap, e.g.:

integer :: i
integer, allocatable :: a(:)
a = [integer ::]
do i = 1, 100000000
  a = [a, i] ! re-allocates a on every append
end do

It's okay, for small-to-moderate arrays, but for very large ones, isn't it crippling?

@certik
Copy link
Member

certik commented Jan 2, 2020

The canonical way is to pre-allocate the array and then append to it, like this:

integer :: i
integer, allocatable :: a(:)
allocate(a(100000000))
do i = 1, 1000
    a(i) = i
end do

Then you use your actual application to figure out what the maximum size of the array is (100000000 in this example), and then you can either keep a as is (only use the first 1000 elements, as in this example), or you can copy it to a smaller array. A real world example is e.g. here: https://github.com/certik/hfsolver/blob/b4c50c1979fb7e468b1852b144ba756f5a51788d/src/sparse.f90#L111, the Bj_ array is pre-allocated to the maximum size first (determined from the sparse arrays), and then downsized before returning to the user: https://github.com/certik/hfsolver/blob/b4c50c1979fb7e468b1852b144ba756f5a51788d/src/sparse.f90#L127. This is typically still much faster than a linked list implementation. If you don't know the size ahead of time, then you can set some maximum at compile time and fail the program if you go over it (real world example: https://github.com/certik/hfsolver/blob/b4c50c1979fb7e468b1852b144ba756f5a51788d/src/basis.f90#L230) --- many times this is fine, as you can recompile the code easily. But sometimes that's not appropriate, so then you can also do what std::vector does --- it doubles the allocation every time you reach it, and copies the data. Here is a fast implementation of that that I use in LFortran (that's in C++, but one can do something similar in Fortran also): https://gitlab.com/lfortran/lfortran/blob/57d3b8077d884f0ff3945ad3a86b2da920e4b6b3/src/lfortran/parser/parser_stype.h#L22. All of these are fast options.

But as I said, it's good to have linked list in stdlib, if people prefer that, so that they do not need to reimplement it.

@rweed
Copy link

rweed commented Jan 2, 2020

First I think we need to define which types of linked list we need. I prefer a circular double-linked list as the basic type since its the type I use most in FEM codes etc. I also think we would need a single-link list to implement stacks and queues. Also do we need some form of reference counting. As to @milancurcic question as to current Fortran support list that can contain both intrinsic and user defined types, yes it can. I've implemented both a circular list class and a single link class using unlimited polymorphic variables. They works but are not pretty and will probably have poor perfomance when compared to a type specific list generated by pre-processing/templating methods ala the
Fortran Template Library approach.

@everythingfunctional
Copy link
Member

Generic linked-list, or really any generic data structure, is really cumbersome with the current Fortran capabilities. They work, but you end up having to use a select type block every time you want to access the data. So for convenience you'd end up with some wrapper class or library, at which point you might as well have re-implemented for your specific use case. Until we get fully parameterized types or template capabilities I don't think these are a great idea.

@victorsndvg
Copy link

victorsndvg commented Jan 3, 2020

I think the supported data types should be wrapped with containers in order to be extendible.
In the main issue (#1) containers are mentioned, but I don't see any other specific issue.

I think FPL (https://github.com/victorsndvg/FPL) contains a smart implementation strategy for supporting native data types and allow to extend to other user defined data types. It contains lists, hash tables, etc. All of them depend on containers (aka wrappers) in order to manage different data types.

I agree that with this kind of data types you don't get performance, but amazing flexibility. This kind of data types (usually) are not for computation purposes.

Edit:

  • Note 1: FPL data types naming is probably not good
  • Note 2: Multiple dimensions management is another usual issue also treated by FPL

@ivan-pi
Copy link
Member

ivan-pi commented Jan 3, 2020

I think many of the projects in the list of popular projects contain linked list implementations. Perhaps it would be good to do a grep over all of those repositories to get a feeling for linked list usage in production codes (e.g. whether they use generic lists supporting multiple kinds or only specific ones for the intrinsic kinds and potentially derived types).

@nshaffer
Copy link
Contributor

nshaffer commented Jan 3, 2020

I agree with @everythingfunctional on this issue. There's a ton of up-front labor in implementing fully polymorphic containers, and I'm not convinced that they're that much more useful than having generic (but homogeneous) containers. That is, I don't think it's worthwhile to support, say, linked lists where each element is of arbitrary type.

The more common use case I find is to need a linked list of int32, say. Or a binary tree of class(my_derived_t). These can be implemented without select type all over the place. There's still the labor of implementing all the intrinsic types, but this can be templated.

Letting users make containers of derived types is tricker. The common solution is to provide an abstract base class that users need to extend in order to have containers of derived types. I think that solution kind of sucks, but I have an alternate idea... Just ship source code templates that implement each container for class(__T__) and then let users run sed s/__T__/mytype/g on it to produce derived type containers on demand. (This will be slightly more involved for, e.g., mapping types, but just slightly).

I confess I have not thought through if there is some great pitfall to this approach besides being slightly "icky" from a distribution p.o.v.

@nncarlson
Copy link
Member

I'm in agreement with @nshaffer here. I've done the linked-list-of class(*) variables in my own library which I use as a backend for some very specific things where such generality is needed. Otherwise it is incredibly clunky to use with all the select type and isn't an acceptable for general use, imo.

Someone else seemed to suggest that perhaps performance shouldn't be a concern here. I think it would be a big mistake to ignore performance. Linked lists come with their intrinsic performance overhead that most would be aware of, but any implementation that significantly added to that I would find unacceptable to include in a standard library.

I think the best solution beyond intrinsic types, which could all have very performant implementations, would be, as @nshaffer suggested, to provide a literal template that a user could adapt for their particular case. In fact that's more or less what I do myself.

@zbeekman
Copy link
Member

zbeekman commented Jan 3, 2020

A note on performance:

  1. Yes linked lists have some overhead compared with arrays
  2. They perform well for sorted data, and in instances where you're always manipulating one end of the list or the other, e.g., stacks, but, in general are NOT constant time lookup for random access read or insertion unless you're always operating on data "nearby"
  3. They are a building block component for hash tables which are in general constant time insertion and lookup.
  4. In some cases where storage needs vary greatly and dynamically in complex ways pre-allocating a huge array may not be feasible and you may want/need to use a linked list

I think there is merit to providing classic data structures and algorithms. I would add hash tables to this list as well as binary-trees, octrees, K-D trees, and a number of others. Obviously they are not useful to all users and applications but having a decent implementation is worthwhile.

I agree that right now the select type combinatorial explosion makes unlimited polymorphics nearly useless, and very awkward. In my opinion better generic programming should be the highest priority for the next major standard revision.

@certik
Copy link
Member

certik commented Jan 3, 2020

@zbeekman Generic programming will not make it to the next standard revision -- simply because there is no proposal that is ready. I think the latest most developed idea is pursued at j3-fortran/fortran_proposals#125, and we need everybody's help to help transform the idea into a solid proposal. Once we have a proposal that is community backed, I'll be happy to bring it to the committee and try to get it into the next standard.

@zbeekman
Copy link
Member

zbeekman commented Jan 3, 2020

I know @rouson is working with Magne who leads the Bergen Language Design Lab and also @tclune on generics. They have something here but I don't know how up to date it is with their current efforts. Hopefully they can combine efforts and we can get something in, we'll see.

@certik
Copy link
Member

certik commented Jan 3, 2020

Yes, the issue j3-fortran/fortran_proposals#125 is the latest based on our discussion with Magne at the last meeting. Anyway, let's move the discussion about this there, I just wanted to point this out, that we need help.

@tclune
Copy link

tclune commented Jan 3, 2020

In the mean time I have a project https://github.com/Goddard-Fortran-Ecosystem/gFTL which provides (by far less elegant means) a generic container system. Currently it supports Vector and Map (ala C++ STL), but also has Set which is used under the hood.

gFTL uses the C preprocessor and requires explicit instantiation, but is still a real game changer for doing some common operations within Fortran. I have a separate project gFTL-shared that provides common instantiations.

But I do look forward to the day that this could be done much more elegantly through a proper generic facility. (And yes, I realize that other preprocessors could do what I have done more elegantly than the C preprocessor, but ... cpp is already integrated into the build systems for the other projects I work with.

@gronki
Copy link

gronki commented Jan 3, 2020

I agree here with @zbeekman that linked lists are essential and I think the approach to preallocate array is very ineffective (cause then you have to check for overflow and re-allocate it etc). I also sadly agree that this is undoable in the current Fortran. Gotta wait for generics (or hopefully an intrinsic highly-optimized types for lists and dicts).

@LadaF
Copy link

LadaF commented Jan 4, 2020 via email

@certik
Copy link
Member

certik commented Jan 4, 2020

@LadaF nice to meet you! Small world. You should put your name and photo at your GitHub profile if you can.

@jvdp1 jvdp1 added the topic: utilities containers, strings, files, OS/environment integration, unit testing, assertions, logging, ... label Jan 18, 2020
@ngs333
Copy link

ngs333 commented Mar 3, 2021

The FTL ( https://github.com/SCM-NV/ftl Fortran Template Library) is a substantial library that is worth looking at.

@rouson
Copy link

rouson commented Mar 3, 2021

Linked lists are a prime example of an anti-pattern, i.e., "a common response to a recurring problem that is usually ineffective and risks being highly counterproductive." In order of preference, I recommend

  1. Document clearly why linked lists are not supported for reasons related to code performance, correctness, complexity, and generality. Refer people to C++ inventor Bjarne Stroustrup's explanation of why to avoid linked lists.
  2. Use recursive allocatable components, a Fortran 2008 feature designed to support singly-linked lists that allow only unidirectional traversal. This at least resolves the correctness and complexity issues. Resolving the generality issue will require generic programming in Fortran 202Y. There's probably not much that can resolve the performance issues except warning the user about operations to avoid in performance-critical sections of code.
  3. If bidirectional traversal is required, consider whether arrays and indirect addressing could be used instead of pointers.
  4. Use pointers only as a last resort and only if the pointers are encapsulated in a way that precludes memory leaks and dangling pointers. Doing so likely requires reference counting, which is tricky to automate and get right.

If 1 is untenable, start with 2 to reduce development time and maintenance hassles.

@ngs333
Copy link

ngs333 commented Mar 3, 2021

Every data structure (linked list, array, binary tree, heap, priority queue, kd-trees, etc ) has a set of operations along with its computational complexity for each data operation (Cormen et all, Introduction to algorithms). Each has its place in larger algorithms (e.g. ray queries, computational geomety, etc) and programs. Many of them have a place in larger patterns. Having said that, there are some that others are build on (e.g. linked list; heap; priority queue). On top of that, there are some that are so simple that if you can't easily implement it in a language, then the language has serious problems and don't bother even trying the more sophisticated ones. I think the liked list is a good if not natural place to start.

@ivan-pi
Copy link
Member

ivan-pi commented Mar 3, 2021

Great advice @rouson! It would fit well into a Fortran book similar to the Effective C++ series by Scott Meyers.

I never realized one can make a forward list using allocatable components. Recently, I was writing some code for operating with polygons. I decided to implement a polygon as a linked list of edges, half-knowing I might come to regret it, but still thought it will be a fun challenge. Two-weeks later I already regret it. The pointers make it difficult to write getter functions which are pure, the copy behavior is convoluted, and so is finalization.

Instead I might try to implement the polygon with recursive allocatable components:

type :: edge
  real, dimension(2) :: start, end
  type(edge), allocatable :: next
end type

type :: polygon
  integer :: nedges
  type(edge) :: first_edge
end type

(I am aware that this could also be solved with a simple array of vertex coordinates, and the solution above duplicates some information.)

Btw, your link in the 4th bullet point is broken.

@rouson
Copy link

rouson commented Mar 3, 2021

@ivan-pi thanks! I just fixed the link in my 4th bullet point.

@arjenmarkus
Copy link
Member

arjenmarkus commented Mar 4, 2021 via email

@ChetanKarwa
Copy link

ChetanKarwa commented Mar 20, 2021

Hello everyone,
I am looking forward to solving this issue. I wish to contribute under GSOC 2021.
I wish to start my work from now forth but I don't see a final conclusion on what kind of linked list is required here.

Mentors assigned for this project in GSOC are @arjenmarkus and @milancurcic. So, could any one of you finalize a clear picture of what kind of linked list is required?

A homogenous generic linked list (a linked list that contains only a single type of data i.e. decided by the user, this is similar to the list given in standard library in C++) [1,2,3,4,5,20]

or A heterogeneous generic linked list (A linked list that can contain any kind of data at its node be it an integer, character array, double or any other data type. this is similar to the list used in python) [1,'Hello',3.14,20]

@arjenmarkus
Copy link
Member

Welcome to the project :).

The questions you raise are actually part of the project. The design questions need to be elaborated:

  • The required capabilities (see the initial description) - different levels of flexibility can be defined
  • Performance aspects of different implementations (runtime and memory usage)
  • Possible uses (user stories if you like) for application in actual programs

@ChetanKarwa
Copy link

Ok.
The following are my views regarding the issue here.

  • A list in the most use case is a group of similar objects and not random objects. (If there is massive scale use of a list with random data types enlighten me)

  • According to my knowledge in Fortran, it is not possible to design a heterogeneous generic list (eg. [1,3.14, "hello",20]) with the current data-type definition available in Fortran because there won't be a fixed return type of list_name%get(index) and if the user decides to store the value in some variable it is not possible for the user to know what will be the return type of the function.

  • According to my knowledge, this is only achievable if Fortran supports type inference. (Eg. "auto" keyword in C++ and "var" in js).

@LadaF
Copy link

LadaF commented Mar 21, 2021

It is certainly possible to have a linked list or an array of container types that contains a class(*) item. The user of that list then will be responsible to know what types there might be inside and setup the appropriate select type type guards.

@ChetanKarwa
Copy link

I know that it is possible for one to make such a linked list but how can we expect our user to remember or to know what type there might be inside the node. User will have to maintain different data that describes the type carried by that node.

@LadaF
Copy link

LadaF commented Mar 21, 2021 via email

@tclune
Copy link

tclune commented Nov 15, 2021

Not sure about "compiler can optimize". The algorithm is baked into the implementation of STL vectors. The vector interface does not really give the compiler any hint of how the container will be used.

Also note that there are other mechanisms for growing a vector. One can specify a minimum size even after the container is partially filled. This might be useful if you know you are memory constrained and don't want to risk the default doubling.

@certik
Copy link
Member

certik commented Nov 15, 2021

Not sure about "compiler can optimize". The algorithm is baked into the implementation of STL vectors. The vector interface does not really give the compiler any hint of how the container will be used.

If push_back is an stdlib routine, then compilers can understand the meaning of it, which is they can optimize the use of it by reallocating the array.

@KoldMen
Copy link

KoldMen commented Apr 24, 2022

I have been programming in Fortran since 1981. So I guess I know a thing or two about Fortran. I do a lot of high performance computing. I use Fortran for computations and write my GUI in VB, C#, Java, and had used Pascal and QBasic in their hey days for the front end.

I use a lot of graphs and data structures in these languages. Lists are the more commonly used than arrays. One can keep adding to a list without predefining its length. This is important for system programmers. I have never missed lists in my computational engines, except when trying to use stacks. I have my own implementation of a stack where I create a new array and copy the old one when the stack size increases beyond the allocated size. I know all about inefficiency. But then try replacing a stack with an array in any algorithm.

Should Fortran have lists, the answer is YES. Should it be super efficient ? The answer is NO.

Allow the end users like me make the choice. Not all programmers would be writing weather forecasting routines. Those who write will never use a list.

List is a basic feature of modern languages. If you don't want to implement, then you are essentially limiting Fortran to a special use, namely Computations. I am okay with this. But then don't cry that the new generation is not using Fortran.

Add modern features like lists or die.

The whole arguments above seem to miss the point that we stopped teaching Fortran formally in my University and elsewhere more than 20 years ago. We teach Python to freshmen, and they love it. An avid programmer friend from NIST advised me to shift to Python since Python has so many libraries. And here you are, arguing about implementing a simple list. So funny.

Anyway, thanks for trying to build the stdlib, even if it is 30 years too late.

@certik
Copy link
Member

certik commented May 5, 2022

@KoldMen I agree with your post. One nuance is that above we are arguing about the API of a list data structure in stdlib built using Fortran. Notice that in Python the list is part of the language itself, it is not implemented in Python. In LFortran we have actually recently added List as a first class feature into the intermediate representation, and it will get very efficient implementation in the backends. We have not exposed it in the frontend (syntax) yet, but one option is to simply recognize a list from stdlib, and transform it into the List node in the intermediate representation (instead of using the Fortran implementation from stdlib), and thus it would get excellent performance with LFortran. Other compiles could either do the same, or they could just use the actual implementation from stdlib (slower, but it would work).

We also have dictionaries, tuples and sets in LFortran. So once stdlib has an implementation of those, we can hook them in in the frontend.

@14NGiestas 14NGiestas linked a pull request Jul 5, 2022 that will close this issue
@awvwgk
Copy link
Member

awvwgk commented Jul 30, 2022

Now that we have hash maps in stdlib, it would be nice to also have lists for arbitrary types. This would allow IO libraries to directly export the data they are reading in a stdlib compatible format.

@beddalumia
Copy link

beddalumia commented Sep 6, 2022

@milancurcic

Me too, but how do you do it? I thought that appending to an array always re-allocates on heap, e.g.:
integer :: i
integer, allocatable :: a(:)
a = [integer ::]
do i = 1, 100000000
a = [a, i] ! re-allocates a on every append
end do
It's okay, for small-to-moderate arrays, but for very large ones, isn't it crippling?

and

...
allocate(s(0))
call system_clock(c0, cr)
do i = 1, N
s = [s, real(i, dp)]
end do
call system_clock(c1)
...
which is very slow. Of course, you can alleviate by using a vector, but at that point you're introducing a new structure.

Sorry for going slightly off topic (although I see I'm not the only one pushing for arrays instead of lists, for the case of homogeneous collections), but I parsed (admittedly fast) the whole thread and surprisingly did not find a mention to F2003 intrinsic move_alloc which makes for a way faster alternative to the badly performant "automatic reallocation" you mentioned in these two messages (which remarkably stay respectively on top and quite on the tail of the discussion).

I actually did not know about automatic allocations with array constructors, but have implemented in the past a compressed-sparse-row grower using move_alloc and found it quite efficient. So now, reading here, I got surprised about your viewpoint and went on the discourse to find a nice post which compares systematically the two approaches (and something more in the comments), so you might want to check it out.
Incidentally I think move_alloc should probably be the core functionality to implement whatever variation of the std::vector::push_back() strategy that @certik is proposing (one could go more flexible than just always doubling the size...). The discourse thread also briefly mentions this and points to an interesting implementation in FLIBS.

Aside of all of this, I also agree that the user should have a choice and adding linked lists in stdlib may be beneficial for many people, just wanted to go a little off-track and point out this alternative to the readers (like me 🙃). But maybe I would prefer an inhomogeneous one, following @awvwgk's idea of simplifying IO operations (very much like Matlab does with its cell arrays, that are returned whenever an IO intrinsic operates on inhomogeneous files). The idea would then be quite "classic": homogenous collections go into (dynamic) arrays, inhomogeneous ones into lists.

@gronki
Copy link

gronki commented Oct 11, 2022 via email

@certik
Copy link
Member

certik commented Oct 17, 2022

Thank you @gronki!

The LPython compiler (which shares the middle end and backends with LFortran) has lists and on our initial preliminary benchmarks it seems faster than Clang/GCC's std::vector (with all optimizations on) for most people:

The same with dictionaries against std::unordered_map:

So LFortran already has this capability; all that is needed is to expose it via some syntax.

Our experience with LPython is to use regular Python syntax for our fast features and provide a CPython implementation. That seems to work really well. In the same way, as indicated above, we can expose these nice LFortran features via regular (existing) Fortran syntax and provide a Fortran implementation. LFortran would recognize it and use List.

That still leaves the door open to also create an extension of the Fortran language, we can still do that.

@milancurcic, @everythingfunctional if you want to move this forward, let's get a usable List implementation into stdlib, and then we can teach LFortran about it. Everything is ready from my side, we just need the "syntax" in stdlib.

I recommend not to use linked list, but rather store the length, capacity and the data, just like Python, or std::vector does it. LFortran also does it that way.

@everythingfunctional
Copy link
Member

That's a neat way to do it. I can implement a list for stdlib. It shouldn't be hard. However, what should the API look like? I'd lean towards something like

type :: list
  private
  ! store values and whatever else makes sense
contains
  procedure :: item
end type

interface list
  pure module function from_array(values)
    real, intent(in) :: values(:)
    type(list) :: from_array
  end function

  pure module function of_size(size, init, default)
    integer, intent(in) :: size
    real, intent(in), optional :: init !! fill all elements with this value if provided
    real, intent(in), optional :: default !! expand size of list and provide this value if asked for non-existent element
    type(list) :: of_size
  end function
end interface

interface
  module function item(self, position)
    class(list), intent(inout) :: self !! allows for expansion of size if necessary
    integer, intent(in) :: position
    real, pointer :: item !! pointer allows to appear on lhs of assignment
  end function
end interface

@milancurcic
Copy link
Member Author

There's an implementation in #491. It needs some work (mainly docs and tests) to get it through the finish line. I suggest discussing and adjusting (if needed) the API there.

@everythingfunctional
Copy link
Member

There's an implementation in #491. It needs some work (mainly docs and tests) to get it through the finish line. I suggest discussing and adjusting (if needed) the API there.

That is I suppose a place to start the conversation, but I think it has issues, such as:

  • It's a linked list, and the interface seems (to me) to imply that. And @certik indicated we probably shouldn't do that.
  • It's unlimited polymorphic. My suspicion is that we would rather have lists of a specific type to avoid the overhead and clunky usage of class(*) things.

@certik , what is the API for the list that LFortran uses?

@arjenmarkus
Copy link
Member

arjenmarkus commented Oct 18, 2022 via email

@milancurcic
Copy link
Member Author

milancurcic commented Oct 18, 2022

I think there are a few different conversations here (lists vs. vectors and API vs. implementation). Some comments:

  • This is a thread for linked lists. I think we should start a new thread for a vector to keep efforts more focused.
  • There seems to be more appetite (around Fortran-lang) for vectors than for lists. For me as well, vectors are more useful than lists.
  • There has been almost no interest so far in what an API for a list would look like. For example, see APIs for linked list module. #463 and there were several Discourse posts by Chetan with little to no feedback. Maybe there's more interest now.
  • I also prefer a single intrinsic typed list over class(*). It makes working with a list much easier, no select type stuff as already pointed out by Brad, and I have to date not had a need in Fortran to mix types in a list (and very rarely in Python).

@certik
Copy link
Member

certik commented Oct 21, 2022

Let's keep the discussion here to have it in one place, it's all related.

  • In LPython we don't allow mixing types in a list, all items have to be of the same type. Consequently also in LFortran.
  • We can allow mixing types in stdlib, and then LFortran would only optimize lists of the same type, and if you mix types, it would not be optimized; I would lean to not allow mixing of types.
  • the term "list" in my mind visualizes Python list like a = [1, 2, 3] and the operations that you can do on it, like "append" and "extent" and concatenation a = [1, 2, 3] + [10, 11]. However, Python underneath does not use linked list, but rather implements the list using a size, capacity and a pointer. So I would recommend using the term "list" for this datastructure. If we want to specify a particular implementation, we can have both List and LinkedList. Using the term "vector" for this I think is confusing, because vector in this sense is the same as an array and Fortran already has arrays. So I would stick with Python's terminology, which is "list" (cheap appending) and "array" (NumPy array, for our purposes equivalent to Fortran arrays; expensive appending). (In my mind, the C++ implementation of the list is "std::vector". There is also std::list which I think is a linked list and I don't think I've ever used it, as it is very slow. I recommend using Python, not C++ terminology.)
  • We have to figure out the API of a List, that is the main task.
  • We can implement any API into LFortran's frontend; the intermediate representation uses the following API: it has statement nodes and expression nodes and a type List which has a type of the element, such as List[f64] (using LPython's syntax), in ASR notation (here is the full ASR so that you can see the context of the nodes below: https://github.com/lfortran/lfortran/blob/383831d7efb9b5b1bc00c74e928c8997b7470a95/src/libasr/ASR.asdl):
stmt
    | ListAppend(expr a, expr ele)
    | ListInsert(expr a, expr pos, expr ele)
    | ListRemove(expr a, expr ele)
    | ListClear(expr a)

expr
    | ListConstant(expr* args, ttype type)
    | ListLen(expr arg, ttype type, expr? value)
    | ListConcat(expr left, expr right, ttype type, expr? value)
    | ListItem(expr a, expr pos, ttype type, expr? value)
    | ListSection(expr a, array_index section, ttype type, expr? value)
    | ListPop(expr a, expr? index, ttype type, expr? value)

ttype
    | List(ttype type)

@milancurcic
Copy link
Member Author

OK, your terminology is confusing to me, but we can pick any that most people here agree on. For this conversation, we can just call them "linked list" and "list", as you suggest.

Based on your description, as I understand it, your list is the same as a dynamic array. All elements of the same type, cheap indexing and prepending/appending, slow inserting.

The difference is not just an implementation detail. They're different data structures. I initially opened this thread to ask whether there's a desire for a structure with fast insertion (a linked list). I was motivated by the fact that there are many home-cooked implementations on the internet of Fortran linked lists, so it's possible that the community would benefit from a stdlib implementation. For me personally, your list is a more useful structure.

Considering the option of having both structures (linked list and list), do you think it would be a good design if they had the same (to the extent possible) API?

@awvwgk
Copy link
Member

awvwgk commented Oct 21, 2022

There is also need for a list holding class(*) values, for example to implement support for #671 using stdlib. If only one type is allowed user will need to manually instantiate a list using:

type :: wrapped_class
  class(*), allocatable :: raw
end type wrapped_class

And deal with the class dispatch themselves. In this case stdlib could directly provide such a wrapper class.

@tclune
Copy link

tclune commented Oct 21, 2022

Related note: In the initial release of generics in Fortran, one will not be able to write CLASS(T) where T is a deferred type. So one cannot directly use a single template that works for both TYPE(T) and CLASS(T). Well, a user could implement their own wrapper and then either swear up and down in packing/unpacking at each interface, or write their own wrappers.

And you cannot even write an "outer" template to provide the polymorhic case, because you cannot even define the wrapper type within the template. Again - there is no way to write CLASS(T).

And no, I'm not happy about this, as well over half of my uses of containers are polymorphic. The majority of the subgroup prefers to wait for the initial set of features to be implemented so that we can more accurately plan subsequent extensions. I.e., so that we worry about the things that matter most after experience is gained.

@certik
Copy link
Member

certik commented Oct 21, 2022

We can discuss the terminology at our next Fortran meeting. :)

@tclune, we have our former GSoC student (@ansharlubis) working on the generics in LFortran. If you know about anybody else who can help, please let me know. While we are at this: there does not seem to be many good examples of the syntax. Here is our issue for this: lfortran/lfortran#929 where I linked all examples that I was able to find so far. If you create more examples, that would be a huge help.

@tclune
Copy link

tclune commented Oct 21, 2022

@certik The only examples that properly use the syntax as it appears in the formal syntax paper are: linear_algebra and fundamental_requirements. (These are in the last link you provide in the reference above.)

Until now, the syntax has been a (slowly) moving target, and maintaining the examples takes time. If/when the syntax paper is approved, we can start producing "final" examples and deleting the ones with obsolete syntax.

@certik
Copy link
Member

certik commented Oct 21, 2022

I see, thanks @tclune!

@beddalumia
Copy link

About terminology

It may be worth noting that Wikipedia explicitly states that a list can be implemented either as a linked list or a dynamic array (plus a generic statement about the former being more frequent for lisp dialects, where the terms "list" and "linked list" are often conflated.)

https://en.m.wikipedia.org/wiki/List_(abstract_data_type)

To be fair the first picture you see is a depiction of a linked list, so I totally get that it can be misleading. I've also being taught in bachelor that a "list" is a "linked list".

In general, I wouldn't want to give too much importance to Wikipedia per sé, but I like that is more or less language agnostic, and we do not necessarily want to follow conventions from either the C++, or python community or anything else, rather use terms that would be familiar for people with different backgrounds and, even more importantly, would not result confusing after googling them. The C++ case gives an important lesson in my opinion, as

  • to me it violates the principle of least surprise with std::vector (no one that is not aware of C++ specific lore would ever imagine what it is, from the name... if googling works now it's just for how much C++ containers are pervasive at this point)

  • given the diffusion and extensive usage nowadays, it's basically impossible to amend the name, so yes, it would be important to avoid choosing a misleading/debatable name from the start.

Relevant quote from stack overflow:

It's called a vector because Alex Stepanov, the designer of the Standard Template Library, was looking for a name to distinguish it from built-in arrays. He admits now that he made a mistake, because mathematics already uses the term 'vector' for a fixed-length sequence of numbers. C++11 compounds this mistake by introducing a class 'array' that behaves similarly to a mathematical vector. Alex's lesson: be very careful every time you name something.

original link with reference and discussion

@beddalumia
Copy link

beddalumia commented Oct 21, 2022

Considering the option of having both structures (linked list and list), do you think it would be a good design if they had the same (to the extent possible) API?

I think this is exactly what you'll want to do, it would make easy for client code to switch from one to the other, according to different needs / requirements changes / benchmarks.

I was motivated by the fact that there are many home-cooked implementations on the internet of Fortran linked lists, so it's possible that the community would benefit from a stdlib implementation. For me personally, your list is a more useful structure.

I may just comment that I know that in the codebases that I'm working on during the PhD there is currently a homebrew implementation of a dynamic array (for sparse rows...) Older people told me that that was initially implemented as a linked list and used in many performance critical parts of the code. As they eventually scaled up the requirements, crafted a massively parallel algorithm for the iterative solver and started profiling the MPI performance they immediately spotted these linked lists as the main bottleneck and changed to dynamic arrays to give the very same "list functionality" but with acceptable performance.

@beddalumia
Copy link

As for class(*) containers, what about keeping it separated from lists, also in naming convention?

They would be inherently inferior for performance, but provide a whole new level of flexibility... this should suffice to qualify them as a totally separated thing, targeting very different use cases. Matlab calls such things1 "cells". How do you feel about such a name? Alternatives?

Footnotes

  1. not really sure about the underlying implementation, but the API is somewhat the same: a collection of heterogeneous things, generally slower than a builtin array (remember that in matlab all arrays are dynamic, so satisfy Ondrej's definition of list), and with quite limited support to be passed around. E.g. you cannot do plot(x, y) if x and y are cells, nor you can do linear algebra, etc. They provide a suitable cell2mat method to convert homogenous cells to arrays, when needed.

@milancurcic
Copy link
Member Author

Since the terminology varies between communities. I suggest being explicit and descriptive such that the name reflects the properties of the structure. Perhaps those would be:

  • Linked List (type linked_list)
  • Dynamic Array (type dynamic_array)

@ivan-pi
Copy link
Member

ivan-pi commented Oct 22, 2022

@milancurcic, is the difference that linked_list can hold items of different types, while dynamic_array contains items of the same type? Fortran allocatable arrays are dynamically allocated in the sense they can be resized. The dynamic_array on the other hand allows inserting/appending/removing (Wiki - Dynamic Array)?

I know a few algorithms which require a dynamic_array, and I implement them currently with an allocatable array and a smart reallocation procedure (#598). Recently, there was a query about a dynamic array at Fortran Discourse: https://fortran-lang.discourse.group/t/what-package-would-you-recommend-for-typed-lists/4359. In that case, the user only needed the append operation.

@certik
Copy link
Member

certik commented Oct 22, 2022

For what is worth, the first time I heard the term dynamic array is from Milan above, and I wouldn't know what it means without this thread, since Fortran arrays are also "dynamic" (in my mind). But I don't have a better term for it right now (besides just List), so we can use DynamicArray, to mean the "size+capacity+pointer" implementation (like std::vector).

Thanks @bellomia for the comments! Yes, in LFortran's ASR (intermediate representation) we use the term List, and indeed it is independent of how it is actually implemented. So in order to be able to represent both linked list and size+capacity+pointer implementation, we could just add a flag to the type, such as "implementation=LinkedList | DynamicArray". So the API could indeed be exactly the same for both.

We can use "linked_list" and "dynamic_array" for implementations and then somehow alias "list" to mean "dynamic_array", as I think "list" is easier for people to understand what it means. Just "list" and "array".

@milancurcic
Copy link
Member Author

milancurcic commented Oct 22, 2022

Great, yes, I'm not so concerned about the final terminology in stdlib because we can arrive there through consensus in issues and PRs; but more that we are on the same page in this thread when we discuss these structures so that we know what each name refers to.

@milancurcic
Copy link
Member Author

milancurcic commented Oct 22, 2022

@ivan-pi As I understand it, the time complexity (fast insertion vs. fast indexing) and the type of elements are orthogonal.

A linked list has fast insertion and growing and shrinking but slow indexing. A list/vector has fast indexing and fast (amortized) growing and shrinking but slow insertion.

A linked list with class(*) elements allows you to mix types of elements and insert user-derived type instances as elements. But consuming data from the list is awkward because you need all the select type boilerplate. This is the kind of list that @ChetanKarwa worked on during his GSoC project.

But a regular list/vector can, I think, also be implemented with class(*) to allow the same flexibility with types.

And either structure with fixed (intrinsic) types is easy to consume elements from (no select type boilerplate needed) because the types are known at compile-time. And it's probably easier for a compiler to optimize because all elements have the same size in memory.

So, I think that choosing the specific structures to implement requires deciding along both dimensions, time complexity of operations, and type-flexibility vs. ease of use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: container (Abstract) data structures and containers topic: utilities containers, strings, files, OS/environment integration, unit testing, assertions, logging, ...
Projects
None yet
Development

Successfully merging a pull request may close this issue.