Radical changes proposal #3

NickSeagull · 2017-08-02T12:13:05Z

I've been researching about different possible implementations of a dataframe, and it looks like having a Vector (Vector a) is quite slower than having other types. Given that Haskell is compiled to machine code it would be ideal to be as fast as a C++ implementation, or at least be somewhere close.

Haskell Data is immutable by default. This is one of the reasons it is such a nice language to program in; it removes a whole class of errors that can occur when multiple threads are modifying a data structure. It also allows the compiler to apply more optimisations.

However the disadvantage is that every time some data is “modified”, a new object (or at least part of it) must be allocated on the heap, and the old object must be garbage collected.

( Source )

It would be great if we could migrate to MVector or a representation as MutableArrayArray to make this much more faster, and take advantage of libraries like vector-algorithms to make sorting easier. Also, having a stateful representation of our dataframe makes much more sense to me, for people coming from R, Python and even F#.

PS: I would really love to rename the RFrame type to something like DataFrame or DataTable or Table or ..., right now it gives the impression that it we are trying to clone R in some way 😄

Pinging @ejconlon

The text was updated successfully, but these errors were encountered:

ejconlon · 2017-08-02T14:41:34Z

I agree, it is not optimized for anything right now. I am not tied to any naming or representation, things were just set up to figure out the API. Should we start benchmarks first?

NickSeagull · 2017-08-02T14:43:37Z

Yeah, definitely, we are figuring out API design in dataHaskell :)

ejconlon · 2017-08-02T15:44:23Z

You are also welcome to fork into datahaskell github org, and I can give hackage perms if you help me figure that out :) . It'll be a busy week for me and I don't want to be a blocker.

NickSeagull · 2017-08-02T16:00:31Z

Great, AFAIK there's an admin interface in every package page in which you can also add maintainers. 😁

Shimuuar · 2017-10-20T07:09:56Z

I'm not sure switching to MVector will achieve anything.

It has same representation as immutable vector except it uses mutable arrays
It forces you to work in IO/ST monad. It makes it impossible/difficult to interoperate with existing haskell libraries since they expect immutable data

ejconlon · 2017-10-20T17:35:46Z

Arrow might be a more worthwhile target: https://arrow.apache.org/

dmvianna · 2017-11-13T17:55:57Z

Two cents:

1 I love the API. This is very tidy and not making a wholesale copy of pandas or R functionality. This looks like Haskell.
2 I like RFrames as a name. These are Record frames, right? Why copy a less descriptive name from a less defined implementation?
3 I don’t want to rush your work, especially because being careful already led you to a good API. But the selling point for Haskell as a platform for data science (at least for me) is to be able to do data exploration and munging on a stream of data. I would be more interested in figuring out the right API for that before I started optimising for small data that fits memory.

I’m a beginner Haskeller. Yesterday I went through a productive day at #haskell-beginners (functional programming) Slack and figured out Frames had enough of my wishlist to let me do some of my work. But I think this project looks tidier, and I would be willing to test it with my real world data. I work for the state government and have plenty of ugly data to play with.

NickSeagull added enhancement question labels Aug 2, 2017

Magalame mentioned this issue May 7, 2019

analyze : evaluate streaming for RFrame DataHaskell/dh-core#35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Radical changes proposal #3

Radical changes proposal #3

NickSeagull commented Aug 2, 2017

ejconlon commented Aug 2, 2017

NickSeagull commented Aug 2, 2017

ejconlon commented Aug 2, 2017 •

edited

NickSeagull commented Aug 2, 2017

Shimuuar commented Oct 20, 2017

ejconlon commented Oct 20, 2017

dmvianna commented Nov 13, 2017 •

edited

Radical changes proposal #3

Radical changes proposal #3

Comments

NickSeagull commented Aug 2, 2017

ejconlon commented Aug 2, 2017

NickSeagull commented Aug 2, 2017

ejconlon commented Aug 2, 2017 • edited

NickSeagull commented Aug 2, 2017

Shimuuar commented Oct 20, 2017

ejconlon commented Oct 20, 2017

dmvianna commented Nov 13, 2017 • edited

ejconlon commented Aug 2, 2017 •

edited

dmvianna commented Nov 13, 2017 •

edited