Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Radical changes proposal #3

Open
NickSeagull opened this issue Aug 2, 2017 · 7 comments
Open

Radical changes proposal #3

NickSeagull opened this issue Aug 2, 2017 · 7 comments

Comments

@NickSeagull
Copy link
Collaborator

I've been researching about different possible implementations of a dataframe, and it looks like having a Vector (Vector a) is quite slower than having other types. Given that Haskell is compiled to machine code it would be ideal to be as fast as a C++ implementation, or at least be somewhere close.

Haskell Data is immutable by default. This is one of the reasons it is such a nice language to program in; it removes a whole class of errors that can occur when multiple threads are modifying a data structure. It also allows the compiler to apply more optimisations.

However the disadvantage is that every time some data is “modified”, a new object (or at least part of it) must be allocated on the heap, and the old object must be garbage collected.

( Source )

It would be great if we could migrate to MVector or a representation as MutableArrayArray to make this much more faster, and take advantage of libraries like vector-algorithms to make sorting easier. Also, having a stateful representation of our dataframe makes much more sense to me, for people coming from R, Python and even F#.

PS: I would really love to rename the RFrame type to something like DataFrame or DataTable or Table or ..., right now it gives the impression that it we are trying to clone R in some way 😄

Pinging @ejconlon

@ejconlon
Copy link
Owner

ejconlon commented Aug 2, 2017

I agree, it is not optimized for anything right now. I am not tied to any naming or representation, things were just set up to figure out the API. Should we start benchmarks first?

@NickSeagull
Copy link
Collaborator Author

Yeah, definitely, we are figuring out API design in dataHaskell :)

@ejconlon
Copy link
Owner

ejconlon commented Aug 2, 2017

You are also welcome to fork into datahaskell github org, and I can give hackage perms if you help me figure that out :) . It'll be a busy week for me and I don't want to be a blocker.

@NickSeagull
Copy link
Collaborator Author

Great, AFAIK there's an admin interface in every package page in which you can also add maintainers. 😁

@Shimuuar
Copy link

I'm not sure switching to MVector will achieve anything.

  1. It has same representation as immutable vector except it uses mutable arrays
  2. It forces you to work in IO/ST monad. It makes it impossible/difficult to interoperate with existing haskell libraries since they expect immutable data

@ejconlon
Copy link
Owner

Arrow might be a more worthwhile target: https://arrow.apache.org/

@dmvianna
Copy link

dmvianna commented Nov 13, 2017

Two cents:

  • 1 I love the API. This is very tidy and not making a wholesale copy of pandas or R functionality. This looks like Haskell.

  • 2 I like RFrames as a name. These are Record frames, right? Why copy a less descriptive name from a less defined implementation?

  • 3 I don’t want to rush your work, especially because being careful already led you to a good API. But the selling point for Haskell as a platform for data science (at least for me) is to be able to do data exploration and munging on a stream of data. I would be more interested in figuring out the right API for that before I started optimising for small data that fits memory.

I’m a beginner Haskeller. Yesterday I went through a productive day at #haskell-beginners (functional programming) Slack and figured out Frames had enough of my wishlist to let me do some of my work. But I think this project looks tidier, and I would be willing to test it with my real world data. I work for the state government and have plenty of ugly data to play with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants