Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1990: [JS] C++ Refactor, Add DataFrame #1482

Closed
wants to merge 75 commits into from

Conversation

TheNeuralBit
Copy link
Member

@TheNeuralBit TheNeuralBit commented Jan 16, 2018

This PR moves the Table class out of the Vector hierarchy and adds optimized dataframe operations to it. Currently implements an optimized scan() method, filter(predicate), count(), and countBy(column_name) (only works on dictionary-encoded columns).

Some usage examples, based on the file generated by js/test/data/tables/generate.py:

> let table = Table.from(...);
> table.count()
1000000
> table.filter(col('lat').gteq(0)).count()
499718
> table.countBy('origin').toJSON()
{ Charlottesville: 166839,
  'New York': 166251,
  'San Francisco': 166642,
  Seattle: 166659,
  'Terre Haute': 166756,
  'Washington, DC': 166853 }
> table.filter(col('lng').gteq(0)).countBy('origin').toJSON()
{ Charlottesville: 83109,
  'New York': 83221,
  'San Francisco': 83515,
  Seattle: 83362,
  'Terre Haute': 83314,
  'Washington, DC': 83479 }

There are performance tests for the dataframe operations, to run them you must first generate the test data by running npm run create:perfdata.

The PR also includes @trxcllnt's refactor of the JS implementation to make it more closely resemble the C++ implementation. This refactor resolves multiple JIRAs: ARROW-1903, ARROW-1898, ARROW-1502, ARROW-1952 (partially), and ARROW-1985

@TheNeuralBit
Copy link
Member Author

@trxcllnt, @wesm: We may want to keep this open until after @trxcllnt's refactor is complete, but in the meantime I'm interested in what you think about the API and implementation.

js/src/table.ts Outdated
super({batches: [[values, counts]]});
}

asJSON(): Object {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheNeuralBit what about renaming this to toJSON(), so it also becomes the default serialization behavior if an instance is serialized via JSON.stringify()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah nice I didn't know about toJSON, I'll make that change

@TheNeuralBit
Copy link
Member Author

@trxcllnt I don't have anything else major I want to add to this. What do you think?

@trxcllnt
Copy link
Contributor

@TheNeuralBit I would like to add more tests tonight to validate everything before it goes into master. We're hovering around 60% coverage with unit + integration tests, but not covering some of the cases like dateVec.asInt32() etc.

@wesm
Copy link
Member

wesm commented Jan 26, 2018

This is really cool. It's a big patch -- is there anything I can help review?

@trxcllnt
Copy link
Contributor

@wesm it may be worthwhile discussing the JS refactor with respect to the C++ implementation -- I tried to implement the Array/ArrayData stuff as faithfully as I could, but may have misunderstood a few things. I'm in meetings today, but will be available tonight

@TheNeuralBit
Copy link
Member Author

@trxcllnt I'll look at adding some more tests for table.ts, hopefully that'll bump up the coverage for the vectors a bit as well.

@trxcllnt
Copy link
Contributor

trxcllnt commented Feb 1, 2018

@wesm I believe this is ready -- we would like to add more tests of the new capabilities, but I wouldn't block the PR over it. I can follow up with the tests/enhancements I'd like in another PR.

@wesm
Copy link
Member

wesm commented Feb 1, 2018

Cool, this patch is so big I'm just going to merge it. Hopefully the JS patch size can decrease in future PRs to make it easier to review =)

@wesm wesm closed this in e327747 Feb 1, 2018
@wesm
Copy link
Member

wesm commented Feb 1, 2018

Thanks @trxcllnt and @TheNeuralBit for all your hard work on this! Would love to see a blog post or some other document to show off the new fancy stuff

@wesm wesm deleted the table-scan-perf branch February 1, 2018 22:24
@wesm
Copy link
Member

wesm commented Feb 1, 2018

Would you mind updating all the JIRAs that are resolved from this?

@trxcllnt
Copy link
Contributor

trxcllnt commented Feb 1, 2018

@wesm sure thing!

@TheNeuralBit
Copy link
Member Author

Thanks @wesm! I'm travelling at the moment but I'm hoping to write up a blog post about the DataFrame ops when I get back next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants