New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ks.read_spark_io / DataFrame.to_spark_io #447
Conversation
Codecov Report
@@ Coverage Diff @@
## master #447 +/- ##
==========================================
+ Coverage 93.06% 93.07% +0.01%
==========================================
Files 27 27
Lines 3344 3349 +5
==========================================
+ Hits 3112 3117 +5
Misses 232 232
Continue to review full report at Codecov.
|
@floscha want to take a look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is some really useful functionality to have since, in practice, I read/write most data from/to HDFS rather than a local file system.
I left some comments regarding the documentation. Since the implementation itself is pretty straightforward, I don't have any complaints there 😉
databricks/koalas/frame.py
Outdated
path : string, optional | ||
Path to the data source. | ||
format : string, optional | ||
Name of the data source in Spark. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name
sounds more like file name if you asked me. Why not instead go for "Specifies the input data source format." like the Spark docs describe it.
Also, like you did for mode
, it would be great to provide a list of supported formats, namely: CSV, JDBC, JSON, ORC, and Parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Good point. Will do the changes.
databricks/koalas/namespace.py
Outdated
path : string, optional | ||
Path to the data source. | ||
format : string, optional | ||
Name of the data source in Spark. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment on format
above.
|
||
# Write out partitioned by one column | ||
expected.to_spark_io(tmp, format='json', mode='overwrite', partition_cols='i32') | ||
# reset column order, as once the data is written out, Spark rearranges partition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line and 107 could start with a capital letter and end with a period, but that's rather cosmetic 😉
Softagram Impact Report for pull/447 (head commit: e2726df)⭐ Change Overview
📄 Full report
Give feedback on this report to support@softagram.com |
Resolves #446