-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-10380] - Confusing examples in pyspark SQL docs #8647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
There are several points mentioned in the JIRA; this is just one of them. |
|
Pull request modified : Added both documentation changes. |
python/pyspark/sql/column.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @marmbrus also suggested that cast docs be updated to note the alias the other way -- there may be two more changes to make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I built and checked the html page. The note appears in both places. Its an alias and the inline documentation is repeated in both places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem that I was referring to is exactly that the documentation is repeated instead of being adapted for the function that is described. The code examples for astype should use astype not cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I was skeptical about the code change. I will make it now.
|
Recent commit contains all 4 methods(cast, astype, dropDuplicates and drop_duplicates) added and changes to docs as well |
python/pyspark/sql/column.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this
|
I am not sure if I understand what should be removed. Do you want me to remove the entire astype function that was added? Also where should I use the self.cast(dataType)? If its within astype definition, then the whole point to demonstrate the usage of astype is lost. Am I missing something here? I am sorry if this is something trivial and I am not seeing it. |
|
@GayathriMurali I meant we should keep the docstring, but don't need to copy the implementation. |
|
Removed astype implementation added self.cast(dataType) return. Made similar changes to groupby as well |
|
Test build #1828 has finished for PR 8647 at commit
|
|
I've encountered the same problem @GayathriMurali been looking at here and separately reported and fixed it in [SPARK-10782]. There, the specific issue is that drop_duplicates doesn't exist as an independent method definition - around line 1280-1285 of dataframe.py you'll find "drop_duplicates = dropDuplicates" for the method definition. So there is no separate documentation string for drop_duplicates. I "fixed" that problem by adding to the dropDuplicates documentation the note that drop_duplicates is an alias for dropDuplicates, with the presumption that the reader can substitute drop_duplicates in the code example for dropDuplicates on their own. That works for a few instances, but I'm thinking it's pointing to a larger issue that may warrant a different approach. In the case of dropDuplicates and drop_duplicates, there is also the method distinct. All three do the same thing, but distinct is NOT implemented as an alias of dropDuplicates. So we have 2 implementations for 3 methods, all intended to do 1 thing (remove duplicate rows from a dataframe). That doesn't sound healthy to me. Another issue with the aliasing approach currently being used is that the "@SInCE" property is attached to the single method definition. So if we were to alias distinct to dropDuplicates for 1.6.0, the documentation would indicate distinct has existed since 1.4.0 (rather than either the 1.3.0 when it was actually created, or the 1.6.0 when it was aliased to dropDuplicates). End result - this is a bit of documentation and code structure that I'm interested in helping out with, but I lack the skills or background to carry this myself. @GayathriMurali - I am interested in assisting if you are still working this. I have some thoughts about larger design intent that can guide individual fixes for different methods - what is the best forum for documenting and discussing these? |
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation is off?
|
@davies @GayathriMurali any updates on this? Is this patch still active? |
Adds the benchmark results as comments. The codegen version is slower than the interpreted version for `simple` case becasue of 3 reasons: 1. codegen version use a more complex hash algorithm than interpreted version, i.e. `Murmur3_x86_32.hashInt` vs [simple multiplication and addition](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L153). 2. codegen version will write the hash value to a row first and then read it out. I tried to create a `GenerateHasher` that can generate code to return hash value directly and got about 60% speed up for the `simple` case, does it worth? 3. the row in `simple` case only has one int field, so the runtime reflection may be removed because of branch prediction, which makes the interpreted version faster. The `array` case is also slow for similar reasons, e.g. array elements are of same type, so interpreted version can probably get rid of runtime reflection by branch prediction. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10917 from cloud-fan/hash-benchmark.
…r ShuffleBlockFetcherIterator Call shuffleMetrics's incRemoteBytesRead and incRemoteBlocksFetched when polling FetchResult from `results` so as to always use shuffleMetrics in one thread. Also fix a race condition that could cause memory leak. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#11138 from zsxwing/SPARK-13245.
This PR improve the lookup of BytesToBytesMap by: 1. Generate code for calculate the hash code of grouping keys. 2. Do not use MemoryLocation, fetch the baseObject and offset for key and value directly (remove the indirection). Author: Davies Liu <davies@databricks.com> Closes apache#11010 from davies/gen_map.
JIRA: https://issues.apache.org/jira/browse/SPARK-10524 Currently we use the hard prediction (`ImpurityCalculator.predict`) to order categories' bins. But we should use the soft prediction. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes apache#8734 from viirya/dt-soft-centroids.
…ng unnecessary Spark Filter
Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'
Current plan:
```
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
+- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})
== Physical Plan ==
+- Filter (col0#0 = xxx)
+- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)]
```
This patch enables a plan below;
```
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
+- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})
== Physical Plan ==
Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)]
```
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes apache#10427 from maropu/RemoveFilterInJdbcScan.
`FileStreamSource` is an implementation of `org.apache.spark.sql.execution.streaming.Source`. It takes advantage of the existing `HadoopFsRelationProvider` to support various file formats. It remembers files in each batch and stores it into the metadata files so as to recover them when restarting. The metadata files are stored in the file system. There will be a further PR to clean up the metadata files periodically. This is based on the initial work from marmbrus. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#11034 from zsxwing/stream-df-file-source.
Author: Gábor Lipták <gliptak@gmail.com> Closes apache#9532 from gliptak/SPARK-11565.
Author: Jon Maurer <tritab@gmail.com> Author: Jonathan Maurer <jmaurer@Jonathans-MacBook-Pro.local> Closes apache#10789 from tritab/cmd_updates.
…Buffer andrewor14 Please take a look Author: tedyu <yuzhihong@gmail.com> Closes apache#11134 from tedyu/master.
Make Logging private[spark]. Pretty much all there is to it. Author: Sean Owen <sowen@cloudera.com> Closes apache#11103 from srowen/SPARK-9307.
…rse grained mesos mode. This is the next iteration of tnachen's previous PR: apache#4027 In that PR, we resolved with andrewor14 and pwendell to implement the Mesos scheduler's support of `spark.executor.cores` to be consistent with YARN and Standalone. This PR implements that resolution. This PR implements two high-level features. These two features are co-dependent, so they're implemented both here: - Mesos support for spark.executor.cores - Multiple executors per slave We at Mesosphere have been working with Typesafe on a Spark/Mesos integration test suite: https://github.com/typesafehub/mesos-spark-integration-tests, which passes for this PR. The contribution is my original work and I license the work to the project under the project's open source license. Author: Michael Gummelt <mgummelt@mesosphere.io> Closes apache#10993 from mgummelt/executor_sizing.
The patch for SPARK-8964 ("use Exchange to perform shuffle in Limit" / apache#7334) inadvertently broke the planning of the TakeOrderedAndProject operator: because ReturnAnswer was the new root of the query plan, the TakeOrderedAndProject rule was unable to match before BasicOperators.
This patch fixes this by moving the `TakeOrderedAndCollect` and `CollectLimit` rules into the same strategy.
In addition, I made changes to the TakeOrderedAndProject operator in order to make its `doExecute()` method lazy and added a new TakeOrderedAndProjectSuite which tests the new code path.
/cc davies and marmbrus for review.
Author: Josh Rosen <joshrosen@databricks.com>
Closes apache#11145 from JoshRosen/take-ordered-and-project-fix.
…ot getting set correctly The column width for the new DataTables now adjusts for the current page rather than being hard-coded for the entire table's data. Author: Alex Bozarth <ajbozart@us.ibm.com> Closes apache#11057 from ajbozarth/spark13163.
The right margin of the history page is little bit off. A simple fix for that issue. Author: zhuol <zhuol@yahoo-inc.com> Closes apache#11029 from zhuoliu/13126.
JIRA: https://issues.apache.org/jira/browse/SPARK-13537 ## What changes were proposed in this pull request? In readBytes of VectorizedPlainValuesReader, we use buffer[offset] to access bytes in buffer. It is incorrect because offset is added with Platform.BYTE_ARRAY_OFFSET when initialization. We should fix it. ## How was this patch tested? `ParquetHadoopFsRelationSuite` sometimes (depending on the randomly generated data) will be [failed](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52136/consoleFull) by this bug. After applying this, the test can be passed. I added a test to `ParquetHadoopFsRelationSuite` with the data which will fail without this patch. The error exception: [info] ParquetHadoopFsRelationSuite: [info] - test all data types - StringType (440 milliseconds) [info] - test all data types - BinaryType (434 milliseconds) [info] - test all data types - BooleanType (406 milliseconds) 20:59:38.618 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 2597.0 (TID 67966) java.lang.ArrayIndexOutOfBoundsException: 46 at org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBytes(VectorizedPlainValuesReader.java:88) Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#11418 from viirya/fix-readbytes.
Fix type inference issue for sparse CSV data - https://issues.apache.org/jira/browse/SPARK-13309 Author: Rahul Tanwani <rahul@Rahuls-MacBook-Pro.local> Closes apache#11194 from tanwanirahul/master.
…s default parameters consistent in Scala and Python ## What changes were proposed in this pull request? * The default value of ```regParam``` of PySpark MLlib ```LogisticRegressionWithLBFGS``` should be consistent with Scala which is ```0.0```. (This is also consistent with ML ```LogisticRegression```.) * BTW, if we use a known updater(L1 or L2) for binary classification, ```LogisticRegressionWithLBFGS``` will call the ML implementation. We should update the API doc to clarifying ```numCorrections``` will have no effect if we fall into that route. * Make a pass for all parameters of ```LogisticRegressionWithLBFGS```, others are set properly. cc mengxr dbtsai ## How was this patch tested? No new tests, it should pass all current tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#11424 from yanboliang/spark-13545.
…anager in local mode Author: Jeff Zhang <zjffdu@apache.org> Closes apache#10914 from zjffdu/SPARK-12994.
…sistent format Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the regression module. Also, updated 2 params in classification to read as `Supported values:` to be consistent. closes apache#10600 Author: vijaykiran <mail@vijaykiran.com> Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#11404 from BryanCutler/param-desc-consistent-regression-SPARK-12633.
## What changes were proposed in this pull request? Now by default, it shows as ascending order of appId. We might prefer to display as descending order by default, which will show the latest application at the top. ## How was this patch tested? Manual tested. See screenshot below:  Author: zhuol <zhuol@yahoo-inc.com> Closes apache#11357 from zhuoliu/13481.
…ociationRulesSuite JIRA: https://issues.apache.org/jira/browse/SPARK-13506 ## What changes were proposed in this pull request? just chang R Snippet Comment in AssociationRulesSuite ## How was this patch tested? unit test passsed Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes apache#11387 from zhengruifeng/ars.
… as Dataset element type ## What changes were proposed in this pull request? Nested classes defined within Scala objects are translated into Java static nested classes. Unlike inner classes, they don't need outer scopes. But the analyzer still thinks that an outer scope is required. This PR fixes this issue simply by checking whether a nested class is static before looking up its outer scope. ## How was this patch tested? A test case is added to `DatasetSuite`. It checks contents of a Dataset whose element type is a nested class declared in a Scala object. Author: Cheng Lian <lian@databricks.com> Closes apache#11421 from liancheng/spark-13540-object-as-outer-scope.
… function call https://issues.apache.org/jira/browse/SPARK-13507 https://issues.apache.org/jira/browse/SPARK-13509 ## What changes were proposed in this pull request? This PR adds the support to write CSV data directly by a single call to the given path. Several unitests were added for each functionality. ## How was this patch tested? This was tested with unittests and with `dev/run_tests` for coding style Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes apache#11389 from HyukjinKwon/SPARK-13507-13509.
…gate #### What changes were proposed in this pull request? After analysis by Analyzer, two operators could have alias. They are `Project` and `Aggregate`. So far, we only rewrite and propagate constraints if `Alias` is defined in `Project`. This PR is to resolve this issue in `Aggregate`. #### How was this patch tested? Added a test case for `Aggregate` in `ConstraintPropagationSuite`. marmbrus sameeragarwal Author: gatorsmile <gatorsmile@gmail.com> Closes apache#11422 from gatorsmile/validConstraintsInUnaryNodes.
…eartbeat to driver more than N times ## What changes were proposed in this pull request? Sometimes, network disconnection event won't be triggered for other potential race conditions that we may not have thought of, then the executor will keep sending heartbeats to driver and won't exit. This PR adds a new configuration `spark.executor.heartbeat.maxFailures` to kill Executor when it's unable to heartbeat to the driver more than `spark.executor.heartbeat.maxFailures` times. ## How was this patch tested? unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#11401 from zsxwing/SPARK-13522.
## What changes were proposed in this pull request? Just fixed the log place introduced by apache#11401 ## How was this patch tested? unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#11432 from zsxwing/SPARK-13522-follow-up.
## What changes were proposed in this pull request? This PR adds support for implementing whole state codegen for sort. Builds heaving on nongli 's PR: apache#11008 (which actually implements the feature), and adds the following changes on top: - [x] Generated code updates peak execution memory metrics - [x] Unit tests in `WholeStageCodegenSuite` and `SQLMetricsSuite` ## How was this patch tested? New unit tests in `WholeStageCodegenSuite` and `SQLMetricsSuite`. Further, all existing sort tests should pass. Author: Sameer Agarwal <sameer@databricks.com> Author: Nong Li <nong@databricks.com> Closes apache#11359 from sameeragarwal/sort-codegen.
The Hive client library is not smart enough to notice that the current user is a proxy user; so when using a proxy user, it fails to fetch delegation tokens from the metastore because of a missing kerberos TGT for the current user. To fix it, just run the code that fetches the delegation token as the real logged in user. Tested on a kerberos cluster both submitting normally and with a proxy user; Hive and HBase tokens are retrieved correctly in both cases. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#11358 from vanzin/SPARK-13478.
…llib.JavaBisectingKMeansExample JIRA: https://issues.apache.org/jira/browse/SPARK-13551 ## What changes were proposed in this pull request? Fix wrong comment and remove meanless lines in mllib.JavaBisectingKMeansExample ## How was this patch tested? manual test Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes apache#11429 from zhengruifeng/mllib_bkm_je.
JIRA: https://issues.apache.org/jira/browse/SPARK-13550 ## What changes were proposed in this pull request? Just add a java example for ml.clustering.BisectingKMeans ## How was this patch tested? manual tests were done. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes apache#11428 from zhengruifeng/ml_bkm_je.
## What changes were proposed in this pull request? This patch fixes the problem that pyspark fails on Windows because pyspark can't find ```spark-submit2.cmd```. ## How was this patch tested? manual tests: I ran ```bin\pyspark.cmd``` and checked if pyspark is launched correctly after this patch is applyed. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes apache#11442 from tsudukim/feature/SPARK-13592.
JIRA: https://issues.apache.org/jira/browse/SPARK-13511 ## What changes were proposed in this pull request? Current limit operator doesn't support wholestage codegen. This is open to add support for it. In the `doConsume` of `GlobalLimit` and `LocalLimit`, we use a count term to count the processed rows. Once the row numbers catches the limit number, we set the variable `stopEarly` of `BufferedRowIterator` newly added in this pr to `true` that indicates we want to stop processing remaining rows. Then when the wholestage codegen framework checks `shouldStop()`, it will stop the processing of the row iterator. Before this, the executed plan for a query `sqlContext.range(N).limit(100).groupBy().sum()` is: TungstenAggregate(key=[], functions=[(sum(id#5L),mode=Final,isDistinct=false)], output=[sum(id)#6L]) +- TungstenAggregate(key=[], functions=[(sum(id#5L),mode=Partial,isDistinct=false)], output=[sum#9L]) +- GlobalLimit 100 +- Exchange SinglePartition, None +- LocalLimit 100 +- Range 0, 1, 1, 524288000, [id#5L] After add wholestage codegen support: WholeStageCodegen : +- TungstenAggregate(key=[], functions=[(sum(id#40L),mode=Final,isDistinct=false)], output=[sum(id)#41L]) : +- TungstenAggregate(key=[], functions=[(sum(id#40L),mode=Partial,isDistinct=false)], output=[sum#44L]) : +- GlobalLimit 100 : +- INPUT +- Exchange SinglePartition, None +- WholeStageCodegen : +- LocalLimit 100 : +- Range 0, 1, 1, 524288000, [id#40L] ## How was this patch tested? A test is added into BenchmarkWholeStageCodegen. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#11391 from viirya/wholestage-limit.
Estimator for Generalized Linear Models(GLMs) which will be solved by IRLS. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#11136 from yanboliang/spark-12811.
Closes apache#10940 Closes apache#11302 Closes apache#11430 Closes apache#10912
## What changes were proposed in this pull request? This PR defer the resolution from a id of dictionary to value until the column is actually accessed (inside getInt/getLong), this is very useful for those columns and rows that are filtered out. It's also useful for binary type, we will not need to copy all the byte arrays. This PR also change the underlying type for small decimal that could be fit within a Int, in order to use getInt() to lookup the value from IntDictionary. ## How was this patch tested? Manually test TPCDS Q7 with scale factor 10, saw about 30% improvements (after PR apache#11274). Author: Davies Liu <davies@databricks.com> Closes apache#11437 from davies/decode_dict.
## What changes were proposed in this pull request? This patch moves tags and unsafe modules into common directory to remove 2 top level non-user-facing directories. ## How was this patch tested? Jenkins should suffice. Author: Reynold Xin <rxin@databricks.com> Closes apache#11426 from rxin/SPARK-13548.
## What changes were proposed in this pull request? Broadcast left semi join without joining keys is already supported in BroadcastNestedLoopJoin, it has the same implementation as LeftSemiJoinBNL, we should remove that. ## How was this patch tested? Updated unit tests. Author: Davies Liu <davies@databricks.com> Closes apache#11448 from davies/remove_bnl.
… when reading from JDBC datasources. Rows with null values in partition column are not included in the results because none of the partition where clause specify is null predicate on the partition column. This fix adds is null predicate on the partition column to the first JDBC partition where clause. Example: JDBCPartition(THEID < 1 or THEID is null, 0),JDBCPartition(THEID >= 1 AND THEID < 2,1), JDBCPartition(THEID >= 2, 2) Author: sureshthalamati <suresh.thalamati@gmail.com> Closes apache#11063 from sureshthalamati/nullable_jdbc_part_col_spark-13167.
|
@andrewor14 Closing this PR as it is long due. Opening a new PR with all the changes made |
Made documentation changes to Pyspark sql module.