Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-15689][SQL] data source v2 read path #19136

Closed
wants to merge 4 commits into from

Conversation

@cloud-fan
Copy link
Contributor

cloud-fan commented Sep 5, 2017

What changes were proposed in this pull request?

This PR adds the infrastructure for data source v2, and implement features which Spark already have in data source v1, i.e. column pruning, filter push down, catalyst expression filter push down, InternalRow scan, schema inference, data size report. The write path is excluded to avoid making this PR growing too big, and will be added in follow-up PR.

How was this patch tested?

new tests

@cloud-fan

This comment has been minimized.

Copy link
Contributor Author

cloud-fan commented Sep 5, 2017

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java Outdated
* 2. propagate information upward to Spark, e.g., report statistics, report ordering, etc.
* Spark first applies all operator push down optimizations which this data source supports. Then
* Spark collects information this data source provides for further optimizations. Finally Spark
* issues the scan request and does the actual data reading.

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Sep 5, 2017

Author Contributor

TODO: this is not true now, as we push down operators at the planning phase. We need to do some refactor and move it to the optimizing phase.

This comment has been minimized.

Copy link
@RussellSpitzer

RussellSpitzer Sep 6, 2017

Contributor

This would be really nice imho.

@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Sep 5, 2017

Test build #81412 has finished for PR 19136 at commit 543a40b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class DataSourceV2Options
  • public abstract class DataSourceV2Reader
  • class RowToUnsafeRowReadTask implements ReadTask<UnsafeRow>
  • class RowToUnsafeDataReader implements DataReader<UnsafeRow>
  • class DataSourceRDDPartition[T : ClassTag](val index: Int, val readTask: ReadTask[T])
  • class DataSourceRDD[T: ClassTag](
  • case class DataSourceV2Relation(
  • case class DataSourceV2ScanExec(
@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Sep 6, 2017

Test build #81441 has finished for PR 19136 at commit a824d44.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class DataSourceV2Options
  • public abstract class DataSourceV2Reader
  • class RowToUnsafeRowReadTask implements ReadTask<UnsafeRow>
  • class RowToUnsafeDataReader implements DataReader<UnsafeRow>
  • class DataSourceRDDPartition(val index: Int, val readTask: ReadTask[UnsafeRow])
  • class DataSourceRDD(
  • case class DataSourceV2Relation(
  • case class DataSourceV2ScanExec(
@cloud-fan

This comment has been minimized.

Copy link
Contributor Author

cloud-fan commented Sep 6, 2017

retest this please

@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Sep 6, 2017

Test build #81466 has finished for PR 19136 at commit a824d44.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class DataSourceV2Options
  • public abstract class DataSourceV2Reader
  • class RowToUnsafeRowReadTask implements ReadTask<UnsafeRow>
  • class RowToUnsafeDataReader implements DataReader<UnsafeRow>
  • class DataSourceRDDPartition(val index: Int, val readTask: ReadTask[UnsafeRow])
  • class DataSourceRDD(
  • case class DataSourceV2Relation(
  • case class DataSourceV2ScanExec(
@rdblue

This comment has been minimized.

Copy link
Contributor

rdblue commented Sep 6, 2017

Thanks for pinging me. I left comments on the older PR, since other discussion was already there. If you'd prefer comments here, just let me know.

@cloud-fan cloud-fan force-pushed the cloud-fan:data-source-v2 branch Sep 7, 2017
sql/core/src/main/java/org/apache/spark/sql/sources/v2/SchemaRequiredDataSourceV2.java Outdated
/**
* A variant of `DataSourceV2` which requires users to provide a schema when reading data. A data
* source can inherit both `DataSourceV2` and `SchemaRequiredDataSourceV2` if it supports both schema
* inference and user-specified schemas.

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Sep 7, 2017

Author Contributor

cc @rdblue for the new API of schema reference.

@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Sep 7, 2017

Test build #81490 has finished for PR 19136 at commit 89cbfb7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class DataSourceV2Options
  • public abstract class DataSourceV2Reader
  • class RowToUnsafeRowReadTask implements ReadTask<UnsafeRow>
  • class RowToUnsafeDataReader implements DataReader<UnsafeRow>
  • class DataSourceRDDPartition(val index: Int, val readTask: ReadTask[UnsafeRow])
  • class DataSourceRDD(
  • case class DataSourceV2Relation(
  • case class DataSourceV2ScanExec(
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java Outdated
*/
@Experimental
@InterfaceStability.Unstable
public List<ReadTask<UnsafeRow>> createUnsafeRowReadTasks() {

This comment has been minimized.

Copy link
@sureshthalamati

sureshthalamati Sep 7, 2017

Contributor

I really like the new API's flexibility to implement the different types of support. Considering UnsafeRow is unstable , Would it be possible to move createUnsafeRowReadTasks to a different interface ? That might make data source implement two types of data sources one with Row , and another one with UnsafeRow and make it easily configurable based on the spark version.

sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2Options.java Outdated
* Adds one more entry to the options.
* This should only be called by Spark, not data source implementations.
*/
public void addOption(String key, String value) {

This comment has been minimized.

Copy link
@sureshthalamati

sureshthalamati Sep 7, 2017

Contributor

The check added for addOption protects modifying the options passed to the datasource, but data source can still add new options by accident. I think it might be safer to pass DataSourceV2Options that are Unmodifiable by the data source.

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Sep 7, 2017

Author Contributor

good point, I'll make it immutable.

@cloud-fan cloud-fan force-pushed the cloud-fan:data-source-v2 branch 2 times, most recently Sep 7, 2017
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/scan/UnsafeRowScan.java Outdated
*/
@Experimental
@InterfaceStability.Unstable
public interface UnsafeRowScan extends DataSourceV2Reader {

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Sep 7, 2017

Author Contributor

cc @j-baker for the new unsafe row scan API. Programmatically unsafe row scan should be in the base class, and normal row scan should be in the child class. However, conceptually for a developer, normal row scan is a basic interface and should be in the base class. Unsafe row scan is kind of an add-on and should be in the child class.

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java Outdated
* task will always run on these locations. Implementations should make sure that it can
* be run on any location.
*/
default String[] preferredLocations() {

This comment has been minimized.

Copy link
@ash211

ash211 Sep 7, 2017

Contributor

what format are these strings expected to be in? If Spark will be placing this ReadTask onto an executor that is a preferred location, the format will need to be a documented part of the API

are there levels of preference, or only the binary? I'm thinking node vs rack vs datacenter for on-prem clusters, or instance vs AZ vs region for cloud clusers

This comment has been minimized.

Copy link
@RussellSpitzer

RussellSpitzer Sep 7, 2017

Contributor

These have previously only been ip/hostnames. To match the RDD definition I think we would have to continue with that.

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Sep 7, 2017

Author Contributor

This API matches the RDD.preferredLocations directly, I'll add more documents here.

This comment has been minimized.

Copy link
@j-baker

j-baker Sep 12, 2017

can we have a class Host which represents this? Just makes the API more clear.

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Sep 13, 2017

Author Contributor

hmmm, do you mean create a Host class which only has a string field?

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/upward/StatisticsSupport.java Outdated
* A mix in interface for `DataSourceV2Reader`. Users can implement this interface to report
* statistics to Spark.
*/
public interface StatisticsSupport {

This comment has been minimized.

Copy link
@ash211

ash211 Sep 7, 2017

Contributor

some datasources have per-column statistics, like how many bytes a column has or its min/max (e.g. things required for CBO).

should that be a separate interface from this one?

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Sep 8, 2017

Author Contributor

I'd like to put column stats in a separated interface, because we already separate basic stats and column stats in ANALYZE TABLE.

@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Sep 7, 2017

Test build #81506 has finished for PR 19136 at commit ee5faf1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class DataSourceV2Options
  • class DataSourceRDDPartition(val index: Int, val readTask: ReadTask[UnsafeRow])
  • class DataSourceRDD(
  • case class DataSourceV2Relation(
  • case class DataSourceV2ScanExec(
  • class RowToUnsafeRowReadTask(rowReadTask: ReadTask[Row], schema: StructType)
  • class RowToUnsafeDataReader(rowReader: DataReader[Row], encoder: ExpressionEncoder[Row])
@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Sep 7, 2017

Test build #81510 has finished for PR 19136 at commit 182b89d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class DataSourceV2Options
  • class DataSourceRDDPartition(val index: Int, val readTask: ReadTask[UnsafeRow])
  • class DataSourceRDD(
  • case class DataSourceV2Relation(
  • case class DataSourceV2ScanExec(
  • class RowToUnsafeRowReadTask(rowReadTask: ReadTask[Row], schema: StructType)
  • class RowToUnsafeDataReader(rowReader: DataReader[Row], encoder: ExpressionEncoder[Row])
sql/core/src/main/java/org/apache/spark/sql/sources/v2/SchemaRequiredDataSourceV2.java Outdated
*
* @param schema the full schema of this data source reader. Full schema usually maps to the
* physical schema of the underlying storage of this data source reader, e.g.
* parquet files, JDBC tables, etc, while this reader may not read data with full

This comment has been minimized.

Copy link
@rdblue

rdblue Sep 7, 2017

Contributor

Maybe update the doc here, since JDBC sources and Parquet files probably shouldn't implement this. CSV and JSON are the examples that come to mind for sources that require a schema.


object DataSourceV2Relation {
def apply(reader: DataSourceV2Reader): DataSourceV2Relation = {
new DataSourceV2Relation(reader.readSchema().toAttributes, reader)

This comment has been minimized.

Copy link
@rdblue

rdblue Sep 7, 2017

Contributor

Is this the right schema? The docs for readSchema say it is the result of pushdown and projection, which doesn't seem appropriate for a Relation. Does relation represent a table that can be filtered and projected, or does it represent a single read? At least in the Hive read path, it's a table.

On the other hand, it is much better to compute stats on a relation that's already filtered.

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Sep 7, 2017

Author Contributor

I leave it as a TODO as it needs some refactoring on the optimizer. For now DataSourceV2Relation represents a data source without any optimization: we do these optimizations during planning. This is also a problem for data source v1, and that's why we implement partition pruning as an optimizer rule instead of data source internal, because we need to update the stats.

This comment has been minimized.

Copy link
@rdblue

rdblue Sep 7, 2017

Contributor

Are you saying that partition pruning isn't delegated to the data source in this interface?

I was just looking into how the data source should provide partition data, or at least fields that are the same for all rows in a ReadTask. It would be nice to have a way to pass those up instead of materializing them in each UnsafeRow.

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Sep 8, 2017

Author Contributor

In data source V2, we will delegate partition pruning to the data source, although we need to do some refactoring to make it happen.

I was just looking into how the data source should provide partition data, or at least fields that are the same for all rows in a ReadTask. It would be nice to have a way to pass those up instead of materializing them in each UnsafeRow.

This can be achieved by the columnar reader. Think about a data source having a data column i and a partition column j, the returned columnar batch has 2 column vectors for i and j. Column vector i is a normal one that contains all the values of column i within this batch, column vector j is a constant vector that only contains a single value.

This comment has been minimized.

Copy link
@rdblue

rdblue Sep 8, 2017

Contributor

I think we should add a way to provide partition values outside of the columnar reader. It wouldn't be too difficult to add a method on ReadTask that returns them, then create a joined row in the scan exec. Otherwise, this requires a lot of wasted memory for a scan.

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Sep 12, 2017

Author Contributor

I think users can write a special ReadTask to do it, but we can't save memory by doing this. When an operator(the scan operator) transfers data to another operator, the data must be UnsafeRows. So even users return a joined row in ReadTask, Spark need to convert it to UnsafeRow.

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala Outdated
case r: CatalystFilterPushDownSupport =>
r.pushCatalystFilters(filters.toArray)

case r: FilterPushDownSupport =>

This comment has been minimized.

Copy link
@viirya

viirya Sep 10, 2017

Member

Looks like CatalystFilterPushDownSupport and FilterPushDownSupport are exclusive? But we can't prevent users to implement them both?

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Sep 12, 2017

Author Contributor

yea we can't prevent users to implement them both, and we will pick CatalystFilterPushDownSupport over FilterPushDownSupport. Let me document it.

This comment has been minimized.

Copy link
@j-baker

j-baker Sep 12, 2017

can FilterPushDownSupport be an interface which extends CatalystFilterPushDownSupport and provides a default impl of pruning the catalyst flter? Like, this code can just go there as a method:

interface FilterPushDownSupport extends CatalystFilterPushDownSupport {
    List<Filter> pushFilters(List<Filter> filters);

    default List<Expression> pushCatalystFilters(List<Expression> filters) {
        Map<Filter, Expression> translatedMap = new HashMap<>();
        List<Filter> nonconvertiblePredicates = new ArrayList<>();

        for (Expression catalystFilter : filters) {
            Optional<Filter> translatedFilter = DataSourceStrategy.translateFilter(catalystFilter);
            if (translatedFilter.isPresent()) {
                translatedMap.put(translatedFilter.get(), catalystFilter);
            } else {
                nonconvertiblePredicates.add(catalystFilter);
            }
        }

        List<Filter> unhandledFilters = pushFilters(new ArrayList<>(translatedMap.values()));
        return Stream.concat(
            nonconvertiblePredicates.stream(),
            unhandledFilters().stream().map(translatedMap::get))
           .collect(toList());
    }
}

and we can trivially ignore the interface confusion (it's truly confusing if you can implement two interfaces)

This comment has been minimized.

Copy link
@j-baker

j-baker Sep 12, 2017

like, we might as well not document it if the code can document it

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Sep 12, 2017

Author Contributor

good idea!

This comment has been minimized.

Copy link
@viirya

viirya Sep 12, 2017

Member

By doing so, do we still need to match both CatalystFilterPushDownSupport and FilterPushDownSupport here?

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Sep 13, 2017

Author Contributor

After some attempts, I went back with 2 individual interfaces. The reason is that, a) CatalystFilterPushDownSupport is an unstable interface, and it looks weird to let a stable interface extend an unstable one. b) the logic that converts expressions to public filters belongs to Spark internal, and we may change it in the future, so we should not put these codes in a public interface. We may have a risk of breaking compatibility for this interface.

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala Outdated

// Match original case of attributes.
// TODO: nested fields pruning
val requiredColumns = (projectSet ++ filterSet).toSeq.map(attrMap)

This comment has been minimized.

Copy link
@viirya

viirya Sep 10, 2017

Member

Do we need to request columns that are only referenced by pushed filters?

This comment has been minimized.

Copy link
@rdblue

rdblue Sep 11, 2017

Contributor

It seems reasonable to only request the ones that will be used, or that have residuals after pushing filters.

@cloud-fan cloud-fan changed the title [DO NOT MERGE][SPARK-15689][SQL] data source v2 [SPARK-15689][SQL] data source v2 Sep 12, 2017
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataReader.java Outdated
import java.io.Closeable;

/**
* A data reader returned by a read task and is responsible for outputting data for an RDD

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Sep 12, 2017

Member

Nit: an -> a

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java Outdated
* 1. push operators downward to the data source, e.g., column pruning, filter push down, etc.
* 2. propagate information upward to Spark, e.g., report statistics, report ordering, etc.
* 3. special scans like columnar scan, unsafe row scan, etc. Note that a data source reader can
* at most implement one special scan.

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Sep 12, 2017

Member

at most implement one -> implement at most one

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java Outdated
* 3. special scans like columnar scan, unsafe row scan, etc. Note that a data source reader can
* at most implement one special scan.
*
* Spark first applies all operator push down optimizations which this data source supports. Then

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Sep 12, 2017

Member

push down -> push-down

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java Outdated
StructType readSchema();

/**
* Returns a list of read tasks, each task is responsible for outputting data for one RDD

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Sep 12, 2017

Member

, each -> . Each

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java Outdated

/**
* Returns a list of read tasks, each task is responsible for outputting data for one RDD
* partition, which means the number of tasks returned here is same as the number of RDD

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Sep 12, 2017

Member

, which means -> That means

...main/java/org/apache/spark/sql/sources/v2/reader/downward/CatalystFilterPushDownSupport.java Outdated

/**
* A mix-in interface for `DataSourceV2Reader`. Users can implement this interface to push down
* arbitrary expressions as predicates to the data source.

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Sep 12, 2017

Member

Note that, this is an experimental and unstable interface

...core/src/main/java/org/apache/spark/sql/sources/v2/reader/downward/ColumnPruningSupport.java Outdated

/**
* A mix-in interface for `DataSourceV2Reader`. Users can implement this interface to only read
* required columns/nested fields during scan.

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Sep 12, 2017

Member

-> the required

...core/src/main/java/org/apache/spark/sql/sources/v2/reader/downward/ColumnPruningSupport.java Outdated
/**
* Apply column pruning w.r.t. the given requiredSchema.
*
* Implementation should try its best to prune unnecessary columns/nested fields, but it's also

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Sep 12, 2017

Member

the unnecessary

...ore/src/main/java/org/apache/spark/sql/sources/v2/reader/downward/FilterPushDownSupport.java Outdated
public interface FilterPushDownSupport {

/**
* Push down filters, returns unsupported filters.

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Sep 12, 2017

Member

Pushes down filters, and returns unsupported filters.

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/upward/StatisticsSupport.java Outdated
* statistics to Spark.
*/
public interface StatisticsSupport {
Statistics getStatistics();

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Sep 12, 2017

Member

Will the returned stats be adjusted by the data sources based on the operator push-down?

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Sep 12, 2017

Author Contributor

It should, but we need some refactor on optimizer, see #19136 (comment)

...main/java/org/apache/spark/sql/sources/v2/reader/downward/CatalystFilterPushDownSupport.java Outdated
/**
* Push down filters, returns unsupported filters.
*/
Expression[] pushCatalystFilters(Expression[] filters);

This comment has been minimized.

Copy link
@j-baker

j-baker Sep 12, 2017

any chance this could push java lists? They're just more idiomatic in a java interface

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Sep 12, 2017

Author Contributor

java list is not friendly to scala implementations :)

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/upward/Statistics.java Outdated
* An interface to represent statistics for a data source.
*/
public interface Statistics {
long sizeInBytes();

This comment has been minimized.

Copy link
@j-baker

j-baker Sep 12, 2017

OptionalLong for sizeInBytes? It's not obvious that sizeInBytes is well defined for e.g. JDBC datasources, but row count can generally be easily estimated from the query plan.

This comment has been minimized.

Copy link
@j-baker

j-baker Sep 12, 2017

like, I get that it's non-optional at the moment, but it's odd that we have a method that the normal implementor will have to replace with

public long sizeInBytes() {
    return Long.MAX_VALUE;
}

This comment has been minimized.

Copy link
@j-baker

j-baker Sep 12, 2017

and now is a good time to fix it :)

sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2Options.java Outdated

/**
* Returns the option value to which the specified key is mapped, case-insensitively,
* or {@code null} if there is no mapping for the key.

This comment has been minimized.

Copy link
@j-baker

j-baker Sep 12, 2017

can we return Optional<String> here? JDK maintainers wish they could return optional on Map

sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2Options.java Outdated
* Returns the option value to which the specified key is mapped, case-insensitively,
* or {@code defaultValue} if there is no mapping for the key.
*/
public String getOrDefault(String key, String defaultValue) {

This comment has been minimized.

Copy link
@j-baker

j-baker Sep 12, 2017

if the above returns Optional, you probably don't need this method.

@yueawang

This comment has been minimized.

Copy link

yueawang commented Sep 13, 2017

@cloud-fan, Last week when I saw this PR, I remember that you also implemented some new pushdowns like sort or limit, are they removed in this the latest commit?? Any concern? Thanks!

@cloud-fan cloud-fan changed the title [SPARK-15689][SQL] data source v2 [SPARK-15689][SQL] data source v2 read path Sep 13, 2017
@cloud-fan cloud-fan force-pushed the cloud-fan:data-source-v2 branch to abcc606 Sep 13, 2017
@cloud-fan

This comment has been minimized.

Copy link
Contributor Author

cloud-fan commented Sep 13, 2017

@yueawang these new push-downs are in my prototype. This PR is the first version of data source v2, so I'd like to cut down the patch size and only implement features that we already have in data source v1.

@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Sep 13, 2017

Test build #81717 has finished for PR 19136 at commit 4ff1b18.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class DataSourceV2Options
  • class DataSourceRDDPartition(val index: Int, val readTask: ReadTask[UnsafeRow])
  • class DataSourceRDD(
  • case class DataSourceV2Relation(
  • case class DataSourceV2ScanExec(
  • class RowToUnsafeRowReadTask(rowReadTask: ReadTask[Row], schema: StructType)
  • class RowToUnsafeDataReader(rowReader: DataReader[Row], encoder: ExpressionEncoder[Row])
@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Sep 13, 2017

Test build #81722 has finished for PR 19136 at commit 1e86d5c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Sep 13, 2017

Test build #81723 has finished for PR 19136 at commit abcc606.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
import java.util.OptionalLong;

/**
* An interface to represent statistics for a data source.

This comment has been minimized.

Copy link
@rxin

rxin Sep 13, 2017

Contributor

link back to SupportsReportStatistics

@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Sep 14, 2017

Test build #81773 has finished for PR 19136 at commit a1301f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@cloud-fan

This comment has been minimized.

Copy link
Contributor Author

cloud-fan commented Sep 14, 2017

any more comments? is it ready to go?

* constructor.
*
* Note that this is an empty interface, data source implementations should mix-in at least one of
* the plug-in interfaces like `ReadSupport`. Otherwise it's just a dummy data source which is

This comment has been minimized.

Copy link
@rxin

rxin Sep 14, 2017

Contributor

use an actual link ...

import org.apache.spark.sql.sources.v2.reader.DataSourceV2Reader;

/**
* A mix-in interface for `DataSourceV2`. Users can implement this interface to provide data reading

This comment has been minimized.

Copy link
@rxin

rxin Sep 14, 2017

Contributor

Users -> data source implementers

Actually a better one is

"Data sources can implement"

* source can implement both `ReadSupport` and `ReadSupportWithSchema` if it supports both schema
* inference and user-specified schema.
*/
public interface ReadSupportWithSchema {

This comment has been minimized.

Copy link
@rxin

rxin Sep 14, 2017

Contributor

I still find ReadSupport vs ReadSupportWithSchema pretty confusing. But let's address that separately.


/**
* An interface to represent statistics for a data source, which is returned by
* `SupportsReportStatistics`.

This comment has been minimized.

Copy link
@rxin

rxin Sep 14, 2017

Contributor

also use @link


class DataSourceRDD(
sc: SparkContext,
@transient private val generators: java.util.List[ReadTask[UnsafeRow]])

This comment has been minimized.

Copy link
@rxin

rxin Sep 14, 2017

Contributor

why is this called a generators?

@rxin

This comment has been minimized.

Copy link
Contributor

rxin commented Sep 14, 2017

LGTM.

Still some feedback that can be addressed later. We should also document all the APIs as Evolving.

@SparkQA

This comment has been minimized.

Copy link

SparkQA commented Sep 15, 2017

Test build #81809 has finished for PR 19136 at commit d2c86f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@cloud-fan

This comment has been minimized.

Copy link
Contributor Author

cloud-fan commented Sep 15, 2017

thank you all for the review, merging to master!

@asfgit asfgit closed this in c7307ac Sep 15, 2017
ghost pushed a commit to dbtsai/spark that referenced this pull request Oct 12, 2017
## What changes were proposed in this pull request?

As we discussed in apache#19136 (comment) , we should push down operators to data source before planning, so that data source can report statistics more accurate.

This PR also includes some cleanup for the read path.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#19424 from cloud-fan/follow.
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Aug 20, 2018
This PR adds the infrastructure for data source v2, and implement features which Spark already have in data source v1, i.e. column pruning, filter push down, catalyst expression filter push down, InternalRow scan, schema inference, data size report. The write path is excluded to avoid making this PR growing too big, and will be added in follow-up PR.

new tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#19136 from cloud-fan/data-source-v2.
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Aug 20, 2018
As we discussed in apache#19136 (comment) , we should push down operators to data source before planning, so that data source can report statistics more accurate.

This PR also includes some cleanup for the read path.

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#19424 from cloud-fan/follow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.