Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: make dfschema wrap schemaref #9595

Merged
merged 74 commits into from
Apr 1, 2024

Conversation

haohuaijin
Copy link
Contributor

@haohuaijin haohuaijin commented Mar 13, 2024

Which issue does this PR close?

Closes #4680
follow on #8905

Rationale for this change

see related issue #4680, #7944 and #8905

See also mailing list discussion: https://lists.apache.org/thread/79rpwd22bncdxbkwlq842rwr485kwwyh

What changes are included in this PR?

change the DFSchema and remove DFField struct
from

pub struct DFSchema {
    /// Fields
    fields: Vec<DFField>,
    /// Additional metadata in form of key value pairs
    metadata: HashMap<String, String>,
    /// Stores functional dependencies in the schema.
    functional_dependencies: FunctionalDependencies,
}

to

pub struct DFSchema {
    /// Inner Arrow schema reference.
    inner: SchemaRef,
    /// Optional qualifiers for each column in this schema. In the same order as
    /// the `self.inner.fields()`
    field_qualifiers: Vec<Option<OwnedTableReference>>,
    /// Stores functional dependencies in the schema.
    functional_dependencies: FunctionalDependencies,
}

and update the related API

Are these changes tested?

test with existing tests

Are there any user-facing changes?

New API for create DFSchema

  • pub fn from_unqualifed_fields(fields: Fields, metadata: HashMap<String, String>)->Result<Self>
  • pub fn try_from_qualified_schema<'a>(qualifier: impl Into<TableReference<'a>>,schema: &Schema,)->Result<Self>
  • pub fn new_with_metadata(qualified_fields: Vec<(Option<OwnedTableReference>, Arc<Field>)>,metadata: HashMap<String, String>) -> Result<Self>
  • pub fn from_field_specific_qualified_schema<'a>(qualifiers: Vec<Option<impl Into<TableReference<'a>>>>, schema: &SchemaRef) -> Result<Self>

New API to find qualified Field, the old API only return Field

  • pub fn qualified_field(&self, i: usize) -> (Option<&OwnedTableReference>, &Field): returns an immutable reference of a specific Field and its qualifier by an offset
  • pub fn qualified_field_with_name(&self, qualifier: Option<&TableReference>, name: &str) -> Result<(Option<&OwnedTableReference>, &Field)>: find the qualified field with the given name
  • pub fn qualified_fields_with_unqualified_name(&self, name: &str) -> Vec<(Option<&OwnedTableReference>, &Field)>: Find all fields that match the given name and return them with their qualifier
  • pub fn qualified_field_with_unqualified_name(&self,name: &str) -> Result<(Option<&OwnedTableReference>, &Field)>: Find the qualified field with the given unqualified name
  • pub fn qualified_field_from_column(&self,column: &Column) -> Result<(Option<&OwnedTableReference>, &Field)>: Find the field with the given qualified column
  • pub fn iter(&self) -> impl Iterator<Item = (Option<&OwnedTableReference>, &FieldRef)>: Iterate over the qualifiers and fields in the DFSchema

Other new API

  • pub fn columns(&self) -> Vec<Column>: return all Columns for the schema
  • pub fn columns_with_unqualified_name(&self, name: &str) -> Vec<Column>: find all fields that match the given name and convert to column

@haohuaijin haohuaijin marked this pull request as draft March 13, 2024 11:07
@github-actions github-actions bot added the logical-expr Logical plan and expressions label Mar 13, 2024
@github-actions github-actions bot added the physical-expr Physical Expressions label Mar 14, 2024
@matthewmturner
Copy link
Contributor

Great job @haohuaijin, glad to see this near the finish line.

@alamb
Copy link
Contributor

alamb commented Mar 22, 2024

I plan to merge this tomorrow unless anyone else would like time to review. Thanks again everyone

@jayzhan211
Copy link
Contributor

jayzhan211 commented Mar 24, 2024

I found it difficult to cleanup the tuple into DFField struct because there 4 kinds of combination I need to construct. I thought we only need DFFieldRef and DFField for owned.

Normal case

pub struct DFFieldRef<'a> {
    /// Optional qualifier (usually a table or relation name)
    qualifier: Option<&'a OwnedTableReference>,
    /// Arrow field definition
    field: &'a Field,
}

Sometimes we need FieldRef to avoid the cost of clone for field.

pub struct DFFieldRefWithArc<'a> {
    /// Optional qualifier (usually a table or relation name)
    qualifier: Option<&'a OwnedTableReference>,
    /// Arrow field definition
    field: &'a FieldRef,
}

It is usually easier to deal with owned dffield sometimes without additional cost, because we can't avoid clone for Schema::new_with_metadata

pub struct DFField {
    /// Optional qualifier (usually a table or relation name)
    pub qualifier: Option<OwnedTableReference>,
    /// Arrow field definition
    pub field: FieldRef,
}

And of course, we can have field not fieldRef, it seems better not to wrap it with Arc if possible.

pub struct DFField {
    /// Optional qualifier (usually a table or relation name)
    pub qualifier: Option<OwnedTableReference>,
    /// Arrow field definition
    pub field: Field,
}

@alamb
Copy link
Contributor

alamb commented Mar 24, 2024

I suggest we want until the imminent 37.0.0 release -- #9697 -- so we have additional time to test this change before releasing it to the larger community

@alamb
Copy link
Contributor

alamb commented Apr 1, 2024

Since Arrow 37 is now basically released I think this PR is ready to merge.

I took the liberty of merging up from main to resolve some conflicts, and once CI is complete I intend to merge this PR

Thanks again @haohuaijin

@matthewmturner
Copy link
Contributor

Great job getting this over the line @haohuaijin

@alamb alamb merged commit cc1db8a into apache:main Apr 1, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 1, 2024

🚀

@haohuaijin haohuaijin deleted the feat/make-dfschema-wrap-schemaref branch April 2, 2024 00:26
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request May 13, 2024
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request May 13, 2024
The previous implementation relied on `DFField` which was removed upstream.

Ref: apache/datafusion#9595
andygrove pushed a commit to apache/datafusion-python that referenced this pull request May 14, 2024
* chore: upgrade datafusion Deps

Ref #690

* update concat and concat_ws to use datafusion_functions

Moved in apache/datafusion#10089

* feat: upgrade functions.rs

Upstream is continuing it's migration to UDFs.

Ref apache/datafusion#10098
Ref apache/datafusion#10372

* fix ScalarUDF import

* feat: remove deprecated suppors_filter_pushdown and impl supports_filters_pushdown

Deprecated function removed in apache/datafusion#9923

* use `unnest_columns_with_options` instead of deprecated `unnest_column_with_option`

* remove ScalarFunction wrappers

These relied on upstream BuiltinScalarFunction, which are now removed.

Ref apache/datafusion#10098

* update dataframe `test_describe`

`null_count` was fixed upstream.

Ref apache/datafusion#10260

* remove PyDFField and related methods

DFField was removed upstream.

Ref: apache/datafusion#9595

* bump `datafusion-python` package version to 38.0.0

* re-implement `PyExpr::column_name`

The previous implementation relied on `DFField` which was removed upstream.

Ref: apache/datafusion#9595
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core datafusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sql sqllogictest substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make DfSchema wrap SchemaRef
5 participants