File-based CDK: As a user, I want to be able to rename the columns of the CSV file #29850

maxi297 · 2023-08-25T18:04:44Z

We have six workspaces using pyarrow config column_names:

configapi=> select workspace_id, auto_propagation_status, max(jobs.created_at) from jobs join connection on jobs.scope = connection.id::text join actor on connection.source_id = actor.id join schema_management on connection.id = schema_management.connection_id where actor.actor_definition_id = '69589781-7828-43c5-9f63-8925b1c1ccc2' and configuration::text like '%column_names%' group by workspace_id, auto_propagation_status;

One hasn't been synced since May but the other we're synced in the last month so this is still useful.

What pyarrow does to determine column names:

Given column_names is provided, assume there are no headers and use those columns (number needs to match with the number of columns in the CSV)
Else if column_names not provided, fallback on autogenerate_column_names and if true, assume there are no headers and generate headers
Else if not autogenerate_column_names, assume there is a header row located after skip_rows

New solution
Using the rejected solution below, we would get two levels of "Optional fields" as the frontend handles them separately. Hence, we decided to go with a oneOf using this interface:


class CsvHeaderDefinitionType(Enum):
    FROM_CSV = "From CSV"
    AUTOGENERATED = "Autogenerated"
    USER_PROVIDED = "User Provided"


class CsvHeaderFromCsv(BaseModel):
    class Config:
        title = "From CSV"

    header_definition_type: Literal[CsvHeaderDefinitionType.FROM_CSV.value] = CsvHeaderDefinitionType.FROM_CSV.value  # type: ignore


class CsvHeaderAutogenerated(BaseModel):
    class Config:
        title = "Autogenerated"

    header_definition_type: Literal[CsvHeaderDefinitionType.AUTOGENERATED.value] = CsvHeaderDefinitionType.AUTOGENERATED.value  # type: ignore


class CsvHeaderUserProvided(BaseModel):
    class Config:
        title = "User Provided"

    header_definition_type: Literal[CsvHeaderDefinitionType.USER_PROVIDED.value] = CsvHeaderDefinitionType.USER_PROVIDED.value  # type: ignore
    column_names: List[str] = Field(
        title="Column Names",
        description="The column names that will be used while emitting the CSV records",
    )

Rejected
In terms of config interface:

class CsvHeaderDefinitionType(Enum):
    USER_PROVIDED = "User Provided"
    AUTOGENERATED = "Autogenerated"
    FROM_CSV = "From CSV"


class CsvHeaderDefinition(BaseModel):
    definition_type: CsvHeaderDefinitionType = Field(
        title="CSV Header Definition Type",
        description="How to infer column names.",
    )
    column_names: List[str] = Field(
        title="Column Names",
        description="Given type is user-provided, column names will be provided using this field",
        default=None
    )

We will need to update the LegacyConfigTransformer to align with this new definition. This might affect #29435 as well

The text was updated successfully, but these errors were encountered:

sherifnada · 2023-08-25T18:25:49Z

my "grooming feedback": let's make sure there are no "undefined behaviors" here e.g:

what is the output format if the user selects autogenerate? e.g is it _f0, _f1, ..., _fn where the indices are counting the column index?
in the user provided scenario what happens if the user specifies fewer (or more) column names than there are columns in the file?

both are very tractable but I'm just calling out that we should define these behaviors clearly in the docstring and probably the connector docs too

maxi297 · 2023-08-25T18:30:15Z

what is the output format if the user selects autogenerate? e.g is it _f0, _f1, ..., _fn where the indices are counting the column index?

It is f"f{i}" where i is the index starting from 0

in the user provided scenario what happens if the user specifies fewer (or more) column names than there are columns in the file?

The current behavior for S3 is that an exception is raised as such: pyarrow.lib.ArrowInvalid: CSV parse error: Expected 2 columns, got 3: id,name,valid. V4 would do the same but with a non pyarrow exception

I'll document both behaviors in the config docstring and update the file-based CDK README.md

maxi297 added the team/extensibility label Aug 25, 2023

maxi297 self-assigned this Aug 25, 2023

maxi297 mentioned this issue Aug 25, 2023

File-based CDK: allow user to provided column names #29868

Merged

maxi297 closed this as completed Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File-based CDK: As a user, I want to be able to rename the columns of the CSV file #29850

File-based CDK: As a user, I want to be able to rename the columns of the CSV file #29850

maxi297 commented Aug 25, 2023 •

edited

Loading

sherifnada commented Aug 25, 2023

maxi297 commented Aug 25, 2023 •

edited

Loading

File-based CDK: As a user, I want to be able to rename the columns of the CSV file #29850

File-based CDK: As a user, I want to be able to rename the columns of the CSV file #29850

Comments

maxi297 commented Aug 25, 2023 • edited Loading

sherifnada commented Aug 25, 2023

maxi297 commented Aug 25, 2023 • edited Loading

maxi297 commented Aug 25, 2023 •

edited

Loading

maxi297 commented Aug 25, 2023 •

edited

Loading