Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File-based CDK: As a user, I want to be able to rename the columns of the CSV file #29850

Closed
maxi297 opened this issue Aug 25, 2023 · 2 comments
Assignees

Comments

@maxi297
Copy link
Contributor

maxi297 commented Aug 25, 2023

We have six workspaces using pyarrow config column_names:

configapi=> select workspace_id, auto_propagation_status, max(jobs.created_at) from jobs join connection on jobs.scope = connection.id::text join actor on connection.source_id = actor.id join schema_management on connection.id = schema_management.connection_id where actor.actor_definition_id = '69589781-7828-43c5-9f63-8925b1c1ccc2' and configuration::text like '%column_names%' group by workspace_id, auto_propagation_status;

One hasn't been synced since May but the other we're synced in the last month so this is still useful.

What pyarrow does to determine column names:

  • Given column_names is provided, assume there are no headers and use those columns (number needs to match with the number of columns in the CSV)
  • Else if column_names not provided, fallback on autogenerate_column_names and if true, assume there are no headers and generate headers
  • Else if not autogenerate_column_names, assume there is a header row located after skip_rows

New solution
Using the rejected solution below, we would get two levels of "Optional fields" as the frontend handles them separately. Hence, we decided to go with a oneOf using this interface:


class CsvHeaderDefinitionType(Enum):
    FROM_CSV = "From CSV"
    AUTOGENERATED = "Autogenerated"
    USER_PROVIDED = "User Provided"


class CsvHeaderFromCsv(BaseModel):
    class Config:
        title = "From CSV"

    header_definition_type: Literal[CsvHeaderDefinitionType.FROM_CSV.value] = CsvHeaderDefinitionType.FROM_CSV.value  # type: ignore


class CsvHeaderAutogenerated(BaseModel):
    class Config:
        title = "Autogenerated"

    header_definition_type: Literal[CsvHeaderDefinitionType.AUTOGENERATED.value] = CsvHeaderDefinitionType.AUTOGENERATED.value  # type: ignore


class CsvHeaderUserProvided(BaseModel):
    class Config:
        title = "User Provided"

    header_definition_type: Literal[CsvHeaderDefinitionType.USER_PROVIDED.value] = CsvHeaderDefinitionType.USER_PROVIDED.value  # type: ignore
    column_names: List[str] = Field(
        title="Column Names",
        description="The column names that will be used while emitting the CSV records",
    )

Rejected
In terms of config interface:

class CsvHeaderDefinitionType(Enum):
    USER_PROVIDED = "User Provided"
    AUTOGENERATED = "Autogenerated"
    FROM_CSV = "From CSV"


class CsvHeaderDefinition(BaseModel):
    definition_type: CsvHeaderDefinitionType = Field(
        title="CSV Header Definition Type",
        description="How to infer column names.",
    )
    column_names: List[str] = Field(
        title="Column Names",
        description="Given type is user-provided, column names will be provided using this field",
        default=None
    )

We will need to update the LegacyConfigTransformer to align with this new definition. This might affect #29435 as well

@maxi297 maxi297 self-assigned this Aug 25, 2023
@sherifnada
Copy link
Contributor

my "grooming feedback": let's make sure there are no "undefined behaviors" here e.g:

  • what is the output format if the user selects autogenerate? e.g is it _f0, _f1, ..., _fn where the indices are counting the column index?
  • in the user provided scenario what happens if the user specifies fewer (or more) column names than there are columns in the file?

both are very tractable but I'm just calling out that we should define these behaviors clearly in the docstring and probably the connector docs too

@maxi297
Copy link
Contributor Author

maxi297 commented Aug 25, 2023

what is the output format if the user selects autogenerate? e.g is it _f0, _f1, ..., _fn where the indices are counting the column index?

It is f"f{i}" where i is the index starting from 0

in the user provided scenario what happens if the user specifies fewer (or more) column names than there are columns in the file?

The current behavior for S3 is that an exception is raised as such: pyarrow.lib.ArrowInvalid: CSV parse error: Expected 2 columns, got 3: id,name,valid. V4 would do the same but with a non pyarrow exception

I'll document both behaviors in the config docstring and update the file-based CDK README.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants