You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have six workspaces using pyarrow config column_names:
configapi=> select workspace_id, auto_propagation_status, max(jobs.created_at) from jobs join connection on jobs.scope = connection.id::text join actor on connection.source_id = actor.id join schema_management on connection.id = schema_management.connection_id where actor.actor_definition_id = '69589781-7828-43c5-9f63-8925b1c1ccc2' and configuration::text like '%column_names%' group by workspace_id, auto_propagation_status;
One hasn't been synced since May but the other we're synced in the last month so this is still useful.
What pyarrow does to determine column names:
Given column_names is provided, assume there are no headers and use those columns (number needs to match with the number of columns in the CSV)
Else if column_names not provided, fallback on autogenerate_column_names and if true, assume there are no headers and generate headers
Else if not autogenerate_column_names, assume there is a header row located after skip_rows
New solution
Using the rejected solution below, we would get two levels of "Optional fields" as the frontend handles them separately. Hence, we decided to go with a oneOf using this interface:
class CsvHeaderDefinitionType(Enum):
FROM_CSV = "From CSV"
AUTOGENERATED = "Autogenerated"
USER_PROVIDED = "User Provided"
class CsvHeaderFromCsv(BaseModel):
class Config:
title = "From CSV"
header_definition_type: Literal[CsvHeaderDefinitionType.FROM_CSV.value] = CsvHeaderDefinitionType.FROM_CSV.value # type: ignore
class CsvHeaderAutogenerated(BaseModel):
class Config:
title = "Autogenerated"
header_definition_type: Literal[CsvHeaderDefinitionType.AUTOGENERATED.value] = CsvHeaderDefinitionType.AUTOGENERATED.value # type: ignore
class CsvHeaderUserProvided(BaseModel):
class Config:
title = "User Provided"
header_definition_type: Literal[CsvHeaderDefinitionType.USER_PROVIDED.value] = CsvHeaderDefinitionType.USER_PROVIDED.value # type: ignore
column_names: List[str] = Field(
title="Column Names",
description="The column names that will be used while emitting the CSV records",
)
Rejected
In terms of config interface:
class CsvHeaderDefinitionType(Enum):
USER_PROVIDED = "User Provided"
AUTOGENERATED = "Autogenerated"
FROM_CSV = "From CSV"
class CsvHeaderDefinition(BaseModel):
definition_type: CsvHeaderDefinitionType = Field(
title="CSV Header Definition Type",
description="How to infer column names.",
)
column_names: List[str] = Field(
title="Column Names",
description="Given type is user-provided, column names will be provided using this field",
default=None
)
We will need to update the LegacyConfigTransformer to align with this new definition. This might affect #29435 as well
The text was updated successfully, but these errors were encountered:
what is the output format if the user selects autogenerate? e.g is it _f0, _f1, ..., _fn where the indices are counting the column index?
It is f"f{i}" where i is the index starting from 0
in the user provided scenario what happens if the user specifies fewer (or more) column names than there are columns in the file?
The current behavior for S3 is that an exception is raised as such: pyarrow.lib.ArrowInvalid: CSV parse error: Expected 2 columns, got 3: id,name,valid. V4 would do the same but with a non pyarrow exception
I'll document both behaviors in the config docstring and update the file-based CDK README.md
We have six workspaces using pyarrow config
column_names
:configapi=> select workspace_id, auto_propagation_status, max(jobs.created_at) from jobs join connection on jobs.scope = connection.id::text join actor on connection.source_id = actor.id join schema_management on connection.id = schema_management.connection_id where actor.actor_definition_id = '69589781-7828-43c5-9f63-8925b1c1ccc2' and configuration::text like '%column_names%' group by workspace_id, auto_propagation_status;
One hasn't been synced since May but the other we're synced in the last month so this is still useful.
What pyarrow does to determine column names:
autogenerate_column_names
and if true, assume there are no headers and generate headersautogenerate_column_names
, assume there is a header row located afterskip_rows
New solution
Using the rejected solution below, we would get two levels of "Optional fields" as the frontend handles them separately. Hence, we decided to go with a oneOf using this interface:
Rejected
In terms of config interface:
We will need to update the
LegacyConfigTransformer
to align with this new definition. This might affect #29435 as wellThe text was updated successfully, but these errors were encountered: