Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add column / update schema in an existing table without delete data #232

Closed
mbiemann opened this issue May 12, 2020 · 5 comments
Closed
Assignees
Labels
feature minor release Will be addressed in the next minor release question Further information is requested
Milestone

Comments

@mbiemann
Copy link

Hi @igorborgest ,

In a recent issue #198 was add the combination of catalog_versioning and mode overwrite in s3.to_parquet to update the table, however, I'm looking for some method that don't delete the old data and update the existing table adding the new columns.

Let me exemplify... This is the actual "insert":

wr.s3.to_parquet(
    df=pd.DataFrame([{
        'event_date':     datetime.now().strftime('%Y-%m-%d'),
        'event_datetime': datetime.now().isoformat(),
        'event_name':     f'Event {chr(random.randint(65,90))}',
        'event_value':    random.randint(10000,20000)
    }]),
    dataset=True,
    mode='append',
    database=databaseName,
    table=tableName,
    partition_cols=['event_date'],
    path=f's3://{bucket}/{databaseName}/{tableName}/',
    compression='snappy'
)

If I add a new column to DataFrame, the data is actual write in parquet, but I can't select via Athena.

wr.s3.to_parquet(
    df=pd.DataFrame([{
        'event_date':     datetime.now().strftime('%Y-%m-%d'),
        'event_datetime': datetime.now().isoformat(),
        'event_name':     f'Event {chr(random.randint(65,90))}',
        'event_value':    random.randint(10000,20000),
        'new_column1':    random.randint(1,10)
    }]),
    dataset=True,
    mode='append',
    database=databaseName,
    table=tableName,
    partition_cols=['event_date'],
    path=f's3://{bucket}/{databaseName}/{tableName}/',
    compression='snappy'
)

Note: Editing the schema via Console to add the new column on the table makes it possible to select it via Athena.

I would you like to know what do you thing about the solution:

Around catalog.py line 995, if catalog_versioning is true, compare the count of actual columns (need to get existing table definition) with the count of incoming columns (using table_input.columns_types).
If the counts is difference, update the table like like 993

I would like to contribute to this change if you agree with the solution or have another one in mind.

Best Regards,
Marcell Biemann

@mbiemann mbiemann added the question Further information is requested label May 12, 2020
@mbiemann mbiemann changed the title Add column / update schema in an existing table Add column / update schema in an existing table without delete data May 12, 2020
@igorborgest
Copy link
Contributor

Hi @mbiemann, thanks for reaching out!

Yep, it makes sense and a PR would be great!

We just should be cautious about some details related to invalid schema evolution:

  • The columns order will be relevant? We will need to sort it back to the original order?
  • We will only accept new columns in the tail of the table? Or anywhere?
  • What about type mismatch?

It will be a really cool feature, but schema evolution is not a trivial subject.

@mbiemann
Copy link
Author

Thanks @igorborgest . I’m studying the best solution for Schema Evolution with Parquet.

@igorborgest igorborgest self-assigned this May 18, 2020
@igorborgest igorborgest added minor release Will be addressed in the next minor release WIP Work in progress feature labels May 18, 2020
@igorborgest igorborgest added this to the 1.2.0 milestone May 18, 2020
@igorborgest
Copy link
Contributor

Hi @mbiemann!

I've just submitted the PR above with a first propose about schema evolution on Parquet Datasets. What do you think?

To test it you can run:

pip install git+https://github.com/awslabs/aws-data-wrangler.git@schema-evolution

@igorborgest
Copy link
Contributor

New tutorial about the subject added!

@igorborgest
Copy link
Contributor

Released on version 1.2.0

@igorborgest igorborgest removed the WIP Work in progress label May 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature minor release Will be addressed in the next minor release question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants