Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert df to pyspark DataFrame if it is pandas before writing #301

Merged
merged 3 commits into from
Sep 19, 2022
Merged

Convert df to pyspark DataFrame if it is pandas before writing #301

merged 3 commits into from
Sep 19, 2022

Conversation

chamini2
Copy link
Contributor

@chamini2 chamini2 commented Sep 8, 2022

resolves #312

Description

This adds a check for the df returned from the model function to convert into pyspark DataFrame if it's pandas.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have run changie new to create a changelog entry

@cla-bot
Copy link

cla-bot bot commented Sep 8, 2022

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @chamini2

@cla-bot cla-bot bot added the cla:yes label Sep 8, 2022
@jtcohen6 jtcohen6 added the ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering label Sep 16, 2022
@jtcohen6
Copy link
Contributor

Original community Slack thread

@lostmygithubaccount I believe this change would resolve the bug that @b-per ran into during yesterday's hackathon, by checking to see if the user is returning a Pandas dataframe and converting it back to a PySpark dataframe before writing it back to the database.

We should make the same change over in dbt-spark as well. (Another argument for finding a place to store PySpark-specific code, so it doesn't need to be copy-pasted between these two.)

@chamini2
Copy link
Contributor Author

chamini2 commented Sep 16, 2022

Should I open a ticket @jtcohen6 ?

I see someone did! Let me know if you need anything else for the PR to be good to go!

@lostmygithubaccount
Copy link

@chamini2 nothing on your end, appreciate the contribution! we're doing some manual testing before final review/merge

@dbeatty10
Copy link
Contributor

I manually verified that the following didn't work before. Confirmed that it works using @chamini2's fix 👍

import pandas as pd

def model(dbt, session):
    dbt.config(
        materialized="table",
        packages=["pandas"]
    )

    df = pd.DataFrame(
        {'City': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota', 'Caracas'],
        'Country': ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Venezuela'],
        'Latitude': [-34.58, -15.78, -33.45, 4.60, 10.48],
        'Longitude': [-58.66, -47.91, -70.66, -74.08, -66.86]}
        )

    return df

@ChenyuLInx ChenyuLInx merged commit 9699a48 into dbt-labs:main Sep 19, 2022
@ChenyuLInx
Copy link
Contributor

@chamini2 Thanks for contributing this!!!

@dbeatty10 thanks for confirming it! I will add this as a basic tests in core!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-1199] [Feature] Support python model return a pandas dataframe
5 participants