Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove pandas-gbq from testing #31

Merged
merged 3 commits into from
Dec 16, 2021
Merged

Remove pandas-gbq from testing #31

merged 3 commits into from
Dec 16, 2021

Conversation

ncclementi
Copy link
Contributor

@ncclementi ncclementi commented Dec 14, 2021

Replace use of pandas-gbq for pure Bigquery

@ncclementi ncclementi marked this pull request as ready for review December 14, 2021 22:01
@@ -19,6 +20,7 @@ def df():
{
"name": random.choice(["fred", "wilma", "barney", "betty"]),
"number": random.randint(0, 100),
"timestamp": datetime.now(timezone.utc) - timedelta(days=i % 2),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before adding timezone.utc I was getting this assertion error:

E           AssertionError: Attributes of DataFrame.iloc[:, 2] (column name="timestamp") are different
E           
E           Attribute "dtype" are different
E           [left]:  datetime64[ns, UTC]
E           [right]: datetime64[ns]

It seems like when reading back from bigquery, it will automatically convert to utc if not otherwise specified, causing the error.
@tswast can you confirm this is the case? any comments?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIMESTAMP columns are intended to come back as datetime64[ns, UTC], yes.

DATETIME should come back as datetime64[ns].

See my answer here on the difference between the two: https://stackoverflow.com/a/47724366/101923

Also note: both will come back as object dtype if there's a date outside of the pandas representable range, e.g. 0001-01-01 or 9999-12-31.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually working on making the pandas-gbq dtypes consistent with google-cloud-bigquery as we speak in googleapis/python-bigquery-pandas#444

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that if I don't provide a schema, bigquery will infer that the dataframe column named "timestamp" is a TIMESTAMP column therefore it's converting it is coming back as datetime64[ns, UTC]. That been said to keep the test simple I think we can have the local dataframe to be timezone aware and test that it comes back as it should.

cc: @jrbourbeau Does this convince you? If so this PR is ready for review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good 👍

@ncclementi
Copy link
Contributor Author

It looks like when on macOS it can't find pyarrow and the tests are failing. I've never seen anything like these. @jrbourbeau DO you know what could be happening?

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing @ncclementi and reviewing @tswast! This is in

@jrbourbeau
Copy link
Member

Forgot to mention the macOS CI failure. This looks like a totally unrelated packaging issue (we're seeing similar things over in Dask's CI). I'm not currently able to reproduce locally -- let me rerun CI to see if the issue has already been resolved

@jrbourbeau
Copy link
Member

Hmm unfortunately the macOS environment issue is still around. I'm highly confident this is unrelated to the changes in this PR (see similar things being reported in distributed here dask/distributed#5601). Let's go ahead and merge this in for now and we can address any follow ups in subsequent PRs. FWIW I've opened this issue conda-forge/conda-forge.github.io#1574 with the conda-forge maintainers

@jrbourbeau jrbourbeau merged commit c372694 into main Dec 16, 2021
@jrbourbeau jrbourbeau deleted the remove_pd_gbq branch December 16, 2021 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pandas-gbq 0.16 release broke dask-bigquery CI
3 participants