-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-46728][PYTHON] Check Pandas installation properly #44745
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Ya, I hit the same issue on those deleted packages.
Could you do the same things for the other packages like PyArrow
, @itholic ?
Sure. I just confirmed that other packages work as expected without any changes unlike Pandas (e.g. PyArrow) >>> import pyspark.pandas
pyspark.errors.exceptions.base.PySparkImportError: [PACKAGE_NOT_INSTALLED] PyArrow >= 4.0.0 must be installed; however, it was not found. Do you happen to have any other packages reproduce the same issue such as Pandas? |
@itholic can you actually check why this happens only in pandas though? |
My concern is that, this is sort of a hacky bandaid fix. It is a bit weird that we do this only for pandas without knowing what's exactly going on. |
I roughly suspect that this happened due to the same package names in our project here and there (such as The reason why I suspect in this way is that because the path |
This one I know because the test fails sometimes with IDE for the reason.
This one can also happen in other packages as well. If that's the case, we should also address the same thing in other packages, e.g., pandas udf and spark connect. It'd be great if we can at least googling and it only happens in pandas before merging this. |
Yeah, I googled when I submitting this PR, but unfortunately couldn't figure out any clue. Let me have some more investigation today. |
It seems like if there are extension packages that use parts of the package we're trying to remove, In our case,
So, to completely remove $ pip uninstall pandas-stubs
$ pip uninstall pandas
$ ls /path/to/python/site-packages/pandas
ls: /path/to/python/site-packages/pandas: No such file or directory |
Updated PR description and comment accordingly. |
On second thought, this issue seems like a corner case according to #44745 (comment). Both |
Let me close this PR for now, but please feel free to ping me if there is any other opinions! |
Hmm... actually I just noticed that this harms a bit of dev testability for some case such as #44778, so I think maybe we better at least bandaid this?? WDYT @HyukjinKwon @dongjoon-hyun ? |
okay, let's go ahead. |
Merged to master. |
What changes were proposed in this pull request?
This PR proposes to check Pandas installation properly
Why are the changes needed?
Checking Pandas installation is not working correctly, but raising improper exception when Pandas is not installed.
This issue occurs because the deleted Pandas was not actually deleted completely when related extension is installed (e.g.
pandas-stubs
).Does this PR introduce any user-facing change?
No API change, but user-facing error message is now showing proper error message to guide:
Before
After
How was this patch tested?
Manually tested
Was this patch authored or co-authored using generative AI tooling?
No.