Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46728][PYTHON] Check Pandas installation properly #44745

Closed
wants to merge 5 commits into from

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Jan 16, 2024

What changes were proposed in this pull request?

This PR proposes to check Pandas installation properly

Why are the changes needed?

Checking Pandas installation is not working correctly, but raising improper exception when Pandas is not installed.

This issue occurs because the deleted Pandas was not actually deleted completely when related extension is installed (e.g. pandas-stubs).

Does this PR introduce any user-facing change?

No API change, but user-facing error message is now showing proper error message to guide:

Before

>>> import pyspark.pandas
AttributeError: module 'pandas' has no attribute '__version__'

After

>>> import pyspark.pandas
pyspark.errors.exceptions.base.PySparkImportError: [PACKAGE_NOT_INSTALLED] Pandas >= 1.4.4 must be installed; however, it was not found.

How was this patch tested?

Manually tested

Was this patch authored or co-authored using generative AI tooling?

No.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Ya, I hit the same issue on those deleted packages.

Could you do the same things for the other packages like PyArrow, @itholic ?

@itholic
Copy link
Contributor Author

itholic commented Jan 16, 2024

Could you do the same things for the other packages like PyArrow, @itholic ?

Sure. I just confirmed that other packages work as expected without any changes unlike Pandas (e.g. PyArrow)

>>> import pyspark.pandas
pyspark.errors.exceptions.base.PySparkImportError: [PACKAGE_NOT_INSTALLED] PyArrow >= 4.0.0 must be installed; however, it was not found.

Do you happen to have any other packages reproduce the same issue such as Pandas?

@HyukjinKwon
Copy link
Member

@itholic can you actually check why this happens only in pandas though?

@HyukjinKwon
Copy link
Member

My concern is that, this is sort of a hacky bandaid fix. It is a bit weird that we do this only for pandas without knowing what's exactly going on.

@itholic
Copy link
Contributor Author

itholic commented Jan 16, 2024

I roughly suspect that this happened due to the same package names in our project here and there (such as pyspark.pandas, pyspark.sql.pandas), so the namespace conflicts issue occur for some reason only in pandas, but could not figure out the actual root cause right now.

The reason why I suspect in this way is that because the path /../site-packages/pandas is only not deleted clearly with PySpark dev env when uninstalling pandas.

@HyukjinKwon
Copy link
Member

I roughly suspect that this happened due to the same package names in our project here and there (such as pyspark.pandas, pyspark.sql.pandas), so the namespace conflicts issue occur for some reason only in pandas, but could not figure out the actual root cause right now.

This one I know because the test fails sometimes with IDE for the reason.

The reason why I suspect in this way is that because the path /../site-packages/pandas is only not deleted clearly with PySpark dev env when uninstalling pandas.

This one can also happen in other packages as well. If that's the case, we should also address the same thing in other packages, e.g., pandas udf and spark connect. It'd be great if we can at least googling and it only happens in pandas before merging this.

@itholic
Copy link
Contributor Author

itholic commented Jan 16, 2024

It'd be great if we can at least googling and it only happens in pandas before merging this.

Yeah, I googled when I submitting this PR, but unfortunately couldn't figure out any clue. Let me have some more investigation today.

@itholic
Copy link
Contributor Author

itholic commented Jan 17, 2024

It seems like if there are extension packages that use parts of the package we're trying to remove, pip uninstall adds those dependencies to the Would not remove list and doesn't actually remove them.

In our case, pip uninstall prints out Would not remove list and doesn't actually remove them when pandas-stubs is still installed, and that's why pandas is not completely removed:

(pyspark-dev-env) haejoon.lee@NQ679Q495F spark % pip uninstall pandas
Found existing installation: pandas 2.1.4
Uninstalling pandas-2.1.4:
  Would remove:
    /Users/haejoon.lee/anaconda3/envs/pyspark-dev-env/lib/python3.9/site-packages/pandas-2.1.4.dist-info/*
    /Users/haejoon.lee/anaconda3/envs/pyspark-dev-env/lib/python3.9/site-packages/pandas/*
  Would not remove (might be manually added):
    /Users/haejoon.lee/anaconda3/envs/pyspark-dev-env/lib/python3.9/site-packages/pandas/__init__.pyi
    /Users/haejoon.lee/anaconda3/envs/pyspark-dev-env/lib/python3.9/site-packages/pandas/_config/__init__.pyi
    /Users/haejoon.lee/anaconda3/envs/pyspark-dev-env/lib/python3.9/site-packages/pandas/_config/config.pyi
    ...
    /Users/haejoon.lee/anaconda3/envs/pyspark-dev-env/lib/python3.9/site-packages/pandas/util/_tester.pyi
    /Users/haejoon.lee/anaconda3/envs/pyspark-dev-env/lib/python3.9/site-packages/pandas/util/_validators.pyi
    /Users/haejoon.lee/anaconda3/envs/pyspark-dev-env/lib/python3.9/site-packages/pandas/util/testing.pyi

So, to completely remove pandas, we need to uninstall pandas-stubs first and then uninstall pandas.

$ pip uninstall pandas-stubs
$ pip uninstall pandas
$ ls /path/to/python/site-packages/pandas
ls: /path/to/python/site-packages/pandas: No such file or directory

@itholic
Copy link
Contributor Author

itholic commented Jan 17, 2024

Updated PR description and comment accordingly.

@itholic
Copy link
Contributor Author

itholic commented Jan 17, 2024

On second thought, this issue seems like a corner case according to #44745 (comment). Both pandas and pandas-stubs are specified as required packages, so I think it would be a good idea not to deal with problems caused by deleting required packages.

@itholic
Copy link
Contributor Author

itholic commented Jan 17, 2024

Let me close this PR for now, but please feel free to ping me if there is any other opinions!

@itholic
Copy link
Contributor Author

itholic commented Jan 24, 2024

Hmm... actually I just noticed that this harms a bit of dev testability for some case such as #44778, so I think maybe we better at least bandaid this?? WDYT @HyukjinKwon @dongjoon-hyun ?

@HyukjinKwon
Copy link
Member

okay, let's go ahead.

@HyukjinKwon HyukjinKwon reopened this Jan 24, 2024
@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants