Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Race condition in _pandas_api#is_data_frame #39313

Closed
TomTJarosz opened this issue Dec 20, 2023 · 1 comment · Fixed by #39314
Closed

[Python] Race condition in _pandas_api#is_data_frame #39313

TomTJarosz opened this issue Dec 20, 2023 · 1 comment · Fixed by #39314

Comments

@TomTJarosz
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

Concurrent invocation of _pandas_api#is_data_frame can result in incorrect behavior (returning false when provided a dataframe). This can cause upstream issues when using higher-level public arrow APIs (such as write_feather).

I have authored a pytest attached below which reproduces the issue:

import pandas as pd
from pyarrow.pandas_compat import _pandas_api
from threading import Thread

def test_is_data_frame_race_condition():
    wait = True
    num_threads = 10
    df = pd.DataFrame()
    results = []
    def rc():
        while wait:
            pass
        results.append(_pandas_api.is_data_frame(df))

    threads = [Thread(target=rc) for _ in range(num_threads)]
    for t in threads:
        t.start()

    wait = False

    for t in threads:
        t.join()

    assert len(results) == num_threads
    assert all(results), "is_data_frame() returned false when given a dataframe"

Component(s)

Python

@TomTJarosz
Copy link
Contributor Author

take

pitrou added a commit that referenced this issue Jan 9, 2024
…39314)

### Rationale for this change

See:
```
    cdef inline bint _have_pandas_internal(self):
        if not self._tried_importing_pandas:
            self._check_import(raise_=False)
        return self._have_pandas
```

The method `_check_import`:
1) sets `_tried_importing_pandas` to true
2) does some things which take time...
3) sets `_have_pandas` to true (if we indeed do have pandas)

Suppose thread 1 calls `_have_pandas_internal`. If thread 1 is at step 2 while thread 2 calls `_have_pandas_internal`, `_have_pandas_internal` may incorrectly return False for thread 2 as thread 1 has set `_tried_importing_pandas` to true, but has not yet (but will) set `_have_pandas` to True. `_have_pandas_internal` will return True for thread 1.

After my fix, `_have_pandas_internal` will not return an incorrect value in the scenario described above. It would instead result in a redundant, but (I believe) harmless, invocation of `_check_import`.

### What changes are included in this PR?

Changes ordering of "trying to import pandas" and "recording that pandas import has been tried"

### Are these changes tested?
yes, see test committed

### Are there any user-facing changes?

This PR resolves a user-facing race condition #39313
* Closes: #39313

Lead-authored-by: Thomas Jarosz <thomas.jarosz@c3.ai>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 16.0.0 milestone Jan 9, 2024
clayburn pushed a commit to clayburn/arrow that referenced this issue Jan 23, 2024
…ort (apache#39314)

### Rationale for this change

See:
```
    cdef inline bint _have_pandas_internal(self):
        if not self._tried_importing_pandas:
            self._check_import(raise_=False)
        return self._have_pandas
```

The method `_check_import`:
1) sets `_tried_importing_pandas` to true
2) does some things which take time...
3) sets `_have_pandas` to true (if we indeed do have pandas)

Suppose thread 1 calls `_have_pandas_internal`. If thread 1 is at step 2 while thread 2 calls `_have_pandas_internal`, `_have_pandas_internal` may incorrectly return False for thread 2 as thread 1 has set `_tried_importing_pandas` to true, but has not yet (but will) set `_have_pandas` to True. `_have_pandas_internal` will return True for thread 1.

After my fix, `_have_pandas_internal` will not return an incorrect value in the scenario described above. It would instead result in a redundant, but (I believe) harmless, invocation of `_check_import`.

### What changes are included in this PR?

Changes ordering of "trying to import pandas" and "recording that pandas import has been tried"

### Are these changes tested?
yes, see test committed

### Are there any user-facing changes?

This PR resolves a user-facing race condition apache#39313
* Closes: apache#39313

Lead-authored-by: Thomas Jarosz <thomas.jarosz@c3.ai>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou modified the milestones: 16.0.0, 15.0.1 Feb 14, 2024
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…ort (apache#39314)

### Rationale for this change

See:
```
    cdef inline bint _have_pandas_internal(self):
        if not self._tried_importing_pandas:
            self._check_import(raise_=False)
        return self._have_pandas
```

The method `_check_import`:
1) sets `_tried_importing_pandas` to true
2) does some things which take time...
3) sets `_have_pandas` to true (if we indeed do have pandas)

Suppose thread 1 calls `_have_pandas_internal`. If thread 1 is at step 2 while thread 2 calls `_have_pandas_internal`, `_have_pandas_internal` may incorrectly return False for thread 2 as thread 1 has set `_tried_importing_pandas` to true, but has not yet (but will) set `_have_pandas` to True. `_have_pandas_internal` will return True for thread 1.

After my fix, `_have_pandas_internal` will not return an incorrect value in the scenario described above. It would instead result in a redundant, but (I believe) harmless, invocation of `_check_import`.

### What changes are included in this PR?

Changes ordering of "trying to import pandas" and "recording that pandas import has been tried"

### Are these changes tested?
yes, see test committed

### Are there any user-facing changes?

This PR resolves a user-facing race condition apache#39313
* Closes: apache#39313

Lead-authored-by: Thomas Jarosz <thomas.jarosz@c3.ai>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
raulcd pushed a commit that referenced this issue Feb 20, 2024
…39314)

### Rationale for this change

See:
```
    cdef inline bint _have_pandas_internal(self):
        if not self._tried_importing_pandas:
            self._check_import(raise_=False)
        return self._have_pandas
```

The method `_check_import`:
1) sets `_tried_importing_pandas` to true
2) does some things which take time...
3) sets `_have_pandas` to true (if we indeed do have pandas)

Suppose thread 1 calls `_have_pandas_internal`. If thread 1 is at step 2 while thread 2 calls `_have_pandas_internal`, `_have_pandas_internal` may incorrectly return False for thread 2 as thread 1 has set `_tried_importing_pandas` to true, but has not yet (but will) set `_have_pandas` to True. `_have_pandas_internal` will return True for thread 1.

After my fix, `_have_pandas_internal` will not return an incorrect value in the scenario described above. It would instead result in a redundant, but (I believe) harmless, invocation of `_check_import`.

### What changes are included in this PR?

Changes ordering of "trying to import pandas" and "recording that pandas import has been tried"

### Are these changes tested?
yes, see test committed

### Are there any user-facing changes?

This PR resolves a user-facing race condition #39313
* Closes: #39313

Lead-authored-by: Thomas Jarosz <thomas.jarosz@c3.ai>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…ort (apache#39314)

### Rationale for this change

See:
```
    cdef inline bint _have_pandas_internal(self):
        if not self._tried_importing_pandas:
            self._check_import(raise_=False)
        return self._have_pandas
```

The method `_check_import`:
1) sets `_tried_importing_pandas` to true
2) does some things which take time...
3) sets `_have_pandas` to true (if we indeed do have pandas)

Suppose thread 1 calls `_have_pandas_internal`. If thread 1 is at step 2 while thread 2 calls `_have_pandas_internal`, `_have_pandas_internal` may incorrectly return False for thread 2 as thread 1 has set `_tried_importing_pandas` to true, but has not yet (but will) set `_have_pandas` to True. `_have_pandas_internal` will return True for thread 1.

After my fix, `_have_pandas_internal` will not return an incorrect value in the scenario described above. It would instead result in a redundant, but (I believe) harmless, invocation of `_check_import`.

### What changes are included in this PR?

Changes ordering of "trying to import pandas" and "recording that pandas import has been tried"

### Are these changes tested?
yes, see test committed

### Are there any user-facing changes?

This PR resolves a user-facing race condition apache#39313
* Closes: apache#39313

Lead-authored-by: Thomas Jarosz <thomas.jarosz@c3.ai>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants