Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Compute] Make pyarrow.compute.strptime %Y handling consistent with time.strptime #38528

Closed
no23reason opened this issue Oct 31, 2023 · 11 comments · Fixed by #38665
Closed

Comments

@no23reason
Copy link
Contributor

Describe the enhancement requested

The pyarrow.compute.strptime handles the %Y format part (i.e. 4-digit year) differently from the built-in time.strptime:
When the year part of the input has only two digits, time.strptime fails to parse it, while pyarrow.compute.strptime parses it with no error yielding a result with a year in the 1st century.

For example

>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> import time
>>> input = "01-01-23"
>>> format = "%m-%d-%Y"
>>> time.strptime(input, format)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/no23reason/.pyenv/versions/3.11.4/lib/python3.11/_strptime.py", line 562, in _strptime_time
    tt = _strptime(data_string, format)[0]
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/no23reason/.pyenv/versions/3.11.4/lib/python3.11/_strptime.py", line 349, in _strptime
    raise ValueError("time data %r does not match format %r" %
ValueError: time data '01-01-23' does not match format '%m-%d-%Y'
>>> pc.strptime(pa.array([input]), format=format, unit="s")
<pyarrow.lib.TimestampArray object at 0x152e40c40>
[
  0023-01-01 00:00:00
]

I believe the pyarrow.compute.strptime should also fail in this case as it most likely means that the format is wrong.

Component(s)

Python

@no23reason no23reason changed the title Make pyarrow.compute.strptime %Y handling consistent with time.strptime [Python] Make pyarrow.compute.strptime %Y handling consistent with time.strptime Oct 31, 2023
@no23reason no23reason changed the title [Python] Make pyarrow.compute.strptime %Y handling consistent with time.strptime [Python][Compute] Make pyarrow.compute.strptime %Y handling consistent with time.strptime Oct 31, 2023
@rok
Copy link
Member

rok commented Oct 31, 2023

Thanks for reporting this @no23reason! Pyarrow uses system's strptime and hence doesn't really follow Python's strptime conventions.
This falls under #31324 umbrella.

@no23reason
Copy link
Contributor Author

Thank you @rok or the clarification and the umbrella link. I played some more with the C version of strptime (https://onlinegdb.com/Et55gUIpp) and indeed, it parses the years with less than 4 digits 🤯

@rok
Copy link
Member

rok commented Oct 31, 2023

Unfortunately dialects can't be reconciled. We could add a dialect parameter to control behavior but I suspect it would be a significant amount of work and we're not seeing strong demand for it.

@no23reason
Copy link
Contributor Author

Makes sense. Maybe it would be enough to mention it in the docs of the pyarrow compute, that the function follows C++ semantics, not Python ones 🤔

@rok
Copy link
Member

rok commented Nov 1, 2023

@no23reason that would be good! Want to open a PR?

@no23reason
Copy link
Contributor Author

Sure, I'll try to get to it later today :)

@no23reason
Copy link
Contributor Author

Sorry for radio silence, I was able to extend the documentation locally, but was struggling with the python build so that I could preview the changes in the built docs. Will give it some more time when I'm able.

@rok
Copy link
Member

rok commented Nov 8, 2023

No worries!
For simple changes you can also sidestep building docs locally by just opening a PR and use github actions to build a preview - see here.

no23reason added a commit to no23reason/arrow that referenced this issue Nov 10, 2023
To prevent possible confusion with the compute strptime function, we now explicitly mention that the C/C++ semantics are followed.
@no23reason
Copy link
Contributor Author

Hm, seems that as a first time contributor, I cannot run the preview: #38665 can you please take a look? :)

@rok
Copy link
Member

rok commented Nov 10, 2023

It seems so! I started the CI and the docs build.

@no23reason
Copy link
Contributor Author

no23reason commented Nov 12, 2023

Hm, it seems something is wrong with the CI :( I tried running the archery docker run -v "${PWD}/docs:/build/docs" ubuntu-docs locally, but it also failed with errors like

CMake Error at cmake_modules/FindAzure.cmake:32 (find_package):
  Could not find a package configuration file provided by "azure-core-cpp"
  with any of the following names:

    azure-core-cppConfig.cmake
    azure-core-cpp-config.cmake

let's hope the CI comes around :)

no23reason added a commit to no23reason/arrow that referenced this issue Nov 13, 2023
Splits a long sentence, makes the language more directive.

Co-authored-by: Rok Mihevc <rok@mihevc.org>
rok added a commit that referenced this issue Nov 13, 2023
### Rationale for this change

To prevent possible confusion with the compute strptime function, we now explicitly mention that the C/C++ semantics are followed.

### What changes are included in this PR?

The documentation of the `format` parameter of the `strptime` function is expanded.

### Are these changes tested?

N/A documentation change only.

### Are there any user-facing changes?

Just the documentation.
* Closes: #38528

Lead-authored-by: Dan Homola <dan.homola@hotmail.cz>
Co-authored-by: Dan Homola <dan.homola@gooddata.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Rok Mihevc <rok@mihevc.org>
@rok rok added this to the 15.0.0 milestone Nov 13, 2023
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…pache#38665)

### Rationale for this change

To prevent possible confusion with the compute strptime function, we now explicitly mention that the C/C++ semantics are followed.

### What changes are included in this PR?

The documentation of the `format` parameter of the `strptime` function is expanded.

### Are these changes tested?

N/A documentation change only.

### Are there any user-facing changes?

Just the documentation.
* Closes: apache#38528

Lead-authored-by: Dan Homola <dan.homola@hotmail.cz>
Co-authored-by: Dan Homola <dan.homola@gooddata.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Rok Mihevc <rok@mihevc.org>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…pache#38665)

### Rationale for this change

To prevent possible confusion with the compute strptime function, we now explicitly mention that the C/C++ semantics are followed.

### What changes are included in this PR?

The documentation of the `format` parameter of the `strptime` function is expanded.

### Are these changes tested?

N/A documentation change only.

### Are there any user-facing changes?

Just the documentation.
* Closes: apache#38528

Lead-authored-by: Dan Homola <dan.homola@hotmail.cz>
Co-authored-by: Dan Homola <dan.homola@gooddata.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Rok Mihevc <rok@mihevc.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants