Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix huggingface validator #330

Merged
merged 8 commits into from
Jun 19, 2024
Merged

Conversation

Taniya-Das
Copy link
Collaborator

Bugs in the validator implementation. It is now replaced with the huggingface_hub.utils.validate_repo_id as suggested in the docs.


REPO_ID_ILLEGAL_CHARACTERS = re.compile(r"[^0-9a-zA-Z-_./]+")
MSG_PREFIX = "The platform_resource_identifier for HuggingFace should be a valid repo_id. "
from huggingface_hub.utils import validate_repo_id


def throw_error_on_invalid_identifier(platform_resource_identifier: str):
Copy link
Member

@jsmatias jsmatias Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of outside the scope of this PR, but it would be nice to have this docstring updated to avoid misunderstanding.

[a-zA-Z0-9] or-”, ”_”, ”.”
”—” and ”..” are forbidden

Notice that "-" in the first line has a different encoding than the second one. As I understand, this validation comes from HF, and the second dash is a double dash instead "--" that was probably replaced by the code editor.

In [1]: "-".encode()
Out [1]: b'-'

In [2]: "—".encode()
Out[2]: b'\xe2\x80\x94'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this would be better. And maybe even to wrap the error hugging face throws since it has similar hard to read glyphs and a confusing sentence structure (https://github.com/huggingface/huggingface_hub/blob/4c7aa33bac0b4fb59796e76fcc9cd7f5ff0fd426/src/huggingface_hub/utils/_validators.py#L161).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a PR to replace the character. It simply does the replacement, but I agree that wrapping it would be better to avoid confusion. #331

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I agree. I am wrapping the error HF throws.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging this PR now. I will raise a new PR for better handling error messages,

Copy link
Member

@jsmatias jsmatias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor comment to update the docstring of a function.

jsmatias and others added 2 commits June 19, 2024 09:57
…ator_docstring

Correct docstring of Huggingface validator
Copy link
Collaborator

@andrejridzik andrejridzik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to fix the validator, but we are still getting issues with some datasets when running the connector. I have identified at least one so far. Do we want to address them in a separate PR?

@PGijsbers
Copy link
Collaborator

Is it a distinct error? Then yes, I think we should tackle it in a separate PR.
Could you provide some more details on the error how to recreate it in a new issue? That would help a lot!

@andrejridzik
Copy link
Collaborator

Is it a distinct error? Then yes, I think we should tackle it in a separate PR. Could you provide some more details on the error how to recreate it in a new issue? That would help a lot!

I can create an issue and also an initial PR with the changes. We might identify and fix more issues in the following hours :) .

@Taniya-Das Taniya-Das merged commit be7a0c5 into develop Jun 19, 2024
1 check passed
@Taniya-Das Taniya-Das deleted the bugfix/huggingface_validator branch June 19, 2024 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants