Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding name normalization does not remove year suffix #156

Closed
rossj opened this issue Jul 3, 2017 · 6 comments
Closed

Encoding name normalization does not remove year suffix #156

rossj opened this issue Jul 3, 2017 · 6 comments

Comments

@rossj
Copy link

rossj commented Jul 3, 2017

This line is supposed to normalize encoding names by removing non-alphanumeric characters and stripping an appended year. The year is not being stripped with the current regex, causing encoding names with years to not match.

Example:
Input: iso_8859-5:1988
Output: iso885951988
Expected output: iso88595

I think the reason the current regex does not work is that the colon character is matched in the first part as a non-alphanumeric character, therefore causing the following year part to not match.

@erikkemperman
Copy link

Just noticed the same thing. Suggested fix:

.toLowerCase().replace(/:\d{4}$/, "").replace(/[^0-9a-z]/g, "");

@ashtuchkin
Copy link
Owner

ashtuchkin commented Jul 5, 2017 via email

@erikkemperman
Copy link

Actually, maybe it should be

.toLowerCase().replace(/:\d{4}[^0-9a-z]*$/, "").replace(/[^0-9a-z]/g, "");

@rossj
Copy link
Author

rossj commented Jul 5, 2017

FWIW I just swapped the ORs in my local copy

/:\d{4}$|[^0-9a-z]/g

@erikkemperman
Copy link

I was thinking you might want to get rid of trailing (after the year) non-alphanumerics as well?

@erikkemperman
Copy link

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants