Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement request: Support '... Ph. D.' instead of only '... Ph.D.' #43

Closed
rolfhnelson opened this issue Mar 14, 2016 · 4 comments
Closed
Assignees
Milestone

Comments

@rolfhnelson
Copy link

In 0.3.12:

HumanName('John Smith, Ph.D.') works as expected, but the common misspelling HumanName('John Smith, Ph. D.'), which incorrectly has a space between Ph. and D., now yields 'Ph. D. John Smith'. Personally I would prefer to go back to 0.3.11's behavior, where it left the misspelled title at the end.

@derek73
Copy link
Owner

derek73 commented Mar 14, 2016

Interesting. Are you sure that it used to work that way? I have tried locally with every version back to v0.3.5 and the result from "John Smith, Ph. D." is always the same as v0.3.12. Am I doing something wrong with my local env or maybe you are mistaken about the previous behavior?

In general I try to avoid having the parser correct mistakes in the input, just because there are so many potential mistakes and correcting one frequently causes other valid input to not work. It's more important that it work correctly for input with no mistakes. But it would be nice if the parser could be useful a useful tool for that because the reality is that these mistakes sometimes exist in the input.

One approach would be to use the preprocess() method to do some regex replacing on the whole string before it is parsed to correct the mistake, whenever you find some variation of "ph. d." replace it with "ph.d." or something. That would be fairly simple, you could do it pretty easily with subclassing HumanName and little understanding of the class' inner workings. But that would actually change the string so what you got back would not equal what you input. Some people don't like that.

Another approach would be to make the parser recognize "Ph d" as a suffix. This would be somewhat difficult because at the moment the first thing the parser does is break up the string on spaces, so "ph" and "d" are in different pieces. Maybe you could do something like with the conjunctions, whenever you find a "ph" by itself connect it to the following piece, i guess only if it's a "d". But it's hard to imagine an agnostic solution that would be helpful for more than just "ph d". Can you think of other similar examples?

I feel like ideally I'd like to have the parser do something to make it easy for each developer to handle correcting the input for their particular use case. Not sure the best way to do that though, partly because I know so little about how people actually use this parser. Suggestions welcome.

@rolfhnelson
Copy link
Author

Are you sure that it used to work that way?

No. Something about suffix handling changed in the last release but I may have mis-remembered which test case it was that originally caught my attention.

In general I try to avoid having the parser correct mistakes in the input, just because there are so many potential mistakes and correcting one frequently causes other valid input to not work.

That sounds wise.

But it's hard to imagine an agnostic solution that would be helpful for more than just "ph d". Can you think of other similar examples?

Not really, I think Ed.D. is the only other real example. These errors come up occasionally in older book author data. For example, http://clas.caltech.edu/record/418307?ln=en lists a
"Harrison, David, Ph. D" (sic).

@derek73
Copy link
Owner

derek73 commented Mar 15, 2016

The change in the last release with suffix handling is here: fcd7652

It does pertain to the handling of suffixes after a comma. Now the parser will only consider the name to be in the "Firstname Lastname, Suffix" format if the part before the first comma has more than one piece when split on spaces, the assumption being that "Lastname, Suffix" is not an expected/supported format. Does that break something in your data?

@rolfhnelson
Copy link
Author

Does that break something in your data?

No.

@derek73 derek73 added this to the v1.0 milestone Aug 31, 2018
@derek73 derek73 self-assigned this Aug 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants