Smithsonian NMNH improvements related to creator and description fields #474
Conversation
As a result of this fix, the creator and description information of certain other museums seems to have improved as well. For example, the percentage of images with no creator has reduced from 75.40% to 67.74% for smithsonian_american_indian_museum(NMAI) and the percentage of images with no description has reduced from 98.18% to 0% for si_libraries (SIL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. I found a couple of possible changes and pointed them out.
creators_list = [c['content'] for c in ordered_freetext_creator_objects | ||
if priority == creator_types[c['label'].lower()]] | ||
|
||
creator = '; '.join(creators_list[:-1]) + ' and ' + creators_list[-1] \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to add a comma instead of a semicolon?
creator = '; '.join(creators_list[:-1]) + ' and ' + creators_list[-1] \ | |
creator = ', '.join(creators_list[:-1]) + ' and ' + creators_list[-1] \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks heaps for the comments @allen505 ! In fact I changed the delimiter from comma to semicolon since sometimes a single creator value too contains commas within the string. Using a semicolon is not ideal, however I'm unsure how else we can make a distinction between the seperate creator values and the comma separation within a single creator string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting. Do we know why a single creator value has commas within it?
If it's because of multiple creators, then do we need to make a distinction between the separate creator values and the comma separation within a single creator string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessarily multiple creators. For example creators appearing for the SIL unit code may look like 'Greenawald, John L' where the value coming before the comma is the surname.
indexed_structured_creator_generator = ( | ||
i['content'] for i in _check_type(indexed_structured.get('name'), list) | ||
if type(i) == dict | ||
and _check_type(i.get('type'), str).lower() == 'personal_main' | ||
and _check_type(i.get('content'), str) | ||
) | ||
|
||
creator = next(freetext_creator_generator, None) | ||
if ordered_freetext_creator_objects: | ||
c = ordered_freetext_creator_objects[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this variable here can be given a more apt name as there is another variable c
on line 289 in the for loop which can cause some confusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @allen505 ! Made the requested change
…into si_nmnh_improvements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thank you!
Fixes
Fixes #470 by @ChariniNana
Description
With this fix, we reduce the percentage of missing creators related to NMNH data from 100.00% to 3.34% and the percentage of missing descriptions is reduced from 99.60% to 96.96%
Technical details
The creator value is retrieved from the
freetext -> name -> Collector
field in the JSON response and the description is taken from thefreetext -> notes -> Notes
field. Furthermore, we now concatenate all creator values with the same level of preference (as specified in theCREATOR_TYPES
dictionary) whereas earlier only one value among all identified as most preferred were taken. Currently, the lowest preference is given to the creator value being retrieved from theCollector
field since better fields exist of images corresponding to other unit codes.Checklist
Update index.md
).main
ormaster
).I added tests for the changes I made (if applicable).I added or updated documentation (if applicable).visible errors.
Developer Certificate of Origin
Developer Certificate of Origin