Skip to content
This repository has been archived by the owner on Jan 13, 2022. It is now read-only.

Fix-It Ticket for Smithsonian Institution Integration #397

Closed
annatuma opened this issue May 18, 2020 · 5 comments
Closed

Fix-It Ticket for Smithsonian Institution Integration #397

annatuma opened this issue May 18, 2020 · 5 comments
Assignees
Labels
🙅 status: discontinued Not suitable for work as repo is in maintenance

Comments

@annatuma
Copy link
Contributor

We know that some museums in SI have discrepancies between field names, e.g. in some museums they use "summary" and in others "description" to describe the result.

To improve the results of SI objects, we need to go through each museum within SI to look for any missing metadata mapping/potential improvements.

@annatuma annatuma created this issue from a note in Backlog (Q2 2020) May 18, 2020
@annatuma
Copy link
Contributor Author

Blocked by #318

@mathemancer mathemancer changed the title Fix-It Ticket for SI Integration Fix-It Ticket for Smithsonian Institution Integration May 26, 2020
@annatuma annatuma moved this from Q2 2020 to Q3 2020 in Backlog Jun 12, 2020
@ChariniNana ChariniNana self-assigned this Jul 20, 2020
@ChariniNana
Copy link
Contributor

ChariniNana commented Jul 23, 2020

@annatuma @mathemancer

An initial analysis on missing metadata mapping is as follows:

The numbers and percentages of missing creators:-

                         Sub provider | No Creator | Total Images | Missing Percentage
si_national_museum_of_natural_history |    3019894 |      3019894 |              100.0
                         si_libraries |          1 |           55 | 1.8181818181818181
                           si_gardens |        669 |          689 |  97.09724238026125
                  si_portrait_gallery |       7080 |        11661 | 60.715204527913556
           si_american_history_museum |        196 |         2167 |   9.04476234425473
              si_cooper_hewitt_museum |      36105 |        65632 |  55.01127498781082
   si_african_american_history_museum |       3252 |         7519 | 43.250432238329566
               si_american_art_museum |         22 |        11561 | 0.19029495718363462
                  si_anacostia_museum |        322 |          571 |   56.3922942206655
                     si_postal_museum |       2900 |         2951 |  98.27177228058285
              si_freer_gallery_of_art |       2929 |         3875 |  75.58709677419355
              si_air_and_space_museum |        238 |         2516 |   9.45945945945946
                  si_hirshhorn_museum |          2 |          477 | 0.4192872117400419
                si_african_art_museum |          3 |          136 | 2.2058823529411766
            si_american_indian_museum |        187 |          248 |  75.40322580645162

The numbers and percentages of missing descriptions in the meta data field:-

                         Sub provider | No Description | Total Images | Missing Percentage
si_national_museum_of_natural_history |        3007887 |      3019894 |   99.6024032631609
                         si_libraries |             54 |           55 |  98.18181818181819
                           si_gardens |              0 |          689 |                0.0
                  si_portrait_gallery |          11661 |        11661 |              100.0
           si_american_history_museum |            963 |         2167 |  44.43931702814952
              si_cooper_hewitt_museum |           4186 |        65632 | 6.3779863481228665
   si_african_american_history_museum |              0 |         7519 |                0.0
               si_american_art_museum |          11561 |        11561 |              100.0
                  si_anacostia_museum |            501 |          571 |  87.74080560420315
                     si_postal_museum |              2 |         2951 | 0.06777363605557438
              si_freer_gallery_of_art |           3875 |         3875 |              100.0
              si_air_and_space_museum |            319 |         2516 |  12.67885532591415
                  si_hirshhorn_museum |            477 |          477 |              100.0
                si_african_art_museum |              1 |          136 | 0.7352941176470589
            si_american_indian_museum |            248 |          248 |              100.0

The reason for missing the creator value is because the field from which to get it is not yet included in the CREATOR_TYPES dictionary and the description is missing since it's not yet covered in DESCRIPTION_TYPES as defined in the Smithsonian script.

Other findings:-
We entirely lose the following museums due to unavailability of the mandatory value foreign_landing_url and/or due to not knowing whether they have the CC0 license

  1. SIA (smithsonian_institution_archives) - Both the record_link and guid fields from which we get the foreign_landing_url are missing.
  2. NZP (smithsonian_zoo_and_conservation) - Both the record_link and guid fields from which we get the foreign_landing_url are missing.
  3. FBR (smithsonian_field_book_project) - Both the record_link and guid fields from which we get the foreign_landing_url are missing. The usage -> access fields from which we determine whether images are CC0 licensed are also missing.
  4. NAA (smithsonian_anthropological_archives) - Both the record_link and guid fields from which we get the foreign_landing_url are missing. The usage -> access fields from which we determine whether images are CC0 licensed are also missing.

@mathemancer mathemancer pinned this issue Jul 24, 2020
@ChariniNana
Copy link
Contributor

ChariniNana commented Jul 26, 2020

As per the initial research conducted on the NMNH data, it was realised that the creator field may be retrieved from the freetext -> name -> Collector value which appears for some of the images in NMNH. Further discussion is necessary to determine whether this is an appropriate field from which to obtain the creator.

For populating the description information, it was noted that the freetext -> notes -> Notes field would be appropriate for NMNH.

@ChariniNana
Copy link
Contributor

ChariniNana commented Jul 30, 2020

For the four museums with missing foreign_identifier_url (SIA, NZP, FBR, NAA), no alternative field could be identified in the JSON responses to retrieve the url from.
For FBR we actually do have the content.descriptiveNonRepeating.online_media.media.usage.access path available. So obtaining the license type is possible. But for most objects we don't find an image list which we get from the path content.descriptiveNonRepeating.online_media.media in the Smithsonian script. For NAA we don't find the image list for any of the objects.

@ChariniNana
Copy link
Contributor

For si_postal_museum (NPM) with 98% of the creators missing, we might be able to use the freetext -> name -> Presentor value as the creator and some have freetext -> name -> Associated Organization. Both fields seem to contain names of places which could be the place where the image is presented or has some association with. For certain images freetext -> name -> Associated Person is available.

@mathemancer mathemancer unpinned this issue Aug 10, 2020
@kgodey kgodey moved this from Q3 2020 to tmp in Backlog Aug 13, 2020
@kgodey kgodey moved this from Q3 2020 to Internships 2020 in Backlog Aug 13, 2020
@kgodey kgodey moved this from Internships 2020 to Q4 2020 in Backlog Sep 18, 2020
@cc-open-source-bot cc-open-source-bot added the 🏷 status: label work required Needs proper labelling before it can be worked on label Dec 2, 2020
@kgodey kgodey added this to [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020
@kgodey kgodey removed this from [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020
@kgodey kgodey added this to [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020
@kgodey kgodey removed this from [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020
@kgodey kgodey moved this from Q4 2020 to CC Search in Backlog Dec 2, 2020
@kgodey kgodey added this to [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020
@kgodey kgodey removed this from [TEMPORARY] Deprioritize in Active Sprint Dec 2, 2020
@kgodey kgodey added 🙅 status: discontinued Not suitable for work as repo is in maintenance and removed 🏷 status: label work required Needs proper labelling before it can be worked on labels Dec 16, 2020
@kgodey kgodey closed this as completed Dec 16, 2020
@kgodey kgodey moved this from CC Search to Done in Backlog Dec 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
🙅 status: discontinued Not suitable for work as repo is in maintenance
Development

No branches or pull requests

4 participants