Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix-It Ticket for Smithsonian Institution Integration (original #397) #1784

Open
obulat opened this issue Apr 21, 2021 · 0 comments
Open

Fix-It Ticket for Smithsonian Institution Integration (original #397) #1784

obulat opened this issue Apr 21, 2021 · 0 comments
Labels
馃煩 priority: low Low priority and doesn't need to be rushed 馃П stack: catalog Related to the catalog and Airflow DAGs
Projects

Comments

@obulat
Copy link
Contributor

obulat commented Apr 21, 2021

This issue has been migrated from the CC Search Frontend repository

Author: annatuma
Date: Mon May 18 2020
Labels: 馃檯 status: discontinued

We know that some museums in SI have discrepancies between field names, e.g. in some museums they use "summary" and in others "description" to describe the result.

To improve the results of SI objects, we need to go through each museum within SI to look for any missing metadata mapping/potential improvements.

Original Comments:

annatuma commented on Mon May 18 2020:

Blocked by cc-archive/cccatalog#318
source

ChariniNana commented on Fri Jul 24 2020:

@annatuma @mathemancer

An initial analysis on missing metadata mapping is as follows:

The numbers and percentages of missing creators:-

                         Sub provider | No Creator | Total Images | Missing Percentage
si_national_museum_of_natural_history |    3019894 |      3019894 |              100.0
                         si_libraries |          1 |           55 | 1.8181818181818181
                           si_gardens |        669 |          689 |  97.09724238026125
                  si_portrait_gallery |       7080 |        11661 | 60.715204527913556
           si_american_history_museum |        196 |         2167 |   9.04476234425473
              si_cooper_hewitt_museum |      36105 |        65632 |  55.01127498781082
   si_african_american_history_museum |       3252 |         7519 | 43.250432238329566
               si_american_art_museum |         22 |        11561 | 0.19029495718363462
                  si_anacostia_museum |        322 |          571 |   56.3922942206655
                     si_postal_museum |       2900 |         2951 |  98.27177228058285
              si_freer_gallery_of_art |       2929 |         3875 |  75.58709677419355
              si_air_and_space_museum |        238 |         2516 |   9.45945945945946
                  si_hirshhorn_museum |          2 |          477 | 0.4192872117400419
                si_african_art_museum |          3 |          136 | 2.2058823529411766
            si_american_indian_museum |        187 |          248 |  75.40322580645162

The numbers and percentages of missing descriptions in the meta data field:-

                         Sub provider | No Description | Total Images | Missing Percentage
si_national_museum_of_natural_history |        3007887 |      3019894 |   99.6024032631609
                         si_libraries |             54 |           55 |  98.18181818181819
                           si_gardens |              0 |          689 |                0.0
                  si_portrait_gallery |          11661 |        11661 |              100.0
           si_american_history_museum |            963 |         2167 |  44.43931702814952
              si_cooper_hewitt_museum |           4186 |        65632 | 6.3779863481228665
   si_african_american_history_museum |              0 |         7519 |                0.0
               si_american_art_museum |          11561 |        11561 |              100.0
                  si_anacostia_museum |            501 |          571 |  87.74080560420315
                     si_postal_museum |              2 |         2951 | 0.06777363605557438
              si_freer_gallery_of_art |           3875 |         3875 |              100.0
              si_air_and_space_museum |            319 |         2516 |  12.67885532591415
                  si_hirshhorn_museum |            477 |          477 |              100.0
                si_african_art_museum |              1 |          136 | 0.7352941176470589
            si_american_indian_museum |            248 |          248 |              100.0

The reason for missing the creator value is because the field from which to get it is not yet included in the CREATOR_TYPES dictionary and the description is missing since it's not yet covered in DESCRIPTION_TYPES as defined in the Smithsonian script.

Other findings:-
We entirely lose the following museums due to unavailability of the mandatory value foreign_landing_url and/or due to not knowing whether they have the CC0 license

  1. SIA (smithsonian_institution_archives) - Both the record_link and guid fields from which we get the foreign_landing_url are missing.
  2. NZP (smithsonian_zoo_and_conservation) - Both the record_link and guid fields from which we get the foreign_landing_url are missing.
  3. FBR (smithsonian_field_book_project) - Both the record_link and guid fields from which we get the foreign_landing_url are missing. The usage -> access fields from which we determine whether images are CC0 licensed are also missing.
  4. NAA (smithsonian_anthropological_archives) - Both the record_link and guid fields from which we get the foreign_landing_url are missing. The usage -> access fields from which we determine whether images are CC0 licensed are also missing.

source

ChariniNana commented on Mon Jul 27 2020:

As per the initial research conducted on the NMNH data, it was realised that the creator field may be retrieved from the freetext -> name -> Collector value which appears for some of the images in NMNH. Further discussion is necessary to determine whether this is an appropriate field from which to obtain the creator.

For populating the description information, it was noted that the freetext -> notes -> Notes field would be appropriate for NMNH.
source

ChariniNana commented on Fri Jul 31 2020:

For the four museums with missing foreign_identifier_url (SIA, NZP, FBR, NAA), no alternative field could be identified in the JSON responses to retrieve the url from.
For FBR we actually do have the content.descriptiveNonRepeating.online_media.media.usage.access path available. So obtaining the license type is possible. But for most objects we don't find an image list which we get from the path content.descriptiveNonRepeating.online_media.media in the Smithsonian script. For NAA we don't find the image list for any of the objects.
source

ChariniNana commented on Fri Jul 31 2020:

For si_postal_museum (NPM) with 98% of the creators missing, we might be able to use the freetext -> name -> Presentor value as the creator and some have freetext -> name -> Associated Organization. Both fields seem to contain names of places which could be the place where the image is presented or has some association with. For certain images freetext -> name -> Associated Person is available.
source

@obulat obulat added the 馃П stack: catalog Related to the catalog and Airflow DAGs label Feb 24, 2023
@obulat obulat added the 馃煩 priority: low Low priority and doesn't need to be rushed label Mar 8, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
馃煩 priority: low Low priority and doesn't need to be rushed 馃П stack: catalog Related to the catalog and Airflow DAGs
Projects
Status: 馃搵 Backlog
Openverse
  
Backlog
Development

No branches or pull requests

1 participant