Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map DigitalResources #13

Closed
VladimirAlexiev opened this issue May 18, 2017 · 8 comments
Closed

map DigitalResources #13

VladimirAlexiev opened this issue May 18, 2017 · 8 comments
Labels

Comments

@VladimirAlexiev
Copy link
Member

VladimirAlexiev commented May 18, 2017

@tobiashreiter: "The Item_DigitalResources file was updated on May 15th to use our publicly facing URL".
That file has very rich info and I think should be mapped. Columns:

  • DIGITALRESOURCEID: image ID
  • FKITEMID: object ID
  • PARTTYPE: page, cover, diary entry...
  • PARTNUM: eg page number
  • FULLSIZE: image URL (checked several, work ok). Only 1 audio file
  • ORDER: order in which the image should be presented

Analysis of PARTTYPE:

>csvcut -t -c PARTTYPE Item_DigitalResources.txt|sort|uniq -c
      1 PARTTYPE
    523 cover
    341 cover back
     89 cover verso
     10 drawing
   5723 entry  <<< diary entry
    137 envelope
     89 envelope verso
     71 front
    144 inside
  11785 page
  11214 pages  <<< same as "page"
     41 postcard back
    659 sketch
   1462 sketchbook page
     10 title page
      1 track  <<< only one is an audio file not image
   5601 verso

Comparing PARTNUM vs ORDER (ignoring PARTNUM=0):

>perl -ne "@a=split/\t/; print if $a[3] && $a[3]!=$a[5]" Item_DigitalResources.txt > Item_DigitalResources-mismatch.txt

> wc -l *.txt
   1271 Item_DigitalResources-mismatch.txt
  37901 Item_DigitalResources.txt

So they are different in 3% of all rows.

Analysis of PARTTYPE for those that have PARTNUM:

>perl -ne "@a=split/\t/; print qq{$a[2]\n} if $a[3]" Item_DigitalResources.txt|sort|uniq -c
      1 PARTTYPE
    153 cover
    307 cover back
     70 cover verso
      6 entry
     70 envelope
     71 envelope verso
     37 front
    120 inside
  11658 page
  11208 pages
     26 postcard back
    653 sketch
   1454 sketchbook page
      6 title page
      1 track
   3238 verso

I think we should map both PARTNUM and ORDER:

  • PARTNUM shows the original data (eg page number)
  • ORDER is important for parts that don't have numbers (eg "cover" should be shown before "cover back")

Every image is used only once:

>csvcut -t -c FULLSIZE Item_DigitalResources.txt|sort|uniq -d

So we can use FULLSIZE rather than DIGITALRESOURCEID for the URL.

@tobiashreiter
Copy link
Collaborator

This is a great analysis. Part number is designed to be for presentation only (ideally), and order number to preserve sequence when there shouldn't be a displayed part number (i.e. for a cover). This accounts for the number mismatch. Nonetheless, sometimes a part number was assigned, incorrectly, for a part type that shouldn't have a displayed number (i.e. cover). I can try to scrub these out and deliver a new file with just the correct part numbers.

I will check the records to find the audio URL -- that is a mistake and should be corrected, somehow.

@VladimirAlexiev
Copy link
Member Author

VladimirAlexiev commented May 18, 2017

(I assumed "track" is an audio file).

@workergnome @azaroth42 here's a proposed mapping, please comment.
Mapping1

<aaa/object/(FKITEMID)> crm:P65_shows_visual_item <(FULLSIZE)>.

<(FULLSIZE)> a crm:E38_Image;
  crm:P2_has_type <aaa/thesaurus/part/urlify(PARTTYPE)>;
  crm:P1_is_identified_by <aaa/image/id/(DIGITALRESOURCEID)>;
  crmx:number "(PARTNUM)"^^xsd:integer; # skip 0
  crmx:sort_oder "(ORDER)"^^xsd:integer.

<aaa/image/id/(DIGITALRESOURCEID)> a crm:E42_Identifier;
  rdf:value "(DIGITALRESOURCEID)".
  
<aaa/thesaurus/part/urlify(PARTTYPE)> a skos:Concept;
  skos:prefLabel "(PARTTYPE)";
  skos:inScheme <aaa/thesaurus/part/>.

<aaa/thesaurus/part/> a skos:ConceptScheme;
  skos:prefLabel "Object parts (page, cover, etc)".

To be more CRM compliant (but slightly more wordy), we could say the object has several features, and each feature carries one image. Here P2 matches much better (a feature can be a Cover, but an image cannot be a Cover: it's just a depiction of the cover).
Mapping2

<aaa/object/(FKITEMID)> crm:P56_bears_feature <aaa/feature/(DIGITALRESOURCEID)>.

<aaa/feature/(DIGITALRESOURCEID)> a crm:E25_Man-Made_Feature;
  crm:P2_has_type <aaa/thesaurus/part/urlify(PARTTYPE)>;
  crm:P1_is_identified_by <aaa/feature/(DIGITALRESOURCEID)/id>;
  crmx:number "(PARTNUM)"^^xsd:integer; # skip 0
  crm:P65_shows_visual_item <(FULLSIZE)>.

<(FULLSIZE)> a crm:E38_Image;
  crmx:sort_oder "(ORDER)"^^xsd:integer.

<aaa/feature/(DIGITALRESOURCEID)/id> a crm:E42_Identifier;
  rdf:value "(DIGITALRESOURCEID)".
  
<aaa/thesaurus/part/urlify(PARTTYPE)> a skos:Concept;
  skos:prefLabel "(PARTTYPE)";
  skos:inScheme <aaa/thesaurus/part/>.

<aaa/thesaurus/part/> a skos:ConceptScheme;
  skos:prefLabel "Object parts (page, cover, etc)".

@azaroth42
Copy link

Not for the current work.

@tobiashreiter
Copy link
Collaborator

@VladimirAlexiev: Just to be clear, how are you accounting for multiple images per item? It looks like you're planning on using p65 to link to the image (using the FULLSIZE URI as the resource identifier), but we might have multiple images per Item. Maybe I'm not reading the mapping correctly, but it looks like it assumes a one-to-one relationship for items to images.

We do have a representative image for each item that is stored in the has_representation field in Items that corresponds to an entry in the Item_DigitalResources spreadsheet, but many items will have more than one matching image (and these would link based on the fkItemID).

If there's a part of the mapping that I'm not understanding, I'd appreciate getting a better sense of how multiple records are being handled.

Thanks!

@tobiashreiter
Copy link
Collaborator

@VladimirAlexiev : The Item_DigitalResources file is now updated to reflect changes to our part numbers (scrubbing them for part types that shouldn't have them).

@VladimirAlexiev
Copy link
Member Author

@tobiashreiter As any RDF property, p65 can (and in this case will) be used multiple times.
If I understand @azaroth42 correctly, he says mapping multiple images is not for this version 1 of the project. In that case we could map just this one has_representation image.
@workergnome ?

@tobiashreiter
Copy link
Collaborator

If so, can you please use has_representation and assign a "classified by" property of AAT:300404670 (preferred terms), so it's clear that this representative image is the manually curated representation for the item?

@tobiashreiter
Copy link
Collaborator

In the end I ended up mapping this slightly differently (I was in a rush, and not always looking at the GitHub issues when doing my mapping). Right now, I have a has_representation that links to the primary image for the object, and then links out (also using has_representation) to the digital resource records. I'm happy to drop the additional records if we really can't handle that now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants