New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPGDims #54

Closed
VladimirAlexiev opened this Issue Aug 9, 2016 · 23 comments

Comments

Projects
None yet
6 participants
@VladimirAlexiev

VladimirAlexiev commented Aug 9, 2016

Some analysis of NPGDimsParsedUpdate2May.xlsx:
image

Questions to @si-npg:

  • Are the linear dimensions in Centimeters?
  • There's only one Weight=3.63, is that Kilograms?

Observations

  • NPGObjects.Dimensions is the human-readable string and should be emitted as a Dimension with just p3_has_note
  • DimItemElemXrefID depends functionally on (ObjectID, Element).
  • it's crucially important that DimItemElemXrefID groups measurements of the same (object,element). Emitting the dimensions without this grouping would be useless (same as in JPGM).
  • Element is sometimes an object part, but not always. Other uses:
    • qualifier about the manner of measurement (Sight)
    • qualifier about the mode of measurement (Case Open, Case Closed, With Base, With Socle, Without Base)
    • dimension type (Duration)
    • placeholder (Unspecified element, Other)

We use the following for JPGM (who also have TMS):
image

Consider the dimension data about one object:

  • display dimension: Image: 44.1 x 36.2cm (17 3/8 x 14 1/4") Sheet: 50.2 x 37.7cm (19 3/4 x 14 13/16") Mat: 71.1 x 55.9cm (28 x 22")
  • structured dimensions:
ObjectID DimensionID DimItemElemXrefID Element DimensionType ElementRank DimRank Dimension
23 349187 168874 Mat Height 3 1 71.12014224
23 349188 168874 Mat Width 3 2 55.88011176
23 213081 105668 Image Width 1 2 36.2
23 213082 105668 Image Height 1 1 44.1
23 213079 105667 Sheet Width 2 2 37.7
23 213080 105667 Sheet Height 2 1 50.2

I propose to map it to this RDF Turtle (first two rows are shown):

@base <http://americanartcollaborative.org/>.
@prefix crm:  <http://www.cidoc-crm.org/cidoc-crm/>.
@prefix crmx: <http://americanartcollaborative.org/crm-ext/>.
@prefix aat:  <http://vocab.getty.edu/aat/>.

<npg/object/23>
  crm:P43_has_dimension <npg/object/23/dimension>;
  crm:P39i_was_measured_by
    <npg/object/23/measurement/168874>, <npg/object/23/measurement/105668>, <npg/object/23/measurement/105667>.

<npg/object/23/dimension> a crm:E54_Dimension;
  crm:P3_has_note
    """Image: 44.1 x 36.2cm (17 3/8 x 14 1/4")  Sheet: 50.2 x 37.7cm (19 3/4 x 14 13/16")  Mat: 71.1 x 55.9cm (28 x 22")""".

<npg/object/23/measurement/168874> a crm:E16_Measurement; 
  crmx:P2_extent aat:300236006; # mat (framing and mounting equipment)
  crmx:sort_order 3; # ElemRank
  crm:P40_observed_dimension <npg/object/23/dimension/349187>, <npg/object/23/dimension/349188>.

<npg/object/23/dimension/349187> a crm:E54_Dimension;
  crm:P2_has_type aat:300055644; # height
  crm:P91_has_unit aat:300379098; # centimeters
  crm:P90_has_value 71.12014224;
  crmx:sort_order 1. # DimRank

<npg/object/23/dimension/349188> a crm:E54_Dimension;
  crm:P2_has_type aat:300055647; # width
  crm:P91_has_unit aat:300379098; # centimeters
  crm:P90_has_value 55.88011176;
  crmx:sort_order 2. # DimRank

Notes:

  • crmx:P2_extent indicates what is being measured. We can easily replace it with the standard crm:P2_has_type, since there's only one "type" in this case. But in CONA/JPGM the same idea "extent" is also used for Subject and Agent Contrubition, that's why we thouht it's a good idea to make a sub-property.
  • crmx:sort_order would be necessary to reconstruct the display dimension. Since we're emitting it as a separate node, we can skip it

@kateblanch @edgartdata @steads @azaroth42 What do you think?

@steads

This comment has been minimized.

steads commented Aug 9, 2016

The E54 Dimension instance you suggest for attaching the display string to, does not exist and should not be instantiated. The string is related to the object as a whole and should be attached directly to the instance of E22 Man-Made Object with P3 and P3.1. This can be instantiated using the CRMpc extension which gives a robust rdf deployment method for the .1 properties. I would prefer that the instances of E54 Dimension that are created for the width and height of the Mat had labels of the form "Object 23 width of Mat". I am unclear if the metric dimensions are genuine measurements or just mathematical conversions (the number of decimal places suggests conversion). If they are mathematical conversions then they should probably be dropped. There is definitely more than a single instance of E16 Measurement; there should be one for each dimension actually present. so in this case there are 6 or 12 instances of E16 Measurement depending on if the metric measurements are measurements or simply conversions. If they are conversions and you wish to have them represented in the data rather than just creating them on the fly in the user interface. then you woul have 2 instances of E54 Dimension connected by 2 instances of P40 to the same instance of E16 Measurement and add a P2 has type [conversion] to the converted value (in this case the metric value). I would probably amend the label to "Object 23 width of Mat (converted from Imperial)" as well.

@workergnome

This comment has been minimized.

workergnome commented Aug 9, 2016

I am very curious about this CRMpc extension—is there any documentation on it that I can read? Google is failing me, as are both the search box at the new CIDOC site and the search at http://www.ics.forth.gr.

@steads

This comment has been minimized.

steads commented Aug 9, 2016

Try http://new.cidoc-crm.org/technical_papers and then modelling properties of properties. There is a presentation and the RDF
HTH

@workergnome

This comment has been minimized.

workergnome commented Aug 9, 2016

Would it make sense to model it as a linguistic object, not just as a note?

:thing P129i_is_subject_of :dimension_string.
:dimension_string a E33_Linguistic_Object;
        p3_has_note [TEXT];
        P2_has_type aac:dimension_string.

It would also allow us to attach aboutness to the object.

I would also recommend looking at the http://qudt.org ontology for our units—the AAT definitions are good as terms, but they don't relate to anything that you'd need if you need to use the dimensions in any mathematical way, like unit conversion.

@si-npg

This comment has been minimized.

Collaborator

si-npg commented Aug 9, 2016

Yes, linear dimensions are in Centimeters. Yes, the one Weight=3.63 is Kilograms.

@steads

This comment has been minimized.

steads commented Aug 9, 2016

If you use P3 has note it captures the required sense. You would only instantiate an instance of E33 Linguistic Object if the text is documented in its own right as a subject in the domain of interest. So no, I am afraid, it does not make sense.

@workergnome

This comment has been minimized.

workergnome commented Aug 9, 2016

So the suggestion for the textual representation of dimension is:

:thing crmpc:P01i_is_domain_of :note_property;
:note_property a crmpc:PC3_has_note;
    crmp:P03_has_range_literal "Image: 44.1 x 36.2cm...";
    crmp:P3.1_has_type aac:dimension_string.

That look right to people?

Will we use this mapping for all notes, or only specific notes, and how will we determine when to use this pattern or when to use the straight P3_has_note mapping?

@azaroth42

This comment has been minimized.

azaroth42 commented Aug 9, 2016

At least in the Provenance Index, and I anticipate in the Museum, we're going to simplify to:

_:Object a E22_Man_Made_Object ;
  schema:height [
    a E54_Dimension ;
    p90_value 71.12 ;
    p91_has_unit qudt:cm ] ;
  schema:width [
    a E54_Dimension ;
    p90_value 55.88 ;
    p91_has_unit qudt:cm ] .

For different parts of the object, I like the proposal that David made to model it as different parts of the object :) Much simpler, just as expressive, provides better hooks for future work.

@VladimirAlexiev

This comment has been minimized.

VladimirAlexiev commented Aug 10, 2016

Rob>model it as different parts

As you see in the pivot, not all Elements are Parts. Some express qualifier or mode.

schema:height

This modeling doesn't say what was measured. As I wrote "it's crucially important that DimItemElemXrefID groups measurements of the same (object,element). Emitting the dimensions without this grouping would be useless (same as in JPGM)". See https://share.getty.edu/display/JPGLODM/JPGM+Dimensions for how the data looks in TMS, and why it is necessary to group the dimensions.

And does schema have props for all dimensions required across museums?

  • the BM dimension thesaurus includes: height, thickness, width, diameter, length, depth, weight, circumference, volume, curvature, percentage, currency, die-axis.
  • CCO p110: Examples of types of measurement include height, width, depth, length, circumference, diameter, volume, weight, area, and running time.
  • JPGM has Depth, Diameter, Section width, Length, Circumference, Height, Width, Weight. All are mapped to AAT, same mapping can be used for NPG

Steve>unclear if the metric dimensions are genuine measurements or just mathematical conversions

The metric values are the only values we got in the database.

David> crmpc:P01i_is_domain_of crmpc:PC3_has_note crmp:P03_has_range_literal

Did you make up these terms? Or is there an RDF definition of "CRMpc"?

Also see this comment in #20: "it is one of the most expensive ways, since it doubles the number of classes and triples the number of property types. There are better ways to attach type to a relation."

@VladimirAlexiev

This comment has been minimized.

VladimirAlexiev commented Aug 10, 2016

not all Elements are Parts

To strengthen this: @azaroth42, can you please model "Without Base" as an object part ;-)

@steads

This comment has been minimized.

steads commented Aug 10, 2016

See previous comment for link to RDF of CRMpc.
The metric values are pretty obviously conversions and not real measurements (71.12014224!). The real measurements are then probably the Imperial measurements embedded in the text. How about parsing those?

@steads

This comment has been minimized.

steads commented Aug 10, 2016

@workergnome

This comment has been minimized.

workergnome commented Aug 10, 2016

@VladimirAlexiev: I'm not advocating for this technique specifically—just that we mutually agree on a technique for modeling the Pn.1 properties.

@steads: It makes me nervous to be modeling this using a vocabulary that is almost completely undocumented and still under development.

I'm also curious if there is a best-practices document, @steads, that describes how the CRM should be used. Your comment about Linguistic Objects makes sense, but is more restrictive than what is in the documentation for the CRM:

You would only instantiate an instance of E33 Linguistic Object if the text is documented in its own right as a subject in the domain of interest.

versus the documentation description:

This class comprises identifiable expressions in natural language or languages.

Is this your opinion as to what is best practices should be, or is this a restriction on the use of the CRM that is formally documented somewhere?

@azaroth42

This comment has been minimized.

azaroth42 commented Aug 10, 2016

Clearly "Other" and "Unspecified" can't be modeled as parts, but are just as meaningless in any other structure too. The state of case open/case closed would be hard to do as parts, I agree, but is an outlier. The rest of them seem to be parts (but happy to be corrected if they're not)

Without X, means there is a part that is X and a part that is the rest of the object without X. So I would model that as:

_:Object has_part :X, :WithoutX .
_:WithoutX width _:WidthForObject ; height _:HeightForObject .

The question that I don't think can be answered from the very useful pivot table is how many objects have more than one set of dimensions. If that number is low, then the majority of objects can simply have dimensions associated with them directly.

@VladimirAlexiev

This comment has been minimized.

VladimirAlexiev commented Aug 10, 2016

@steads: Suggestion to rename "has_domain" to "subject", has_range to "object", "has_range_literal" to "value". Reason: Domain and Range are the types of the subject and object in a triple, not these resources themselves.

@workergnome> curious if there is a best-practices document that describes how the CRM should be used

What Steve said is just common sense.

What are "Elements"

@azaroth42> "Other" and "Unspecified"

Yes, these values should just be skipped. But if you have two DimItemElemXrefID, say Image and Other, you still need to emit them as two Measurements.

The rest of them seem to be parts

Please read more carefully what I wrote. How about "Sight"?
"Image/Sight"?
How about "combination parts" like "Image/Sheet/Mount"? (we don't even know whether that is AND or OR)

is an outlier

Everything that doesn't conform to a theory is an outlier ;-)

Please read about CONA dimensions: https://share.getty.edu/pages/viewpage.action?spaceKey=ITSLODV&title=CONA+Dimensions:

  • dimensions_extent: what part or feature was measured,
    • e.g. overall, general, diameter, stories (floors), pattern, repeat, chain lines
    • CCO examples: overall, general, diameter; platemark, sheet (for a print); secondary support, mat, mount, frame, pattern repeat; lid, base, body (for a vase); footprint, tessera, laid lines, with base; center back (for a jacket); stories (floors), rooms, grounds (for a building)
    • As you see, these are not always physical parts.
  • dimensions_qualifier: how it was measured (how the measurement was taken)
    • e.g. approximate, sight, maximum
    • CCO examples: approximate, sight, maximum, assembled, before restoration, largest, variable, at corners, rounded, and framed

I think that before modeling, you need to study the data more carefully. So I'm telling you from experience people, these are NOT parts

@azaroth42

This comment has been minimized.

azaroth42 commented Aug 10, 2016

Image, Sheet and Mount are all parts (right?), so I would model that as:

_:Object hasPart _:ISM .
_:ISM hasPart _:Image, _:Sheet, _:Mount ;
          height _:HeightForISM .

Or if the Image is part of the Sheet, the appropriate nesting of those two.

Can you point me to a definition of "Sight"? Is it that the measurements were done estimated by sight, rather than with a tool? Then yes, that requires a Measurement to express how the measurement was done, rather than what the measurement is of.

Re has_domain ... why not just use RDF reification? That seems to be what has been reinvented.

@VladimirAlexiev

This comment has been minimized.

VladimirAlexiev commented Aug 10, 2016

We don't even know whether "Image/Sheet/Mount" means these together (AND) or some of them (OR).

Yes, Sight means "by eye".

Re has_domain: yes, in BM & CONA we used reification, but the "CRM reification" kind, which is E13_Attribute_Assignment.

@azaroth42

This comment has been minimized.

azaroth42 commented Aug 10, 2016

If you can't tell what it means, you can't model it. Unless you intend to ask Patricia to add aat:ImageAndOrSheetAndOrMount and just move the problem to someone else. We should find out what it means from the people who can answer the question.

How the information was obtained always requires reification of the relationship, so for the qualifiers I agree we need another node. That shouldn't complicate the general case however, otherwise we end up reifying everything for every object.

@workergnome

This comment has been minimized.

workergnome commented Aug 10, 2016

I would prefer a single method, even if the complications are rarely used, rather than a general case and a special case. Mostly because when using the data I will either have to look for both options, or I'll end up ignoring all the special cases.

Maybe that's unavoidable, but it certainly makes the data harder to use.

@azaroth42

This comment has been minimized.

azaroth42 commented Aug 10, 2016

I would also prefer a single method! The end result is that you end up reifying everything into millions of E13_Attribute_Assignments (or preferably, just, rdf:Statements) so you can record who said it, when, and why. That makes the data unusably complicated and no one looks for anything, ignoring all the cases not just the special ones :(

The million dollar (or hopefully not quite) question is how special is the special case? Thankfully we have data: (294+11=) 395 / 32959, or 1.2%. So you make the lives of everyone more complicated in 99% of the cases, for that final 1%. The cost of doing the 1% differently outweighs the cost of doing 99% consistently with it, in my opinion.

In terms of usage, I like the notion of "Ask forgiveness, not permission". In other words, try the 99% way and only if that fails only try the 1% way(s). That gives you scalability for simple applications doing something and then later adding the special cases, rather than having only very sophisticated applications that can do anything at all.

@workergnome

This comment has been minimized.

workergnome commented Aug 10, 2016

I agree entirely with all of that. I think I'm trying to figure out is at what point in the pipeline we apply that simplification, necessarily throwing out information. I see the pipeline we're talking about as:

Raw information -> Data Model -> API -> Application -> End User

or, more concretely:


Institutional Data Dump
which is transformed by Karma into
CIDOC-CRM RDF Files
which are loaded into a Triplestore, then SPARQLed into
JSON-LD Entity documents
which are read by the Browse application code, and transformed into
AAC Browse Website
which is read by
All Y'all.


if the simplification happens at the information -> data model step, the following steps can be much simpler, but it means that the whole pipeline is used for that one process. I'm OK with that, but I think that the goals of the AAC are bigger than just the use case of the browse application.

If the simplification is happening at the API -> application level, it's a pain in the butt for developers to work with the complexity of the model, and nothing ever gets built.

If we add a level of indirection between the data model and the application and simplify the data at the Data Model -> API, we can provide a nice, concise access point to information, and still preserve the ability to extend the API as needs occur. It's overkill for any one project, but it's probably a "good practice" for the project as a whole.

@azaroth42

This comment has been minimized.

azaroth42 commented Aug 10, 2016

We're vastly off topic now but ...

My preference is that the difference between internal data (e.g. in the TripleStore) and the published data (via the API) are as close as possible. Preferably the API is "all the information the client needs to use this resource in JSON-LD". If so, then exactly how the system maintains the information is irrelevant if the API is just a particular graph boundary.

Otherwise we need profile based content negotiation and to pick a default representation -- in other words, the client needs to say whether it wants the data post or pre transformation in to the API structure. I would anticipate that the default would be the API structure, as by definition it's more useful. And then I would anticipate no one really ever using the non API data ... so having the API be LOD would be good... and hence having the two be as close as possible.

@VladimirAlexiev

This comment has been minimized.

VladimirAlexiev commented Aug 11, 2016

Elements

@azaroth42 Your calculation "1.2%" is flawed, since you ignored some rows of my pivot, haven't seen other AAC museum data, and ignored the examples I gave from CONA and CCO. In cultural data, the exception is usually the rule.

I propose to model TMS "Element" as an extension prop crmx:P2_extent because from my experience with museum data, that cannot be modeled cleanly as "part". "Extent" is defined in CONA and LIDO, and is also used as a target for: material/technique/implement, contribution, subject.

How would you model this real example from CONA: "the St.Peters basilica has height of dome above street level = 138m"? Ignore "dome" (drop data) or model it as a part (wrong)? The dome is a part, but the measurement is of drum+dome, not dome alone.

What will you say about "pattern repeat", "laid lines" or "center back"? These are CRM Features not Parts, the first two are repeating features (CRM has no such concept), and there's no way to tell Features apart from other Elements... unless you include tons of specific coding in the mapping.

Dimension Types

Schema does not have all the weird dimensions you'll find in museums, eg "die axis" and "o'clock" for coins, circumference vs diameter, etc. That's why you need CRM's Dimension

Dimension Units

AAT <size/dimensions by unit> includes about 25 units.

QUDT includes about 800 units, including conversion rules. But it's focused on science/engineering and doesn't include all AAT units. I introduced QUDT to BM and it's used there: but again, museums record weird and wonderful things, and you won't find them all in QUDT.

Examples (some of these are not in AAT either, but we can add them through Patricia):

  • Carats. There are two different: purity of gold vs weight of jewels
  • Various dimensionless (yet different!) units: count, pieces, pairs, pixels.
  • Unusual time units: centuries, millenia.
  • Obsolete currencies: Holland Florins (Rembrand'ts Badende Susanna was sold at auction for 120 HFL if memory serves right).

Of course, we can tie up these extra units into the QUDT framework (eg to state that Pixels is dimensionless). But so far I haven't seen a use case for calculations with dimensions.
Of course, QUDT is not the end-all of scientific dimensions. Eg see http://ci.emse.fr/multidimensional-quantity/

@bsnikhila bsnikhila closed this Apr 30, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment