Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xml2kvp: split multivalued values to multiple fields #222

Open
ghukill opened this issue Jun 26, 2018 · 1 comment
Open

xml2kvp: split multivalued values to multiple fields #222

ghukill opened this issue Jun 26, 2018 · 1 comment

Comments

@ghukill
Copy link
Contributor

ghukill commented Jun 26, 2018

Addressed concatenating mutlivalued fields to a single value, while this proposes to create multiple fields based on a multivalued value split with numeric notation.

e.g. mods_subject_topic : ['horse','goober','tronic'], would convert to three fields:

'mods_subject_topic_0' : 'horse',
'mods_subject_topic_1' : 'goober',
'mods_subject_topic_2' : 'tronic',

The same could be done by splitting on value string, perhaps.

@ghukill
Copy link
Contributor Author

ghukill commented Jul 3, 2018

This touches on a closed issue (as mentioned there as well): #44.

The problem there was "blocks" of elements that were related based on their nesting, e.g.:

<mods:subject>
    <mods:topic>Labor unions</mods:topic>
    <mods:geographic>Michigan</mods:geographic>
    <mods:geographic>Saginaw</mods:geographic>
    <mods:temporal>1800-1810</mods:temporal>
</mods:subject>
<mods:subject>
    <mods:topic>Strikes</mods:topic>
    <mods:geographic>Michigan</mods:geographic>
    <mods:geographic>Hillsdale</mods:geographic>
    <mods:temporal>1930-1940</mods:temporal>
</mods:subject>

In this example, it is the relationship of all siblings under a <mods:subject> that would be helpful to maintain.

A default XML2kvp parse, removing namespaces prefixes, would result in:

In [2]: XML2kvp.xml_to_kvp(test_xml, remove_ns_prefix=True)
Out[2]: 
{'root_subject_geographic': ('Michigan', 'Saginaw', 'Hillsdale'),
 'root_subject_temporal': ('1800-1810', '1930-1940'),
 'root_subject_topic': ('Labor unions', 'Strikes')}

The problem here is that these elements grouped under <mods:subject> are cherry picked to other fields, with little ability to relate them at a glance. Ideally, we could generate a new field, concatenating values from others, that would look something like:

{'root_subject':['Labor unions--Michigan--Saginaw--1800-1810', 'Strikes--Michigan--Hillsdale--1930-1940']} 

We lose the knowledge that Saginaw is geographic, or that 1800-1810 is temporal, but that particular string has value in other contexts, and we could keep those other, further parsed fields as well.

One thought has been to offer a spliting of a field when its values are multivalued (#222). If this were boolean for all, or an array of fields to split, you might get something like:

{'root_subject0_topic0': ('Labor unions'),
{'root_subject1_topic1': ('Strikes'),
{'root_subject0_geographic0': ('Michigan'),
{'root_subject1_geographic1': ('Michigan'),
{'root_subject0_geographic2': ('Saginaw'),
{'root_subject1_geographic3': ('Hillsdale'),
{'root_subject0_temporal0': ('1800-1810'),
{'root_subject1_temporal1': ('1930-1940'),

The trick would be to "collapse" these field names with indexes into something useful...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants