Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various metadata-related fixes and improvements. #42

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 56 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,62 @@
# AUTOTYP (in progress)
# AUTOTYP 1.0.1

This is a bugfix release that focuses on JSON output and improving metadata for variables of type
value list. Notable changes:

- improved JSON output
- improved and corrected the metadata for multiple variables of the type value list
- improved the bibliography data, added Glottolog language and reference IDs (many thanks to
Robert Forkel for doing this work)
- minor data fixes (duplicate entries in datasets `Alienability`, `Gender` and `NumeralClassifiers`)

Many thanks to Robert Forkel for reporting most of these issues and cleaning up the bibliography
files!

Detailed changes:

- Fixed the DOI badge (now points to last released version 1.0.0)
- Added data type `logical` to the list of valid variable types
- Clarified that `value-list` is not actually a list
- Fixed an issue with JSON export where missing values were silently dropped
by the serializer, they are now exported as `null`
- If a value list variable has no values (all missing), the json value list metadata
is now serialized as an empty dictionary `{}` for consistency
- `NPStructurePresence` is no longer classified as a `PerLanguageSummaries` dataset
- `LID` field was sometimes serialized as string, fixed
- Missing glottocodes were sometimes serialized as explicit "NA" string, fixed
- Removed duplicate data entries from `Alienability`
- Removed duplicate data entries from `Gender`
- Removed duplicate data entries from `NumeralClassifiers`
- Added maps illustrating the geographical breakdown (by continent and area)
- improved the bibliography data, added Glottolog language and reference IDs (many thanks to
Robert Forkel for doing this work)
- Multiple metadata fixes:
- Added value list descriptions for `PhonologicalFusion::FusionBinned6` and all variables that
rely on it (such as `GrammaticalMarkers::MarkerFusionBinned6`)
- Added value list descriptions for `PositionalBehavior::MarkerBehaviorBinned4` and all variables
that rely on it (such as `GrammaticalMarkers::MarkerBehaviorBinned4`)
- Value list description for `LocusOfMarking::LocusOfMarkingBinned5` was missing the value
'FloatingorClitic', fixed (this also fixes all the variables that rely on it, such as
`GrammaticalMarkers::LocusOfMarkingBinned5`)
- Fixed value list description for `GrammaticalMarkers::MarkerPositionBinned4`
- Fixed value list description for `GrammaticalMarkers::MarkerPositionBinned5`
- Fixed data type of `GrammaticalMarkers::MarkerExpressesMultipleCategories` to be `logical`
- Added value list descriptions for `ClauseLinkage::IntuitiveClassification`, value "?" is now
recoded as NA (missing)
- Added value list descriptions for multiple fields in `ClauseLinkage` where they were missing.
The fields are: `AnticipatoryArgumentMarking`, `CataphoraConstraints`, `CategoricalSymmetry`,
`ClauseLayer`, `ClausePosition`, `Embedding`, `ExtractionConstraints`, `FinitenessSimplified`,
`FocusMarkingInDependent`, `FocusMarking`, `IllocutionaryMarking`, `IllocutionaryScope`,
`InterpropositionalSemanticRelation`, `ReferenceTrackingSystem`, `TenseMarking` and
`TenseScope`
- Fixed the value list description for `ClauseWordOrder::WordOrderAPLex`
- Fixed the value list description for `SemanticClass::SemanticClassBinned`
- Removed invalid values from `GrammaticalRelationsRaw::SelectedArguments::SemanticCondition`
- Fixed the value list description for `Register::OriginContinent`
- Computed variables in `GrammaticalMarkersPerLanguage` now have correct value list metadata
- Computed variables in `LocusOfMarkingPerLanguage` now have correct value list metadata
- Computed variables `MorphologyPerLanguage::HasAny*` are now correctly annotated as logical
- Computed variables `NPStructurePerLanguage::NPHas*` are now correctly annotated as logical
- `NPStructurePerLanguage::NPStructureID` is now correctly annotated as integer
- Computed variables in `VerbInflection*` summary datasets now have correct value list metadata

39 changes: 23 additions & 16 deletions aggregation-scripts/Alignment.R
Original file line number Diff line number Diff line change
Expand Up @@ -984,7 +984,17 @@ GR_roles <- GR_roles %>%
filter(!SelectorID %in% no_agreement_ID) %>%
# add glottocodes
left_join(select(Register, LID, Glottocode), by = "LID") %>%
select(LID, Glottocode, Language, everything())
select(LID, Glottocode, Language, everything()) %>%
# drop unused factor levels
mutate(
ReferentialCondition = fct_drop(ReferentialCondition),
CoargumentAtr = fct_drop(CoargumentAtr),
CoargumentP = fct_drop(CoargumentP),
ClauseRankCondition =fct_drop(ClauseRankCondition),
CategoryCondition = fct_drop(CategoryCondition),
SyntacticDomainCondition = fct_drop(SyntacticDomainCondition),
PolarityCondition = fct_drop(PolarityCondition)
)


alignments <- alignments %>%
Expand All @@ -1000,27 +1010,24 @@ alignments <- alignments %>%
) %>%
# add glottocodes
left_join(select(Register, LID, Glottocode), by = "LID") %>%
select(LID, Glottocode, Language, everything())
select(LID, Glottocode, Language, everything()) %>%
# drop unused factor levels
mutate(
ReferentialCondition = fct_drop(ReferentialCondition),
CoargumentAtr = fct_drop(CoargumentAtr),
CoargumentP = fct_drop(CoargumentP),
ClauseRankCondition =fct_drop(ClauseRankCondition),
CategoryCondition = fct_drop(CategoryCondition),
SyntacticDomainCondition = fct_drop(SyntacticDomainCondition),
PolarityCondition = fct_drop(PolarityCondition)
)


AlignmentForDefaultPredicatesPerLanguage <- AlignmentForDefaultPredicatesPerLanguage %>%
# add glottocodes
left_join(select(Register, LID, Glottocode), by = "LID") %>%
select(LID, Glottocode, Language, everything())


fix_metadata_levels <- function(desc, values) {
values <- as.character(unique(unlist(values)))
values <- values[!is.na(values)]
if(!all(values %in% desc$levels$level)) {
unknown_levels <- setdiff(values, desc$levels$level)
arg <- caller_arg(values)
cli::cli_abort("unknown values in {arg}: {unknown_levels}")
}
desc$levels <- filter(desc$levels, level %in% values)
desc
}


descriptor <- describe_data(
ptype = tibble(),
description = "
Expand Down
17 changes: 17 additions & 0 deletions aggregation-scripts/GrammaticalMarkers.R
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ GrammaticalMarkersPerLanguage <- GrammaticalMarkers %>%




descriptor <- describe_data(
ptype = tibble(),
description = "
Expand All @@ -111,12 +112,28 @@ descriptor <- describe_data(
new_variables %>%
rowwise() %>%
group_map(~ {
# build the descriptor
descriptor <- .metadata$GrammaticalMarkers$fields[[.$Variable]]
descriptor$description <- format_inline(
"Value of `GrammaticalMarkers::{.$Variable}` for exemplar {.q {.$MarkerExemplar}}"
)
descriptor$computed <- "GrammaticalMarkers.R"
descriptor


# fix factors
if(is.factor(descriptor$ptype)) {
descriptor <- fix_metadata_levels(
descriptor,
GrammaticalMarkersPerLanguage[[.$NewVariable]]
)
GrammaticalMarkersPerLanguage[[.$NewVariable]] <<- factor(
as.character(GrammaticalMarkersPerLanguage[[.$NewVariable]]),
levels = levels(descriptor$ptype)
)
}

descriptor
}) %>% set_names(new_variables$NewVariable)
)
)
Expand Down
29 changes: 26 additions & 3 deletions aggregation-scripts/LocusOfMarking.R
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,10 @@ MarkingPerMicrorelation <- LocusOfMarkingPerMicrorelation %>%
names_from=RoleCatLabel,
values_from=c(LocusOfMarking, LocusOfMarkingBinned5, LocusOfMarkingBinned6),
names_glue = "{.value}For{RoleCatLabel}",
values_fn = function(x) str_flatten(unique(x), "/"),
values_fn = function(x) {
x <- unique(x)
if(length(x) > 1) "multiple" else x
},
values_fill = NA
)

Expand All @@ -175,7 +178,6 @@ LocusOfMarkingPerLanguage <- inner_join(
arrange(LID, Language)



# TODO: improve this
descriptor <- describe_data(
ptype = tibble(),
Expand All @@ -184,11 +186,32 @@ descriptor <- describe_data(
fields = c(
.metadata$Register$fields[c("LID", "Language", "Glottocode")],
map(setdiff(names(LocusOfMarkingPerLanguage), c("LID", "Language", "Glottocode")), ~ {
describe_data(
descriptor <- describe_data(
ptype = if(is.logical(LocusOfMarkingPerLanguage[[.]])) logical() else factor(),
computed = "LocusOfMarking.R",
description = "<pending>"
)

# fix factors
if(is.factor(descriptor$ptype)) {
# variable name
var <- gsub("For.+$", "", .)

dd <- .metadata$LocusOfMarkingPerMicrorelation$fields$LocusOfMarking$element$fields[[var]]
!is_null(dd) || abort("Unknown variable {var}")

descriptor$levels <- add_row(dd$levels,
level = "multiple", description = "multiple different loci"
)
descriptor <- fix_metadata_levels(descriptor, LocusOfMarkingPerLanguage[[.]])

LocusOfMarkingPerLanguage[[.]] <<- factor(
as.character(LocusOfMarkingPerLanguage[[.]]),
levels = levels(descriptor$ptype)
)
}

descriptor
}) %>% set_names(setdiff(names(LocusOfMarkingPerLanguage), c("LID", "Language", "Glottocode")))
)
)
Expand Down
18 changes: 9 additions & 9 deletions aggregation-scripts/MorphologyPerLanguage.R
Original file line number Diff line number Diff line change
Expand Up @@ -393,56 +393,56 @@ descriptor <- describe_data(
"
),
HasAnyPrefixes = describe_data(
ptype = integer(),
ptype = logical(),
computed = "MorphologyPerLanguage.R",
description = "Are prefixes (restricted preposed formatives) present in the language"
),
HasAnySuffixes = describe_data(
ptype = integer(),
ptype = logical(),
computed = "MorphologyPerLanguage.R",
description = "Are suffixes (restricted postposed formatives) present in the language"
),
HasAnyInfixes = describe_data(
ptype = integer(),
ptype = logical(),
computed = "MorphologyPerLanguage.R",
description = "Are infixes (restricted interposed formatives) present in the language"
),
HasAnyProclitics = describe_data(
ptype = integer(),
ptype = logical(),
computed = "MorphologyPerLanguage.R",
description = "
Are proclitics (unrestricted or semirestricted preposed formatives) present
in the language
"
),
HasAnyEnclitics = describe_data(
ptype = integer(),
ptype = logical(),
computed = "MorphologyPerLanguage.R",
description = "
Are enclitics (unrestricted or semirestricted postposed formatives) present
in the language
"
),
HasAnyEndoclitics = describe_data(
ptype = integer(),
ptype = logical(),
computed = "MorphologyPerLanguage.R",
description = "
Are endoclitics (unrestricted or semirestricted interposed formatives) present
in the language
"
),
HasAnyPreposedFormatives = describe_data(
ptype = integer(),
ptype = logical(),
computed = "MorphologyPerLanguage.R",
description = "Are any preposed formatives present in the language"
),
HasAnyPostposedFormatives = describe_data(
ptype = integer(),
ptype = logical(),
computed = "MorphologyPerLanguage.R",
description = "Are any postposed formatives present in the language"
),
HasAnyInterposedFormatives = describe_data(
ptype = integer(),
ptype = logical(),
computed = "MorphologyPerLanguage.R",
description = "Are any interposed formatives present in the language"
),
Expand Down
20 changes: 10 additions & 10 deletions aggregation-scripts/NPStructurePerLanguage.R
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ to_camel_case <- function(x) {
}




# ███████╗██╗ ██╗███╗ ███╗███╗ ███╗ █████╗ ██████╗ ██╗ ██╗
# ██╔════╝██║ ██║████╗ ████║████╗ ████║██╔══██╗██╔══██╗╚██╗ ██╔╝
# ███████╗██║ ██║██╔████╔██║██╔████╔██║███████║██████╔╝ ╚████╔╝
Expand Down Expand Up @@ -354,9 +356,8 @@ NPStructurePresence <- NPStructure %>%
left_join(head_macrosem_constraints_presence, by = c("LID", "NPStructureID")) %>%
# add glottocodes
left_join(select(Register, LID, Glottocode), by = "LID") %>%
select(LID, Glottocode, Language, everything()) %>%
arrange(LID, Language)

select(LID, Glottocode, Language, NPStructureID, everything()) %>%
arrange(LID, Language, NPStructureID)


descriptor <- describe_data(
Expand Down Expand Up @@ -387,7 +388,7 @@ descriptor <- describe_data(
"
),
NPHasGovernment = describe_data(
ptype = integer(),
ptype = logical(),
computed = "NPStructurePerLanguage.R",
description = "
NPs with some kind of marker which is governed/assigned by the head
Expand Down Expand Up @@ -421,7 +422,7 @@ descriptor <- describe_data(
"
),
NPHasAdjGovernment = describe_data(
ptype = integer(),
ptype = logical(),
computed = "NPStructurePerLanguage.R",
description = "
Adjective attribution with some kind of marker which is governed/assigned
Expand All @@ -438,22 +439,21 @@ descriptor <- describe_data(

export_dataset("NPStructurePerLanguage", NPStructurePerLanguage, descriptor, c("PerLanguageSummaries", "NP"))



descriptor <- describe_data(
ptype = tibble(),
description = "Per-language presence of NP properties",
computed = "NPStructurePerLanguage.R",
fields = c(
.metadata$Register$fields[c("LID", "Language", "Glottocode")],
map(names(NPStructurePresence)[-(1:3)], ~ {
list(NPStructureID = .metadata$NPStructure$fields$NPStructureID),
map(names(NPStructurePresence)[-(1:4)], ~ {
describe_data(
ptype = logical(),
computed = "NPStructurePerLanguage.R",
description = "<pending>"
)
}) %>% set_names(names(NPStructurePresence)[-(1:3)])
}) %>% set_names(names(NPStructurePresence)[-(1:4)])
)
)

export_dataset("NPStructurePresence", NPStructurePresence, descriptor, c("PerLanguageSummaries", "NP"))
export_dataset("NPStructurePresence", NPStructurePresence, descriptor, "NP")
Loading