-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Use profile data statements to select good examples -- requires the profile metadata
Use some of the features identified in GDEX: Automatically finding good dictionary examples in a corpus
- Sentence length: a sentence between 10 and 25 words long was
preferred, with longer and shorter ones penalized. - Word frequencies: a sentence was penalized for each word that was not
amongst the commonest 17,000 words in the language, with a further penalty
applied for rare words. - Sentences containing pronouns and anaphors like this that it or one
often fail to present a self-contained piece of language which makes sense
without further context, so sentences containing these words were penalized - Sentences where the target collocation is in the main clause were
preferred (using heuristics to guess where the main clause begins and ends, as
we do not yet use a parser) - Whole sentences – identified as beginning with a capital letter and
ending with a full step, exclamation mark - Sentences with ‘third collocates’, that is, words that occurred with high
salience in sentences containing the node and primary collocate, were
preferred. - We note that good examples often first introduce a context, and then
contain the collocation which, to speak figuratively, fits into the space that the
context has created for it: this is helpful as a user who is unsure of the meaning
of the collocation will be able to make inferences about what it must be from
the context in which it appears. In sentences having this structure, the
collocation is likely to be towards the end of the sentence. Sentences with the
target collocation towards the end were given credit.
Possibly also use parse time, ambiguity or score as a measure of complexity and prefer easy examples.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request