Skip to content

Select better examples #41

@fcbond

Description

@fcbond

Use profile data statements to select good examples -- requires the profile metadata

Use some of the features identified in GDEX: Automatically finding good dictionary examples in a corpus

  • Sentence length: a sentence between 10 and 25 words long was
    preferred, with longer and shorter ones penalized.
  • Word frequencies: a sentence was penalized for each word that was not
    amongst the commonest 17,000 words in the language, with a further penalty
    applied for rare words.
  • Sentences containing pronouns and anaphors like this that it or one
    often fail to present a self-contained piece of language which makes sense
    without further context, so sentences containing these words were penalized
  • Sentences where the target collocation is in the main clause were
    preferred (using heuristics to guess where the main clause begins and ends, as
    we do not yet use a parser)
  • Whole sentences – identified as beginning with a capital letter and
    ending with a full step, exclamation mark
  • Sentences with ‘third collocates’, that is, words that occurred with high
    salience in sentences containing the node and primary collocate, were
    preferred.
  • We note that good examples often first introduce a context, and then
    contain the collocation which, to speak figuratively, fits into the space that the
    context has created for it: this is helpful as a user who is unsure of the meaning
    of the collocation will be able to make inferences about what it must be from
    the context in which it appears. In sentences having this structure, the
    collocation is likely to be towards the end of the sentence. Sentences with the
    target collocation towards the end were given credit.

Possibly also use parse time, ambiguity or score as a measure of complexity and prefer easy examples.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions