-
Notifications
You must be signed in to change notification settings - Fork 173
Overview Issue: Refactor keyword-based assessments to accommodate morphology #1558
Comments
@jdevalk @omarreiss Could you confirm that is the way to go? |
I Agree with a lot. I see a couple of things @jdevalk needs to decide about. 1. Title
In the proposal it no longer seems to matter if the keyword is at the beginning of the title. Do you agree with this? 2. First paragraph
I think this is going to be a bit hard to explain. I'd rather not differentiate here and always require all content words to be matched in at least one sentence for a GOOD. 3. Subheadings
I would prefer a definition where a subheading needs to include all of the content words in order to reflect the topic. 4. Urls
This needs an SEO's perspective. 5. Keyphrase length
This would make it much less strict. From a content words perspective the current calibration is probably already Good: 1-3, OK: 4-6, BAD: otherwise. |
No. Beginning of title matters.
Agree with @omarreiss
That's undoable for larger keyphrases. I think I'm fine with @nataliashitova's suggestion but this needs to be tested on real copy.
I'm fine with this.
I'm fine with less strict for this. So let's go with @nataliashitova's suggestion. |
Update: This issue description was updated following the feedback of Omar and Joost.
Current versions of keyword-based assessments rely on exact matches between the keyword and the text. However, with morphological support exact matches make less sense.
Hereafter I outline my suggestions on possible adjustments of existing assessments.
For languages that have morphological support (for now, English), by word I understand the exact word from the keyphrase and all its forms generated internally. For languages without morphological support, word is understood as exact match with the keyphrase.
By function words I understand prepositions (e.g., for), articles (e.g., the), auxiliaries (e.g., were) and words of diminished or absent semantics (e.g., thing), i.e., all words that currently are listed under function words for prominent words analysis. By content words I understand all words which are not function words. E.g., content words in "The boy has eaten an apple" would be "boy", "eaten", "apple".
Group 1: one-word matches
TextImages
Alt-attributes of images should have keyword.
Current: If < 5 images, GOOD if at least 1 alt-tag with the keyword. If 5 images, GOOD if 2-4 alt-tags with the keyword. If >5 images, GOOD if the number of alt-tags with keyword is within 30-75% range. BAD if there are no images. OK otherwise.
Proposal: Consider any content word in the keyphrase, same otherwise.
Group 2: some- and all-word matches
TitleKeyword
The title should reflect the topic of the copy.
Current: GOOD if the keyword is in the beginning of the title, OK if it is in the title, but not in the beginning, BAD otherwise.
Proposal: GOOD if an exact match of the keyword is found in the beginning of the title, OK if all content words from the keyphrase are in the title, BAD otherwise.
IntroductionHasKeyword
The topic of the copy should be clear immediately.
Current: GOOD if the keyword is in the first paragraph of the copy, BAD otherwise.
Proposal: GOOD if all content words from the keyphrase are matched within one sentence in the introduction, OK if all content words are present in the first paragraph at all (but not in the same sentence), BAD otherwise.
MetaDescriptionKeyword
The keyword should be in the metadescription, but not too much.
Current: GOOD if the keyword occurs once or twice, BAD otherwise.
Proposal: Same as IntroductionHasKeyword, but count the number of matches to be 1 or 2.
SubheadingsKeyword
The topic should be clear from subheadings, but overuse is penalized.
Current: GOOD if the keyword is in 30-75% subheadings, BAD otherwise.
Proposal: A subheading is considered to reflect the topic if > half of content words from the keyphrase are used in it. Then, GOOD if 30-75% of subheadings reflect the topic, BAD otherwise.
UrlKeyword
The URL should reflect the topic of the copy.
Current: GOOD if the keyword is in the URL, OK otherwise
Proposal: If the keyword has 1 or 2 content words: GOOD if all content words are in the URL, OK otherwise. If the keyword has >2 content words: GOOD if > half of content words are in the URL, OK otherwise.
PreviouslyUsedKeyword
More than two articles should not try to rank for the same keyword.
Current: GOOD if the keyword was never used, OK if it was used once, BAD otherwise.
Proposal: Two ways. First, keep the current implementation with exact matches. Second, match based on base forms of every content word in the keyphrase.
Group 3: Density and distribution
KeywordDensity and TopicDensity###
The keyphrase should occur in the text often enough, but not too much.
Current: GOOD if the keyphrase constitutes 0.5-3% of words (for 1-word keyphrase without synonyms) or 1.5-4% (for 1-word keyphrase with synonyms), BAD otherwise. The percentages are normalized by the length of the keyphrase.
Proposal: A match is when all words from the keyphrase are found within one sentence. If all words from the keyphrase are matched multiple times, these are considered as multiple matches and are fed into keywordDensity accordingly.
Examples:
Text: "A
boy
was eating an apple and reading abook
." Keyphrase: "books for boys". Matches found: 1.Text: "A
boy
was eating an apple. He was reading abook
." Keyphrase: "books for boys". Matches found: 0.Text: "A
boy
was reading abook
, which was abook
forboys
" Keyphrase: "books for boys". Matches found: 2.KeywordDistribution
The keyword should be evenly distributed over the copy.
Current: GOOD if the minimal distance between keyword occurrences is <40% of the text length, okay if it is between 40-50%, bad if >50%.
Proposal:
(1) For every sentence: if keywordLength < 4, GOOD if all content words from the keyphrase are in the sentence, OK if some but not all are, BAD if none; if keyword >= 4, GOOD if all content words from the keyphrase are in the sentence, or at least 3 content words from the keyphrase are in the sentence and the rest are in the neighbour sentences, OK if only some content words from the keyphrase are found in the sentence but not all, BAD if none.
(2) Step function: start with the first third of the text (based on the total number of sentences) and calculate an average score over all sentences in this set. Move down by one sentence, calculate an average score again. Continue until the end of the text is reached.
(3) For every step calculate an eventual punishment if not all content words from the keyphrase were used in the step. The punishment is either 0 (if all content words were used) or 0.5 if not all keyphrase words were used. The punishments are then averaged over all steps.
(4) Compute a Gini coefficient over all steps. The Gini coefficient shows how uniform the distribution is ranging from 0 for a perfectly uniform distribution to 1 for a horribly un-uniform distribution. Add the averaged punishment (which is a value from 0 to 0.5). Calculate the scores: GOOD if the final score <- 0.4, OK if between 0.4 and 0.6, BAD otherwise.
Group 4: Other
KeyphraseLength
The keyphrase should be present and should not be too long.
Current: GOOD if the keyphrase length is between 1-4, OK if between 5-8, BAD otherwise.
Proposal: Same but count only content words.
PLAN
Import refactored assessments from feature/recalibration
Implement morhological researchers
Implement Premium morphology interface
Refactor assessments to implement morphological support
Final checklist for 9.0
Stretch goals
For an overview on how morphology works and how to adjust existing assessment to include mophology consult this wiki article.
The text was updated successfully, but these errors were encountered: