Skip to content
This repository has been archived by the owner on Oct 4, 2022. It is now read-only.

Overview Issue: Refactor keyword-based assessments to accommodate morphology #1558

Open
27 of 30 tasks
nataliashitova opened this issue Jun 25, 2018 · 3 comments
Open
27 of 30 tasks
Assignees
Labels
backlog enhancement morpho-syno Issue that is related to providing morphological analysis for keywords and synonyms. needs-decision text analysis

Comments

@nataliashitova
Copy link
Contributor

nataliashitova commented Jun 25, 2018

Update: This issue description was updated following the feedback of Omar and Joost.

Current versions of keyword-based assessments rely on exact matches between the keyword and the text. However, with morphological support exact matches make less sense.
Hereafter I outline my suggestions on possible adjustments of existing assessments.

For languages that have morphological support (for now, English), by word I understand the exact word from the keyphrase and all its forms generated internally. For languages without morphological support, word is understood as exact match with the keyphrase.

By function words I understand prepositions (e.g., for), articles (e.g., the), auxiliaries (e.g., were) and words of diminished or absent semantics (e.g., thing), i.e., all words that currently are listed under function words for prominent words analysis. By content words I understand all words which are not function words. E.g., content words in "The boy has eaten an apple" would be "boy", "eaten", "apple".

Group 1: one-word matches

TextImages

Alt-attributes of images should have keyword.
Current: If < 5 images, GOOD if at least 1 alt-tag with the keyword. If 5 images, GOOD if 2-4 alt-tags with the keyword. If >5 images, GOOD if the number of alt-tags with keyword is within 30-75% range. BAD if there are no images. OK otherwise.
Proposal: Consider any content word in the keyphrase, same otherwise.

Group 2: some- and all-word matches

TitleKeyword

The title should reflect the topic of the copy.
Current: GOOD if the keyword is in the beginning of the title, OK if it is in the title, but not in the beginning, BAD otherwise.
Proposal: GOOD if an exact match of the keyword is found in the beginning of the title, OK if all content words from the keyphrase are in the title, BAD otherwise.

IntroductionHasKeyword

The topic of the copy should be clear immediately.
Current: GOOD if the keyword is in the first paragraph of the copy, BAD otherwise.
Proposal: GOOD if all content words from the keyphrase are matched within one sentence in the introduction, OK if all content words are present in the first paragraph at all (but not in the same sentence), BAD otherwise.

MetaDescriptionKeyword

The keyword should be in the metadescription, but not too much.
Current: GOOD if the keyword occurs once or twice, BAD otherwise.
Proposal: Same as IntroductionHasKeyword, but count the number of matches to be 1 or 2.

SubheadingsKeyword

The topic should be clear from subheadings, but overuse is penalized.
Current: GOOD if the keyword is in 30-75% subheadings, BAD otherwise.
Proposal: A subheading is considered to reflect the topic if > half of content words from the keyphrase are used in it. Then, GOOD if 30-75% of subheadings reflect the topic, BAD otherwise.

UrlKeyword

The URL should reflect the topic of the copy.
Current: GOOD if the keyword is in the URL, OK otherwise
Proposal: If the keyword has 1 or 2 content words: GOOD if all content words are in the URL, OK otherwise. If the keyword has >2 content words: GOOD if > half of content words are in the URL, OK otherwise.

PreviouslyUsedKeyword

More than two articles should not try to rank for the same keyword.
Current: GOOD if the keyword was never used, OK if it was used once, BAD otherwise.
Proposal: Two ways. First, keep the current implementation with exact matches. Second, match based on base forms of every content word in the keyphrase.

Group 3: Density and distribution

KeywordDensity and TopicDensity###

The keyphrase should occur in the text often enough, but not too much.
Current: GOOD if the keyphrase constitutes 0.5-3% of words (for 1-word keyphrase without synonyms) or 1.5-4% (for 1-word keyphrase with synonyms), BAD otherwise. The percentages are normalized by the length of the keyphrase.
Proposal: A match is when all words from the keyphrase are found within one sentence. If all words from the keyphrase are matched multiple times, these are considered as multiple matches and are fed into keywordDensity accordingly.
Examples:
Text: "A boy was eating an apple and reading a book." Keyphrase: "books for boys". Matches found: 1.
Text: "A boy was eating an apple. He was reading a book." Keyphrase: "books for boys". Matches found: 0.
Text: "A boy was reading a book, which was a book for boys" Keyphrase: "books for boys". Matches found: 2.

KeywordDistribution

The keyword should be evenly distributed over the copy.
Current: GOOD if the minimal distance between keyword occurrences is <40% of the text length, okay if it is between 40-50%, bad if >50%.
Proposal:
(1) For every sentence: if keywordLength < 4, GOOD if all content words from the keyphrase are in the sentence, OK if some but not all are, BAD if none; if keyword >= 4, GOOD if all content words from the keyphrase are in the sentence, or at least 3 content words from the keyphrase are in the sentence and the rest are in the neighbour sentences, OK if only some content words from the keyphrase are found in the sentence but not all, BAD if none.
(2) Step function: start with the first third of the text (based on the total number of sentences) and calculate an average score over all sentences in this set. Move down by one sentence, calculate an average score again. Continue until the end of the text is reached.
(3) For every step calculate an eventual punishment if not all content words from the keyphrase were used in the step. The punishment is either 0 (if all content words were used) or 0.5 if not all keyphrase words were used. The punishments are then averaged over all steps.
(4) Compute a Gini coefficient over all steps. The Gini coefficient shows how uniform the distribution is ranging from 0 for a perfectly uniform distribution to 1 for a horribly un-uniform distribution. Add the averaged punishment (which is a value from 0 to 0.5). Calculate the scores: GOOD if the final score <- 0.4, OK if between 0.4 and 0.6, BAD otherwise.

Group 4: Other

KeyphraseLength

The keyphrase should be present and should not be too long.
Current: GOOD if the keyphrase length is between 1-4, OK if between 5-8, BAD otherwise.
Proposal: Same but count only content words.

PLAN

Import refactored assessments from feature/recalibration

Implement morhological researchers

Implement Premium morphology interface

Refactor assessments to implement morphological support

Final checklist for 9.0

Stretch goals

  • Refactor PreviouslyUsedKeyword assessment Refactor PreviouslyUsedKeyword assessment to include morphology #1752
  • Remove topicCount if it's not needed anymore
  • Guess base form of every word and list it as the first on in the array of forms. Merge topic words if they have the same base form (only needed if someone is using a content word in the keyphrase twice)

For an overview on how morphology works and how to adjust existing assessment to include mophology consult this wiki article.

@nataliashitova
Copy link
Contributor Author

@jdevalk @omarreiss Could you confirm that is the way to go?

@omarreiss
Copy link
Contributor

I Agree with a lot. I see a couple of things @jdevalk needs to decide about.

1. Title

Proposal: Good if an exact match of the keyword is found, OK if all content words from the keyphrase are in the title, BAD otherwise.

In the proposal it no longer seems to matter if the keyword is at the beginning of the title. Do you agree with this?

2. First paragraph

If the keyphrase has 4 or more content words: GOOD if 3 content words from the keyphrase are matched within one sentence in the introduction, while the rest are found in the neighbour sentences, OK if all content words are present in the first paragraph at all (but not in the same sentence), BAD otherwise.

I think this is going to be a bit hard to explain. I'd rather not differentiate here and always require all content words to be matched in at least one sentence for a GOOD.

3. Subheadings

Proposal: A subheading is considered to reflect the topic if > half of content words from the keyphrase are used in it. Then, GOOD if 30-75% of subheadings reflect the topic, BAD otherwise.

I would prefer a definition where a subheading needs to include all of the content words in order to reflect the topic.

4. Urls

Proposal: If the keyword has 1 or 2 content words: GOOD if all content words are in the URL, OK otherwise. If the keyword has >2 content words: GOOD if > half of content words are in the URL, OK otherwise.

This needs an SEO's perspective.

5. Keyphrase length

Current: GOOD if the keyphrase length is between 1-4, OK if between 5-8, BAD otherwise.
Proposal: Same but count only content words.

This would make it much less strict. From a content words perspective the current calibration is probably already Good: 1-3, OK: 4-6, BAD: otherwise.

@nataliashitova nataliashitova changed the title Refactor keyword-based assessments to accommodate morphology Overview Issue: Refactor keyword-based assessments to accommodate morphology Jul 11, 2018
@jdevalk
Copy link

jdevalk commented Jul 17, 2018

  1. Title

Proposal: Good if an exact match of the keyword is found, OK if all content words from the > keyphrase are in the title, BAD otherwise.

In the proposal it no longer seems to matter if the keyword is at the beginning of the title. Do you > agree with this?

No. Beginning of title matters.

  1. First paragraph

If the keyphrase has 4 or more content words: GOOD if 3 content words from the keyphrase are matched within one sentence in the introduction, while the rest are found in the neighbour sentences, OK if all content words are present in the first paragraph at all (but not in the same sentence), BAD otherwise.

I think this is going to be a bit hard to explain. I'd rather not differentiate here and always require all content words to be matched in at least one sentence for a GOOD.

Agree with @omarreiss

  1. Subheadings

Proposal: A subheading is considered to reflect the topic if > half of content words from the keyphrase are used in it. Then, GOOD if 30-75% of subheadings reflect the topic, BAD otherwise.

I would prefer a definition where a subheading needs to include all of the content words in order to reflect the topic.

That's undoable for larger keyphrases. I think I'm fine with @nataliashitova's suggestion but this needs to be tested on real copy.

  1. Urls

Proposal: If the keyword has 1 or 2 content words: GOOD if all content words are in the URL, OK otherwise. If the keyword has >2 content words: GOOD if > half of content words are in the URL, OK otherwise.

This needs an SEO's perspective.

I'm fine with this.

  1. Keyphrase length

Current: GOOD if the keyphrase length is between 1-4, OK if between 5-8, BAD otherwise.
Proposal: Same but count only content words.

This would make it much less strict. From a content words perspective the current calibration is probably already Good: 1-3, OK: 4-6, BAD: otherwise.

I'm fine with less strict for this. So let's go with @nataliashitova's suggestion.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backlog enhancement morpho-syno Issue that is related to providing morphological analysis for keywords and synonyms. needs-decision text analysis
Projects
None yet
Development

No branches or pull requests

3 participants