Overview Issue: Refactor keyword-based assessments to accommodate morphology #1558

nataliashitova · 2018-06-25T09:14:47Z

Update: This issue description was updated following the feedback of Omar and Joost.

Current versions of keyword-based assessments rely on exact matches between the keyword and the text. However, with morphological support exact matches make less sense.
Hereafter I outline my suggestions on possible adjustments of existing assessments.

For languages that have morphological support (for now, English), by word I understand the exact word from the keyphrase and all its forms generated internally. For languages without morphological support, word is understood as exact match with the keyphrase.

By function words I understand prepositions (e.g., for), articles (e.g., the), auxiliaries (e.g., were) and words of diminished or absent semantics (e.g., thing), i.e., all words that currently are listed under function words for prominent words analysis. By content words I understand all words which are not function words. E.g., content words in "The boy has eaten an apple" would be "boy", "eaten", "apple".

Group 1: one-word matches

TextImages

Alt-attributes of images should have keyword.
Current: If < 5 images, GOOD if at least 1 alt-tag with the keyword. If 5 images, GOOD if 2-4 alt-tags with the keyword. If >5 images, GOOD if the number of alt-tags with keyword is within 30-75% range. BAD if there are no images. OK otherwise.
Proposal: Consider any content word in the keyphrase, same otherwise.

Group 2: some- and all-word matches

TitleKeyword

The title should reflect the topic of the copy.
Current: GOOD if the keyword is in the beginning of the title, OK if it is in the title, but not in the beginning, BAD otherwise.
Proposal: GOOD if an exact match of the keyword is found in the beginning of the title, OK if all content words from the keyphrase are in the title, BAD otherwise.

IntroductionHasKeyword

The topic of the copy should be clear immediately.
Current: GOOD if the keyword is in the first paragraph of the copy, BAD otherwise.
Proposal: GOOD if all content words from the keyphrase are matched within one sentence in the introduction, OK if all content words are present in the first paragraph at all (but not in the same sentence), BAD otherwise.

MetaDescriptionKeyword

The keyword should be in the metadescription, but not too much.
Current: GOOD if the keyword occurs once or twice, BAD otherwise.
Proposal: Same as IntroductionHasKeyword, but count the number of matches to be 1 or 2.

SubheadingsKeyword

The topic should be clear from subheadings, but overuse is penalized.
Current: GOOD if the keyword is in 30-75% subheadings, BAD otherwise.
Proposal: A subheading is considered to reflect the topic if > half of content words from the keyphrase are used in it. Then, GOOD if 30-75% of subheadings reflect the topic, BAD otherwise.

UrlKeyword

The URL should reflect the topic of the copy.
Current: GOOD if the keyword is in the URL, OK otherwise
Proposal: If the keyword has 1 or 2 content words: GOOD if all content words are in the URL, OK otherwise. If the keyword has >2 content words: GOOD if > half of content words are in the URL, OK otherwise.

PreviouslyUsedKeyword

More than two articles should not try to rank for the same keyword.
Current: GOOD if the keyword was never used, OK if it was used once, BAD otherwise.
Proposal: Two ways. First, keep the current implementation with exact matches. Second, match based on base forms of every content word in the keyphrase.

Group 3: Density and distribution

KeywordDensity and TopicDensity###

The keyphrase should occur in the text often enough, but not too much.
Current: GOOD if the keyphrase constitutes 0.5-3% of words (for 1-word keyphrase without synonyms) or 1.5-4% (for 1-word keyphrase with synonyms), BAD otherwise. The percentages are normalized by the length of the keyphrase.
Proposal: A match is when all words from the keyphrase are found within one sentence. If all words from the keyphrase are matched multiple times, these are considered as multiple matches and are fed into keywordDensity accordingly.
Examples:
Text: "A boy was eating an apple and reading a book." Keyphrase: "books for boys". Matches found: 1.
Text: "A boy was eating an apple. He was reading a book." Keyphrase: "books for boys". Matches found: 0.
Text: "A boy was reading a book, which was a book for boys" Keyphrase: "books for boys". Matches found: 2.

KeywordDistribution

The keyword should be evenly distributed over the copy.
Current: GOOD if the minimal distance between keyword occurrences is <40% of the text length, okay if it is between 40-50%, bad if >50%.
Proposal:
(1) For every sentence: if keywordLength < 4, GOOD if all content words from the keyphrase are in the sentence, OK if some but not all are, BAD if none; if keyword >= 4, GOOD if all content words from the keyphrase are in the sentence, or at least 3 content words from the keyphrase are in the sentence and the rest are in the neighbour sentences, OK if only some content words from the keyphrase are found in the sentence but not all, BAD if none.
(2) Step function: start with the first third of the text (based on the total number of sentences) and calculate an average score over all sentences in this set. Move down by one sentence, calculate an average score again. Continue until the end of the text is reached.
(3) For every step calculate an eventual punishment if not all content words from the keyphrase were used in the step. The punishment is either 0 (if all content words were used) or 0.5 if not all keyphrase words were used. The punishments are then averaged over all steps.
(4) Compute a Gini coefficient over all steps. The Gini coefficient shows how uniform the distribution is ranging from 0 for a perfectly uniform distribution to 1 for a horribly un-uniform distribution. Add the averaged punishment (which is a value from 0 to 0.5). Calculate the scores: GOOD if the final score <- 0.4, OK if between 0.4 and 0.6, BAD otherwise.

Group 4: Other

KeyphraseLength

The keyphrase should be present and should not be too long.
Current: GOOD if the keyphrase length is between 1-4, OK if between 5-8, BAD otherwise.
Proposal: Same but count only content words.

PLAN

Import refactored assessments from feature/recalibration

Migrate SEO assessments from feature/recalibration. Part 1. Migrate SEO assessments from feature/recalibration. Part 1. #1591
Migrate SEO assessments from feature/recalibration. Part 2. Migrate SEO assessments from feature/recalibration. Part 2. #1592
Update SEOAssessorSpec (including cornerstone) to follow the new format Update SEOAssessorSpec (including cornerstone) to follow the new format #1599
Change taxonomy assessor in wordpress-seo repository (calls to classes, new file names) Change taxonomy assessor after assessment refactor wordpress-seo#10400

Implement morhological researchers

Implement research that generates Keyword+Synonyms structure including morphology Implement research that generates Keyword+Synonyms structure including morphology #1587
Implement research that checks how many keyphrase words are present within the string. Implement research that checks if keyphrase/synonym word forms are present in the string. #1634
Implement research that calls the previous research only for keyphrase or for keyphrase and synonyms. Implement research that searches for keyphrase or keyphrase/synonyms words #1635

Implement Premium morphology interface

Implement morphological data imports from the server Implement morphological data imports from the server #1641
Memoize the keyphrase/synonymsForms structure to speed up assessments Memoize the keyphrase/synonymsForms structure to speed up assessments #1750
Remove regex and exception files which were moved to a single JSON Remove regex and exception files which were moved to a single JSON #1751
Transition the regex/exception JSON file to Yoast/YoastSEO.js-premium-configuration Transition the regex/exception JSON file to Yoast/YoastSEO.js-premium-configuration #1758
Create authenticated downloads in MyYoast. https://github.com/Yoast/my-yoast/issues/1918
Pass YoastSEO.js premium config to YoastSEO.js Load morphology data from URL #1809
Make sure that Free doesn't have access to the morphology functionality

Refactor assessments to implement morphological support

Final checklist for 9.0

All assessments return new feedback strings
The morphology data is removed from YoastSEO.js
The license is decided upon https://github.com/Yoast/YoastSEO.js-premium-configuration/issues/3
The issues from Morpho-Syno milestone are all merged

Stretch goals

Refactor PreviouslyUsedKeyword assessment Refactor PreviouslyUsedKeyword assessment to include morphology #1752
Remove topicCount if it's not needed anymore
Guess base form of every word and list it as the first on in the array of forms. Merge topic words if they have the same base form (only needed if someone is using a content word in the keyphrase twice)

For an overview on how morphology works and how to adjust existing assessment to include mophology consult this wiki article.

The text was updated successfully, but these errors were encountered:

nataliashitova · 2018-07-04T09:30:44Z

@jdevalk @omarreiss Could you confirm that is the way to go?

omarreiss · 2018-07-06T14:11:27Z

I Agree with a lot. I see a couple of things @jdevalk needs to decide about.

1. Title

Proposal: Good if an exact match of the keyword is found, OK if all content words from the keyphrase are in the title, BAD otherwise.

In the proposal it no longer seems to matter if the keyword is at the beginning of the title. Do you agree with this?

2. First paragraph

If the keyphrase has 4 or more content words: GOOD if 3 content words from the keyphrase are matched within one sentence in the introduction, while the rest are found in the neighbour sentences, OK if all content words are present in the first paragraph at all (but not in the same sentence), BAD otherwise.

I think this is going to be a bit hard to explain. I'd rather not differentiate here and always require all content words to be matched in at least one sentence for a GOOD.

3. Subheadings

Proposal: A subheading is considered to reflect the topic if > half of content words from the keyphrase are used in it. Then, GOOD if 30-75% of subheadings reflect the topic, BAD otherwise.

I would prefer a definition where a subheading needs to include all of the content words in order to reflect the topic.

4. Urls

Proposal: If the keyword has 1 or 2 content words: GOOD if all content words are in the URL, OK otherwise. If the keyword has >2 content words: GOOD if > half of content words are in the URL, OK otherwise.

This needs an SEO's perspective.

5. Keyphrase length

Current: GOOD if the keyphrase length is between 1-4, OK if between 5-8, BAD otherwise.
Proposal: Same but count only content words.

This would make it much less strict. From a content words perspective the current calibration is probably already Good: 1-3, OK: 4-6, BAD: otherwise.

jdevalk · 2018-07-17T08:21:05Z

Title

Proposal: Good if an exact match of the keyword is found, OK if all content words from the > keyphrase are in the title, BAD otherwise.

In the proposal it no longer seems to matter if the keyword is at the beginning of the title. Do you > agree with this?

No. Beginning of title matters.

First paragraph

If the keyphrase has 4 or more content words: GOOD if 3 content words from the keyphrase are matched within one sentence in the introduction, while the rest are found in the neighbour sentences, OK if all content words are present in the first paragraph at all (but not in the same sentence), BAD otherwise.

I think this is going to be a bit hard to explain. I'd rather not differentiate here and always require all content words to be matched in at least one sentence for a GOOD.

Agree with @omarreiss

Subheadings

Proposal: A subheading is considered to reflect the topic if > half of content words from the keyphrase are used in it. Then, GOOD if 30-75% of subheadings reflect the topic, BAD otherwise.

I would prefer a definition where a subheading needs to include all of the content words in order to reflect the topic.

That's undoable for larger keyphrases. I think I'm fine with @nataliashitova's suggestion but this needs to be tested on real copy.

Urls

Proposal: If the keyword has 1 or 2 content words: GOOD if all content words are in the URL, OK otherwise. If the keyword has >2 content words: GOOD if > half of content words are in the URL, OK otherwise.

This needs an SEO's perspective.

I'm fine with this.

Keyphrase length

Current: GOOD if the keyphrase length is between 1-4, OK if between 5-8, BAD otherwise.
Proposal: Same but count only content words.

This would make it much less strict. From a content words perspective the current calibration is probably already Good: 1-3, OK: 4-6, BAD: otherwise.

I'm fine with less strict for this. So let's go with @nataliashitova's suggestion.

nataliashitova added enhancement development lingo text analysis labels Jun 25, 2018

nataliashitova self-assigned this Jun 25, 2018

nataliashitova mentioned this issue Jun 25, 2018

Add morphological analysis for YoastSEO.js assessments #1500

Open

10 tasks

nataliashitova added needs-changes lingo development lingo and removed development lingo needs-changes lingo labels Jun 25, 2018

nataliashitova added the morpho-syno Issue that is related to providing morphological analysis for keywords and synonyms. label Jul 4, 2018

nataliashitova added needs-decision needs-changes lingo and removed development lingo labels Jul 4, 2018

nataliashitova mentioned this issue Jul 9, 2018

Implement research that generates Keyword+Synonyms structure including morphology #1587

Closed

nataliashitova added backlog and removed needs-changes lingo labels Jul 11, 2018

nataliashitova changed the title ~~Refactor keyword-based assessments to accommodate morphology~~ Overview Issue: Refactor keyword-based assessments to accommodate morphology Jul 11, 2018

This was referenced Jul 17, 2018

Punctuation marks used in the keyword get stripped in slug #1608

Closed

Plugin has troubles recognizing Focus Keywords with special characters #1193

Closed

nataliashitova added bug and removed bug labels Jul 17, 2018

nataliashitova mentioned this issue Jul 26, 2018

Refactor TitleKeyword assessment to include morphology #1638

Closed

nataliashitova mentioned this issue Aug 7, 2018

Process the text in the worker: removeHtmlBlocks #1673

Closed

nataliashitova mentioned this issue Oct 12, 2018

Keyphrase in subheading check is broken #1859

Closed

Pcosta88 mentioned this issue Apr 24, 2019

[Feature request] Detect if Keyword has a punctuation mark in it and output a different notification for the slug #2222

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overview Issue: Refactor keyword-based assessments to accommodate morphology #1558

Overview Issue: Refactor keyword-based assessments to accommodate morphology #1558

nataliashitova commented Jun 25, 2018 •

edited

nataliashitova commented Jul 4, 2018

omarreiss commented Jul 6, 2018

jdevalk commented Jul 17, 2018

Overview Issue: Refactor keyword-based assessments to accommodate morphology #1558

Overview Issue: Refactor keyword-based assessments to accommodate morphology #1558

Comments

nataliashitova commented Jun 25, 2018 • edited

Group 1: one-word matches

TextImages

Group 2: some- and all-word matches

TitleKeyword

IntroductionHasKeyword

MetaDescriptionKeyword

SubheadingsKeyword

UrlKeyword

PreviouslyUsedKeyword

Group 3: Density and distribution

KeywordDensity and TopicDensity###

KeywordDistribution

Group 4: Other

KeyphraseLength

PLAN

Import refactored assessments from feature/recalibration

Implement morhological researchers

Implement Premium morphology interface

Refactor assessments to implement morphological support

Final checklist for 9.0

Stretch goals

nataliashitova commented Jul 4, 2018

omarreiss commented Jul 6, 2018

1. Title

2. First paragraph

3. Subheadings

4. Urls

5. Keyphrase length

jdevalk commented Jul 17, 2018

nataliashitova commented Jun 25, 2018 •

edited