Comprehension/fix plagiarism nil error #7976
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
WHAT
Refactor the logic that identifies which text to highlight when the plagiarism algorithm detects that an entry plagiarizes from the passage so that it gracefully handles cases where there are multiple spaces between words or non-letter/number characters isolated by spaces.
WHY
The logic that is used to determine whether plagiarism is present or not normalizes sentences into arrays of individual words. It does this by stripping extraneous characters (like punctuation), downcasing the string, and then splitting the string on spaces. Ruby's split behavior will treat any number of spaces as a single split token, so once we have our arrays of words we no longer have any knowledge about how many spaces might have been between each word.
This works perfectly for identifying whether a user has plagiarized or not, at least for our purposes. However, it complicates the code that we use to determine what literal text needs to be highlighted to the user. The code that is being replaced here could not handle cases where there were multiple spaces between words (which could also happen if someone had punctuation isolated by spaces such as
We went , as you all know , to the beach.
. This code is more robust about understanding where multiple spaces exist and accounting for them in highlights.HOW
We now determine the index position of each instance of a non-single space in the string and save it. Then, using those indexes as reference, we adjust the
start_index
andend_index
that are then applied to the original raw user input to determine what full string to highlight.(This was actually a pretty complicated process to figure out, and I could go into a lot more detail, but I'm not sure if that's appropriate in a PR? Let me know if you think it might be.)
Notion Card Links
https://www.notion.so/quill/Sentry-Error-NoMethodError-Comprehension-FeedbackController-plagiarism-60dd0c0160df466ebe31839a5942470c