Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comprehension/fix plagiarism nil error #7976

Merged
merged 2 commits into from Jul 28, 2021

Conversation

anathomical
Copy link
Contributor

WHAT

Refactor the logic that identifies which text to highlight when the plagiarism algorithm detects that an entry plagiarizes from the passage so that it gracefully handles cases where there are multiple spaces between words or non-letter/number characters isolated by spaces.

WHY

The logic that is used to determine whether plagiarism is present or not normalizes sentences into arrays of individual words. It does this by stripping extraneous characters (like punctuation), downcasing the string, and then splitting the string on spaces. Ruby's split behavior will treat any number of spaces as a single split token, so once we have our arrays of words we no longer have any knowledge about how many spaces might have been between each word.

This works perfectly for identifying whether a user has plagiarized or not, at least for our purposes. However, it complicates the code that we use to determine what literal text needs to be highlighted to the user. The code that is being replaced here could not handle cases where there were multiple spaces between words (which could also happen if someone had punctuation isolated by spaces such as We went , as you all know , to the beach.. This code is more robust about understanding where multiple spaces exist and accounting for them in highlights.

HOW

We now determine the index position of each instance of a non-single space in the string and save it. Then, using those indexes as reference, we adjust the start_index and end_index that are then applied to the original raw user input to determine what full string to highlight.

(This was actually a pretty complicated process to figure out, and I could go into a lot more detail, but I'm not sure if that's appropriate in a PR? Let me know if you think it might be.)

Notion Card Links

https://www.notion.so/quill/Sentry-Error-NoMethodError-Comprehension-FeedbackController-plagiarism-60dd0c0160df466ebe31839a5942470c

PR Checklist Your Answer
Have you added and/or updated tests? Yes
Have you deployed to Staging? Not yet - deploying now!
Self-Review: Have you done an initial self-review of the code below on Github? Yes
Design Review: If applicable, have you compared the coded design to the mockups? N/A

Because of the way that we calculate plagiarism, the code that identifies the matched string for highlighting removes any cases of multiple spaces in a row.  We need to account for that edge case in case a student entry actually has multiple spaces.

it 'should successfully highlight even when the passage has multiple consecutive spaces but the user entry does not' do
entry = "This phrase plagiarises from the passage even though it has a ton of spaces in it."
passage = "From the passage even though it has a ton of spaces."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's better to remove extra whitespace in passages at publish time, so that downstream algorithms like this can assume clean passages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I also don't think there's any way to guarantee this without taking away some control from Curriculum. It's possible, though I can't imagine off-hand why, that they might intentionally want something like this in some edge case. It doesn't seem reasonable to try to take that away from them without good reason.

@anathomical anathomical merged commit cabbd69 into develop Jul 28, 2021
@delete-merged-branch delete-merged-branch bot deleted the comprehension/fix-plagiarism-nil-error branch July 28, 2021 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants