Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In match_#.html, distro code is sometimes highlighted as an Exact match and not filtered out #72

Open
dmalan opened this issue Jul 22, 2020 · 3 comments
Labels

Comments

@dmalan
Copy link
Member

dmalan commented Jul 22, 2020

No description provided.

@dmalan dmalan added the bug label Jul 22, 2020
@Jelleas
Copy link
Contributor

Jelleas commented Jul 22, 2020

Tricky one there. In short, the current noise threshold is too high for function definitions. These definitions carry relatively few tokens (especially in speller) and are by compare50's current standards not big enough to be relevant. We could lower the noise threshold and see how that performs here, or allow the distribution code comparison to have a lower noise threshold than the actual comparison.

Full story: to exclude distro code from student code, compare50 will use the same method it uses for comparison. Namely it breaks up the file into k-grams, short sequences of tokens. If any such sequence from the distro code matches a sequence in the student's file, that sequence gets removed/ignored. The problem here lies in the "noise" threshold, that is effectively the length of the sequences. If this length is too short, almost everything will match, but if it's too long almost nothing will match. Through experimentation we landed on the "magic number" 25 (tokens)(

comparator = comparators.Winnowing(k=25, t=35)
).

@dmalan
Copy link
Member Author

dmalan commented Jul 22, 2020

Hm, here too could we do more thorough comparisons after the initial filtration, such that we re-examine all ~50 matches, diff out distro code, then exact-match other lines before sending to the GUI? Probably pretty fast for just 50 pairs?

@Jelleas
Copy link
Contributor

Jelleas commented Jul 23, 2020

That would essentially be a new method exact-by-line. But would be interesting to try out, given that for text/exact, lines are somewhat of a logical unit of information. Small gotchas though perhaps with this technique, there are always uninteresting lines in code:

{
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants