In match_#.html, distro code is sometimes highlighted as an Exact match and not filtered out #72

dmalan · 2020-07-22T17:07:05Z

No description provided.

Jelleas · 2020-07-22T21:12:30Z

Tricky one there. In short, the current noise threshold is too high for function definitions. These definitions carry relatively few tokens (especially in speller) and are by compare50's current standards not big enough to be relevant. We could lower the noise threshold and see how that performs here, or allow the distribution code comparison to have a lower noise threshold than the actual comparison.

Full story: to exclude distro code from student code, compare50 will use the same method it uses for comparison. Namely it breaks up the file into k-grams, short sequences of tokens. If any such sequence from the distro code matches a sequence in the student's file, that sequence gets removed/ignored. The problem here lies in the "noise" threshold, that is effectively the length of the sequences. If this length is too short, almost everything will match, but if it's too long almost nothing will match. Through experimentation we landed on the "magic number" 25 (tokens)(

compare50/compare50/passes.py

Line 33 in 3727b46

comparator = comparators.Winnowing(k=25, t=35)

).

dmalan · 2020-07-22T22:53:31Z

Hm, here too could we do more thorough comparisons after the initial filtration, such that we re-examine all ~50 matches, diff out distro code, then exact-match other lines before sending to the GUI? Probably pretty fast for just 50 pairs?

Jelleas · 2020-07-23T13:23:17Z

That would essentially be a new method exact-by-line. But would be interesting to try out, given that for text/exact, lines are somewhat of a logical unit of information. Small gotchas though perhaps with this technique, there are always uninteresting lines in code:

{
}

dmalan added the bug label Jul 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In match_#.html, distro code is sometimes highlighted as an Exact match and not filtered out #72

In match_#.html, distro code is sometimes highlighted as an Exact match and not filtered out #72

dmalan commented Jul 22, 2020

Jelleas commented Jul 22, 2020

dmalan commented Jul 22, 2020

Jelleas commented Jul 23, 2020

In match_#.html, distro code is sometimes highlighted as an Exact match and not filtered out #72

In match_#.html, distro code is sometimes highlighted as an Exact match and not filtered out #72

Comments

dmalan commented Jul 22, 2020

Jelleas commented Jul 22, 2020

dmalan commented Jul 22, 2020

Jelleas commented Jul 23, 2020