-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compare with MOSS results #27
Comments
at the time of commit 9661fc5 the highlights mostly match except for two notable differences:
Moss result. The |
at the time of commit 7a6fd96 all the dolos flags match the moss flags. So from now on the same flags for both applications are used. |
This compares the -d flag.
Where everything except |
Both programs were run with |
Can you post an updated runtime with the new code? Be sure to only run the code and not to include the tsc compile step. |
Yeah sure, at the time of commit e7c8b14 running dolos with the same files as in my previous comment takes roughly 2s. |
It seems that our algorithm can't handle the following code segment very well: code block
let minuten = {
"00": "HET IS",
"05": "HET IS VIJF OVER",
"10": "HET IS TIEN OVER",
"15": "HET IS KWART OVER",
"20": "HET IS TIEN VOOR HALF",
"25": "HET IS VIJF VOOR HALF",
"30": "HET IS HALF",
"35": "HET IS VIJF OVER HALF",
"40": "HET IS TIEN OVER HALF",
"45": "HET IS KWART VOOR",
"50": "HET IS TIEN VOOR",
"55": "HET IS VIJF VOOR"
};
let uren = {
"01": "EEN",
"02": "TWEE",
"03": "DRIE",
"04": "VIER",
"05": "VIJF",
"06": "ZES",
"07": "ZEVEN",
"08": "ACHT",
"09": "NEGEN",
"10": "TIEN",
"11": "ELF",
"12": "TWAALF",
"13": "EEN",
"14": "TWEE",
"15": "DRIE",
"16": "VIER",
"17": "VIJF",
"18": "ZES",
"19": "ZEVEN",
"20": "ACHT",
"21": "NEGEN",
"22": "TIEN",
"23": "ELF",
"00": "TWAALF"
};
Dolos gets confused and splits it up in many small parts while Moss seems to handle it mostly fine. |
You should try to find out why. Start by looking at the raw results instead of the summary and see if we have matches for each of the lines. If not, try to play with the kmer length and filter strength. |
Would it make sense to only report on the order with the highest value (e.g. max(score(a,b), score(b,a)))? That way we would get rid of the doubles and the results would become symmetric (which was an issue for the clustering?) |
I don't really think that is the best option here, because as far as I can see they are perfectly symmetrical in which case we can just filter them out. If they aren't symmetrical it would make more sense to join both matches into one, so it contains everything from both so no information is lost. I'll test if they are symmetrical and reply back here. |
The results seem to be symmetric as far as can tell by the test results code
import fs from "fs";
import path from "path";
import { JSONFormatter, JSONSummaryFormat } from "./lib/jsonFormatter";
import { RangesTuple } from "./lib/summary";
import { Range } from "./lib/range"
(async () => {
const jsonResults: JSONSummaryFormat = JSON.parse(
fs.readFileSync(path.resolve("temp.json"), "utf8"),
JSONFormatter.JSONReviverFunction,
);
let reversed: number = 0;
let normal: number = 0;
const resultsMap: Map<string, RangesTuple[]> = new Map();
for (const group of jsonResults.results) {
for (let [file1, file2, matches] of group) {
if (file1 < file2) {
[file1, file2] = [file2, file1];
}
if (!resultsMap.has(file1 + file2)) {
normal += 1;
resultsMap.set(file1 + file2, matches);
} else {
reversed += 1;
const otherMatches: RangesTuple[] = resultsMap.get(file1 + file2) as RangesTuple[];
if( matches.length !== otherMatches.length || !areMatchesEqual(matches, otherMatches)) {
console.log(`${file1}'s and ${file2}'s results aren't symmetrical`);
}
}
}
}
console.log(normal, reversed);
})();
function areMatchesEqual(matches: RangesTuple[], mirroredMatches: RangesTuple[]): boolean {
for(const match of matches.values()) {
if(mirroredMatches.findIndex((potentialMatch) => areRangesTuplesMirroredEqual(match, potentialMatch)) === -1) {
console.log(match);
return false;
}
}
return true;
}
function areRangesTuplesMirroredEqual([r11, r12]: RangesTuple, [r21, r22]: RangesTuple): boolean {
return areRangesEqual(r11, r22) && areRangesEqual(r12, r21);
}
function areRangesEqual(range1: Range, range2: Range): boolean {
return range1.from === range2.from && range1.to === range2.to;
} |
After looking at the raw results it seems that each entry in the array is matched against every other entry, causing them to exceed the maximum hash count or maximum hash percentage. results[0, 5] [2, 33] [3, 8] [4, 63] [6, 20] [8, 15] [9, 51] [10, 5] [12, 39] [13, 40] [14, 8] [15, 43] [16, 26] [21, 16] [22, 2] [23, 8] [24, 124] [26, 55] [29, 38] [30, 79] [31, 2] [33, 9] [34, 6] [35, 22] [36, 22] [37, 4] [38, 162] [39, 31] [40, 57] [41, 25] [42, 67] [46, 22] [47, 4] [48, 162] [49, 37] [50, 5] [51, 59] [52, 70] [53, 36] [57, 3] [59, 3] [62, 15] [63, 8] [65, 30] [66, 30] [67, 30] [68, 30] [69, 30] [70, 30] [71, 30] [72, 30] [73, 30] [74, 2] [75, 3] [78, 8] [80, 30] [81, 30] [82, 30] [83, 30] [84, 30] [85, 30] [86, 30] [87, 30] [88, 30] [89, 30] [90, 30] [91, 30] [92, 30] [93, 30] [94, 30] [95, 30] [96, 30] [97, 30] [98, 30] [99, 30] [100, 30] [101, 2] [102, 2] [104, 2] [105, 6] [106, 126] [107, 4] [108, 550] [109, 4] [110, 159] [111, 4] [112, 256] [113, 4] [114, 254] [115, 1] [117, 133] [118, 144] [119, 12] [120, 11] [123, 121] [125, 101] [126, 19] [128, 54] [130, 126] [131, 101] [133, 105] [135, 14] [136, 28] [137, 2] [138, 26] [142, 17] [143, 10] [144, 12] [145, 73] [146, 75] [147, 77] [148, 60] [150, 6] [151, 6] [152, 4] [153, 3] [154, 16] [155, 45] [156, 78] [157, 30] [158, 55] [159, 53] [160, 1] [161, 58] [162, 39] [163, 78] [164, 30] [165, 55] [166, 52] [167, 1] [172, 8] [173, 8] [174, 4] [175, 5] [176, 37] [177, 2] [179, 35] [180, 24] [181, 53] [182, 66] [183, 5] [184, 6] [185, 54] [186, 68] [187, 7] [188, 81] [189, 14] [190, 14] [191, 34] [192, 29] [193, 13] [194, 11] [195, 9] [199, 30] [200, 67] [201, 2] [203, 29] [204, 13] [205, 11] [206, 12] [208, 7] [209, 81] [210, 14] [211, 14] [212, 34] [213, 29] [214, 13] [215, 11] [216, 10] [220, 53] |
After some more testing I found that the results are not at all symmetrical. My last error worked with a filtered output causing the asymmetrical results to be removed. That said most of the asymmetrical results seem to be bad matches. new code
import fs from "fs";
import path from "path";
import { CodeTokenizer } from "./lib/codeTokenizer";
import { Comparison, Matches } from "./lib/comparison";
import { JSONFormatter, JSONSummaryFormat } from "./lib/jsonFormatter";
import { Range } from "./lib/range";
import { FilterOptions, RangesTuple, Summary } from "./lib/summary";
(async () => {
const mapLocation: string = path.resolve("./samples/js/qlocktwo/");
const locations: string[] = fs
.readdirSync(mapLocation, "utf8")
.map(location => `${mapLocation}/${location}`);
const tokenizer = new CodeTokenizer("javascript");
const comparison = new Comparison(tokenizer, {
filterHashByPercentage: undefined,
maxHash: 200,
});
comparison.addFiles(locations);
const matchesPerFile: Map<string, Matches<number>> = await comparison.compareFiles(locations);
const filterOptions: FilterOptions = {
minimumFragmentLength: 1,
};
const summary = new Summary(matchesPerFile, 0, filterOptions, 0);
() => summary;
const jsonResults: JSONSummaryFormat = JSON.parse(
fs.readFileSync(path.resolve("temp.json"), "utf8"),
JSONFormatter.JSONReviverFunction,
);
testRawResults(matchesPerFile);
() => testSummaryResults(jsonResults.results); //TODO
})();
function testRawResults(results: Map<string, Matches<number>>): void {
let reversed: number = 0;
let normal: number = 0;
const resultsMap: Map<string, Array<[number, number]>> = new Map();
for (let [file1, matches] of results.entries()) {
for (let [file2, matchingLines] of matches.entries()) {
if (file1 < file2) {
[file1, file2] = [file2, file1];
}
if (!resultsMap.has(file1 + file2)) {
normal += 1;
resultsMap.set(file1 + file2, matchingLines);
} else {
reversed += 1;
const otherMatches: Array<[number, number]> = resultsMap.get(file1 + file2) as Array<[
number,
number,
]>;
if (!areMatchingLinesEqual(matchingLines, otherMatches)) {
console.log(
`${file1}'s (${matchingLines.length}) and ${file2}'s (${otherMatches.length}) results aren't symmetrical`,
);
}
}
}
console.log(normal, reversed);
}
}
function areMatchingLinesEqual(
lines1: Array<[number, number]>,
lines2: Array<[number, number]>,
): boolean {
for (const [l11, l12] of lines1.values()) {
if (lines2.findIndex(([l21, l22]) => l11 === l22 && l12 == l21) === -1) {
console.log(lines1);
return false;
}
}
return true;
}
function testSummaryResults(results: Array<Array<[string, string, RangesTuple[]]>>): void {
let reversed: number = 0;
let normal: number = 0;
const resultsMap: Map<string, RangesTuple[]> = new Map();
for (const group of results) {
for (let [file1, file2, matches] of group) {
if (file1 < file2) {
[file1, file2] = [file2, file1];
}
if (!resultsMap.has(file1 + file2)) {
normal += 1;
resultsMap.set(file1 + file2, matches);
} else {
reversed += 1;
const otherMatches: RangesTuple[] = resultsMap.get(file1 + file2) as RangesTuple[];
if (!areMatchesEqual(matches, otherMatches)) {
console.log(
`${file1}'s (${matches.length}) and ${file2}'s (${otherMatches.length}) results aren't symmetrical`,
);
}
}
}
}
console.log(normal, reversed);
}
function areMatchesEqual(matches: RangesTuple[], mirroredMatches: RangesTuple[]): boolean {
for (const match of matches.values()) {
if (
mirroredMatches.findIndex(potentialMatch =>
areRangesTuplesMirroredEqual(match, potentialMatch),
) === -1
) {
console.log(match);
return false;
}
}
return true;
}
function areRangesTuplesMirroredEqual([r11, r12]: RangesTuple, [r21, r22]: RangesTuple): boolean {
return areRangesEqual(r11, r22) && areRangesEqual(r12, r21);
}
function areRangesEqual(range1: Range, range2: Range): boolean {
return range1.from === range2.from && range1.to === range2.to;
} |
This has been done in the upcoming publication. |
MOSS
The text was updated successfully, but these errors were encountered: