Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix out of memory in hash computation #550

Merged
merged 1 commit into from
Jun 7, 2021

Conversation

edoardopirovano
Copy link
Contributor

Closes #528 by making two changes:

  1. As @adityasharad points out, we probably shouldn't try hashing files for results that don't have a line number, since these are probably not source code files. This will reduce the number of files we need to hash, and avoid us trying to hash large files like tarballs.
  2. As @RA80533 points out, we could avoid the OOM even if we do encounter a large file by computing the hash in a streaming manner so that we never need to store the whole thing in memory.

Either change individually would fix the issue above, but having both seems the most robust solution.

Merge / deployment checklist

  • Confirm this change is backwards compatible with existing workflows.
  • Confirm the readme has been updated if necessary.
  • Confirm the changelog has been updated if necessary.

Copy link
Contributor

@aeisenberg aeisenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. One more test would be good and then 🚢 .

"f1b89f20de0d133:1",
"c129715d7a2bc9a3:1",
]
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs another test that includes a sarif file without line numbers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fingerprinting2.input.sarif test file was already missing line numbers, so that's showing the new behaviour of omitting the fingerprints (see change in expected output). I'm not sure if that was testing any other behaviour that is now being missed, it didn't seem to be doing anything particularly different from the first test case to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah...yes. Was re yon my phone and didn't see the whole file.

const readStream = fs.createReadStream(filepath, "utf8");
readStream.on("close", fulfill);
readStream.on("end", () => {
processCharacter(65535);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is 65535?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was being appended at the end of the files when computing their fingerprint in the existing code. I don't know quite what its purpose is, but in the interest of keeping the fingerprints the same I think it makes sense to keep it there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it doesn't seem like a reasonable thing to change given that it would affect the hashes. It might be best to move it into a constant so it doesn't remain a magic number in that case.

@edoardopirovano edoardopirovano merged commit 02e8dcf into github:main Jun 7, 2021
@edoardopirovano edoardopirovano deleted the fix-oom branch June 7, 2021 14:12
Comment on lines +117 to +127
await new Promise((fulfill) => {
const readStream = fs.createReadStream(filepath, "utf8");
readStream.on("close", fulfill);
readStream.on("end", () => {
processCharacter(65535);
});
readStream.on("data", (data) => {
for (let i = 0; i < data.length; ++i)
processCharacter(data.charCodeAt(i));
});
});
Copy link
Contributor

@RA80533 RA80533 Jun 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be careful about mixing async-await and callback patterns. Error handling gets tricky because thrown errors don't necessarily correlate to promise rejections.

In this case, this could be rewritten entirely in async-await like so:

const readStream = fs.createReadStream(filepath, "utf8");

for await (const chunk of readStream) {
  // Handle the chunk
}

processCharacter(65535);

Any potential errors would be caught as normal because synchronous functions would throw and asynchronous functions would be rejected (because of the await).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this appears to be used for iterating over lines, the Readline API could be useful for consuming the stream:

const reader = readline.createInterface({
  input: fs.createReadStream(filepath, "utf8")
});

for await (const line of reader) {
  // Handle the line
}

processCharacter(65535);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions! I've applied your first one in #550, it is indeed nicer than what I wrote. The second one doesn't quite fit our use case nicely because we actually want the new lines in the hash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[…] we actually want the new lines in the hash.

Good catch. I figured it would contain the line endings since you're able to redirect the stream directly to {stdout|stderr}.write but apparently it doesn't. 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

codeql-action/upload-sarif - JavaScript heap out of memory
3 participants