Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement approximate file matching #5 #6

Merged
merged 8 commits into from
Apr 16, 2024
Merged

Conversation

JonoYang
Copy link
Contributor

No description provided.

Signed-off-by: Jono Yang <jyang@nexb.com>
    * Compute resource fingerprints in fingerprint_codebase

Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Jono Yang <jyang@nexb.com>
@JonoYang JonoYang force-pushed the 5-approximate-file-matching branch from 9293abc to 287f8ec Compare April 15, 2024 21:58
    * Use open instead of codecs.open

Signed-off-by: Jono Yang <jyang@nexb.com>
file_fingerprint = BitAverageHaloHash(ngs_bytes) if ngs_bytes else None

return dict(
halo1=file_fingerprint.hexdigest().decode('utf-8') if file_fingerprint else ''
Copy link
Contributor Author

@JonoYang JonoYang Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pombredanne What should the file fingerprint be called? Currently it is halo1 but I don't think that's a great label. Also, if a fingerprint could not be generated for a file, should we still populate that resource's extra_data field with {'halo1': ''} or just avoid returning anything at all?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should the file fingerprint be called? Currently it is halo1 but I don't think that's a great label.

This is OK for now IMHO.

Also, if a fingerprint could not be generated for a file, should we still populate that resource's extra_data field with {'halo1': ''} or just avoid returning anything at all?

IMHO, always provide a value, null or empty (the same we do for other hashes)

Copy link
Contributor Author

@JonoYang JonoYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pombredanne I have a question regarding the file fingerprint name

@pombredanne pombredanne changed the title 5 approximate file matching Implement approximate file matching #5 Apr 16, 2024
Copy link
Collaborator

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking fine.... but I wonder if the tests are correct or enough? Especially using power of two repeated words may be a special corner case?

@JonoYang
Copy link
Contributor Author

@pombredanne What additional tests do you suggest?

@pombredanne
Copy link
Collaborator

@JonoYang what about a test that would be using two real code files where the second one has been modified a little?

Signed-off-by: Jono Yang <jyang@nexb.com>
@JonoYang JonoYang merged commit 842777f into main Apr 16, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants