-
-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete duplicate files #637
Conversation
There are large number of files in this repo that have duplicated filenames if they're treated as case-insenstive. One effect of this is that if you use nix to bring this repo into your project, the checksum for the project will differ between say Linux and MacOS. I also suspect there's undefined behavior as far as which module gets built. I wrote a script that takes the output of `git ls-files` and: 1. For duplicate filenames with exactly equivalent content, it deletes the older files. 2. For duplicate filenames with differing content, it deletes the older files. ```python import sys import hashlib import subprocess def md5(fname): hash_md5 = hashlib.md5() with open(fname, "rb") as f: for chunk in iter(lambda: f.read(4096), b""): hash_md5.update(chunk) return hash_md5.hexdigest() def file_git_timestamp(fname): return int( subprocess.check_output(["git", "log", "-1", "--pretty=%ct", fname]) .decode("utf8") .strip() ) files = {} for file in map(str.rstrip, sys.stdin): normalized = file.lower() if normalized not in files: files[normalized] = [] files[normalized].append(file) exacts = [] diffs = [] for k, v in files.items(): if len(v) >= 2: hashes = list(map(lambda f: md5(f), v)) if all(hashes[0] == x for x in hashes): exacts.append(v) else: diffs.append(v) for exact in exacts: oldest_to_newest = sorted(exact, key=lambda x: file_git_timestamp(x)) print(f"Keep {oldest_to_newest[0]}") for kill in oldest_to_newest[1:]: print(f"git rm {kill}") subprocess.check_call(["git", "rm", kill]) for diff in diffs: newest_to_oldest = sorted(diff, key=lambda x: file_git_timestamp(x), reverse=True) print(f"Keep {newest_to_oldest[0]}") for kill in newest_to_oldest[1:]: print(f"git rm {kill}") subprocess.check_call(["git", "rm", kill]) print(f"{len(exacts)} exact matches") print(f"{len(diffs)} differing matches") ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@brendanhay I'm currently referencing this branch as some of my colleges where affected by the macos FS. What do you think has to be done to get this merged? Is there something that prevents the merge I can do to help with. Staying on a commit that is not merged, increases the chance it will not be merged and we develop against a non future proof edge of the tree - with more update friction. From my limited perspective this PR seems to be good as is, but if there is something I can do to get it merged: Let me know. Happy to do the legwork. |
@mbj either @MichaelXavier or myself can merge it. |
Thanks, @brendanhay ! |
The same issue from brendanhay#637 has recurred where files with the same name but different case cause problems on Mac due to the case-insensitive filesystem. Previously, my script would prefer newer files where there was a conflict *and* differing file content, but it would prefer *older* files when the files were the exact same. In retrospect that was a mistake because if the conflicting files are being automatically generated, that script is just going to regenerate the files that were deleted every time. Below is the updated script I used to resolve this issue. I've spot checked the files in the diff and the ones being removed are indeed the older ones. ```python import sys import hashlib import subprocess def md5(fname): hash_md5 = hashlib.md5() with open(fname, "rb") as f: for chunk in iter(lambda: f.read(4096), b""): hash_md5.update(chunk) return hash_md5.hexdigest() def file_git_timestamp(fname): return int( subprocess.check_output(["git", "log", "-1", "--pretty=%ct", fname]) .decode("utf8") .strip() ) files = {} for file in map(str.rstrip, sys.stdin): normalized = file.lower() if normalized not in files: files[normalized] = [] files[normalized].append(file) exacts = [] diffs = [] for k, v in files.items(): if len(v) >= 2: hashes = list(map(lambda f: md5(f), v)) if all(hashes[0] == x for x in hashes): exacts.append(v) else: diffs.append(v) for exact in exacts: newest_to_oldest = sorted(exact, key=lambda x: file_git_timestamp(x), reverse=True) print(f"Keep {newest_to_oldest[0]}") for kill in newest_to_oldest[1:]: print(f"git rm {kill}") subprocess.check_call(["git", "rm", kill]) for diff in diffs: newest_to_oldest = sorted(diff, key=lambda x: file_git_timestamp(x), reverse=True) print(f"Keep {newest_to_oldest[0]}") for kill in newest_to_oldest[1:]: print(f"git rm {kill}") subprocess.check_call(["git", "rm", kill]) print(f"{len(exacts)} exact matches") print(f"{len(diffs)} differing matches") ```
There are large number of files in this repo that have duplicated
filenames if they're treated as case-insenstive. One effect of this is
that if you use nix to bring this repo into your project, the checksum
for the project will differ between Linux and MacOS. I also suspect
there's undefined behavior as far as which module gets built.
I wrote a script that takes the output of
git ls-files
and:older files.
files.
FWIW I have introduced this package into a large (>100KLOC) project and everything seems to build and work.