Delete duplicate files #637

MichaelXavier · 2021-09-21T00:03:22Z

There are large number of files in this repo that have duplicated
filenames if they're treated as case-insenstive. One effect of this is
that if you use nix to bring this repo into your project, the checksum
for the project will differ between Linux and MacOS. I also suspect
there's undefined behavior as far as which module gets built.

I wrote a script that takes the output of git ls-files and:

For duplicate filenames with exactly equivalent content, it deletes the
older files.
For duplicate filenames with differing content, it deletes the older
files.

FWIW I have introduced this package into a large (>100KLOC) project and everything seems to build and work.

import sys
import hashlib
import subprocess

def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

def file_git_timestamp(fname):
    return int(
        subprocess.check_output(["git", "log", "-1", "--pretty=%ct", fname])
        .decode("utf8")
        .strip()
    )

files = {}

for file in map(str.rstrip, sys.stdin):
    normalized = file.lower()
    if normalized not in files:
        files[normalized] = []
    files[normalized].append(file)

exacts = []
diffs = []

for k, v in files.items():
    if len(v) >= 2:
        hashes = list(map(lambda f: md5(f), v))
        if all(hashes[0] == x for x in hashes):
            exacts.append(v)
        else:
            diffs.append(v)

for exact in exacts:
    oldest_to_newest = sorted(exact, key=lambda x: file_git_timestamp(x))
    print(f"Keep {oldest_to_newest[0]}")
    for kill in oldest_to_newest[1:]:
        print(f"git rm {kill}")
        subprocess.check_call(["git", "rm", kill])

for diff in diffs:
    newest_to_oldest = sorted(diff, key=lambda x: file_git_timestamp(x), reverse=True)
    print(f"Keep {newest_to_oldest[0]}")
    for kill in newest_to_oldest[1:]:
        print(f"git rm {kill}")
        subprocess.check_call(["git", "rm", kill])

print(f"{len(exacts)} exact matches")
print(f"{len(diffs)} differing matches")

There are large number of files in this repo that have duplicated filenames if they're treated as case-insenstive. One effect of this is that if you use nix to bring this repo into your project, the checksum for the project will differ between say Linux and MacOS. I also suspect there's undefined behavior as far as which module gets built. I wrote a script that takes the output of `git ls-files` and: 1. For duplicate filenames with exactly equivalent content, it deletes the older files. 2. For duplicate filenames with differing content, it deletes the older files. ```python import sys import hashlib import subprocess def md5(fname): hash_md5 = hashlib.md5() with open(fname, "rb") as f: for chunk in iter(lambda: f.read(4096), b""): hash_md5.update(chunk) return hash_md5.hexdigest() def file_git_timestamp(fname): return int( subprocess.check_output(["git", "log", "-1", "--pretty=%ct", fname]) .decode("utf8") .strip() ) files = {} for file in map(str.rstrip, sys.stdin): normalized = file.lower() if normalized not in files: files[normalized] = [] files[normalized].append(file) exacts = [] diffs = [] for k, v in files.items(): if len(v) >= 2: hashes = list(map(lambda f: md5(f), v)) if all(hashes[0] == x for x in hashes): exacts.append(v) else: diffs.append(v) for exact in exacts: oldest_to_newest = sorted(exact, key=lambda x: file_git_timestamp(x)) print(f"Keep {oldest_to_newest[0]}") for kill in oldest_to_newest[1:]: print(f"git rm {kill}") subprocess.check_call(["git", "rm", kill]) for diff in diffs: newest_to_oldest = sorted(diff, key=lambda x: file_git_timestamp(x), reverse=True) print(f"Keep {newest_to_oldest[0]}") for kill in newest_to_oldest[1:]: print(f"git rm {kill}") subprocess.check_call(["git", "rm", kill]) print(f"{len(exacts)} exact matches") print(f"{len(diffs)} differing matches") ```

brendanhay

LGTM

mbj · 2021-09-23T03:02:44Z

@brendanhay I'm currently referencing this branch as some of my colleges where affected by the macos FS.

What do you think has to be done to get this merged? Is there something that prevents the merge I can do to help with.

Staying on a commit that is not merged, increases the chance it will not be merged and we develop against a non future proof edge of the tree - with more update friction.

From my limited perspective this PR seems to be good as is, but if there is something I can do to get it merged: Let me know. Happy to do the legwork.

brendanhay · 2021-09-23T05:56:46Z

@mbj either @MichaelXavier or myself can merge it.

MichaelXavier · 2021-09-28T15:42:06Z

Thanks, @brendanhay !

The same issue from brendanhay#637 has recurred where files with the same name but different case cause problems on Mac due to the case-insensitive filesystem. Previously, my script would prefer newer files where there was a conflict *and* differing file content, but it would prefer *older* files when the files were the exact same. In retrospect that was a mistake because if the conflicting files are being automatically generated, that script is just going to regenerate the files that were deleted every time. Below is the updated script I used to resolve this issue. I've spot checked the files in the diff and the ones being removed are indeed the older ones. ```python import sys import hashlib import subprocess def md5(fname): hash_md5 = hashlib.md5() with open(fname, "rb") as f: for chunk in iter(lambda: f.read(4096), b""): hash_md5.update(chunk) return hash_md5.hexdigest() def file_git_timestamp(fname): return int( subprocess.check_output(["git", "log", "-1", "--pretty=%ct", fname]) .decode("utf8") .strip() ) files = {} for file in map(str.rstrip, sys.stdin): normalized = file.lower() if normalized not in files: files[normalized] = [] files[normalized].append(file) exacts = [] diffs = [] for k, v in files.items(): if len(v) >= 2: hashes = list(map(lambda f: md5(f), v)) if all(hashes[0] == x for x in hashes): exacts.append(v) else: diffs.append(v) for exact in exacts: newest_to_oldest = sorted(exact, key=lambda x: file_git_timestamp(x), reverse=True) print(f"Keep {newest_to_oldest[0]}") for kill in newest_to_oldest[1:]: print(f"git rm {kill}") subprocess.check_call(["git", "rm", kill]) for diff in diffs: newest_to_oldest = sorted(diff, key=lambda x: file_git_timestamp(x), reverse=True) print(f"Keep {newest_to_oldest[0]}") for kill in newest_to_oldest[1:]: print(f"git rm {kill}") subprocess.check_call(["git", "rm", kill]) print(f"{len(exacts)} exact matches") print(f"{len(diffs)} differing matches") ```

MichaelXavier requested a review from brendanhay September 21, 2021 00:03

brendanhay approved these changes Sep 21, 2021

View reviewed changes

brendanhay merged commit c28b015 into brendanhay:develop Sep 23, 2021

MichaelXavier deleted the fix-duplicate-files branch September 28, 2021 23:45

MichaelXavier mentioned this pull request Sep 30, 2021

Prefer newer files on exact duplicates #655

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete duplicate files #637

Delete duplicate files #637

MichaelXavier commented Sep 21, 2021

brendanhay left a comment

mbj commented Sep 23, 2021

brendanhay commented Sep 23, 2021

MichaelXavier commented Sep 28, 2021

Delete duplicate files #637

Delete duplicate files #637

Conversation

MichaelXavier commented Sep 21, 2021

brendanhay left a comment

Choose a reason for hiding this comment

mbj commented Sep 23, 2021

brendanhay commented Sep 23, 2021

MichaelXavier commented Sep 28, 2021