Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete duplicate files #637

Merged
merged 1 commit into from
Sep 23, 2021
Merged

Conversation

MichaelXavier
Copy link
Collaborator

There are large number of files in this repo that have duplicated
filenames if they're treated as case-insenstive. One effect of this is
that if you use nix to bring this repo into your project, the checksum
for the project will differ between Linux and MacOS. I also suspect
there's undefined behavior as far as which module gets built.

I wrote a script that takes the output of git ls-files and:

  1. For duplicate filenames with exactly equivalent content, it deletes the
    older files.
  2. For duplicate filenames with differing content, it deletes the older
    files.

FWIW I have introduced this package into a large (>100KLOC) project and everything seems to build and work.

import sys
import hashlib
import subprocess

def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

def file_git_timestamp(fname):
    return int(
        subprocess.check_output(["git", "log", "-1", "--pretty=%ct", fname])
        .decode("utf8")
        .strip()
    )

files = {}

for file in map(str.rstrip, sys.stdin):
    normalized = file.lower()
    if normalized not in files:
        files[normalized] = []
    files[normalized].append(file)

exacts = []
diffs = []

for k, v in files.items():
    if len(v) >= 2:
        hashes = list(map(lambda f: md5(f), v))
        if all(hashes[0] == x for x in hashes):
            exacts.append(v)
        else:
            diffs.append(v)

for exact in exacts:
    oldest_to_newest = sorted(exact, key=lambda x: file_git_timestamp(x))
    print(f"Keep {oldest_to_newest[0]}")
    for kill in oldest_to_newest[1:]:
        print(f"git rm {kill}")
        subprocess.check_call(["git", "rm", kill])

for diff in diffs:
    newest_to_oldest = sorted(diff, key=lambda x: file_git_timestamp(x), reverse=True)
    print(f"Keep {newest_to_oldest[0]}")
    for kill in newest_to_oldest[1:]:
        print(f"git rm {kill}")
        subprocess.check_call(["git", "rm", kill])

print(f"{len(exacts)} exact matches")
print(f"{len(diffs)} differing matches")

There are large number of files in this repo that have duplicated
filenames if they're treated as case-insenstive. One effect of this is
that if you use nix to bring this repo into your project, the checksum
for the project will differ between say Linux and MacOS. I also suspect
there's undefined behavior as far as which module gets built.

I wrote a script that takes the output of `git ls-files` and:

1. For duplicate filenames with exactly equivalent content, it deletes the
older files.
2. For duplicate filenames with differing content, it deletes the older
files.

```python

import sys
import hashlib
import subprocess

def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

def file_git_timestamp(fname):
    return int(
        subprocess.check_output(["git", "log", "-1", "--pretty=%ct", fname])
        .decode("utf8")
        .strip()
    )

files = {}

for file in map(str.rstrip, sys.stdin):
    normalized = file.lower()
    if normalized not in files:
        files[normalized] = []
    files[normalized].append(file)

exacts = []
diffs = []

for k, v in files.items():
    if len(v) >= 2:
        hashes = list(map(lambda f: md5(f), v))
        if all(hashes[0] == x for x in hashes):
            exacts.append(v)
        else:
            diffs.append(v)

for exact in exacts:
    oldest_to_newest = sorted(exact, key=lambda x: file_git_timestamp(x))
    print(f"Keep {oldest_to_newest[0]}")
    for kill in oldest_to_newest[1:]:
        print(f"git rm {kill}")
        subprocess.check_call(["git", "rm", kill])

for diff in diffs:
    newest_to_oldest = sorted(diff, key=lambda x: file_git_timestamp(x), reverse=True)
    print(f"Keep {newest_to_oldest[0]}")
    for kill in newest_to_oldest[1:]:
        print(f"git rm {kill}")
        subprocess.check_call(["git", "rm", kill])

print(f"{len(exacts)} exact matches")
print(f"{len(diffs)} differing matches")

```
Copy link
Owner

@brendanhay brendanhay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mbj
Copy link

mbj commented Sep 23, 2021

@brendanhay I'm currently referencing this branch as some of my colleges where affected by the macos FS.

What do you think has to be done to get this merged? Is there something that prevents the merge I can do to help with.

Staying on a commit that is not merged, increases the chance it will not be merged and we develop against a non future proof edge of the tree - with more update friction.

From my limited perspective this PR seems to be good as is, but if there is something I can do to get it merged: Let me know. Happy to do the legwork.

@brendanhay
Copy link
Owner

@mbj either @MichaelXavier or myself can merge it.

@brendanhay brendanhay merged commit c28b015 into brendanhay:develop Sep 23, 2021
@MichaelXavier
Copy link
Collaborator Author

Thanks, @brendanhay !

@MichaelXavier MichaelXavier deleted the fix-duplicate-files branch September 28, 2021 23:45
MichaelXavier added a commit to Soostone/amazonka that referenced this pull request Sep 30, 2021
The same issue from brendanhay#637 has recurred where files with the same name but
different case cause problems on Mac due to the case-insensitive
filesystem. Previously, my script would prefer newer files where there
was a conflict *and* differing file content, but it would prefer *older*
files when the files were the exact same. In retrospect that was a
mistake because if the conflicting files are being automatically
generated, that script is just going to regenerate the files that were
deleted every time. Below is the updated script I used to resolve this
issue. I've spot checked the files in the diff and the ones being
removed are indeed the older ones.

```python

import sys
import hashlib
import subprocess

def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

def file_git_timestamp(fname):
    return int(
        subprocess.check_output(["git", "log", "-1", "--pretty=%ct", fname])
        .decode("utf8")
        .strip()
    )

files = {}

for file in map(str.rstrip, sys.stdin):
    normalized = file.lower()
    if normalized not in files:
        files[normalized] = []
    files[normalized].append(file)

exacts = []
diffs = []

for k, v in files.items():
    if len(v) >= 2:
        hashes = list(map(lambda f: md5(f), v))
        if all(hashes[0] == x for x in hashes):
            exacts.append(v)
        else:
            diffs.append(v)

for exact in exacts:
    newest_to_oldest = sorted(exact, key=lambda x: file_git_timestamp(x), reverse=True)
    print(f"Keep {newest_to_oldest[0]}")
    for kill in newest_to_oldest[1:]:
        print(f"git rm {kill}")
        subprocess.check_call(["git", "rm", kill])

for diff in diffs:
    newest_to_oldest = sorted(diff, key=lambda x: file_git_timestamp(x), reverse=True)
    print(f"Keep {newest_to_oldest[0]}")
    for kill in newest_to_oldest[1:]:
        print(f"git rm {kill}")
        subprocess.check_call(["git", "rm", kill])

print(f"{len(exacts)} exact matches")
print(f"{len(diffs)} differing matches")

```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants