Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring - Optional Merge #45

Closed
wants to merge 6 commits into from

Conversation

qtc-de
Copy link

@qtc-de qtc-de commented Dec 3, 2022

Hi Elise 馃憢

first of all, cool idea! I recently needed to compare a large chunks of images and your approach for comparing them worked pretty well 馃憤

That being said, in the current implementation it is rather slow. Comparing larger chunks of images (15000+) takes a while. Moreover, you use a lot of different dependencies where some of them are quite large (e.g. opencv). This makes it difficult to install the tool in specific environments like within a Docker container.

Since I probably need to compare images in future again, I thought of improving these issues. This pull request provides the results. Before talking about the changes, let me apologize for the huge pull request. I actually do not like larger pull requests for my own repos and prevent from doing them to other persons as well. However, the dependency changes and especially the multiprocessing required a larger restructuring of your tool. Therefore, I totally understand if you do not want to merge the changes. In this case, I'm fine with maintaining a fork of your repository that provides an alternative implementation. Just decide as you like :)

Here is a brief summary of the changes I made:

  1. Make a clearer cut between CLI and library. The CLI script is now contained in /bin/difpy, while the code in /difPy/difPy.py only contains the library implementation.
  2. Reduce dependencies. The whole technique you describe can be implemented using numpy and Pillow. This makes it possible to create a Docker container running difPy that has only 161MB. Before, with opencv, we were around 1.2GB.
  3. Add multiprocessing. Work can now be distributed between different cores, which should speed up the operation quite a bit for larger image sets.
  4. Add a fast compare option. When image A is similar to image B, one probably does not want to compare B to other images, but is fine with only comparing A with others from here. Sure, this may misses some edge case duplicates, but in most situations it should be fine and provides a huge speedup for the operation.
  5. Change the command line layout. Feels now more intuitive (at least to me :D)
  6. Change the output format. The output format is still JSON based, but does not include much statistic information now. The regular end user is probably not that interested on when a comparison took place, but more on the actual comparison result. The new reduced output format should be easier to read / parse.
  7. Add a Dockerfile for building a container running difpy.

As I said, many changes. Just think about whether you want to merge or whether we keep these changes in a separate fork. I'm fine with both approaches 馃槈

Best
Tobias

Started some bigger refactoring. Still work in progress.
Completed the refactoring.
Added a Dockerfile for building a self contained difPy docker container.
Container size is currently around 160MB.
The option combination --copy-uniq and --move-duplicates did not work
correctly. Should be fixed now.
@elisemercury
Copy link
Owner

Hi Tobias!

Thank you so much for your feedback and your effort in making difPy better and more efficient! I really appreciate the effort you invested in this. I had a quick look at your changes and they seem very clean at first sight. I also appreciate a lot that you adjusted the README to your changes. Let me go through the code more in detail and test the changes in the next few days. After that, I'll decide whether we can merge and deploy them.

Again, thank you very much for this!

All the best,
Elise

@elisemercury elisemercury self-assigned this Dec 9, 2022
@elisemercury elisemercury added the feature : new New feature for difPy. label Dec 9, 2022
Accidently confused ENTRYPOINT with CMD. ENTRYPOINT is the correct
directive, as difPy should be the entrypoint and user supplied arguments
should be passed to it. The Dockerfile should work now as expected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature : new New feature for difPy.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants