Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMAGING-345] Unit test : make tests in memory instead of writing hundred of tmp files #271

Closed
wants to merge 1 commit into from

Conversation

kpouer
Copy link

@kpouer kpouer commented Jan 19, 2023

An attempt to perform unit tests in memory using byte array when possible instead of writing temporary files.
Tests pass much faster and do not make useless write on my SSD.
I know that having the tmp files might be useful in some case, if you think it it necessary I could update the patch to write tmp file in case of failed test.

@codecov-commenter
Copy link

codecov-commenter commented Jan 19, 2023

Codecov Report

Merging #271 (00b6560) into master (ab2a97a) will increase coverage by 0.04%.
The diff coverage is n/a.

❗ Current head 00b6560 differs from pull request most recent head 94aba6c. Consider uploading reports for the commit 94aba6c to get more accurate results

@@             Coverage Diff              @@
##             master     #271      +/-   ##
============================================
+ Coverage     70.76%   70.81%   +0.04%     
- Complexity     3368     3373       +5     
============================================
  Files           332      332              
  Lines         16936    16936              
  Branches       2652     2652              
============================================
+ Hits          11985    11993       +8     
+ Misses         3904     3898       -6     
+ Partials       1047     1045       -2     
Impacted Files Coverage Δ
.../main/java/org/apache/commons/imaging/Imaging.java 73.56% <0.00%> (+2.29%) ⬆️
...aging/common/bytesource/ByteSourceInputStream.java 78.76% <0.00%> (+3.53%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm busy with $work these days, but

An attempt to perform unit tests in memory using byte array when possible instead of writing temporary files.
Tests pass much faster and do not make useless write on my SSD.
I know that having the tmp files might be useful in some case, if you think it it necessary I could update the patch to write tmp file in case of failed test.

I think it could be useful in some cases, like if you want to debug/troubleshoot it, but I think speeding up the tests and reducing IO sound like a good idea. In case a test fails and we need to troubleshoot, it shouldn't be too hard to write it to disk to investigate (assuming that's needed at all). So I'm a big +1 for this, @kpouer .

I had a quick look and looks like you are adding some public methods too, in non-test files. Were these necessary for the change in this PR, @kpouer ?

Final note, it'd be necessary to squash the commits later. If you are still sorting out how to organize the change, feel free to keep it as is until you have it ready. Otherwise, squash it so that it's ready to be merged, please.

Thanks heaps!

@garydgregory
Copy link
Member

Keep in mind that tests must be able to run in containers that may not provide much memory.

@kpouer
Copy link
Author

kpouer commented Jan 19, 2023

Hi @kinow,
About the new methods, it is utility that get various information about the image. Some format are not detected only from their data and the file extension is necessary. It was of course working with tmp files, but not with byte arrays.
I took inspiration of the equivalent method taking an InputStream and a filename, for example
There was a

public static ICC_Profile getICCProfile(final InputStream is, final String fileName)

Then I created
public static ICC_Profile getICCProfile(final String fileName, final byte[] bytes) (the parameter order is not the same but it follows the constructor parameter orders of the ByteSourceArray class (however when rereading I think that Imaging class should always have the same parameter orders, it would be easier to understand.
I could have used getICCProfile(ByteSource) but the method was protected then I should have changed it's visibility.

About the troubleshooting I'll update the branch in order to save tmp files for failed test.

@garydgregory I don't think it will take a lot of memory as images are processed one by one, I think it should store something like 2 or 3 time the same image during some tests, then the memory is released to the GC to process the next image. And test images do not seem to be huge. However I am interested but not sure about what you mean by tested in containers. Do I have a way to test it to validate ?

@garydgregory
Copy link
Member

I was referring to GitHub Actions builds being resource constrained (to what extent I don't know exactly).

@kpouer
Copy link
Author

kpouer commented Jan 21, 2023

Hey, I was able to do the same test without changing public api so I reverted the change (even if I think those functions might be useful, it could be done in another PR if necessary.
I added code to save tmp file in case of error during test (catched throwable, I don't like it but had no idea how to do it in another way). For memory I tried to isolate ByteArrayOutputStream in blocks to allow GC to collect them as soon as possible, I hope it helps.
when you think it is ok I can resubmit the same patch in only one commit (or update it if you think there are more changes to do

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kpouer can you squash your commits? And would you be able to create a JIRA for this? Even though it's a change in the tests I'd like to review (will try to compare number of IO ops, and memory on my laptop at least) and include it in the changelog for the next release. Thanks!

@kpouer
Copy link
Author

kpouer commented Jan 21, 2023

Hey, I was not sure how I could squash commits without deleting my branch and pushing it again so I created a new branch and a new pull request (same code but squashed into one commit)
#272
I would be happy to create a JIRA ticket but I don't have an account on it. It says I could send an email to private@commons-imaging.apache.org if that email is setup. Is it ?

@kinow
Copy link
Member

kinow commented Jan 22, 2023

Hi @kpouer

Hey, I was not sure how I could squash commits without deleting my branch and pushing it again so I created a new branch and a new pull request (same code but squashed into one commit) #272

It's probably useful to get used to squashing commits, both for Apache Commons code contributions, but for other OSS that follow similar approach. The issue with opening new PR's is that a) you may have to add more commits again, and then squash again, and b) it gets more complicated to keep track of discussions and reviews (there's also c) where users/devs subscribed to the repo may get spammed by similar changes).

If you'd like you can try on this PR, as it's still open: https://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html

I squash rebasing interactively. Basically what I'd do here would be something similar to:

git fetch --all
git rebase upstream/master # upstream for me is apache/commons-imaging, kinow for me is the kinow/commons-imaging
gitk # or git log -n N, but I like to take a look at the GUI too, then count the number of commits since the top of master
# you have 17 commits here, ATOW
git rebase -i HEAD~17  # -i is interactive, i.e. I will interact with git to decide how to rebase
# each commit is prefixed by an action code, I mark the first line as r-eword
# and for each new commit line in the VIM/editor, I mark as f-ixup

image

When I save my editor, git rebase proceeds, opening the editor again as I asked to reword the fix commit. Now I write a good commit line. If I already have the JIRA number I prefix the commit message with [IMAGING-NNNN] Perform tests in memory instead of disk, making them faster".

When I save this file again, git will use the text as the commit message of the reworded commit. The other commits have been all "fixed-up". So now everything has been "squashed" into a single commit.

At this point I gitk again, or git log -n 1 or -n 2 to look at the commit history again.

image

At this point your local branch contains the squashed commits for your change. You can browse the code, run mvn, etc., until you are sure that everything is in order. Then you are ready to push -f kinow/branch-name (where kinow will be your personal remote, and branch-name must be unit_test_tmp_fil...., I think).

And the pull request will have been updated with your forced (-f) commit, showing a single commit for reviewers 👍

If that works you can close your other PR. Or if you prefer to try this later, feel free to close this issue 👍 up to you.

I would be happy to create a JIRA ticket but I don't have an account on it. It says I could send an email to private@commons-imaging.apache.org if that email is setup. Is it ?

I am sorry, due to bots SPAMing our JIRA server we had to lock it for new users 😥

Send the email, please, to the commons private mail (it's the email address you mentioned above, minus the "-imaging" part) with the info on https://infra.apache.org/jira-guidelines.html#account, and I or another committer will create your account.

Really sorry for the inconvenience 🙇

Thanks @kpouer !

@kpouer kpouer force-pushed the unit_test_tmp_file_optimization branch from eddd5bf to 1947c86 Compare January 22, 2023 10:27
@kinow
Copy link
Member

kinow commented Jan 22, 2023

kpouer force-pushed the unit_test_tmp_file_optimization branch from eddd5bf to 1947c86

👏 great job! Much easier to review, and thanks for closing the other PR. Bravo @kpouer 👍

@kpouer
Copy link
Author

kpouer commented Jan 22, 2023

Yes I was able to squash but was missing the "force" option for the push

@kinow kinow self-requested a review January 22, 2023 12:36
@kpouer kpouer force-pushed the unit_test_tmp_file_optimization branch from 1947c86 to 23208d9 Compare January 23, 2023 20:00
@kpouer
Copy link
Author

kpouer commented Jan 23, 2023

Hi, I just created a JIRA ticket for this patch and updated the commit description
https://issues.apache.org/jira/browse/IMAGING-345

@kinow
Copy link
Member

kinow commented Feb 5, 2023

Sorry for the delay, next on my list! 🤓

final ImagingParameters params = Util.getImageParser(formatInfo.format).getDefaultParameters();
try (FileOutputStream fos = new FileOutputStream(temp1)) {
imageParser.writeImage(testImage, fos, params);
try (ByteArrayOutputStream bos = new ByteArrayOutputStream(1000000)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leave it the default value and let it handle expanding the internal buffer? That could be helpful if we load many smaller images, I think (although I have no idea the distribution of sizes of our images & their metadata too).

@kpouer kpouer force-pushed the unit_test_tmp_file_optimization branch from 00b6560 to 94aba6c Compare February 6, 2023 21:10
@kinow
Copy link
Member

kinow commented Feb 6, 2023

Used Docker to run the build from this PR limiting the allowed memory for the container. The build crashed at 64mb, 128mb, 256mb. And it passed for 512mb and higher.

Switched to master, and 512mb passes, and 256mb fails as well. Command line used:

docker run -it --rm --name my-maven-project -v "$(pwd)":/usr/src/mymaven -w /usr/src/mymaven --memory=256m maven:3.8-openjdk-17 mvn clean install

I think the overhead of doing everything in memory is not noticeable, and also the operations to read/parse & write might end up loading things to memory anyway before writing to disk. Still, might be a good idea to implement this change somehow that's not hard to undo it if needed (maybe encapsulate it into a single class/function reused?)

@kpouer kpouer force-pushed the unit_test_tmp_file_optimization branch from 94aba6c to a49411b Compare February 7, 2023 21:28
@kpouer
Copy link
Author

kpouer commented Feb 7, 2023

I updated the branch, first I wanted to comment out code to create tmp files for troubleshooting but finally I just did enclose it in a if (false) {} block, this way no problem with imports and it is even easiest to set it to true for troubleshooting.
I'm exploring buiding with docker just to see if something can be done more efficiently

@kpouer kpouer changed the title Unit test : make tests in memory instead of writing hundred of tmp files [IMAGING-345] Unit test : make tests in memory instead of writing hundred of tmp files Feb 7, 2023
Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kpouer I'd be OK removing the if (false) block, or leaving it commented out in one test class, or adding a README.md in the test directory, etc.

As more tests are added, I guess other devs may ignore copying that if (false) block, and it's not very hard to change the code to write to the disk in case a test is failing.

Furthermore, it's hard to be consistent with this code as it's not used (see comments where in some blocks there's a call to LOGGER.severe, while others are missing it.

What do you think?

@kpouer
Copy link
Author

kpouer commented Feb 22, 2023

Hi @kinow I think you are right, those blocks adds complexity to the test and I am not sure they will be so useful. In the rare case someone really needs to see the file it is finally easy to add a few lines of code to write the file. I will remove them.

@kinow
Copy link
Member

kinow commented Feb 22, 2023

Hi @kinow I think you are right, those blocks adds complexity to the test and I am not sure they will be so useful. In the rare case someone really needs to see the file it is finally easy to add a few lines of code to write the file. I will remove them.

Excellent @kpouer . Feel free to ignore/resolve my last review comments then. I believe most/all are about the missing LOGGER in the if(false) block, and so they shouldn't be valid any longer. Thank you!

@kpouer kpouer force-pushed the unit_test_tmp_file_optimization branch from a49411b to 598e5e4 Compare February 22, 2023 14:05
@kpouer kpouer force-pushed the unit_test_tmp_file_optimization branch from 598e5e4 to 95405b1 Compare March 10, 2023 22:40
Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment about the JUnit method, but I think this is my last review feedback. It should be ready for a final pass of 👀 and merge. Thanks @kpouer !!!

@kpouer
Copy link
Author

kpouer commented Mar 30, 2023

Hey I missed the notification, yes a simple assertArrayEquals would be more efficient, do you want that I remove it ?

@kinow
Copy link
Member

kinow commented Mar 30, 2023

Hey I missed the notification, yes a simple assertArrayEquals would be more efficient, do you want that I remove it ?

Yes, if possible. Rebase and fix the conflicts, and then we can get it ready to be merged. Thanks @kpouer !

@kpouer kpouer force-pushed the unit_test_tmp_file_optimization branch from 95405b1 to 0cc46af Compare March 31, 2023 19:03
@kpouer
Copy link
Author

kpouer commented Mar 31, 2023

done

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, @kpouer ! I will add a changelog and merge it. Thank you for your contribution, and for the patience!

@kinow kinow closed this in 933584e Mar 31, 2023
@kinow
Copy link
Member

kinow commented Mar 31, 2023

Merged!

@kpouer kpouer deleted the unit_test_tmp_file_optimization branch April 1, 2023 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants