Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maven-like repository for JavaCPP resources #120

Closed
wants to merge 3 commits into from

Conversation

cypof
Copy link
Contributor

@cypof cypof commented Aug 9, 2016

I have a need for #43 again, so I resumed working on it. To help merge it, I decided to split the old PR into two. This first part is the repo itself, without the include files jar or symlinks support. While we iterate on this, I will rebase and work on the rest of the old PR.

@saudet
Copy link
Member

saudet commented Aug 11, 2016

Sorry for taking some time to reply. I was worried this wouldn't work on some platforms, so I've been testing this out on all platforms, and (apart from an obvious bug) it seems to be working surprisingly well everywhere. Good job!

Thinking about this some more though, it looks more like a cache than a repo, no? If we consider it as a cache, we wouldn't need any coordinates. Do we need "clean" paths?

@saudet
Copy link
Member

saudet commented Aug 11, 2016

Since this new code is basically getting used only when loading files found in a JAR archive, what if we simply used the name of the archive? So that files found in opencv-3.0.0-1.0-windows-x86_64.jar get extracted to ~/.cache/javacpp/opencv-3.0.0-1.0-windows-x86_64/, for example.

In any case, there's at least one issue that we need to resolve. We'll need to make sure that the files in the cache/repo are the same as the ones in resources, and if not overwrite them. Just a call to Files.exists() isn't enough...

@saudet
Copy link
Member

saudet commented Aug 12, 2016

To check for modified files, maybe we could just do like rsync? "Rsync finds files that need to be transferred using a lqquick checkrq algorithm (by default) that looks for files that have changed in size or in last-modified time."

@cypof
Copy link
Contributor Author

cypof commented Aug 13, 2016

As you know once the second PR is ready, you won't need to install anything on your machine before you can build native code. I think your project will enable all kinds of apps to add little bits of native code, as it will be really convenient to do. The only way I think to make sure we don't have name conflicts and proper versioning across a large number of apps is to closely stick to Maven coordinates. They provide global unique names, versions etc. and Maven files cannot be changed. Checking for content should only be necessary for *-SNAPSHOT jars. Maybe we can leverage their checksum mechanism too, or do what most Maven tools seem to do for SNAPSHOT versions which is to always replace them?

@saudet
Copy link
Member

saudet commented Aug 14, 2016

Extracting files in the tmpdir already allows to have a large number of applications running without conflicts, so it's not a good reason to change that behavior. There are issues with temp files that can be dealt with using a cache, without adding a lot complexity, because it's basically transparent, so that's good I think.

But what you propose isn't transparent. We would have cases when what we have in Java resources doesn't match what we have in JavaCPP resources, and that would require some tools to deal with. Both Maven and Nexus do allow redeploying artifacts with non-SNAPSHOT versions, for debugging, recovering from failed operations etc. I personally wouldn't be ready to deal with the added complexity of an additional repo, unless the tools are there and there's a clear advantage over something like a cache.

BTW, simply overwriting files wouldn't work well when they are already opened by another process. Applications could fail for obscure reasons. For a cache, a hash based on the content of the JAR file like you propose is good, but we can use nicer names based on the name, date, and size of the JAR, which should avoid conflicts, but would require a max cache size option and some appropriate cleanup a la ccache.

@cypof
Copy link
Contributor Author

cypof commented Aug 15, 2016

I like the simplicity of a pure cache story, but it might be limiting when using it to build against it. E.g. if you are playing with custom builds, or runtime code generation etc, it's handy to have a fixed path to include. Each library version has a single path it can be found at, and that can be relied upon to be there.

I think it's worth the small added complexity of fixed location, plus checking if the files have changed. I looked about this a little, Maven does let you republish files, but it seems really exceptional and something devs only would play with in practice. It seems maven does an additional check for the timestamp, in addition to testing for the file existence. We could replicate this by looking at the timestamp in .m2, and re-expanding if changed?

@cypof
Copy link
Contributor Author

cypof commented Aug 18, 2016

I added support for detecting Maven re-deployments. If the timestamp in .m2 is higher than ours, the jar is expanded again. Would that work? Do you agree with my argument about stable paths to native libraries being very valuable? Thanks

@saudet
Copy link
Member

saudet commented Aug 20, 2016

Something like that would work, yes, but it's not enough. rsync checks for both the size of files and their last modified dates. So, in this case as well, let's make sure that a file already extracted has the same last modified date and file size than one found in the JAR file. Then we can reasonably assume the content is the same, and I would be fine with that.

@saudet
Copy link
Member

saudet commented Aug 20, 2016

Another important thing that we'll need to fix is Android support. Android doesn't support the FileSystem API from Java 7, but it has its own buggy ways of dealing with JNI. Anyway it doesn't matter. We just need to make sure things like ClassNotFoundException, NoSuchMethodError, and such are handled properly, as is already done on these lines:
https://github.com/bytedeco/javacpp/blob/master/src/main/java/org/bytedeco/javacpp/Loader.java#L148
https://github.com/bytedeco/javacpp/blob/master/src/main/java/org/bytedeco/javacpp/Loader.java#L157
https://github.com/bytedeco/javacpp/blob/master/src/main/java/org/bytedeco/javacpp/Loader.java#L258
https://github.com/bytedeco/javacpp/blob/master/src/main/java/org/bytedeco/javacpp/Loader.java#L276

@saudet
Copy link
Member

saudet commented Aug 20, 2016

Instead of some cryptic hash, what about if we use the following mapping for resources without coordinates?
group = name of JAR file
id = size of JAR file
version = last modified date of JAR file

@cypof
Copy link
Contributor Author

cypof commented Aug 21, 2016

Yes that sounds good, I added a commit to remove the hash. About android support, I'm not sure what to do. If I add try-catch blocks, I still need to fail further along, as the files would not be deployed.

And about the file size check, I don't think it is necessary. There is no way the maven repo can be updated without the timestamp changing. Also I don't see a clean way to implement it, as I need to compare the size of a jar to a expanded folder. Would the timestamp only work? thks

@saudet
Copy link
Member

saudet commented Aug 28, 2016

Thanks! Could you also make sure the extracted files have the same size and same last modified date as well? Just checking that files in JAR files are more recent isn't enough...

As for Android, we don't need to extract the files for Android. They get extracted at install time by the Android installer, and in theory, we only need to have the fallback System.loadLibrary() called (but Android devices are buggy and we actually need to do a lot more, but don't worry about that, as long as it gets to System.loadLibrary()). Adding the catch blocks isn't hard, but we need to run this in the Android emulator to make sure it works. Not sure when I'll have the time to do it myself, so if you could check it out it would help :)

@saudet
Copy link
Member

saudet commented Aug 28, 2016

BTW, I'm not talking about the size and the date of the JAR files, but of the files in the JAR files. From what I understand, we would be adding the check here:
https://github.com/cypof/javacpp/blob/repo_only/src/main/java/org/bytedeco/javacpp/tools/Repository.java#L129

If, for example, the .m2 repository gets corrupted, the last modified dates jump into the future, but then restored from a backup, of course the last modified dates can go backwards. Like I said previously, if we are to consider this a cache, we need to make sure that what's in it isn't out of sync with the actual repository, with reasonable certainty. What rsync does by default appears to work for most people so let's just do the same!

On the other hand, if we consider this a repository, we are going to need more tools to deal with these inconsistencies that will arise in various situations that are going to happen in production (Murphy's law). For example, how could we "deploy" the files without running an application that uses JavaCPP? These kinds of questions don't matter if this is just a cache.

@cypof
Copy link
Contributor Author

cypof commented Aug 28, 2016

I also don't want this to be a full repo, definitely more like a cache. The only thing is that I would like deterministic paths to content, so that we can build code against it. Initially I thought having its own root in home would make it easier to do that, but maybe not. Moving it to .cache/javacpp as you suggest should be OK, I don't see people clearing their cache while an application is running or messing with the content.

For content validation, I still don't see the value for the additional complexity that would incur, and the fact that every launch will require opening and listing the full directory of each jar in the dependency tree even if one file is actually used etc. I can write it, but it will take time and I don't see the point. You mention the scenario where someone would restore their .m2 from backup but not the rest of their home folder. That's already a bit of a stretch, but even there it can work with time stamps. When extracting the jar, we set the modified time on the folder to the one of the jar. If it doesn't match for any reason we re-extract, whether it is in the future or in the past.

I will add the Android try/catch blocks and look at the emulator. Do you have scripts for that already to save time, or not yet? Thanks

@saudet
Copy link
Member

saudet commented Sep 3, 2016

The more I think about this, the more it seems like we're trying to do something that's way more complicated than it needs to be. Native code cannot easily access Java resources found in JAR files, and that's basically the only problem that needs to be solved, correct? When the resources are actually files on the native file system, we don't need to do anything other than to pass in the right path. We don't want to copy them, right? Only when they are in JAR files do we need to extract them somewhere, and IMO that would be the only thing that JavaCPP's cache needs to do.

To get consistent naming, we can just use the name of the JAR file they were extracted from. Since we assume we are using Maven, there are basically two scenarios:

  1. JAR files from the ~/.m2/repository get used, so we could extract libraries from opencv-3.1.0-1.2-linux-x86_64.jar into say ~/.javacpp/jarcache/opencv-3.1.0-1.2-linux-x86_64/org/bytedeco/javacpp/linux-x86_64/, and we have our "coordinates" in there, without having to do anything special.
  2. An application gets executed from an_uber.jar file, in which case library files would get extracted to ~/.javacpp/jarcache/an_uber/org/bytedeco/javacpp/linux-x86_64/ (and similarly for include files or other resources). We might get duplicate files in the cache that way, but there's not much we can do. Users can put whatever they want in the new JAR file, and they do want to use those files. They do not want to have JavaCPP use some coordinates to fetch stuff from a repository somewhere. They put files in an uber JAR for a reason. Besides, it makes more sense that way for header files. If we create the artifacts with the "include" classifier by copying the header files into org/bytedeco/javacpp/include/ then we just need to extract that from the uber JAR as well. It will contain header files from all the "include" artifacts, and that's it. No need to worry about clashing paths, which we would have to worry about if we were trying to satisfy a set of coordinates.

Still, it's conceivable that multiple JAR files with the same name but with different content get used at the same time by different processes, but if that becomes a problem, we can easily extend the simple scheme above by appending say the size and the last modified date of the JAR file to the subdirectory name in jarcahe, and just create a symbolic link to the latest one. We can get back deterministic names that way. But I don't think we will need to do that, so let's not worry about that for now. It's something we can add later if needed.

In any case, since we'd be caching on a per JAR file basis, I guess we could validate only with the last modified date on the subdirectory, something that wouldn't be possible if we tried to map uber JARs to some coordinates, in which case we would have to check each file one by one.

More importantly, we shouldn't use the FileSystem API. This way, we could use JavaCPP on Android also to extract resources in a portable way. If you think that what I describe above would meet your needs, that's something I'd be ready to code from scratch myself, but let me know. If it does not though, could you make your requirements more explicit and why they are not satisfied?

(Not so sure anymore about using ".cache/javacpp". The actual cache directory depends on the OS and sounds like it would make things confusing, http://wiki.netbeans.org/FaqWhatIsUserdir , for example, and then there is Android's own scheme. Besides, some tools like ccache still don't use it.)

Scripts for the Android emulator? I'm afraid it's a lot of manual installation of everything. Google does not make Android amenable to automation. :(

@cypof
Copy link
Contributor Author

cypof commented Sep 3, 2016

Supporting uber jars might be useful. My use case is covered with just expending jars with coordinate attributes, so either way for me. If you feel like re-writing it, as long as it has include file support and stable paths I'm happy. I probably have already invested an unreasonable amount of time in this feature. I just like the long term potential of being able to compile native code without having to install anything, or even at runtime. I want to have it, and I don't really care about the specifics of how it works. What would be your time estimate for re-writing it? thks

@saudet
Copy link
Member

saudet commented Sep 5, 2016

It's not about supporting uber JAR or not supporting uber JAR, it's about not breaking existing functionality...

@saudet
Copy link
Member

saudet commented Sep 26, 2016

I've finally been able to spend a bit of time on this. Please check the cache branch and let me know what you think: https://github.com/bytedeco/javacpp/tree/cache

It currently doesn't support caching directories, yet. When the "resource" is a directory, we would need to add special support, and using the FileSystem API would be fine. In any case, Builder could then call something like Loader.cacheResource(opencv_core.class, "include") to get the header files extracted and the directory in the cache would be returned as a File object.

BTW, on Windows, we would also need to bundle the *.lib files, but that can easily be done alongside the *.dll ones I believe.

@cypof
Copy link
Contributor Author

cypof commented Sep 26, 2016

OK that sounds good, I like the class + "include" API. As I mentioned before we need to make sure the paths will be predictable, so that we could hard-code if needed, e.g. in a build script outside of the JVM. Probably only needed for development but I'm sure other uses cases will popup as we go.

@saudet
Copy link
Member

saudet commented Sep 26, 2016

As long as the JAR files in the class path are taken from the local Maven repository, they will be predictable... Will that always be the case for what you are anticipating to do?

@cypof
Copy link
Contributor Author

cypof commented Sep 26, 2016

Yes, that should be fine.

@saudet
Copy link
Member

saudet commented May 7, 2017

We now have Loader.cacheResources() (works on Android as well) that basically does what we need here, as used in the Builder to start an arbitrary build command, for example:
https://github.com/bytedeco/javacpp/blob/master/src/main/java/org/bytedeco/javacpp/tools/Builder.java#L721
If it doesn't meet your needs in some way, please let me know! Thanks

@saudet saudet closed this May 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants