Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Makefile instead of the build_dependencies.sh script #210

Merged
merged 9 commits into from
Oct 10, 2015
Merged

Conversation

ws233
Copy link
Collaborator

@ws233 ws233 commented Aug 11, 2015

  1. libtiff update
  2. No more image libraries in lib folder. Instead those files are taken directly from the libtiff-ios folder, where they are built.
  3. No more leptonica & tesseract libraries in the lib folder of the repository. Instead those binaries are built automatically directly when run button in xCode is pressed
  4. No more build_dependencies script. Makefile instead to build Tesseract and all dependent libraries.
  5. An extra build phase has been added to the Tesseract OCR iOS framework. This phase simply runs the Makefile and build all the dependencies if necessary.
  6. Readme files have been updated to mention above changes.

@ws233
Copy link
Collaborator Author

ws233 commented Aug 11, 2015

@kevincon, I've removed build script with the Makefile.
Advantages of this change:

  1. GNU make utility figures out automatically which files it needs to update, based on which source files have changed. It also automatically determines the proper order for updating files, in case one non-source file depends on another non-source file.
    As a result, if you change a few source files and then run Make, it does not need to recompile all of your program. It updates only those non-source files that depend directly or indirectly on the source files that you changed. So there is no need to rebuild everything if only one dependent library has been updated or even only one file in the library. You may check this behavior just pressing run button from Xcode, since I've added this Makefile as a prebuilt phase to the Framework target. Very first build will take about half of an hour, but every next will very fast, since there is no need to rebuild dependencies anymore (they are unchanged).
  2. As a side effect of this, we do not have to add compiled binaries in the repo anymore. Any user gets them compiled right after they try to run the program. So we may purge all the binaries and leptonica and tesseract sources from the repo and significantly decrease its size.
  3. I've also added -O2 compile option to the leptonica and tesseract building phase, so they should work faster now.
  4. The Tesseract, Leptonica, and image libs build scripts are under Travis control now. So we can easily change build scripts and confirm that they are still working right after commit ^.^

Pls, try this and let me know what do you think about the whole idea.

@ws233
Copy link
Collaborator Author

ws233 commented Aug 11, 2015

@kevincon , the build has failed cause liftiff-ios is configured to be built with iOS SDK 8.4. But in our tests we use 7.1 and 8.1 versions. I'll fork libtiff-ios and modify it to support any SDK version very soon, but so far you may try the PR with SDK 8.4.

@ws233
Copy link
Collaborator Author

ws233 commented Aug 11, 2015

You may also try common clean targets as follows:

     make mostlyclean // removes everything besides the results of the build procedure. No intermediate files.
     make clean //removes all object files but save the Makefiles made by GNU configure. So there is no need to reconfigure 
     make distclean // removes all object files and Makefiles. So the distributive gets back to its initial state

@ws233 ws233 force-pushed the make branch 3 times, most recently from 291f572 to f7dd0cb Compare August 12, 2015 16:33
@kevincon
Copy link
Collaborator

I like the idea of converting the build script to a Makefile for the reasons you mentioned, but I think you need to update the podspec as well since the libs wouldn't be part of the repo anymore.

But before you do that, wouldn't making the Makefile run as part of the TesseractOCR target cause users of the CocoaPod to have to build the libs every time they install the pod in a new project (which takes a long time, as you mention in the README)? If that's the case, I don't think we should make this change because I think that's unreasonable for every user to have to wait that long when they add the Pod to a project.

Instead, would it be possible to add a brand new target to the project that represents building the libs using the Makefile so that it's easier for users to do that if they want to (and we still get the other Makefile benefits), but then also still have the libs included in the repo so that CocoaPod users don't have to build them when they add the Pod to a project? And maybe the existing TesseractOCR target would be smart enough to know that, since the libs are included in the repo, they don't have to be rebuilt if the user doesn't need them to be rebuilt?

@ws233
Copy link
Collaborator Author

ws233 commented Aug 16, 2015

I like the idea of converting the build script to a Makefile for the reasons you mentioned, but I think you need to update the podspec as well since the libs wouldn't be part of the repo anymore.

What update do you mean? Sorry, I don't understand what is necessary here.

wouldn't making the Makefile run as part of the TesseractOCR target cause users of the CocoaPod to have to build the libs every time they install the pod in a new project (which takes a long time, as you mention in the README)?

Yes, the lib is completely compiled on every installing the pod in a new project. But how often do you install new projects? That's a tradeoff between first compile time and repo size. So far the repo size is already about 350Mb and it will significantly grow further, since in the xcode 7 bitcode has appeared. It significantly increases the size of every architecture slice, almost doubles, if I'm not mistaking. Anyway, In my humble opinion 350Mb for the wrapper is quiet big, we do not have so much code in our repo ^.^

Is it possible to make a survey in hithub to ask our users about this tradeoff? What do they choose? Do they think the biuld time is preferrable or the repo size?

but then also still have the libs included in the repo so that CocoaPod users don't have to build them when they add the Pod to a project? And maybe the existing TesseractOCR target would be smart enough to know that, since the libs are included in the repo, they don't have to be rebuilt if the user doesn't need them to be rebuilt?

I thought about it. I may include the precompiled libs binaries back to the repo. In such case we even do not need the separate target. GNU Make utility will just do nothing since the target is already at the place.

@ws233
Copy link
Collaborator Author

ws233 commented Aug 16, 2015

@kevincon, just an example. In the bitcode branch I've uploaded 58.61 MB of just a libtesseract_all.a. And I've even received a warning from the github:

remote: warning: GH001: Large files detected.
remote: warning: See http://git.io/iEPt8g for more information.
remote: warning: File TesseractOCR/lib/libtesseract_all.a is 58.61 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB

But besides only libtesseract_all.a, there are leptonica and image binaries.
So it seems that every next binary upload will significantly increase repo size. So I think we should definitely address this issue very soon.

@ws233
Copy link
Collaborator Author

ws233 commented Aug 16, 2015

@kevincon, how do you like an idea of building the dependencies right in pod install phase? it will take the same time, but at least it would happen during installing the pod instead of building the project for the first time?
Purhaps, there are also another ways to decrease repo size besides the one I've proposed in this PR?

@kevincon
Copy link
Collaborator

I like the idea of converting the build script to a Makefile for the reasons you mentioned, but I think you need to update the podspec as well since the libs wouldn't be part of the repo anymore.

What update do you mean? Sorry, I don't understand what is necessary here.

The podspec file (TesseractOCRiOS.podspec in the root of the repo) is what CocoaPods uses to determine how to install the library in a user's project when they run pod install. Right now there's a line in that file that tells CocoaPods that it can find the static Tesseract libs in the repo so that CocoaPods can copy it into the user's project:

 s.ios.vendored_library    = 'TesseractOCR/lib/*.a'

So if we were to merge this PR right now, pod install would break for users since these libs are no longer included as part of your changes.

Yes, the lib is completely compiled on every installing the pod in a new project. But how often do you install new projects?

I don't think the important point is how often users install new projects; I think the important point is the user's perception about the library based on their experience installing it.

@kevincon, how do you like an idea of building the dependencies right in pod install phase? it will take the same time, but at least it would happen during installing the pod instead of building the project for the first time?

I timed how long it takes to run pod install with just TesseractOCRiOS release version 4.0.0 as a dependency:

3.26s user 0.67s system 68% cpu 5.771 total

I also timed how long it takes to run make to build the libs with the changes from this PR:

1808.37s user 333.38s system 213% cpu 16:42.24 total

1808.37 seconds (30 minutes and 8 seconds) is ridiculously long for a user to wait to start using this library, regardless of whether that time is part of pod install or whether it's the part of building their project for the first time in Xcode.

To be clear, I will not merge this pull request if it results in imposing that kind of time on users when they install this library.

That's a tradeoff between first compile time and repo size. So far the repo size is already about 350Mb and it will significantly grow further, since in the xcode 7 bitcode has appeared. It significantly increases the size of every architecture slice, almost doubles, if I'm not mistaking. Anyway, In my humble opinion 350Mb for the wrapper is quiet big, we do not have so much code in our repo ^.^

It's an easy choice from my point of view to choose having a larger repo over the long compilation time that this PR introduces for users.

As yet another data point, I timed how long it takes to clone this repo (from HEAD in master):

21.70s user 6.44s system 37% cpu 1:15.12 total

I'm personally happy as long as that time stays under a minute, and I don't think users will complain that they're about to run out of hard drive space as part of using this library.

Is it possible to make a survey in hithub to ask our users about this tradeoff? What do they choose? Do they think the biuld time is preferrable or the repo size?

I can't think of a great way to do this given our resources right now. We could start a mailing list for users interested in providing their opinions on these kinds of questions, but otherwise I think the best thing we can do is to create a new issue on the repo to ask users what they would prefer.

I thought about it. I may include the precompiled libs binaries back to the repo. In such case we even do not need the separate target. GNU Make utility will just do nothing since the target is already at the place.

I think the best thing to do is to put the pre-compiled lib binaries back in the repo so that users don't have to build them when they install the library (or build it for the first time).

@kevincon, just an example. In the bitcode branch I've uploaded 58.61 MB of just a libtesseract_all.a. And I've even received a warning from the github:

remote: warning: GH001: Large files detected.
remote: warning: See http://git.io/iEPt8g for more information.
remote: warning: File TesseractOCR/lib/libtesseract_all.a is 58.61 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB

But besides only libtesseract_all.a, there are leptonica and image binaries.
So it seems that every next binary upload will significantly increase repo size. So I think we should definitely address this issue very soon.

You're right that GitHub recommends that individual files be less than 50 MB, but this is an arbitrary recommendation on their part, so I'm not concerned about it. In reality, the only true limit we face is the 1 GB limit that GitHub imposes on public repos.

I agree that we should address the issue of a growing repo size, but not at the expense of our users' experience installing the library.

Purhaps, there are also another ways to decrease repo size besides the one I've proposed in this PR?

I propose that we simply purge the repo of older versions of the libs anytime we update them. One trade-off I can think of is that if users check out a specific past commit of the repo (say commit X) then the libs they see at that commit will be the libs from HEAD as opposed to the actual libs that were part of the repo at the time of commit X. However, I think that this issue is somewhat mitigated by the fact that users can download a zip file from the Releases page that has a preserved copy of the libs from the time of each release.

There is a tool called BFG Repo-Cleaner that we can use to do this: https://rtyley.github.io/bfg-repo-cleaner. What do you think?

@ws233
Copy link
Collaborator Author

ws233 commented Aug 17, 2015

ok. That makes sence. I'll add libs back.

2. No more image libraries in lib folder. Instead those files are taken directly from the libtiff-ios folder, where they are built.
3. No more leptonica & tesseract libraries in the lib folder of the repository. Instead those binaries are built automatically directly when run button in xCode is pressed
4. No more build_dependencies script. Makefile instead to build Tesseract and all dependent libraries.
5. An extra build phase has been added to the Tesseract OCR iOS framework. This phase simply runs the Makefile and build all the dependencies if necessary.
6. Readme files have been updated to mention above changes.
2. Removed internal tesseract and leptonica headers from the project.
@ws233
Copy link
Collaborator Author

ws233 commented Aug 21, 2015

ok. I've added back the libraries binaries.
But it seems, that travis build is no longer started again.

2. Image fat libs are copied in ./libs and their headers in ./include
@ws233
Copy link
Collaborator Author

ws233 commented Aug 21, 2015

I've removed prebuilt make from the run phase.
But it's still there during the travis build. Just let me know, if I should remove building dependency libs phase from travis build.

@kevincon
Copy link
Collaborator

Thanks for putting the libs back, but it looks like you put back a folder for leptonica-1.71 while the Makefile is hard-coded to look for leptonica-1.72. This results in a bug with the Makefile; if you try to run make clean directly from this pull request, it goes into an infinite loop with the error below (I think because the leptonica-1.72 folder doesn't exist by default, but regardless it should just fail gracefully instead of inifinte looping). Can you make sure that the lib binaries and headers in the repo are all the most recent versions of leptonica, libtiff, etc., then can you fix make clean and double-check that all of the other Makefile commands work?

/bin/sh: line 0: cd: /Users/kcon/Sources/Tesseract-OCR-iOS/TesseractOCR/leptonica-1.72/arm-apple-darwin7/: No such file or directory
for folder in  /Users/kcon/Sources/Tesseract-OCR-iOS/TesseractOCR/leptonica-1.72/arm-apple-darwin7/  /Users/kcon/Sources/Tesseract-OCR-iOS/TesseractOCR/leptonica-1.72/arm-apple-darwin7s/  /Users/kcon/Sources/Tesseract-OCR-iOS/TesseractOCR/leptonica-1.72/arm-apple-darwin64/  /Users/kcon/Sources/Tesseract-OCR-iOS/TesseractOCR/leptonica-1.72/i386-apple-darwin/  /Users/kcon/Sources/Tesseract-OCR-iOS/TesseractOCR/leptonica-1.72/x86_64-apple-darwin/; do \
        cd $folder; \
        /Applications/Xcode.app/Contents/Developer/usr/bin/make clean; \
    done ; \

Can you also fix the two warnings on the TesseractOCR target? The way we resolved those before was by manually editing the affected Tesseract source files to add (int) casts (see #120 when we talked about it before).

Also I think it's okay for the libs to be built as part of the Travis build.

@ws233
Copy link
Collaborator Author

ws233 commented Aug 24, 2015

Regarding #120. I've created a PR to the upstream repo.
So let's wait a bit, while they review it and merge.

Regarding make script. I'll double check it soon.

@ws233
Copy link
Collaborator Author

ws233 commented Aug 24, 2015

Ok. I've fixed the issue.
It was due to a lack of the dependent lib sources.
The script tried to change the directory to the lib source folder and failed.
So I've added a condition to check if the sources exist.
So it should work now.
I've also checked other targets as well and they do work.

@kevincon
Copy link
Collaborator

Regarding #120. I've created a PR to the upstream repo.
So let's wait a bit, while they review it and merge.

No, please fix those files in include/ now as part of this PR. Upstream tesseract has 14 open PRs, so I doubt they review them often (if at all?). In the meantime, we shouldn't expose our users to any compilation warnings.

Related, I think we should specify specific tags for the submodules so that we have some indication that they're stable (while preserving the ability to easily upgrade them in the future). We can do this by modifying the .gitmodules file to point to the tag of a stable release. I created a new issue to do this (#214) and to update the libs accordingly (e.g. I noticed that upstream Tesseract finally released version 3.04 last month, so we should make the tesseract-ocr submodule point to the 3.04 release tag). Let me know if you'd like to take care of this issue; otherwise I'll do it when I have some time.

Ok. I've fixed the issue.
It was due to a lack of the dependent lib sources.
The script tried to change the directory to the lib source folder and failed.
So I've added a condition to check if the sources exist.
So it should work now.
I've also checked other targets as well and they do work.

Thanks for fixing it, it works for me now. Once you fix the compilation warnings, I'll merge this PR.

@ws233
Copy link
Collaborator Author

ws233 commented Aug 25, 2015

No, please fix those files in include/ now as part of this PR. Upstream tesseract has 14 open PRs, so I doubt they review them often (if at all?). In the meantime, we shouldn't expose our users to any compilation warnings.

Yes, they do review PRs. But our one is low priority, I guess.
Anyway, I'll fix it.

so we should make the tesseract-ocr submodule point to the 3.04 release tag

I was going to do this as soon as we commit this PR.

Let me know if you'd like to take care of this issue; otherwise I'll do it when I have some time.

I would appreciate if you help me with this and take care about the upgrade. But be aware that Tesseract 3.04 has significantly changed the function to produce PDF output. So let me know if you need some help with upgrading our wrapper for this function or rewritting tests for it.

Thx!

@ws233
Copy link
Collaborator Author

ws233 commented Aug 25, 2015

Also, libtiff-ios has merged my Makefile I've created for them. So let me spend a few time to remove doublication code from our Makefile and simply use those code from libtiff-ios Makefile to build image libs. I'm expecting to finish it this week, if you don't mind.

@ws233 ws233 force-pushed the make branch 22 times, most recently from d2ab8eb to a800676 Compare September 13, 2015 11:24
@ws233
Copy link
Collaborator Author

ws233 commented Sep 13, 2015

That's ready to be merged as well.
In the last commit I've modified Makefile to respect ARCHS environment variable. So it's able to build only specified architectures. That's the same environment variable as xCode uses. So it's simple to add this Makefile as a prebuilt step to xCode build and only the active architecture is build. And it's possible to build only device targets (arms) to submit them to AppStore.
Due to above the Travis CI build has been decreased down to 10 minutes for unit tested target.

I've also modified Travis CI build matrix, so now every target is build with rebuilding dependent libraries and without it. That helps us ensure that 1. the binaries uploaded into the repo are correct and 2. the Makefile to build dependent libraries is not broken.

…S environment variable only.

2. Makefile updated to support ARCHS environment variable only for building specified architectures only.
3. README_howto_compile_libraries.md updated to explain above changes.
4. Prebuilt script changed to build only active architectures of dependent libraries in Travis CI build environment.
ws233 added a commit that referenced this pull request Oct 10, 2015
Add Makefile instead of the build_dependencies.sh script
@ws233 ws233 merged commit 9d94a87 into gali8:master Oct 10, 2015
@ws233 ws233 deleted the make branch October 10, 2015 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants