Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any plans to speed up recognition? #99

Closed
schrodog opened this issue Nov 19, 2018 · 46 comments
Closed

Any plans to speed up recognition? #99

schrodog opened this issue Nov 19, 2018 · 46 comments
Labels
enhancement New feature or request
Milestone

Comments

@schrodog
Copy link

Right now when I am using howdy with certainty = 5, it usually cost more than 2-3 seconds to authenticate, which is considerably slower than Windows hello (average < 2s speed). Any plan to further improve algorithm?

@boltgolt
Copy link
Owner

Not right now, no. I'm not experienced enough with machine learning to create my own model, and if i did it would probably be slower than the one used right now. @dmig has made some general optimizations, but they are more to make the code more readable and won't speed things up that much.

You can set end_report to true with sudo howdy config to get a more detailed time report after authentication. For me this returns:

Time spend
  Starting up: 135ms
  Opening the camera: 260ms
  Importing face_recognition: 1426ms
  Searching for known face: 301ms

So the most time by far is used on importing the face_recognition module, which is where we could definitely save a lot of time. If anyone has ideas to speed this up or other modules to use, let me know.

@boltgolt boltgolt added the enhancement New feature or request label Nov 19, 2018
@dmig
Copy link
Contributor

dmig commented Nov 19, 2018

For me noticeable speedup was gained by rebuilding dlib with full optimisations (see it's manual). Using CUDA also should give performance boost.

@dmig
Copy link
Contributor

dmig commented Nov 19, 2018

Oh, btw, using normal cam (not IR one) slows down recognition, but gives more reliable result

@dmig
Copy link
Contributor

dmig commented Nov 19, 2018

My results:

Time spend
  Starting up: 92ms
  Opening the camera: 435ms
  Importing face_recognition: 1070ms
  Searching for known face: 91ms

Resolution
  Native: 374x340
  Used: 374x340

Frames searched: 1 (11.01 fps)
Dark frames ignored: 0
Certainty of winning frame: 2.38
Winning model: 0 ("Initial model")

This is probably the best time, it can be up to 2 seconds sometimes. Notice the resolution: I set max_height = 374 in config to avoid unnecessary scaling.
I'll try to discover if it's possible to reduce face_recognition load time.

@dmig
Copy link
Contributor

dmig commented Nov 19, 2018

... rebuilding dlib with full optimizations...

Looking at debian/postinst, how it builds dlib, I realized, that I got deeper: I installed one of dlib optional dependencies, OpenBLAS, which is pretty well optimized and gained me better recognition speed.
But there are more options:

All of them except first are available as packages in standard repositories. By default, not finding any, dlib will issue a warning at configure stage and allow you to proceed, probably using some fallback code.

I'll make some tests and rewrite installation script later to choose best available libraries. I have a good feeling.

@boltgolt
Copy link
Owner

Nice work! Does it speed up the per-frame recognition times or the face_recognition import? How much does installing OpenBLAS contribute to installation duration?

@dmig
Copy link
Contributor

dmig commented Nov 19, 2018

I can't tell how much, since I didn't measure. I'll do that later. Libraries are installed as packages from Ubuntu repos.

@sapjunior
Copy link

Maybe we have to build howdy as a background service because it's spend most of time on loading the library. The serivice will wake up camera upon authentication request.

@dmig
Copy link
Contributor

dmig commented Nov 20, 2018

That's possible, but I don't like the approach. I'll try to reduce amount of data loaded. Also, it's possible to rely on filesystem cache to keep libraries in memory most of the time.

@boltgolt
Copy link
Owner

I've also been thinking about offloading the actual recognition to an (optional) deamon. However, compare.py uses about 217MB of memory right now which i think is a bit much to keep reserved. If we could do it though a cache it would be the best of both worlds though.

There's also the possibility of multithreading. Right now Howdy just uses 1 core and pushes it to maximum load, but doesn't utilize any of the other 7 on my machine because everything is running in a single thread. I think splitting the recognition into a separate process would also allow us to start multiple instances, increasing the amount of frames we can process.

@dmig
Copy link
Contributor

dmig commented Nov 21, 2018

@boltgolt that's what I'm addressing right now. It's possible to reduce memory footprint, probably not much, need to test.
Can you tell, why are you using 2 versions of python at the same time? I see no reason for that.

@boltgolt
Copy link
Owner

Howdy depends on pam_python, which handles the conversation with the central authentication system. It's a bit outdated though and can only run python2 scripts, which is why pam.py runs in a different version than the rest of the script.

I'm looking at alternatives in #9. In my opinion a C++ PAM module that starts compare.py would be the most efficient solution.

@dmig
Copy link
Contributor

dmig commented Nov 21, 2018

Ok, let me ask another way: why do you use python3? Python2 would be sufficient.

@boltgolt
Copy link
Owner

boltgolt commented Nov 21, 2018

Python2 will be dropped in 2020, about a year from now. I don't think it's very future proof and all python packages should be moving to 3 relatively soon.

@dmig
Copy link
Contributor

dmig commented Nov 21, 2018

Until pam_python gets updated to python3 in system repo, we can stick to 2.7. No need to waste time/ram spawning python3 interpreter.

@boltgolt
Copy link
Owner

boltgolt commented Nov 21, 2018

I don't particularly like the idea rewriting everything except pam.py to python2, which is slowly fading away. A more structural move would be to ditch the pam_python part and directly go from C++ to python3.

@dmig
Copy link
Contributor

dmig commented Nov 21, 2018

You mean pam_exec? It's installed by default, -1 dependency.

@boltgolt
Copy link
Owner

No, i mean writing a custom PAM module for Howdy alone. The major upside to this is that it should also be able to ask for simultaneous password input while Howdy is running.

@dmig
Copy link
Contributor

dmig commented Nov 21, 2018

So, why stick to python code then?
Rewritten in C completely it'll provide the best possible performance.

@boltgolt
Copy link
Owner

boltgolt commented Nov 21, 2018

The honest answer to that is that i started this as a small script just for myself, and i'm much more familiar with python. It's the same reason why we're still depending on pam_python to this day, i'm not comfortable at all with C++. That doesn't mean we shouldn't switch. But my contributions would be much less in C++.

@sapjunior
Copy link

sapjunior commented Nov 21, 2018

I found a fork of pam_python which support python3 here https://github.com/minrk/pamela (used in JupyterHub).

@sapjunior
Copy link

sapjunior commented Nov 22, 2018

@dmig @boltgolt
Please take a look at my repository https://github.com/sapjunior/dlibFaceRecognition. I wrote a stripped down version of face_recognition by loading only necessary dlib model. The original ageitgey/face_recognition loading the unused models (CNN and 64 face landmark). This stripped down version can reduce the total process time (loading library,model + open camera + recogition ) to only ~1.4s on my notebook. I also apply some optimizations by using numpy broadcasting technique instead of simple for-loop to compare faces.

Original
Time spend
Starting up: 297ms
Opening the camera: 357ms
Importing face_recognition: 1028ms
Searching for known face: 2191ms

Resolution
Native: 340x340
Used: 340x340

Improve version
Load Library and Model in 0.7182595729827881 s
Open Camera in 0.36762571334838867 s
Winning Model 0 Distance 0.22640754844688699
Face Recognition in 0.29424214363098145 s
Total Time 1.3801851272583008 s

@dmig
Copy link
Contributor

dmig commented Nov 22, 2018

@sapjunior great! I'm looking into ditching face_recognition packages completely because these are just dlib examples wrapped in packages.

@dmig
Copy link
Contributor

dmig commented Nov 28, 2018

Ok, tests finished. Here is a small report.

Test HW: Core-i7 8550U, 8GB, SSD (SATA, 6MBit/s), no GPU acceleration used (unfortunately, dlib doesn't support neither OpenCL by itself nor any of OpenCL backed libraries).
Test software:

  • Ubuntu 18.10 x64 with kernel 4.18.19
  • 6 python2 + python3 virtualenvs with different dlib builds and same other dependencies
  • 200 sample images from IR webcam, face recognition fails on ~60 of them, no dark frames
  • face models from Howdy installation
  • every test was repeated 10 times

I built dlib with following configurations:

Results:
graphs-med

Each graph represent numbers for a single test. Bars denote median values. Smaller are better.
All numbers are milliseconds if not noted otherwise. Blue bars for Python2, red for Python3.

Conclusion:
I was surprised with atlas results and would recommend adding it as a package dependency or installing at build time. I wouldn't recommend using Intel MKL libs because of their size and some packaging problems.

More tests needed:

  • on CUDA enabled hardware
  • possibly with other dlib build flags on older CPUs w/o AVX, SSE4 support -- different configuration may win

If someone wishes to repeat tests, I'll write some instructions and publish all scripts.

@dmig
Copy link
Contributor

dmig commented Nov 28, 2018

Next step is -- get rid of face_recognition packages and reduce load time.

@dmig
Copy link
Contributor

dmig commented Nov 29, 2018

Another test results. I've rewritten the test in the same way as @sapjunior example.
graphs

Here face recognition was measured differently so face search + face recognition still take the same time.

Most notable change: memory consumption reduced by ~70mb.

@boltgolt
Copy link
Owner

Amazing work man! Very nice and readable with the great graphs.

I agree that atlas is the obvious choice if you look at the data. MKL also looks quite good in a lot of tests but i trust your judgement.

Would adding atlas be as simple (at least for Ubuntu) as adding libatlas3-base to the dependencies and a -DLIB_USE_BLAS to the dlib build command?

@sapjunior
Copy link

sapjunior commented Dec 1, 2018

I think ATLAS is the best choice because MKL is quite large and difficult to install. By the way, I'm trying to port MobileFacenet from https://github.com/deepinsight/insightface by using OpenCV 4.0 ONNX dnn module. This network is quite fast (~40ms inference time) and look promising in term of accuracy as well.

@dmig
Copy link
Contributor

dmig commented Dec 1, 2018

@boltgolt all these libraries are mutually exclusive. There is no flag to tell dlib to use one of them, there is a simple routine to detect library presence. Moreover these libraries use alternatives mechanism to provide libblas.so and liblapack.so. So installing another one may possibly break existing software.
IMO better way would be adding a line to control:

Recommends: libatlas-base-dev | libopenblas-dev | liblapack-dev

@dmig
Copy link
Contributor

dmig commented Dec 1, 2018

@sapjunior MKL is not really difficult to install -- it is available from system repositories, but one of dependencies is packaged wrong way, so one extra package must be installed manually.
Overall size of all libmkl-dev dependencies is huge and this may increase load time with HDD.

@dmig
Copy link
Contributor

dmig commented Dec 1, 2018

BTW this may help HDD owners:

  • https://github.com/kokoko3k/gopreload -- nice and simple preloader, add installed dlib and face_recognition_models (especially!) to its preload list
  • old preload daemon -- I didn't find a way to point it to dlib/face_recognition_models so lets hope it'll pick them up automatically

@boltgolt boltgolt added this to the Release 2.5.0 milestone Dec 2, 2018
@boltgolt
Copy link
Owner

boltgolt commented Dec 2, 2018

I'll add them as recommended packages for Debian to the next release, thanks!

I'd say we add the optimized face_recognition @sapjunior made directly into the code here. Then either keep howdy in memory as a whole or just the .dat files.

@dmig
Copy link
Contributor

dmig commented Dec 7, 2018

I'd like to know some opinions about face_recognition_models package which contains 4 .dat files. It's simple to install but we need only 2 files of 4 at the same time. 1 common and 2 different files for CNN or HOG recognition code.

I see 2 options: leave it as a dependency or ditch it and download required .dat files from https://github.com/davisking/dlib-models (their origin)?

Right now I tend to the second, because we already download dlib itself.

@boltgolt
Copy link
Owner

boltgolt commented Dec 7, 2018

If we ditch face_recognition completely we could implement the code from @sapjunior i mentioned in my last comment and just ship the +/- 30MB of data files in the package itself. Would eliminate a 2/3 of the space required and keep everything easy to install.

@dmig
Copy link
Contributor

dmig commented Dec 8, 2018

Ok, great. I made modifications to the code and testing them right now.

@dmig
Copy link
Contributor

dmig commented Dec 8, 2018

Also, I added an option to use CNN for recognition. Test runs 12-15 times slower on my hardware and consumes 200-600MB ram (opposite to 120MB for HOG), but maybe someone with CUDA enabled will find it useful.

@dmig
Copy link
Contributor

dmig commented Dec 9, 2018

Ok, the new code is in my repository https://github.com/dmig/howdy. Still not thoroughly tested, so not yet making a PR.
Authentication is now almost as fast as Windows Hello.

I want to test a pure python2 version of pam module: this should reduce ram usage by 8-16MB and slightly reduce startup time. Probably, subprocess.call() in pam.py is a blocker for #9.

@boltgolt
Copy link
Owner

boltgolt commented Dec 9, 2018

Changes look very nice! The speedup would be very welcome and completely dropping face_recognition is a huge improvement. Open PR whenever you're ready.

I'm still against moving to python2 though. It's a step backwards in tech that 8 to 16 MB of memory doesn't compensate. Still working on a PAM module in C++ that should probably speed up startup time in a similar fashion.

@dmig
Copy link
Contributor

dmig commented Dec 9, 2018

I'm still against moving to python2 though. It's a step backwards in tech that 8 to 16 MB of memory doesn't compensate.

Python2 would be not a step backward, but a removal of ad-hoc code. Also, python code may be modified to run on both versions.

Still working on a PAM module in C++ that should probably speed up startup time in a similar fashion.

That would be great to have and it will give the same memory/startup improvements.

@dmig
Copy link
Contributor

dmig commented Dec 11, 2018

I've managed to reduce auth time to less than 1 second:

$ sudo ls
Time spent
  Starting up: 99ms
  Open cam + load libs: 690ms
    Opening the camera: 690ms
    Importing recognition libs: 548ms
  Searching for known face: 161ms

Resolution
  Native: 374x340
  Used: 374x340

Frames searched: 2 (12.45 fps)
Dark frames ignored: 0 
Certainty of winning frame: 3.431
Winning model: 3 ("Model #4")
Identified face as dmig
...

A small gain compared to the previous result, around 250ms:

Time spent
  Starting up: 101ms
  Opening the camera: 434ms
  Importing libs: 535ms
  Searching for known face: 1447ms

Some more tests and I'll make a PR.

@boltgolt
Copy link
Owner

boltgolt commented Dec 11, 2018

That's almost exactly as long as windows hello (on my machine), very nice work!

@dmig
Copy link
Contributor

dmig commented Dec 11, 2018

That wasn't a final result. Probably this is:

$ sudo ls
Time spent
  Starting up: 95ms
  Open cam + load libs: 434ms
    Opening the camera: 434ms
    Importing recognition libs: 392ms
  Searching for known face: 63ms

Resolution
  Native: 374x340
  Used: 374x340

Frames searched: 2 (31.61 fps)
Dark frames ignored: 1 
Certainty of winning frame: 2.641
Winning model: 4 ("Model #5")
Identified face as dmig
autocomplete  debian  LICENSE  README.md  src

@timwelch
Copy link

I installed your branch, to see what might need to be changed for compatibility with the ffmpeg work I had been doing. It looks like you call a video_capture.grab() function, instead of .read(). In my ffmpeg_reader.py class, we'll need to add that function as a sort of redirect into .read().

	def grab(self):
		""" Redirect grab() to read() for compatibility """
		self.read()

Other than that it seems to work, and although the ffmpeg methods are slower than openCV, it looks like the new face recog saves around 500ms on average to the total time taken on my machine.

Kudos for the hard work!

@timwelch
Copy link

timwelch commented Dec 16, 2018

In the spirit of speeding things up... I found a way to ditch ffmpeg, which is horrendously slow, by using fully python v4l2 implementation instead (for my own HP IR camera). :-)

I ran a handful of tests. :-)

  • Note 1: I haven't compiled dlib for any tweaks, if you did that along the way. I only pulled down your git repo and started using it. So if there's more speedups because of that, this could be even faster!

  • Note 2: It appears that starting a new thread to load in the new face recog does make things faster, but for some reason it causes the rest of the code to slow down. Initially the pyv4l2 loading was 226ms, but running in parallel (i'm imagining) with the face recog loads nearly doubles that time to 475ms. Not quite sure what to make of that. If I remove the thread piece, and run the code like it was, loading the face recog piece after the pyv4l2 opening of the camera, the total time is ~890ms.

  • Note 3: Running with the idea of threading further, I ran a test pulling the actual video_capture initialization into a thread as well... That slowed things down more. So, there is definitely a 'cost' associated to threading with a diminishing return.

Original code with FFMPEG class (2210 ms):

-> sudo echo original-with-ffmpeg
Time spent
  Starting up: 118ms
  Open cam + load libs: 1007ms
    Opening the camera: 1007ms
    Importing recognition libs: 988ms
  Searching for known face: 97ms
  Total time: 2210ms

Resolution
  Native: 352x352
  Used: 320x320

Frames searched: 2 (20.60 fps)
Dark frames ignored: 1 
Certainty of winning frame: 2.583
Winning model: 0 ("Initial model")
Identified face as tim
original-with-ffmpeg

Original code with new pyv4l2 class (1565 ms):

-> sudo echo original-with-pyv4l2
Time spent
  Starting up: 117ms
  Open cam + load libs: 1124ms
    Opening the camera: 226ms
    Importing recognition libs: 1124ms
  Searching for known face: 98ms
  Total time: 1565ms

Resolution
  Native: 352x352
  Used: 320x320

Frames searched: 1 (10.22 fps)
Dark frames ignored: 0 
Certainty of winning frame: 3.228
Winning model: 0 ("Initial model")
Identified face as tim
original-with-pyv4l2

New recog code with new pyv4l2 class WITHOUT threading (885 ms):

-> sudo echo new-recog-with-pyv4l2
Time spent
  Starting up: 159ms
  Open cam + load libs: 382ms
    Opening the camera: 257ms
    Importing recognition libs: 382ms
  Searching for known face: 87ms
  Total time: 885ms

Resolution
  Native: 352x352
  Used: 320x320

Frames searched: 2 (22.98 fps)
Dark frames ignored: 1 
Certainty of winning frame: 2.450
Winning model: 1 ("Model #2")
Identified face as tim
new-recog-with-pyv4l2

New recog code with new pyv4l2 class WITH THREADING (748 ms):

-> sudo echo new-recog-with-pyv4l2
Time spent
  Starting up: 139ms
  Open cam + load libs: 522ms
    Opening the camera: 475ms
    Importing recognition libs: 522ms
  Searching for known face: 87ms
  Total time: 748ms

Resolution
  Native: 352x352
  Used: 320x320

Frames searched: 2 (22.98 fps)
Dark frames ignored: 1 
Certainty of winning frame: 2.542
Winning model: 1 ("Model #2")
Identified face as tim
new-recog-with-pyv4l2

@boltgolt
Copy link
Owner

Looks even better than the FFmpeg option! Again, very nice work!

The performance improvement is definitely worth it so i'd add it as a new recorder, just like FFmpeg. I think it's best to keep the FFmpeg option available like it is now on the dev branch, for compatibility with non-v4l devices among other things.

@boltgolt
Copy link
Owner

boltgolt commented Jan 2, 2019

Sorry for the delay, but the code written by @dmig has now been merged into the dev branch. There were a few issues (a deprecated Github API endpoint for instance), but those are fixed now. The speed upgrade is incredible. Most of my tests succeeded in less than 0.3 seconds, which is just amazing.

Because this thread was mostly about those changes and the goal of the author (<2 seconds) has been met i'm closing this issue.
Thanks again all for your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants