Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mobilenet-gpu not working with double-take #691

Closed
bigbangus opened this issue Dec 22, 2021 · 15 comments
Closed

mobilenet-gpu not working with double-take #691

bigbangus opened this issue Dec 22, 2021 · 15 comments

Comments

@bigbangus
Copy link

bigbangus commented Dec 22, 2021

Describe the bug
Initially, mobilenet-gpu version appears to work in the GUI and I can successfully test my recognition application in the GUI using a stock photo. However, once I try to connect double-take to it using the url + key, it stalls and I receive errors in the compreface logs.

notes
-same behavior if I use Unraid single container version
-compreface regular version works fine with double-take in both single container version and docker-compose version
-same behavior with internal or external db
-same behavior with arcface gpu version

Hardware/OS
Unraid 6.10-rc2 (docker-compose plugin)
Nvidia Driver: 495.46 (patched)
GTX 1050Ti
Ryzen 9 3900X w/64MB RAM

initial GUI test works and subsequent tests work
image
image

then the double-take api connection fails
image
image

nvidia-smi:
image

docker logs:
compreface-db.txt
compreface-ui.txt
compreface-admin.txt
compreface-api.txt
compreface-core.txt

.env:
registry=exadel/ postgres_username=postgres postgres_password=postgres postgres_db=frs postgres_domain=compreface-postgres-db postgres_port=5432 email_host=smtp.gmail.com email_username= email_from= email_password= enable_email_server=false save_images_to_db=true compreface_api_java_options=-Xmx8g compreface_admin_java_options=-Xmx8g ADMIN_VERSION=0.6.1 API_VERSION=0.6.1 FE_VERSION=0.6.1 CORE_VERSION=0.6.1-mobilenet-gpu

docker-compose:
docker-compose.zip

@pospielov
Copy link
Collaborator

According to logs, compreface-api makes a request to compreface-core and it answers too long, this is why it returns a timeout error.
compreface-core is a module with a neural network. Also, in compreface-core I don't see errors, but I see that uwsgi listen queue is full.
Also, I see on nvidia-smi that GPU is overloaded.
Is it possible that you do too many requests?

@bigbangus
Copy link
Author

I see on nvidia-smi that GPU is overloaded. Is it possible that you do too many requests?

I think so. This is what nvidia-smi looks like when I'm uploading images to the web GUI. Only 17%.
image

I just don't understand why double-take would flood compreface with requests when trying to connect. And why this only happens with the gpu versions. I think it uses a sample lenna.jpg to test the API and to show a green status on compreface. But for whatever reason it goes nuts with the gpu version.

@pospielov
Copy link
Collaborator

How much time does it take for a response for the GPU version when you test through UI? Ideally, it should take less time.
Also, be aware that the first one-two requests will take more time as servers will load models and init caches.

@jakowenko
Copy link

I can shed a little insight on the Double Take side. On page load of /config or every 30 seconds, the detectors status is updated. There shouldn't be any spamming of the CompreFace API unless you keep refreshing the /config page on the DT UI.

@bigbangus
Copy link
Author

How much time does it take for a response for the GPU version when you test through UI? Ideally, it should take less time. Also, be aware that the first one-two requests will take more time as servers will load models and init caches.

Yes this is accurate to my experiences. compreface regular takes several seconds to process each image. compreface-mobilenet-gpu takes under 1 sec to process the images. Both have an initial delay like you said.

I can shed a little insight on the Double Take side. On page load of /config or every 30 seconds, the detectors status is updated. There shouldn't be any spamming of the CompreFace API unless you keep refreshing the /config page on the DT UI.

Understood. Again, works fine in compreface regular. GPU versions it just floods it. Are the GPU versions too fast?

@bigbangus
Copy link
Author

Today i tried again using an Ubuntu 20.04.3 LTS 64bit virtual machine with a GTX 1660Ti passed through on Unraid 6.10rc2. I installed docker, docker compose, and the nvidia docker drivers. Same result for the mobilenet gpu. Initially I passed it lenna.jpg and it works fine. All subsequent Web GUI requests work fine and are super fast. No issues to this point.

As soon as I connect double-take using the url and key it blows up again.

Not sure what else I can do here to help solve the issue. I feel like double-take is flooding compreface with requests but not sure why. Would love for this to work because it's so much faster on the gpu.

Docker logs before double-take connects:
compreface-admin.log
compreface-api.log
compreface-core.log
compreface-postgres-db.log
compreface-ui.log

Docker logs after double-take tries to connect
compreface-admin.log
compreface-api.log
compreface-core.log
compreface-fe.log
compreface-postgres-db.log

nvidia-smi:
image

docker ps
image

@pospielov
Copy link
Collaborator

pospielov commented Dec 30, 2021

Looks like I found the issue, here is an example of your request:
{"log":"10.129.0.250 - - [28/Dec/2021:15:45:02 +0000] \"POST /api/v1/recognition/recognize?face_plugins=undefined\u0026det_prob_threshold=0 HTTP/1.1\" 499 0 \"-\" \"axios/0.24.0\"\n","stream":"stdout","time":"2021-12-28T15:45:02.168531834Z"}
There is a param det_prob_threshold=0.
How the algorithm works - first it tries to find all faces on the image. As every other ML algorithm it can't say "yes" or "no", it says that this is a face with probability from 0 to 1. And this threshold should tell the algorithm what should it treat as a face.
If it equals zero, it just uses all found "faces". I tried locally if it equals the default value, it found one face. If it equals 0, it found 7129 "faces". Then the algorithm runs facial recognition on all 7129 "faces"...
So double-take does not flood Compreface with requests, it just sends a request that is too heavy as it tries to recognize thousands of faces on the image.
So why is the difference when you change the CompreFace version?
Looks like double-take always sends det_prob_threshold=0, but FaceNet(the default version) works different than InsightFace(all custom builds). With that same image and det_prob_threshold=0 FaceNet returned just one face.
Then I tried this image and by default, FaceNet found 14 faces:
image
When I send the same request with det_prob_threshold=0, FaceNet returned 17 faces.
Then I tried the same image with InsightFace, and by default, it returned only 7 faces:
image
If I set det_prob_threshold=0.1, it returned 14 faces.
If I set det_prob_threshold=0, it returned 4714 faces.
The default value for FaceNet is 0.85 and for InsightFace is 0.8 and we didn't change it from the original libraries.

So, you need to change this value in double-take if it's possible. If not, we need to ask @jakowenko to implement this functionality :)

@bigbangus
Copy link
Author

bigbangus commented Dec 31, 2021

@pospielov thank you for putting the time to find this and explain how det_prob_threshold works.

So, you need to change this value in double-take if it's possible. If not, we need to ask @jakowenko to implement this functionality :)

I can confirm that my double-take config has det_prob_threshold: 0.8 (see below).

@jakowenko can you confirm that det_prob_threshold is being correctly passed through to the compreface API?

double-take detector config:

# detector settings (default: shown below)
detectors:
  compreface:

    url: http://x.x.x.x:8000 #masked for privacy
    key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx #masked for privacy
   
    # number of seconds before the request times out and is aborted
    timeout: 15
    # minimum required confidence that a recognized face is actually a face
    # value is between 0.0 and 1.0
    det_prob_threshold: 0.8
    # comma-separated slugs of face plugins
    # https://github.com/exadel-inc/CompreFace/blob/master/docs/Face-services-and-plugins.md)
    # face_plugins: mask,gender,age

@bigbangus
Copy link
Author

Yep pretty confident that it's just the status check in double-take that is hardcoded at det_prob_threshold = 0.

@jakowenko if you can update the code to use 0.1 or to use the det_prob_threshold defined in the double-take config for the lenna.jpg status check, that would probably solve the issue.

Thanks!

From compreface unraid log (regular version), you can see where status check lenna.jpg is at 0.0 despite the config being at 0.8.

image

@pospielov
Copy link
Collaborator

@bigbangus Probably you can create a pull request to double-take, looks like this is the line where det_prob_threshold is defined:
https://github.com/jakowenko/double-take/blob/252cbce65f4a94cc20d2cc9e333b43b8887655bf/api/src/util/detectors/compreface.js#L26

@bigbangus
Copy link
Author

@bigbangus Probably you can create a pull request to double-take, looks like this is the line where det_prob_threshold is defined: https://github.com/jakowenko/double-take/blob/252cbce65f4a94cc20d2cc9e333b43b8887655bf/api/src/util/detectors/compreface.js#L26

yes I found that in the code as well, but I'm still new to programming, github, dockers and IT in general. I will read up on how to make a pull request and pursue that if the author doesn't have the time to update it. thanks!

@bigbangus
Copy link
Author

Ok nevermind I watched this youtube video and made a pull request using my linux vm. So cool! Thanks for the tip. Love github!

https://github.com/jakowenko/double-take/pull/185/files

@juan11perez
Copy link

@bigbangus
I have very similar hardware (GTX 1050Ti, Ryzen 9 3900X w/64MB RAM) with unraid 6.9.2 and experienced exactly the same issue.
I have modified the double-take docker per your pull request and it works now works with mobilenet-gpu.

Thank you

@bigbangus
Copy link
Author

@juan11perez awesome. I modified the compreface.js in the running docker and restarted double-take through it's own GUI and it works now. So yes seems like this change would be great once @jakowenko has time to address it!

image
image

{"severity": "DEBUG", "message": "Found: BoundingBoxDTO(x_min=214, y_min=193, x_max=350, y_max=395, probability=0.9944502115249634, _np_landmarks=array([[270.41437, 271.69827],\n [327.2149 , 273.5742 ],\n [308.94156, 315.77658],\n [270.0415 , 344.64978],\n [316.9687 , 346.79092]], dtype=float32))", "request": {"method": "POST", "path": "/find_faces", "filename": "lenna.jpg", "api_key": "", "remote_addr": "127.0.0.1"}, "logger": "src.services.facescan.plugins.insightface.insightface", "module": "insightface", "traceback": null, "build_version": "dev"}
{"severity": "INFO", "message": "200 OK", "request": {"method": "POST", "path": "/find_faces", "filename": "lenna.jpg", "api_key": "", "remote_addr": "127.0.0.1"}, "logger": "src.services.flask_.log_response", "module": "log_response", "traceback": null, "build_version": "dev"}
172.17.0.1 - - [17/Jan/2022:14:16:04 -0500] "POST /api/v1/recognition/recognize?face_plugins=undefined&det_prob_threshold=0.8 HTTP/1.1" 200 256 "-" "axios/0.24.0"

@bigbangus
Copy link
Author

The PR is now merged and the issue is confirmed resolved with v.1.9.0 of double-take. Thank you @jakowenko @pospielov
https://github.com/jakowenko/double-take/releases/tag/v1.9.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants