Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability issue with authentication ? #979

Closed
claustres opened this issue Sep 5, 2018 · 18 comments
Closed

Scalability issue with authentication ? #979

claustres opened this issue Sep 5, 2018 · 18 comments

Comments

@claustres
Copy link
Contributor

claustres commented Sep 5, 2018

Steps to reproduce

Create two chat apps using the CLI by following https://docs.feathersjs.com/guides/chat/readme.html, NeDB as a datastore, socket.io as transport, one with local authentication enabled and one without it.

Create a test user in each one. To track the concurrent number of connections simply change the socketio configuration line like this:

var nbConnections = 0
app.configure(socketio({}, (io) => {
  io.on('connection', socket => {
    nbConnections++
    console.log(`${nbConnections} concurrent sockets`)
    socket.on('disconnect', () => {
      nbConnections--
      console.log(`${nbConnections} concurrent sockets`)
    })
  })
}));

We also disabled channels.

Use this benchmark article to perform a workload test of the applications. The following scenario should be run for the authenticated app:

  • connect
  • authenticate
  • get user using ID from JWT payload
  • get messages from service
  • logout
  • disconnect

The following scenario should be run for the unauthenticated app:

  • connect
  • get messages from service
  • disconnect

We run this scenarios with a ramp up duration of 60 seconds and the following configurations:

  • ramp up to 10 concurrent clients
  • ramp up to 30 concurrent clients
  • ramp up to 60 concurrent clients

This means that in the worst case we are performing 1 client authentication per second while others clients access the service. But this is usually less than that because after consulting the service each client will pause during 5s to simulate a user reading the stuffs, so most clients are connected but inactive.

Actual behavior

Everything goes fine with the unauthenticated app, average service response is always under 10 ms.

Until 30 concurrent clients on the authenticated app everything almost goes fine as well, average service response is around 150 ms and average authentication response is around 1 s.

With 60 concurrent clients on the authenticated app almost a quarter of them suffer this error: Error: Authentication timed out. Average service response is around 700 ms and average authentication response is around 8 s (for those succeeding to connect).

I am not 100% sure about that but, when I wrote the benchmark article to perform a workload test of our production application with Feathers V2 I think we supported something like 500 concurrent connections on the same hardware. So this might be a V3-specific issue. When doing the test again on our production app now migrated to V3 we also noticed a behavior similar to this issue #892, which can be related to timeouts.

What is strange is that increasing the ramp up duration does not help, it seems there is a "processing barrier" at some point that prevent authentication to scale.

Expected behavior

A basic Feathers app should support performing authentication of a large number of users per minute on a decent machine. Of course we could probably make this benchmark work with multiple app instances but it seems to me that 60 concurrent clients are not too much on a "good" machine.

A good test has already been done on websockets but it did not include authentication, it might be interesting to try reproduce this issue on it.

System configuration

Hardware: Core i7 7700HQ 2.8 GHz (4 cores), 16GB RAM

Module versions :

  • "@feathersjs/feathers": "^3.1.7",
  • "@feathersjs/socketio": "^3.2.2",
  • "@feathersjs/client": "^3.5.3"

NodeJS version: 8.9

Operating System: Windows

@claustres
Copy link
Contributor Author

Please note that I increased the authentication client timeout to 20s while the default configuration is 5s, making the tests break even before 60 concurrent clients with the default Feathers setup.

@daffl
Copy link
Member

daffl commented Sep 5, 2018

This has already been discussed in feathersjs-ecosystem/authentication-local#70. From the BCrypt.js documentation:

For bcrypt to be effective, it needs to be THAT much slower, since it’s designed to raise the cost of password cracking. At 100ms, that means at least 10 passwords per second or faster for the attacker per machine; at that rate, a single machine w/o optimizations (GPU, etc.) can crack a 6-character non-complex password in an average time of 6 months. Realistically, at 100x speed, that means a 7-character password will fall in 45 days

If you require better performance you have to implement a different password hashing strategy.

@claustres
Copy link
Contributor Author

I knew that password hashing is a slow process but here I simply perform authentication with an existing user/password, does password comparison also suffer this scalability issue as far as you know ?

@daffl
Copy link
Member

daffl commented Sep 5, 2018

Yes. The plain text password has to go through the same hashing mechanism in order to compare it.

@claustres
Copy link
Contributor Author

claustres commented Sep 5, 2018

OK this is a really good explanation thanks ! However don't you think these numbers to be too much high ? I am gathering some performance benchmarks about bcrypt and with a work factor of 12 it is about hundreds of milliseconds on a laptop to perform hashing, far away from 20s (my timeout setup), e.g. this one from 2012, this one from 2018, etc.

Maybe you can explain a little bit the logic behind https://github.com/feathersjs/feathers/blob/master/packages/authentication-local/lib/utils/hash.js, which seems to increase the work factor depending on the date to anticipate computer power increase I guess. Do you know which work factor is used by default in Feathers now ?

It seems this changed at some point because we used the local authentication module v0.4.3 for a long time and bcrypt setup was less complex, see https://github.com/feathersjs/authentication-local/blob/v0.4.3/src/utils/hash.js. This might explain why we didn't notice it previously.

Maybe the problem is also related to the main loop being polluted by bcrypt processing so that requests stack-up.

I will try to perform authentication with an existing JWT token instead of a full login to test if things go better, but they will I think ;-)

@claustres
Copy link
Contributor Author

Just tested https://www.dailycred.com/article/bcrypt-calculator and this is almost instant result with a 12 factor.

@daffl
Copy link
Member

daffl commented Sep 5, 2018

feathersjs-ecosystem/authentication-local#30 is the pull request and discussion for that.

@claustres
Copy link
Contributor Author

claustres commented Sep 5, 2018

So as far as I understand a cost factor of 12 is still used today, this keeps my observations relevant IMHO.

@daffl
Copy link
Member

daffl commented Sep 5, 2018

It looks like it can be random between 12 and 19. You can try removing that code to see if it makes a difference (or implement your own hashing system). Besides bCrypt hashing I am not aware of anything computationally intensive happening anywhere else in Feathers so any delays probably come from there.

I never had an application where 30+ people logged in at the exact same time so I'll let you make a call on this by putting up a PR.

@claustres
Copy link
Contributor Author

claustres commented Sep 5, 2018

Maybe you did not read my article completely but in my tests the number of logins is far less than that. I have a ramp up phase (e.g. during one minute) where clients progressively connect until we have let's say 60 concurrent clients connected. Then each client call a service and pause during let's say 5s. After a given number of calls a client logouts/disconnects and is replaced by a new one (so we have a login here) in order to maintain the concurrency level during the whole test duration.

With my numbers this means that in the worst case we are performing 1 login per second while some others clients might access the service. But this is usually far less than that because each client pause between service calls to simulate a user reading the stuffs, so most clients are connected but inactive. I will try to track the number of login per seconds and given you results back in case I am wrong.

I also updated my code to authenticate directly with a JWT token strategy instead of performing a password-based login. Things are better, I am now getting some timeouts starting at 200 concurrent clients instead of 60, but it does not seem to be a so great improvement :-( By the way on the client I now get Error: Logout timed out and on the server error: NotAuthenticated: No auth token.

@claustres
Copy link
Contributor Author

Good news, with JWT I increased the ramp up duration so that there is not more than 1 login per second and I successfully jumped without problem to 1000 concurrent clients on my hardware.

It seems there is a barrier around 1/2 logins per second with JWT and less using local login, not sure if this is normal but I have at least improved my knowledge today. I also updated my article to add the JWT based authentication strategy.

I would appreciate if someone could provide some benchmark including authentication to see if it looks similar.

@daffl
Copy link
Member

daffl commented Sep 6, 2018

I just ran a REST benchmark including authentication of my performance comparison repo. The stats without were:

Running 10000 requests test @ http://localhost:3030/messages/test
100 connections

Stat         Avg     Stdev  Max
Latency (ms) 32.23   26.31  449.28
Req/Sec      2500.25 952    3829
Bytes/Sec    631 kB  241 kB 957 kB

Same request with JWT authentication (using a dummy user service):

Running 10000 requests test @ http://localhost:3030/messages/test
100 connections

Stat         Avg     Stdev  Max
Latency (ms) 67.52   28.04  480.98
Req/Sec      1428.58 232.45 1639
Bytes/Sec    358 kB  58 kB  410 kB

It's definitely slower but not unreasonably so. So I think the problem is either

  • Issues with Socket authentication - very likely related to the ones I also mentioned in the Crow roadmap post I just published.
  • Intentionally slowed down comparison by BCrypt.js for username/password login. Not sure if anything can be done about that other than reverting back to the default settings or changing the password hashing method (which you can already do manually).

@claustres
Copy link
Contributor Author

Thanks for feedback, I also updated my benchmark to work with REST and experienced no problems with it. I extended my article with an example of our staging infrastructure at the end. So far with sockets:

  • login with bcrypt is slow, on a decent hardware we cannot reach more than 1 login per second, which seems to me a little low as a cost factor of 12 is still used in 2018
  • login with JWT is faster but we can hardly reach 2 logins per second, which seems to me also a little low since no computational intensive task has to be done

I wonder if there is a particular state causing sockets to be slower (eg an array of connections you have to look into at each new connection, etc.) ? This might also be the cost of the underlying socket library, by the way we use socket.io.

@daffl
Copy link
Member

daffl commented Sep 6, 2018

Are you using the latest Feathers with channels? How is the performance of logins (local authentication) via REST?

@claustres
Copy link
Contributor Author

Yes I use Feathers V3 with channels but deactivated it in the test apps to avoid any additional workload. I can share my number for a ramp up duration of 60s until 60 concurrent clients, i.e. one local login per second on average. What I observe is a really strange behavior with authentication. Although REST does not raises any timeout it actually performs a lot worse !

REST with local authentication:

info: Total test time = 188.82 (s)
info: Total connect time = 0.05 (s)
info: Average connect time = 0.00 (s)
info: Total authenticate time = 3982.26 (s)
info: Average authenticate time = 33.19 (s)
info: Total messages time = 364.78 (s)
info: Average messages time = 3.04 (s)
info: Total disconnect time = 0.01 (s)
info: Average disconnect time = 0.00 (s)
info: Error ratio = 0.00 %

Websockets with local authentication:

info: Total test time = 106.69 (s)
info: Total connect time = 0.20 (s)
info: Average connect time = 0.00 (s)
info: Total authenticate time = 638.07 (s)
info: Average authenticate time = 5.32 (s)
info: Total messages time = 25.29 (s)
info: Average messages time = 0.21 (s)
info: Total disconnect time = 14.68 (s)
info: Average disconnect time = 0.12 (s)
info: Error ratio = 64.17 %

With the app without autentication numbers between REST and Websockets are pretty similar.

REST without authentication:

info: Total test time = 99.80 (s)
info: Total connect time = 0.08 (s)
info: Average connect time = 0.00 (s)
info: Total messages time = 0.83 (s)
info: Average messages time = 0.01 (s)
info: Total disconnect time = 0.01 (s)
info: Average disconnect time = 0.00 (s)
info: Error ratio = 0.00 %

Websockets without authentication:

info: Total test time = 103.85 (s)
info: Total connect time = 0.38 (s)
info: Average connect time = 0.00 (s)
info: Total messages time = 0.75 (s)
info: Average messages time = 0.01 (s)
info: Total disconnect time = 0.09 (s)
info: Average disconnect time = 0.00 (s)
info: Error ratio = 0.00 %

By saying the local authentication problem does not affect REST maybe I has been misguided by the fact the timeout option of the authentication module is not actually taken into account with REST ? So I tried to manage it by myself and bingo, changing the REST client configuration like this client.configure(feathers.rest(url).fetch((url, options) => fetch(url, Object.assign({ timeout: 20000 }, options)))) made timeouts appear with REST as well. And then we have pretty similar results than with websockets as you can see.

REST with local authentication (corrected):

info: Total test time = 98.92 (s)
info: Total connect time = 0.04 (s)
info: Average connect time = 0.00 (s)
info: Total authenticate time = 500.68 (s)
info: Average authenticate time = 4.17 (s)
info: Total messages time = 133.67 (s)
info: Average messages time = 1.11 (s)
info: Total disconnect time = 0.00 (s)
info: Average disconnect time = 0.00 (s)
info: Error ratio = 64.17 %

In any case I also share the numbers with JWT authentication.

REST with JWT authentication:

info: Total test time = 105.17 (s)
info: Total connect time = 0.08 (s)
info: Average connect time = 0.00 (s)
info: Total authenticate time = 28.01 (s)
info: Average authenticate time = 0.23 (s)
info: Total messages time = 1.29 (s)
info: Average messages time = 0.01 (s)
info: Total disconnect time = 0.02 (s)
info: Average disconnect time = 0.00 (s)
info: Error ratio = 0.00 %

Websockets with JWT authentication:

info: Total test time = 104.12 (s)
info: Total connect time = 0.29 (s)
info: Average connect time = 0.00 (s)
info: Total authenticate time = 22.13 (s)
info: Average authenticate time = 0.18 (s)
info: Total messages time = 3.50 (s)
info: Average messages time = 0.03 (s)
info: Total disconnect time = 0.93 (s)
info: Average disconnect time = 0.01 (s)
info: Error ratio = 0.00 %

So in short:

  • the differences between REST/Websockets with local authentication are not significant
  • there is something like a 30x factor between local and JWT authentication

@claustres
Copy link
Contributor Author

What I don't understand is why in your tests raw performances are really better. I am not saying my benchmark is free of bugs but I am logging the number of connections and it seems to be consistent. There are two main differences between your benchmark and mine. First I am using Feathers client.
Second, although the total number of concurrent connections is fixed they are not initiated once then persist on my side. During the ramp up phase there is a new client connecting every second while others have started working on services. In the steady phase one client connects/disconnects every second to simulate users constantly entering/leaving the app.

@daffl
Copy link
Member

daffl commented Sep 4, 2019

It would be great to see if those specific issues still persist in the latest version (v4).

@daffl
Copy link
Member

daffl commented Jan 9, 2020

Going to close this since even if it is still the case it is part of how bcrypt is designed. The new authentication system is flexible enough to customize the hashing algorithm by customizing the strategy.

@daffl daffl closed this as completed Jan 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants