Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oom custom trust manager #928

Conversation

uplogix-mmcclain
Copy link
Contributor

@uplogix-mmcclain uplogix-mmcclain commented Jan 9, 2023

The following is similar in nature to CVE-2021-4213, but requires a
non-standard configuration.

We are running Tomcat 9 with JSS 4.9.3 and Tomcat-JSS 7.7.1 (OpenJDK 8),
but the underlying code is unchanged in the master branch on Github.

If a Tomcat endpoint is configured to use JSS with a custom TrustManager
(anything other than JSSNativeTrustManager) and requires client
certificates, then the server will rapidly run out of memory.

If a client establishes a TLS connection with a client cert that passes
validation of the trust manager, the data associated with the connection
will be "leaked": the 2 large buffers (4096 to 18713 bytes each), plus
JNI references (SSLFD, etc), the JSSSession, JSSEngineReferenceImpl, the
CertValidationTask, etc.

The data is not leaked if a client tries to establish a TLS connection,
but never presents a cert or presents a cert that fails the
X509TrustManager.checkClientTrusted() check. In either case, an
exception is internally created and handled, closeInbound() and
closeOutbound() are called followed by tryCleanup(). This releases the
JSSSession and later allows for the JSSEngineReferenceImpl to be garbage
collected.

The attached patches work for our use cases. A more conservative change
could be made to CertValidationTask to store a WeakReference to the
JSSEngineReferenceImpl, but then the code would need to handle if the
JSSEngineReferenceImpl was garbage collected before the
CertValidationTask (check would need to return a non-zero value). This
more conservative change could still support X509ExtendedTrustManager.
Being part of a library, I don't know if anyone relies on such support
though.

Red Hat Security asked if this could be fixed in the custom
TrustManager.

Unfortunately, this can't be fixed in the custom TrustManager.

The issue doesn't happen when JSSNativeTrustManager is used because the
JSSEngineReferenceImpl never instantiates a CertValidationTask in that
code path.

The patch provided changes CertValidationTask to a static inner class so
it no longer holds a reference to JSSEngineReferenceImpl. The
CertValidationTask is held by SSLFDProxy and SSLFDProxy is held by
JSSEngineReferenceImpl completing a cycle.

Without the patch: JSSEngineReferenceImpl -> SSLFDProxy ->
CertValidationTask -> JSSEngineReferenceImpl...

If the TrustManager rejects the certificate, exception handling causes
cleanup() to be called which frees the SSLFDProxy object. When the
TrustManager accepts the certificate, cleanup() doesn't get called until
the JSSEngineReferenceImpl is garbage collected which can't happen
because of the object cycle when a custom TrustManager is used.

In an earlier patch I tried to release the reference to the
CertValidationTask in SSLFDProxy after
SSLFDProxy.invokeCertAuthHandler() had run. This still ended up
leaking memory, but it didn't happen on every request.

Below is the code path JSSEngineReferenceImpl takes when
JSSNativeTrustManager which bypasses the problem.

in https://github.com/dogtagpki/jss/blob/master/base/src/main/java/org/mozilla/jss/ssl/javax/JSSEngineReferenceImpl.java#L549-L569

if (trust_managers.length == 1 && trust_managers[0] instanceof JSSNativeTrustManager) {
    // This is a dummy TrustManager. It signifies that we should call
    // SSL.ConfigJSSDefaultCertAuthCallback(...) on this SSL
    // PRFileDesc pointer, letting us utilize the same certificate
    // validation logic that SSLSocket had.
    debug("JSSEngine: applyTrustManagers() - adding Native TrustManager");
    if (SSL.ConfigJSSDefaultCertAuthCallback(ssl_fd) == SSL.SECFailure) {
        throw new SSLException("Unable to configure JSSNativeTrustManager on this JSSengine: " + errorText(PR.GetError()));
    }
    return;
}

if (as_server) {
    // We need to manually invoke the async cert auth handler. However,
    // SSLFDProxy makes this easy for us: our CertAuthHandler derives
    // from Runnable, so we can reuse it here as well. We can create
    // it ahead of time though. In this case, checkNeedCertValidation()
    // is never called.
    ssl_fd.certAuthHandler = new CertValidationTask(ssl_fd);

    if (SSL.ConfigSyncTrustManagerCertAuthCallback(ssl_fd) == SSL.SECFailure) {

Copy link
Contributor

@edewata edewata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some minor comments, but I don't see anything objectionable. I hope others can take a look at this PR too since I'm not too familiar with the code.

int nss_code = Cert.MatchExceptionToNSSError(excpt);

if (seen_exception) {
if (logger.isDebugEnabled() == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be written as:

if (!logger.isDebugEnabled()) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. It's a habit I formed a long time ago. I'll change it.

Note. Sonar doesn't like that this method returns nss_code in 2 places. It's saying it's a "Blocker".
https://sonarcloud.io/project/issues?pullRequest=928&issues=AYWX5DBvciWndDK-ESS5&open=AYWX5DBvciWndDK-ESS5&id=dogtagpki_jss

I could change it to:

if (logger.isDebugEnabled()) {
  String msg = ...
  logger.debug(msg, excpt);
}
return nss_code;

I wouldn't ordinarily write code like that though, and the code wasn't originally ordered like that either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sonar isn't marking it as a blocker any more...

Comment on lines -1831 to 1832
private int assignException(Exception excpt, PK11Cert[] chain) {
private int logException(Exception excpt, PK11Cert[] chain) {
int nss_code = Cert.MatchExceptionToNSSError(excpt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method does assign an NSS code to the exception (instead of just logging the exception), so probably the original method name is more appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Assign" isn't really the verb I'd use as nothing is being set on the Exception or any other object.
Exceptions are being mapped to codes.

Break the cycle so the garbage collector doesn't need to figure out it can
release these together and then let the JSSEngineReferenceImpl.finalize() run.
This breaks the cycle of SSLFD -> CertValidationTask -> JSSEngineReferenceImpl.
@sonarcloud
Copy link

sonarcloud bot commented Jan 10, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 8 Code Smells

0.0% 0.0% Coverage
0.0% 0.0% Duplication

@fmarco76
Copy link
Member

With the patch: JSSEngineReferenceImpl -> SSLFDProxy ->
CertValidationTask -> JSSEngineReferenceImpl...

I am reviewing this patch but the above sentence is not clear to me. If I get correctly with the patch we have no loops so
the last reference is not present, isn't it?

@cipherboy
Copy link
Member

I'm not sure I have much meaningful to contribute to this discussion in a reasonable timeframe, but I seem to recall there was some reason why we wanted to facilitate ExtendedTMs in the first place, that talked to the SSLEngine.

What that was, I do not recall any more. Might be worth double checking it isn't used in Dogtag, and if so, feel free to remove it.

@uplogix-mmcclain
Copy link
Contributor Author

With the patch: JSSEngineReferenceImpl -> SSLFDProxy ->
CertValidationTask -> JSSEngineReferenceImpl...

I am reviewing this patch but the above sentence is not clear to me. If I get correctly with the patch we have no loops so the last reference is not present, isn't it?

Yes, I meant to say "Without the patch". I updated the PR comment.

@fmarco76
Copy link
Member

What that was, I do not recall any more. Might be worth double checking it isn't used in Dogtag, and if so, feel free to remove it.

I have looked at Dogtag and it seems the X509ExtendedTrustManager is not used. However, if we remove it then some configuration could not work because the ExtendedTrustManager is required (X509ExtendedTrustManager documentation), such as the TLS1.2 with custom trust manager.
@edewata do you know if such configurations are used in real cases?

@edewata
Copy link
Contributor

edewata commented Jan 16, 2023

@fmarco76 wrote:

I have looked at Dogtag and it seems the X509ExtendedTrustManager is not used. However, if we remove it then some configuration could not work because the ExtendedTrustManager is required (X509ExtendedTrustManager documentation), such as the TLS1.2 with custom trust manager.
@edewata do you know if such configurations are used in real cases?

I'm not really familiar with this. According to the linked doc, it's highly recommended to use X509ExtendedTrustManager for TLS 1.2 and later, but I don't know the consequence if we don't use it. Do we have any CI tests for TLS 1.2 and later? Maybe as long as the tests pass it should be OK.

@jmagne @ladycfu What do you think?

@fmarco76
Copy link
Member

fmarco76 commented Mar 6, 2023

@uplogix-mmcclain I would do some tests of this patch but the custom trust manager does not get loaded correctly. Have you modified the tomcatjss or do I miss some configurations? Could you provide some hints on how to replicate your setup? Thanks

@uplogix-mmcclain
Copy link
Contributor Author

In server.xml for the Connector, I have:
sslImplementationName="MySslImplementation"

In that I have an implementation of this:

  public SSLUtil getSSLUtil(SSLHostConfigCertificate cert) {

        return new MyJssSslUtil(cert);
  }

MyJssSslUtil overrides JSSUtil with this:

  public TrustManager[] getTrustManagers() throws Exception {

    return new TrustManager[] { trustManagerClass.newInstance() };
  }

I can probably get you a more complete example later this week.

@fmarco76
Copy link
Member

fmarco76 commented Mar 7, 2023

@uplogix-mmcclain Thanks! This is the information I was looking for. Reading your first comment post I was thinking you did not extend tomcatjss so I was looking to set the trust manager using some configuration (it is possible for JSSE) without success.

@fmarco76
Copy link
Member

fmarco76 commented Mar 9, 2023

@uplogix-mmcclain before to analyse in more detail the proposed changes I was trying to replicate the problem highlighted in this PR.

I have tested using a fedora 37 container with tomcat and jss as follow:

[root@tomcat jss]# rpm -qa|grep jss
jss-debugsource-5.4.0-0.1.alpha1.fc37.x86_64
dogtag-jss-debuginfo-5.4.0-0.1.alpha1.fc37.x86_64
dogtag-jss-javadoc-5.4.0-0.1.alpha1.fc37.x86_64
dogtag-jss-5.4.0-0.1.alpha1.fc37.x86_64
dogtag-tomcatjss-8.2.0-1.fc37.2.noarch

[root@tomcat jss]# rpm -qa|grep tomcat
tomcat-servlet-4.0-api-9.0.71-1.fc37.noarch
tomcat-el-3.0-api-9.0.71-1.fc37.noarch
tomcat-jsp-2.3-api-9.0.71-1.fc37.noarch
tomcat-native-1.2.35-1.fc37.x86_64
tomcat-lib-9.0.71-1.fc37.noarch
tomcat-9.0.71-1.fc37.noarch
dogtag-tomcatjss-8.2.0-1.fc37.2.noarch
tomcat-webapps-9.0.71-1.fc37.noarch

I have created my dummy X509TrustManager (it does accept the certificate without really verifying) and extended tomcatjss org.dogtagpki.tomcat.JSSImplementation and org.dogtagpki.tomcat.JSSUtil to load the new trust manager as explained above.

I initially started in debug mode and follow the execution steps to verify if the new trust manager is used. Then I have verified if the cleanup() method is invoked after the communication, when a valid certificate is used since this is the condition generating the problem. It is invoked as expected and the object is removed correctly.

Then, I thought the problem could be related to some race conditions so I run >130000 curl requests (the request was to the root page but for this problem that is not relevant) using GNU parallel. I have used 64 threads and I have generated a dump of the heap memory before and after the requests. Analysing the heap dumps with Eclipse Memory Analyzer I did not see much difference between before and after the requests. The only JSSEngineReferenceImpl objects in the heap where referenced by the connection in the waiting threads.

Is there something I am missing to replicate the problem?

Additional consideration, I have thought about the cycle problem solved modifying the nested class of JSSEngineReferenceImpl to static so there is no loop. However, for the GC point of view loops are not a problem because it is not based on reference count but on trace and should remove all the objects in the cycle when they go out of scope. Therefore, I was wondering if the problem could be in the reference to CertValidationTask stored in ssl_fd which is then provided to the native code. In the heap analysis I have noticed that the instance of SSLFDProxy marked for removal were with JNI references.
Have you tried with current version of JSS if you still get OoM errors?

Your solution could solve the problem in any case because remove a reference but it could be not necessary.

@edewata I confirm this scenario is never present in dogtag, the JSSNativeTrustManager is always used.

@uplogix-mmcclain
Copy link
Contributor Author

My application is currently limited to Java 8, so it can't run JSS 5.0+. I just set up a Fedora 37 environment with a minimal configuration. I should be able to investigate Monday.

@uplogix-mmcclain
Copy link
Contributor Author

I know with the other issue that didn't require a client cert I used ab (Apache benchmark) to test with. I don't remember if I tested this with that.

@fmarco76
Copy link
Member

I know with the other issue that didn't require a client cert I used ab (Apache benchmark) to test with. I don't remember if I tested this with that.

I did not know ab. In the past I have used apache jmeter but for simple tests this looks more easy.

@uplogix-mmcclain
Copy link
Contributor Author

If I use ab, I see every connection leak.

ab -E client-test.pem -n 10000 https://example.com:8443/test.jsp

I have 10k JSSEngineReferenceImpl$CertValidationTask, 20k PK11InternalTokenCert, and 40k of class ObjectIdentifier.

If I use curl, I see very few instances left after the run; these probably just haven't been garbage collected yet.

33 JSSEngineReferenceImpl$CertValidationTask
130 PK11InternalTokenCert
421 class ObjectIdentifier

seq 1 10000 | parallel -j 24 ./test-jsp.sh

test-jsp.sh:

curl -k \
     --cert-type PEM \
     --cert client-test.pem \
     -s \
     -H "Accept: text/xml" \
     -H "Accept-Encoding: $ENCODING" \
     -H "Content-Type: text/xml; charset=utf-8" \
     -H "User-Agent: $0" \
     https://example.com:8443/test.jsp

@cipherboy
Copy link
Member

@uplogix-mmcclain Just curious, does this leak go away on JDK9? I seem to recall there was some difference in our memory handling in JDK9 vs JDK8 that made freeing native references easier...

@uplogix-mmcclain
Copy link
Contributor Author

In the Fedora 37 environment I just created I was testing with JDK17 (java-17-openjdk-headless-17.0.6.0.10-1.fc37.x86_64).
All previous testing I did was with JDK 8 on AlmaLinux 8.

@fmarco76
Copy link
Member

@uplogix-mmcclain at least with curl we get the same results and I have debugged also a single connection to see what happen and it is OK.

Not clear to me what happen when ab is used. I can't imagine any reason for this divergent behaviour but I'll do some more investigation. Do you or @cipherboy have some idea what it could be?

@cipherboy
Copy link
Member

cipherboy commented Mar 13, 2023

My best guess, if you don't see this with your own custom trust manager, but perhaps there's something different about this custom trust manager that causes it to hold and persist references afterwards? The TM itself should be nicely scoped to just the session validation, I don't really see in general what'd cause it to persist the reference afterwards as, after validation, the task should be released as it isn't necessary any more...?

That said, using static, or perhaps better, weak_ref, would be fine by me.

But its been forever since I've looked at this code :-)

@uplogix-mmcclain
Copy link
Contributor Author

This seems to trigger it too, so I don't think it's a race condition.

for i in $(seq 1000); do ab -E client-test.pem -n 1 https://example.com:8443/test.jsp; sleep 0.1; done

I don't know why cleanup() isn't getting called. Since I couldn't trust it to be called, I looked into what was pinning the memory.

@fmarco76
Copy link
Member

@uplogix-mmcclain yes, it is not a race condition. After several tests I have noticed that the closeInbound() is invoked for curl requests but not for ab. It turn out that if curlinclude the header -H 'Connection: close' then it gets the same behaviours of ab and the cleanup stop working. It is not related to the trust manager (tested with custom and default trust manager). The problem is with the close_notify sent by the client which is ignored when Connection: close is used.

In your tests, using default JSS trust manager, do all the JSSEngine objects get properly removed?

From the Java side of JSS I do not get the reason so I need to debug the C side.

@cipherboy
Copy link
Member

cipherboy commented Mar 15, 2023

@uplogix-mmcclain @fmarco76 I think, from your discovery, Marco, Connection: close is the key: maybe @csutherl can confirm that Tomcat will persist references to connections which aren't marked Connection: close -- until some keep alive is reached? If so, it seems more like a Tomcat bug than anything else: memory expensive SSLEngine implementations probably can't be kept around as long (without more traffic) than "cheaper" Java-native engines.

(i.e., it could be that Tomcat doesn't even finalize/close_notify/... the SSLEngine on the off chance the tunnel is reused for another request without renegotiating TLS -- with/without session resumption -- which because the resumption never happens and it isn't closed in time, memory balloons).

@uplogix-mmcclain
Copy link
Contributor Author

Yes, all JSSEngine objects are removed if JSSNativeTrustManager is used. I see 2 JSSEngineReferenceImpls in a heap dump after running ab with 10k connections. The largest number of instances of any Mozilla class is SSLFDProxy at 250.

The issue doesn't happen when JSSNativeTrustManager is used because the JSSEngineReferenceImpl never instantiates a CertValidationTask in that code path.

@fmarco76
Copy link
Member

@uplogix-mmcclain I am back on this PR and doing some test. I have created this draft PR to verify if the problem is really with close_notify. Could you perform a test to see if it works? You have just to update JSS with the code in #970.

In my tests the JSSEngineReferenceImpl objects are removed by the GC.

@cipherboy this is not a final solution because several tests failed. If Connection: close is present tomcat invokes only the closeOutbound() method there is not cleanup(). It is not yet clear to me where/how the invocation to closeInbound() should be.

@fmarco76
Copy link
Member

Since the leak fixed by this PR were generating by the problem solved by PR #970 I am closing this.

@uplogix-mmcclain if not all the leaks are solved and this PR is still relevant we can reopen and evaluate this with the new changes. Thanks!

@fmarco76 fmarco76 closed this May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants