Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor stream parsing to ensure safety. #436

Merged
merged 2 commits into from May 19, 2016

Conversation

thomasvandoren
Copy link
Contributor

@thomasvandoren thomasvandoren commented May 4, 2016

Use ByteBuffers and ByteStreams.readFully() to ensure buffers are correctly
filled. Previously, it was possible to read a partial header or partial message, because
IOUtils.read only "guarantees that it will read as many bytes as possible before giving
up". This means it could read all 8 bytes requested for the header or all N bytes
requested for message, but it could also read fewer bytes in which case it would end up
with a partial header or message.

readStreamFrame() is inspired by and based on:

https://gist.github.com/dflemstr/9541193d141729b6c84d/e02c64c67826a2a4291eb1e950bef1e71999b8d3#file-dockerlogstream-java-L90-L115

Motivation

Using this configuration in pom.xml:

            <plugin>
                <groupId>org.jolokia</groupId>
                <artifactId>docker-maven-plugin</artifactId>
                <version>0.13.2</version>
                <configuration>
                    <verbose>true</verbose>
                    <images>
                        <image>
                            <name>${docker.imageName}:${project.version}-${buildNumber}</name>
                            <alias>db-tag</alias>
                            <run>
                                <log>
                                    <prefix>DB-TAG</prefix>
                                </log>
                                <links>
                                    <link>postgres:postgres</link>
                                </links>
                                <env>
                                    <RDS_USERNAME>${docker-maven-plugin.POSTGRES_USER}
                                    </RDS_USERNAME>
                                    <RDS_PASSWORD>${docker-maven-plugin.POSTGRES_PASSWORD}
                                    </RDS_PASSWORD>
                                    <RDS_DB_NAME>${docker-maven-plugin.POSTGRES_DB}
                                    </RDS_DB_NAME>
                                </env>
                                <!-- Should match contents of .ebextentions/10_liquibase.config -->
                                <cmd>
                                    db tag
                                    --migrations /liquibase/db.changelog.yml
                                    /docker-it.yml
                                    ${project.artifactId}-${project.version}-${buildNumber}
                                </cmd>
                                <wait>
                                    <log>liquibase: Successfully released change log lock</log>
                                    <time>300000</time>
                                </wait>
                            </run>
                        </image>
                    </images>
                </configuration>
                <executions>
                    <execution>
                        <id>start</id>
                        <phase>pre-integration-test</phase>
                        <goals>
                            <goal>build</goal>
                            <goal>start</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>stop</id>
                        <phase>post-integration-test</phase>
                        <goals>
                            <goal>stop</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

About 50% of the time, the expected output from docker container would show up,
but the plugin would not recognize it and eventually timeout. An example log
snippet is:

/liquibase/db.changelog.initial.yml::15::thomasvandoren: ChangeSet /liquibase/db.changelog.initial.yml::15::thomasvandoren ran successfully in 3ms
DB-TAG> INFO  [2015-08-04 23:14:17,573] liquibase: Successfully released change log lock
Exception in thread "Thread-17" java.lang.IllegalArgumentException: Invalid format: "04T23:14:17.483Z" is malformed at "T23:14:17.483Z"
          at org.joda.time.format.DateTimeFormatter.parseDateTime(DateTimeFormatter.java:873)
          at org.jolokia.docker.maven.util.Timestamp.<init>(Timestamp.java:53)
          at org.jolokia.docker.maven.access.log.LogRequestor.callLogCallback(LogRequestor.java:142)
          at org.jolokia.docker.maven.access.log.LogRequestor.parseResponse(LogRequestor.java:121)
          at org.jolokia.docker.maven.access.log.LogRequestor.run(LogRequestor.java:97)
[ERROR] DOCKER> [thomasvandoren/test-repo:0.2-SNAPSHOT-1402d5319ddf30d2] "db-tag": Timeout after 300000 ms while waiting on on log out 'liquibase: Successfully released change log lock'

Other invalid format error messages that I observed include:

Exception in thread "Thread-13" java.lang.IllegalArgumentException: Invalid format: "?2015-08-05T20:33:12.280Z"
Exception in thread "Thread-16" java.lang.IllegalArgumentException: Invalid format: "02T02:04:28.607Z" is malformed at "T02:04:28.607Z"
Exception in thread "Thread-16" java.lang.IllegalArgumentException: Invalid format: ".000��"

There were also cases where no stack trace was printed, the expected line did
show up, but the plugin still timed out "waiting" for it. I'm not sure what failed in
those cases.

This all seems like potential corrupt byte buffer behavior to me, which led to
this change.

Signed-off-by: Thomas Van Doren thomas.vandoren@gmail.com

Related to: #259

Use ByteBuffers and ByteStreams.readFully() to ensure buffers are correctly
filled. Previously, it was possible to read a partial header.

readStreamFrame() is inspired by and based on:

https://gist.github.com/dflemstr/9541193d141729b6c84d/e02c64c67826a2a4291eb1e950bef1e71999b8d3#file-dockerlogstream-java-L90-L115

\#### Motivation

Using this configuration in pom.xml:

```maven-pom
            <plugin>
                <groupId>org.jolokia</groupId>
                <artifactId>docker-maven-plugin</artifactId>
                <version>0.13.2</version>
                <configuration>
                    <verbose>true</verbose>
                    <images>
                        <image>
                            <name>${docker.imageName}:${project.version}-${buildNumber}</name>
                            <alias>db-tag</alias>
                            <run>
                                <log>
                                    <prefix>DB-TAG</prefix>
                                </log>
                                <links>
                                    <link>postgres:postgres</link>
                                </links>
                                <env>
                                    <RDS_USERNAME>${docker-maven-plugin.POSTGRES_USER}
                                    </RDS_USERNAME>
                                    <RDS_PASSWORD>${docker-maven-plugin.POSTGRES_PASSWORD}
                                    </RDS_PASSWORD>
                                    <RDS_DB_NAME>${docker-maven-plugin.POSTGRES_DB}
                                    </RDS_DB_NAME>
                                </env>
                                <!-- Should match contents of .ebextentions/10_liquibase.config -->
                                <cmd>
                                    db tag
                                    --migrations /liquibase/db.changelog.yml
                                    /docker-it.yml
                                    ${project.artifactId}-${project.version}-${buildNumber}
                                </cmd>
                                <wait>
                                    <log>liquibase: Successfully released change log lock</log>
                                    <time>300000</time>
                                </wait>
                            </run>
                        </image>
                    </images>
                </configuration>
                <executions>
                    <execution>
                        <id>start</id>
                        <phase>pre-integration-test</phase>
                        <goals>
                            <goal>build</goal>
                            <goal>start</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>stop</id>
                        <phase>post-integration-test</phase>
                        <goals>
                            <goal>stop</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
```

About 50% of the time, the expected output from docker container would show up,
but the plugin would not recognize it and eventually timeout. An example log
snippet is:

```
/liquibase/db.changelog.initial.yml::15::thomasvandoren: ChangeSet /liquibase/db.changelog.initial.yml::15::thomasvandoren ran successfully in 3ms
DB-TAG> INFO  [2015-08-04 23:14:17,573] liquibase: Successfully released change log lock
Exception in thread "Thread-17" java.lang.IllegalArgumentException: Invalid format: "04T23:14:17.483Z" is malformed at "T23:14:17.483Z"
          at org.joda.time.format.DateTimeFormatter.parseDateTime(DateTimeFormatter.java:873)
          at org.jolokia.docker.maven.util.Timestamp.<init>(Timestamp.java:53)
          at org.jolokia.docker.maven.access.log.LogRequestor.callLogCallback(LogRequestor.java:142)
          at org.jolokia.docker.maven.access.log.LogRequestor.parseResponse(LogRequestor.java:121)
          at org.jolokia.docker.maven.access.log.LogRequestor.run(LogRequestor.java:97)
[ERROR] DOCKER> [thomasvandoren/test-repo:0.2-SNAPSHOT-1402d5319ddf30d2] "db-tag": Timeout after 300000 ms while waiting on on log out 'liquibase: Successfully released change log lock'
```

Other invalid format error messages that I observed include:

```
Exception in thread "Thread-13" java.lang.IllegalArgumentException: Invalid format: "?2015-08-05T20:33:12.280Z"
Exception in thread "Thread-16" java.lang.IllegalArgumentException: Invalid format: "02T02:04:28.607Z" is malformed at "T02:04:28.607Z"
Exception in thread "Thread-16" java.lang.IllegalArgumentException: Invalid format: ".000��"
```

There were also cases where no stack trace was printed, the expected line did
show up, but the plugin still timed out "waiting" for it. I'm not sure what failed in
those cases.

This all seems like potential corrupt byte buffer behavior to me, which led to
this change.

Signed-off-by: Thomas Van Doren <thomas.vandoren@gmail.com>
@rhuss
Copy link
Collaborator

rhuss commented May 4, 2016

Thanks a lot ;-) LGTM. With which version of d-m-p did you have the issue ?

I ask because I ran also in concurrency issues where are connection was potentially reused in two different threads. That was fixed in 0.15.1

But you are right, one should at least check whether the number of bytes read are indeed the same as the declared length.

rhuss added a commit that referenced this pull request May 4, 2016
Adapt changelog, cleaned up imports.
@thomasvandoren
Copy link
Contributor Author

We have observed the invalid format issue with d-m-p version 0.13.2 through 0.15.1. The most recent, with 0.15.1, was:

[INFO] DOCKER> [rafastealth/kb-service:0.1-SNAPSHOT-7689be20c71b2621] "kb-db-migrate": Start container 96eab642370d
KB-POSTGRES> ERROR:  relation "public.databasechangeloglock" does not exist at character 22
KB-POSTGRES> STATEMENT:  select count(*) from public.databasechangeloglock
KB-DB-MIGRATE> INFO  [2016-05-04 00:46:37,171] liquibase: /liquibase/db.changelog.yml: 3-slot::brian: Unique constraint added to slot(name, class_id)
Invalid log format for '������2016-05-04T00:46:37.259530918Z INFO  [2016-05-04 00:46:37,259] liquibase: /liquibase/db.changelog.yml: 3-probability::brian: Custom SQL executed
��2016-05-04T00:46:3' (expected: "<timestamp> <txt>") [0002 0000]
[INFO] DOCKER> [rafastealth/kb-service:0.1-SNAPSHOT-7689be20c71b2621] "kb-db-migrate": Waited on log out 'liquibase: Successfully released change log lock' 8007 ms

Let me know if you need any more info. Thanks!

@thomasvandoren
Copy link
Contributor Author

I started using this in our regular CI build today, and 2/5 builds failed with a different error (presumably where we would have gotten the invalid format error previously):

[ERROR] DOCKER> [rafastealth/kb-service:0.1-SNAPSHOT-3ebdb83c963d225c] "kb-db-migrate": Timeout after 30009 ms while waiting on log out 'liquibase: Successfully released change log lock'
[ERROR] DOCKER> Error occurred during container startup, shutting down...

I'll see if I can get some better information about what is going on.

@rhuss
Copy link
Collaborator

rhuss commented May 4, 2016

Do you run the CI with your PR applied or vanilla 0.15.1 ?

@thomasvandoren
Copy link
Contributor Author

With my PR applied. I built a 0.15.2-SNAPSHOT from this branch.

@rhuss
Copy link
Collaborator

rhuss commented May 4, 2016

Shit. 'hope we can sort this out soon. When you have more information, let me know. Also, I will try to reproduce it here with your pom posted.

@thomasvandoren
Copy link
Contributor Author

Here is the latest pom:

            <plugin>
                <groupId>io.fabric8</groupId>
                <artifactId>docker-maven-plugin</artifactId>
                <version>0.15.2-SNAPSHOT</version>
                <configuration>
                    <skip>${skipDocker}</skip>
                    <verbose>true</verbose>
                    <images>
                        <image>
                            <name>
                                postgres:9.5
                            </name>
                            <alias>postgres</alias>
                            <run>
                                <log>
                                    <prefix>KB-POSTGRES</prefix>
                                </log>
                                <env>
                                    <POSTGRES_USER>${docker-maven-plugin.POSTGRES_USER}</POSTGRES_USER>
                                    <POSTGRES_PASSWORD>${docker-maven-plugin.POSTGRES_PASSWORD}</POSTGRES_PASSWORD>
                                    <POSTGRES_DB>${docker-maven-plugin.POSTGRES_DB}</POSTGRES_DB>
                                </env>
                                <ports>
                                    <port>postgres.port:5432</port>
                                </ports>
                                <wait>
                                    <log>PostgreSQL init process complete; ready for start up</log>
                                    <time>300000</time>
                                </wait>
                            </run>
                        </image>
                        <image>
                            <name>${docker.imageName}:${project.version}-${buildNumber}</name>
                            <alias>kb-image</alias>
                            <build>
                                <from>quay.io/aptible/java:oracle-java8</from>
                                <maintainer>rafa</maintainer>
                                <ports>
                                    <port>8080</port>
                                    <port>8081</port>
                                </ports>
                                <entryPoint>
                                    <exec>
                                        <args>java</args>
                                        <args>-jar</args>
                                        <args>/${project.build.finalName}.jar</args>
                                    </exec>
                                </entryPoint>
                                <cmd>
                                    <exec>
                                        <args>server</args>
                                        <args>/aptible.yml</args>
                                    </exec>
                                </cmd>
                                <assembly>
                                    <basedir>/</basedir>
                                    <inline>
                                        <fileSets>
                                            <fileSet>
                                                <outputDirectory>/</outputDirectory>
                                                <filtered>false</filtered>
                                                <directory>${project.build.outputDirectory}
                                                </directory>
                                                <includes>
                                                    <include>aptible.yml</include>
                                                    <include>docker-it.yml</include>
                                                    <include>liquibase/**</include>
                                                </includes>
                                            </fileSet>
                                            <fileSet>
                                                <outputDirectory>/</outputDirectory>
                                                <filtered>false</filtered>
                                                <directory>${project.build.directory}</directory>
                                                <includes>
                                                    <include>${project.build.finalName}.jar
                                                    </include>
                                                </includes>
                                            </fileSet>
                                        </fileSets>
                                    </inline>
                                </assembly>
                            </build>
                        </image>
                        <image>
                            <name>${docker.imageName}:${project.version}-${buildNumber}</name>
                            <alias>kb-db-migrate</alias>
                            <run>
                                <log>
                                    <prefix>KB-DB-MIGRATE</prefix>
                                </log>
                                <links>
                                    <link>postgres:postgres</link>
                                </links>
                                <env>
                                    <DB_USERNAME>${docker-maven-plugin.POSTGRES_USER}
                                    </DB_USERNAME>
                                    <DB_PASSWORD>${docker-maven-plugin.POSTGRES_PASSWORD}
                                    </DB_PASSWORD>
                                    <DB_NAME>${docker-maven-plugin.POSTGRES_DB}
                                    </DB_NAME>
                                </env>
                                <cmd>
                                    db migrate
                                    --migrations /liquibase/db.changelog.yml
                                    /docker-it.yml
                                </cmd>
                                <wait>
                                    <log>liquibase: Successfully released change log lock</log>
                                    <time>30000</time>
                                </wait>
                            </run>
                        </image>
                        <image>
                            <name>${docker.imageName}:${project.version}-${buildNumber}</name>
                            <alias>kb-db-tag</alias>
                            <run>
                                <log>
                                    <prefix>KB-DB-TAG</prefix>
                                </log>
                                <links>
                                    <link>postgres:postgres</link>
                                </links>
                                <env>
                                    <DB_USERNAME>${docker-maven-plugin.POSTGRES_USER}
                                    </DB_USERNAME>
                                    <DB_PASSWORD>${docker-maven-plugin.POSTGRES_PASSWORD}
                                    </DB_PASSWORD>
                                    <DB_NAME>${docker-maven-plugin.POSTGRES_DB}
                                    </DB_NAME>
                                </env>
                                <cmd>
                                    db tag
                                    --migrations /liquibase/db.changelog.yml
                                    /docker-it.yml
                                    ${project.artifactId}-${project.version}-${buildNumber}
                                </cmd>
                                <wait>
                                    <log>liquibase: Successfully released change log lock</log>
                                    <time>30000</time>
                                </wait>
                            </run>
                        </image>
                        <image>
                            <name>${docker.imageName}:${project.version}-${buildNumber}</name>
                            <alias>kb-server</alias>
                            <run>
                                <ports>
                                    <port>kb.port:8080</port>
                                </ports>
                                <log>
                                    <prefix>KB-DOCKER</prefix>
                                </log>
                                <links>
                                    <link>postgres:postgres</link>
                                </links>
                                <env>
                                    <DB_USERNAME>${docker-maven-plugin.POSTGRES_USER}
                                    </DB_USERNAME>
                                    <DB_PASSWORD>${docker-maven-plugin.POSTGRES_PASSWORD}
                                    </DB_PASSWORD>
                                    <DB_NAME>${docker-maven-plugin.POSTGRES_DB}
                                    </DB_NAME>
                                    <AWS_ACCESS_KEY_ID>${env.AWS_ACCESS_KEY_ID}</AWS_ACCESS_KEY_ID>
                                    <AWS_SECRET_ACCESS_KEY>${env.AWS_SECRET_ACCESS_KEY}</AWS_SECRET_ACCESS_KEY>
                                </env>
                                <cmd>
                                    server /docker-it.yml
                                </cmd>
                                <wait>
                                    <url>http://${docker.host.address}:${kb.port}/swagger.json</url>
                                    <time>300000</time>
                                </wait>
                            </run>
                        </image>
                    </images>
                </configuration>
                <executions>
                    <execution>
                        <id>start</id>
                        <phase>pre-integration-test</phase>
                        <goals>
                            <goal>build</goal>
                            <goal>start</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>stop</id>
                        <phase>post-integration-test</phase>
                        <goals>
                            <goal>stop</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

When zero bytes are read when trying to read the 8 byte header, assume the
stream is done and stop iterating. If 1-7 bytes are read when trying to read
the 8 byte header, throw an IOException indicating the error. If anything other
than the full N bytes are read for the message, i.e. if 0 to N-1 bytes are
read, also throw an IOException.

This ensures that unexpected log stream behavior is propaged as an error to the
user.

Signed-off-by: Thomas Van Doren <thomas.vandoren@gmail.com>
@thomasvandoren
Copy link
Contributor Author

I created a 0.15.2-SNAPSHOT jar from d4180c7 (the latest commit on this branch) and ran that 75 times in our CI system with the pom snippet above. 12 out of the 75 builds failed due to d-m-p issues. All 12 failed due to the wait timeout being exceeded, even though the log line it was searching for was present well before the timeout.

3 of those timeouts were accompanied with a reading partial message buffers IOException:

3 of those timeouts were accompanied with a Cannot process chunk response: java.nio.charset.MalformedInputException: Input length = 1 exception:

The other 6 had no additional exceptions or information. The log line can be seen in the log output well before the timeout, but d-m-p times out waiting to find the log line.

FWIW, I ran 10 builds with 2de9724 (the first commit on this branch) and 3 out of those failed due to d-m-p issues. Since the buffer read errors were not reported in this commit, I don't know if any of them were due to that issue.

These builds are running on ec2 instances in us-west-2c with these versions:

Ubuntu 14.04.4 LTS
Docker: version 1.9.1, Git commit a34a1d5
Apache Maven 3.3.3
Oracle Java version: 1.8.0_91
The docker daemon is listing on the default unix socket: unix:///var/run/docker.sock.

@rhuss
Copy link
Collaborator

rhuss commented May 6, 2016

One issue could be that the chunks read to compare the logfile against are split within the log message (so e.g.that one chunk contains "liquibase: Successfully" and the next "released change log lock", but the log check only matches on a single's chunk's content).

I will going to maintain a buffer for the whole line and match always on whole lines.

@rhuss
Copy link
Collaborator

rhuss commented May 6, 2016

btw, I merged your PR already into branch integration, lets continue from there.

@rhuss
Copy link
Collaborator

rhuss commented May 6, 2016

3 of those timeouts were accompanied with a reading partial message buffers IOException:

There must be something wrong with the length decoding since I don't think that there are 750 MB logs there to read (as indicated in the error message)

@rhuss
Copy link
Collaborator

rhuss commented May 6, 2016

Hmm, dont think its the parsing. Could you please do me a favor and rerun your example with the debugging switched on (-X) and from the latest commit 5268542 (head of branch integration) ? This should at least shed some light on the fact tha the logline is not matched even when its there.

@rhuss
Copy link
Collaborator

rhuss commented May 6, 2016

Any chance to get a runnable example so that I can reproduce it ? I think the pom with some liquibase migration scripts could be enough to provoke this behaviour ... My feeling is that we would need at least two images ..

@thomasvandoren
Copy link
Contributor Author

thomasvandoren commented May 6, 2016

@rhuss I shared a private repo with you that repros the issue. Shoot me an email ( thomas.vandoren@gmail.com ) if you have issues or questions.

@rhuss
Copy link
Collaborator

rhuss commented May 9, 2016

Thanks for sharing the repo, I could indeed reproduce the issue. From the logs (900 MB each ;-) I can see that there is an issue with concurrent reads on the socket and even when I'm not using a pooled connection librarary, I see strange behaviour of data appearing on wrong connections (we have several connections open in parallel, one is for reading the log for the wait condition, another is printing it out to the cosole).

I suspect that there is an issue with UNIX socket handling in https://github.com/jnr/jnr-unixsocket When using TCP sockets, everything seems fine to me.

Could you please try the following on your own:

  • Export a TCP socket for the Docker daemon in /etc/default/docker by adding a

    DOCKER_OPTS="-H tcp://0.0.0.0:2375"
    
  • Set environment DOCKER_HOST=tcp://localhost:2375 before running the Maven tests.

I will also check whether an update of jnr-unixsocket does help, too. Unfortunately I'm on the road this week, so I can probably only return to this issue next week.

Though still very curious whether it make a difference for you when using TCP ... ;-)

Thanks a lot for setting up the test case, much appreciated :)

@rhuss
Copy link
Collaborator

rhuss commented May 9, 2016

I just run 23 tests in a row with using TCP without any issue /wrt to the build. However, the Docker daemon broke completely, hanging somewhere in kernel mode. Now rebooting and the same with an updated unix socket lib ...

@thomasvandoren
Copy link
Contributor Author

Using TCP, I was able to run the build 78 times without any issues before I ran out of disk space due to /var/lib/docker/volumes getting too big. I think it's safe to say the TCP workaround works! We'll use that for now.

I was able to reproduce the timeout 1 out of 10 times using unix sockets and the upgraded jnr-unixsocket library. I built docker-maven-plugin from the integration branch at 9c6ac58.

Let me know if I can provide any additional information about the unix socket issue. Thanks for your help!

@rhuss rhuss merged commit d4180c7 into fabric8io:master May 19, 2016
@thomasvandoren thomasvandoren deleted the improve-stream-reader branch May 20, 2016 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants