Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x-pack/filebeat-cloud / TestInput – github.com/elastic/beats/v7/x-pack/filebeat/input/httpjson tests are flaky #34929

Closed
elasticmachine opened this issue Mar 26, 2023 · 11 comments
Assignees
Labels
automation build-failures Build failures in the CI. ci-reported Issues that have been automatically reported from the CI flaky-test Unstable or unreliable test cases.

Comments

@elasticmachine
Copy link
Collaborator

πŸ’” Tests Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-03-26T02:40:04.715+0000

  • Duration: 58 min 45 sec

Test stats πŸ§ͺ

Test Results
Failed 3
Passed 1578
Skipped 58
Total 1639

Test errors 3

Expand to view the tests failures

Extended / x-pack/filebeat-cloud / TestInput/Test_first_event – github.com/elastic/beats/v7/x-pack/filebeat/input/httpjson
    Expand to view the error details

     Failed 
    

    Expand to view the stacktrace

     === RUN   TestInput/Test_first_event
        input_test.go:1186: 
            	Error Trace:	/var/lib/jenkins/workspace/main-1361-95a2659a-58be-41f9-a28c-7a4b7b28efae/src/github.com/elastic/beats/x-pack/filebeat/input/httpjson/input_test.go:1186
            	Error:      	Not equal: 
            	            	expected: map[string]interface {}{"first":"a", "foo":"b"}
            	            	actual  : map[string]interface {}{"first":"none", "foo":"b"}
            	            	
            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -1,3 +1,3 @@
            	            	 (map[string]interface {}) (len=2) {
            	            	- (string) (len=5) "first": (string) (len=1) "a",
            	            	+ (string) (len=5) "first": (string) (len=4) "none",
            	            	  (string) (len=3) "foo": (string) (len=1) "b"
            	Test:       	TestInput/Test_first_event
        --- FAIL: TestInput/Test_first_event (2.01s)
     
    

Extended / x-pack/filebeat-cloud / TestInput – github.com/elastic/beats/v7/x-pack/filebeat/input/httpjson
    Expand to view the error details

     Failed 
    

    Expand to view the stacktrace

     === RUN   TestInput
    --- FAIL: TestInput (17.45s)
     
    

Extended / x-pack/metricbeat-cloud / test_query – x-pack.metricbeat.module.sql.query.test_sql_oracle.Test
    Expand to view the error details

     failed on setup with "DeprecationWarning: The "warn" method is deprecated, use "warning" instead" 
    

    Expand to view the stacktrace

     self = <docker.api.client.APIClient object at 0x7ff6fed218b0>
    response = <Response [500]>
    
        def _raise_for_status(self, response):
            """Raises stored :class:`APIError`, if one occurred."""
            try:
    >           response.raise_for_status()
    
    ../../build/ve/docker/lib/python3.9/site-packages/docker/api/client.py:268: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <Response [500]>
    
        def raise_for_status(self):
            """Raises :class:`HTTPError`, if one occurred."""
        
            http_error_msg = ""
            if isinstance(self.reason, bytes):
                # We attempt to decode utf-8 first because some servers
                # choose to localize their reason strings. If the string
                # isn"t utf-8, we fall back to iso-8859-1 for all other
                # encodings. (See PR #3538)
                try:
                    reason = self.reason.decode("utf-8")
                except UnicodeDecodeError:
                    reason = self.reason.decode("iso-8859-1")
            else:
                reason = self.reason
        
            if 400 <= self.status_code < 500:
                http_error_msg = (
                    f"{self.status_code} Client Error: {reason} for url: {self.url}"
                )
        
            elif 500 <= self.status_code < 600:
                http_error_msg = (
                    f"{self.status_code} Server Error: {reason} for url: {self.url}"
                )
        
            if http_error_msg:
    >           raise HTTPError(http_error_msg, response=self)
    E           requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localhost/v1.42/containers/ae2dbe4df3f302a15bf3c04eba097fafb5807d07173cd52f2d77df7ec9781fc9/start
    
    ../../build/ve/docker/lib/python3.9/site-packages/requests/models.py:1021: HTTPError
    
    The above exception was the direct cause of the following exception:
    
    self = <Service: oracle>
    container = <Container: sql_7918a246f6f8_oracle_1 (ae2dbe)>
    use_network_aliases = True
    
        def start_container(self, container, use_network_aliases=True):
            self.connect_container_to_networks(container, use_network_aliases)
            try:
    >           container.start()
    
    ../../build/ve/docker/lib/python3.9/site-packages/compose/service.py:643: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <Container: sql_7918a246f6f8_oracle_1 (ae2dbe)>, options = {}
    
        def start(self, **options):
    >       return self.client.start(self.id, **options)
    
    ../../build/ve/docker/lib/python3.9/site-packages/compose/container.py:228: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <docker.api.client.APIClient object at 0x7ff6fed218b0>
    resource_id = "ae2dbe4df3f302a15bf3c04eba097fafb5807d07173cd52f2d77df7ec9781fc9"
    args = (), kwargs = {}
    
        @functools.wraps(f)
        def wrapped(self, resource_id=None, *args, **kwargs):
            if resource_id is None and kwargs.get(resource_name):
                resource_id = kwargs.pop(resource_name)
            if isinstance(resource_id, dict):
                resource_id = resource_id.get("Id", resource_id.get("ID"))
            if not resource_id:
                raise errors.NullResource(
                    "Resource ID was not provided"
                )
    >       return f(self, resource_id, *args, **kwargs)
    
    ../../build/ve/docker/lib/python3.9/site-packages/docker/utils/decorators.py:19: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <docker.api.client.APIClient object at 0x7ff6fed218b0>
    container = "ae2dbe4df3f302a15bf3c04eba097fafb5807d07173cd52f2d77df7ec9781fc9"
    args = (), kwargs = {}
    url = "http+docker://localhost/v1.42/containers/ae2dbe4df3f302a15bf3c04eba097fafb5807d07173cd52f2d77df7ec9781fc9/start"
    res = <Response [500]>
    
        @utils.check_resource("container")
        def start(self, container, *args, **kwargs):
            """
            Start a container. Similar to the ``docker start`` command, but
            doesn"t support attach options.
        
            **Deprecation warning:** Passing configuration options in ``start`` is
            no longer supported. Users are expected to provide host config options
            in the ``host_config`` parameter of
            :py:meth:`~ContainerApiMixin.create_container`.
        
        
            Args:
                container (str): The container to start
        
            Raises:
                :py:class:`docker.errors.APIError`
                    If the server returns an error.
                :py:class:`docker.errors.DeprecatedMethod`
                    If any argument besides ``container`` are provided.
        
            Example:
        
                >>> container = client.api.create_container(
                ...     image="busybox:latest",
                ...     command="/bin/sleep 30")
                >>> client.api.start(container=container.get("Id"))
            """
            if args or kwargs:
                raise errors.DeprecatedMethod(
                    "Providing configuration in the start() method is no longer "
                    "supported. Use the host_config param in create_container "
                    "instead."
                )
            url = self._url("/containers/{0}/start", container)
            res = self._post(url)
    >       self._raise_for_status(res)
    
    ../../build/ve/docker/lib/python3.9/site-packages/docker/api/container.py:1126: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <docker.api.client.APIClient object at 0x7ff6fed218b0>
    response = <Response [500]>
    
        def _raise_for_status(self, response):
            """Raises stored :class:`APIError`, if one occurred."""
            try:
                response.raise_for_status()
            except requests.exceptions.HTTPError as e:
    >           raise create_api_error_from_http_exception(e) from e
    
    ../../build/ve/docker/lib/python3.9/site-packages/docker/api/client.py:270: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    e = HTTPError("500 Server Error: Internal Server Error for url: http+docker://localhost/v1.42/containers/ae2dbe4df3f302a15bf3c04eba097fafb5807d07173cd52f2d77df7ec9781fc9/start")
    
        def create_api_error_from_http_exception(e):
            """
            Create a suitable APIError from requests.exceptions.HTTPError.
            """
            response = e.response
            try:
                explanation = response.json()["message"]
            except ValueError:
                explanation = (response.content or "").strip()
            cls = APIError
            if response.status_code == 404:
                explanation_msg = (explanation or "").lower()
                if any(fragment in explanation_msg
                       for fragment in _image_not_found_explanation_fragments):
                    cls = ImageNotFound
                else:
                    cls = NotFound
    >       raise cls(e, response=response, explanation=explanation) from e
    E       docker.errors.APIError: 500 Server Error for http+docker://localhost/v1.42/containers/ae2dbe4df3f302a15bf3c04eba097fafb5807d07173cd52f2d77df7ec9781fc9/start: Internal Server Error ("driver failed programming external connectivity on endpoint sql_7918a246f6f8_oracle_1 (ebce262c681d985304658e54cd7088b9a989f9df6f0e32fef3601853683cb97b): Bind for 0.0.0.0:5500 failed: port is already allocated")
    
    ../../build/ve/docker/lib/python3.9/site-packages/docker/errors.py:39: APIError
    
    During handling of the above exception, another exception occurred:
    
    self = <class "test_sql_oracle.Test">
    
        @classmethod
        def setUpClass(self):
            self.beat_name = "metricbeat"
            self.beat_path = os.path.abspath(
                os.path.join(os.path.dirname(__file__), "../../"))
        
            self.template_paths = [
                os.path.abspath(os.path.join(self.beat_path, "../../metricbeat")),
                os.path.abspath(os.path.join(self.beat_path, "../../libbeat")),
            ]
        
    >       super(XPackTest, self).setUpClass()
    
    tests/system/xpack_metricbeat.py:19: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    ../../metricbeat/tests/system/metricbeat.py:42: in setUpClass
        super().setUpClass()
    ../../libbeat/tests/system/beat/beat.py:204: in setUpClass
        cls.compose_up_with_retries()
    ../../libbeat/tests/system/beat/beat.py:222: in compose_up_with_retries
        raise ex
    ../../libbeat/tests/system/beat/beat.py:218: in compose_up_with_retries
        cls.compose_up()
    ../../libbeat/tests/system/beat/compose.py:66: in compose_up
        project.up(
    ../../build/ve/docker/lib/python3.9/site-packages/compose/project.py:697: in up
        results, errors = parallel.parallel_execute(
    ../../build/ve/docker/lib/python3.9/site-packages/compose/parallel.py:108: in parallel_execute
        raise error_to_reraise
    ../../build/ve/docker/lib/python3.9/site-packages/compose/parallel.py:206: in producer
        result = func(obj)
    ../../build/ve/docker/lib/python3.9/site-packages/compose/project.py:679: in do
        return service.execute_convergence_plan(
    ../../build/ve/docker/lib/python3.9/site-packages/compose/service.py:559: in execute_convergence_plan
        return self._execute_convergence_create(
    ../../build/ve/docker/lib/python3.9/site-packages/compose/service.py:473: in _execute_convergence_create
        containers, errors = parallel_execute(
    ../../build/ve/docker/lib/python3.9/site-packages/compose/parallel.py:108: in parallel_execute
        raise error_to_reraise
    ../../build/ve/docker/lib/python3.9/site-packages/compose/parallel.py:206: in producer
        result = func(obj)
    ../../build/ve/docker/lib/python3.9/site-packages/compose/service.py:478: in <lambda>
        lambda service_name: create_and_start(self, service_name.number),
    ../../build/ve/docker/lib/python3.9/site-packages/compose/service.py:461: in create_and_start
        self.start_container(container)
    ../../build/ve/docker/lib/python3.9/site-packages/compose/service.py:647: in start_container
        log.warn("Host is already in use by another container")
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <Logger compose.service (WARNING)>
    msg = "Host is already in use by another container", args = (), kwargs = {}
    
        def warn(self, msg, *args, **kwargs):
    >       warnings.warn("The "warn" method is deprecated, "
                "use "warning" instead", DeprecationWarning, 2)
    E       DeprecationWarning: The "warn" method is deprecated, use "warning" instead
    
    /usr/lib/python3.9/logging/__init__.py:1457: DeprecationWarning 
    

Steps errors 3

Expand to view the steps failures

x-pack/filebeat-cloud - mage build test
  • Took 5 min 18 sec . View more details here
  • Description: mage build test
x-pack/metricbeat-cloud - mage build test
  • Took 30 min 30 sec . View more details here
  • Description: mage build test
Error signal
  • Took 0 min 0 sec . View more details here
  • Description: Error "hudson.AbortException: script returned exit code 1"

@elasticmachine elasticmachine added automation ci-reported Issues that have been automatically reported from the CI build-failures Build failures in the CI. Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Mar 26, 2023
@elasticmachine
Copy link
Collaborator Author

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@cmacknz cmacknz changed the title Build 1361 for main with status FAILURE x-pack/filebeat-cloud / TestInput – github.com/elastic/beats/v7/x-pack/filebeat/input/httpjson tests are flaky Apr 5, 2023
@cmacknz cmacknz removed the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Apr 5, 2023
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 5, 2023
@cmacknz cmacknz added Team:Security-External Integrations and removed needs_team Indicates that the issue/PR needs a Team:* label labels Apr 5, 2023
@elasticmachine
Copy link
Collaborator Author

Pinging @elastic/security-external-integrations (Team:Security-External Integrations)

@cmacknz cmacknz added the flaky-test Unstable or unreliable test cases. label Apr 5, 2023
@andrewkroh
Copy link
Member

andrewkroh commented Apr 5, 2023

Highlighting the failure and code location:

assert.JSONEq(t, tc.expected[receivedCount], val.(string))

name: "Test first event",
setupServer: func(t *testing.T, h http.HandlerFunc, config map[string]interface{}) {
registerPaginationTransforms()
registerResponseTransforms()
t.Cleanup(func() { registeredTransforms = newRegistry() })
server := httptest.NewServer(h)
config["request.url"] = server.URL
t.Cleanup(server.Close)
},
baseConfig: map[string]interface{}{
"interval": 1,
"request.method": http.MethodGet,
"response.split": map[string]interface{}{
"target": "body.items",
"transforms": []interface{}{
map[string]interface{}{
"set": map[string]interface{}{
"target": "body.first",
"value": "[[.cursor.first]]",
"default": "none",
},
},
},
},
"response.pagination": []interface{}{
map[string]interface{}{
"set": map[string]interface{}{
"target": "url.params.page",
"value": "[[.last_response.body.nextPageToken]]",
"fail_on_template_error": true,
},
},
},
"cursor": map[string]interface{}{
"first": map[string]interface{}{
"value": "[[.first_event.foo]]",
},
},
},
handler: paginationHandler(),
expected: []string{`{"first":"none", "foo":"a"}`, `{"first":"a", "foo":"b"}`, `{"first":"a", "foo":"c"}`, `{"first":"c", "foo":"d"}`},
},

 === RUN   TestInput/Test_first_event
    input_test.go:1186: 
        	Error Trace:	/var/lib/jenkins/workspace/main-1361-95a2659a-58be-41f9-a28c-7a4b7b28efae/src/github.com/elastic/beats/x-pack/filebeat/input/httpjson/input_test.go:1186
        	Error:      	Not equal: 
        	            	expected: map[string]interface {}{"first":"a", "foo":"b"}
        	            	actual  : map[string]interface {}{"first":"none", "foo":"b"}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,3 +1,3 @@
        	            	 (map[string]interface {}) (len=2) {
        	            	- (string) (len=5) "first": (string) (len=1) "a",
        	            	+ (string) (len=5) "first": (string) (len=4) "none",
        	            	  (string) (len=3) "foo": (string) (len=1) "b"
        	Test:       	TestInput/Test_first_event
    --- FAIL: TestInput/Test_first_event (2.01s)

@efd6

This comment was marked as outdated.

@efd6
Copy link
Contributor

efd6 commented Apr 20, 2023

Suggestive, but ultimately I think wrong. It looks like it comes down to the behaviour of *split.sendMessage which does reference-behaviour dependent mutations on the value it is passed. This appears to be depended on by the current implementation (ensuring that values are hermetic breaks TestSplit/A_nested_array_with_a_nested_map). It looks like this is in lieu of holding a stack during the split recursion.

My suspicion is that due to timing, it may be possible for an early value that is held up in publishing to be over-written by a later split operation. I think this is possible because of the concurrent state that is held in the processing via the *transformContext which is data-race protected, but does not look to be completely serialised; in processAndPublishEvents the transformed event is received and published and then there looks to be a logic race with the call to *transformContext.updateFirstEvent β€” an atomic update first event looks like it would go part way towards fixing this

func (ctx *transformContext) updateFirstEventIfUnset(e mapstr.M) {
	ctx.lock.Lock()
	if len(*ctx.firstEvent) == 0 {
		*ctx.firstEvent = e
	}
	ctx.lock.Unlock()
}

but even then I think that there is a race between obtaining the lock on the first event and the event producer overwriting the shared reference. To fix this an ack lock would be needed to be passed back to or shared withthe event producer, at which point we are simulating non-concurrent execution with goroutines and channels.

This hypothesis can be tested by delaying the first event's publication for each request's set of events like so

diff --git a/x-pack/filebeat/input/httpjson/request.go b/x-pack/filebeat/input/httpjson/request.go
index 9832275a32..186664f300 100644
--- a/x-pack/filebeat/input/httpjson/request.go
+++ b/x-pack/filebeat/input/httpjson/request.go
@@ -13,6 +13,7 @@ import (
        "net/http"
        "net/url"
        "strings"
+       "time"
 
        "github.com/PaesslerAG/jsonpath"
 
@@ -511,6 +512,8 @@ func processAndPublishEvents(trCtx *transformContext, events <-chan maybeMsg, pu
                        continue
                }
 
+               time.Sleep(time.Duration(1-n) * 10 * time.Millisecond)
+
                if publish {
                        event, err := makeEvent(maybeMsg.msg)
                        if err != nil {

Running test now always fails with the same outcome as the flakey test in the report.

--- FAIL: TestInput (2.03s)
    --- FAIL: TestInput/Test_first_event (2.03s)
        input_test.go:1186: 
            	Error Trace:	/Users/daniel/go/src/github.com/elastic/beats/x-pack/filebeat/input/httpjson/input_test.go:1186
            	Error:      	Not equal: 
            	            	expected: map[string]interface {}{"first":"a", "foo":"b"}
            	            	actual  : map[string]interface {}{"first":"none", "foo":"b"}
            	            	
            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -1,3 +1,3 @@
            	            	 (map[string]interface {}) (len=2) {
            	            	- (string) (len=5) "first": (string) (len=1) "a",
            	            	+ (string) (len=5) "first": (string) (len=4) "none",
            	            	  (string) (len=3) "foo": (string) (len=1) "b"
            	Test:       	TestInput/Test_first_event
FAIL
exit status 1

The error here is in using an unbuffered channel in place of a yield for a coroutine (they are close but not semantically identical) in conjunction with shared mutable state.

@narph
Copy link
Contributor

narph commented May 11, 2023

@andrewkroh , @efd6 can we go ahead skip this flaky test until we have a solution for this issue?
@efd6 , can you create a separate issue regarding the refactoring needed for the httpjson input ?

@efd6
Copy link
Contributor

efd6 commented May 11, 2023

@narph I'd be reluctant to disable the test; the frequency of the failure is low enough that it only very rarely results in a build failure, and it does protect against regressions the rest of the time.

@andrewkroh
Copy link
Member

I did a search for how many builds were affected by this test. It looks like about 18 over the last year.

Screenshot 2023-05-15 at 10 09 30

@pierrehilbert
Copy link
Collaborator

@efd6 / @narph What is the final call here? Did you disable this test? Did you fix it? Can we close this issue?

@efd6
Copy link
Contributor

efd6 commented Jul 10, 2023

It was disabled with intention of fixing during a major refactor. This can be closed.

@pierrehilbert
Copy link
Collaborator

Thx for the update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automation build-failures Build failures in the CI. ci-reported Issues that have been automatically reported from the CI flaky-test Unstable or unreliable test cases.
Projects
None yet
Development

No branches or pull requests

6 participants