Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out_es: ensure integrity of already recorded logs #2026

Closed
wants to merge 2 commits into from
Closed

out_es: ensure integrity of already recorded logs #2026

wants to merge 2 commits into from

Conversation

thotypous
Copy link
Contributor

@thotypous thotypous commented Mar 17, 2020

The create_doc index privilege was introduced in ElasticSearch 7.5 to ensure a role can only add new logs, but never modify or delete previously recorded ones.

If the role has wider privileges than create_doc and the system running fluent-bit is compromised, one cannot ensure the integrity of logs previously stored in ElasticSearch, since the past logs could be modified by the adversary after the breach.

However, fluent-bit currently does not support running with the create_doc privilege, since it uses the index op_type, which has the semantics of changing a document if it already exists with the same _id. Therefore, any requests with the index op_type are denied for a role whose only privilege is create_doc, even if they would create a new document, e.g.:

{
   "took":464,
   "errors":true,
   "items":[
      {
         "index":{
            "_index":"myindex-test",
            "_type":"flb_type",
            "_id":"dOq6HAB5BvOnZv3fWiu0",
            "status":403,
            "error":{
               "type":"security_exception",
               "reason":"action [indices:data/write/index:op_type/index] is unauthorized for user [myuser]",
               "suppressed":[
                  {
                     "type":"security_exception",
                     "reason":"action [indices:data/write/index:op_type/index] is unauthorized for user [myuser]"
                  },
                  {
                     "type":"security_exception",
                     "reason":"action [indices:data/write/index:op_type/index] is unauthorized for user [myuser]"
                  }
               ]
            }
         }
      },
      {
         "index":{
            "_index":"myindex-test",
            "_type":"flb_type",
            "_id":"dOq6HAB5BvOnZv3fWiu1",
            "status":403,
            "error":{
               "type":"security_exception",
               "reason":"action [indices:data/write/index:op_type/index] is unauthorized for user [myuser]",
               "suppressed":[
                  {
                     "type":"security_exception",
                     "reason":"action [indices:data/write/index:op_type/index] is unauthorized for user [myuser]"
                  },
                  {
                     "type":"security_exception",
                     "reason":"action [indices:data/write/index:op_type/index] is unauthorized for user [myuser]"
                  }
               ]
            }
         }
      }
   ]
}

We solve this by replacing all index operations by the create operation, which is authorized for roles which only have the create_doc privilege.

Please note this change is backwards compatible even with very old versions of ElasticSearch.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

Documentation

  • [N/A] Documentation required for this feature

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@thotypous
Copy link
Contributor Author

thotypous commented Mar 17, 2020

[Edit: Solved below] I need some guidance regarding the following issue:

If Generate_ID is set to On in the es plugin config and a document with the same _id already exists at the ElasticSearch server (due to a retry of a previously successful operation deemed as failed by fluent-bit), it will return something like this:

{
   "took":9,
   "errors":true,
   "items":[
      {
         "create":{
            "_index":"myindex-test",
            "_type":"flb_type",
            "_id":"dOq6HAB5BvOnZv3fWiu0",
            "status":409,
            "error":{
               "type":"version_conflict_engine_exception",
               "reason":"[dOq6HAB5BvOnZv3fWiu0]: version conflict, document already exists (current version [1])",
               "index_uuid":"MHbN2sP7T8mdTy55MHTlew",
               "shard":"0",
               "index":"myindex-test"
            }
         }
      },
      {
         "create":{
            "_index":"myindex-test",
            "_type":"flb_type",
            "_id":"dOq6HAB5BvOnZv3fWiu1",
            "status":409,
            "error":{
               "type":"version_conflict_engine_exception",
               "reason":"[dOq6HAB5BvOnZv3fWiu1]: version conflict, document already exists (current version [1])",
               "index_uuid":"MHbN2sP7T8mdTy55MHTlew",
               "shard":"0",
               "index":"myindex-test"
            }
         }
      }
   ]
}

This in turn will cause the es plugin to issue a retry here. This would cause some retries in a row until fluent-bit gives up.

Currently, this does not happen because the "index" op_type is interpreted as "replace the existing document".

Do we need to handle this issue, or just let fluent-bit naturally give up retrying? It seems to be a waste of bandwidth, so I would like to handle this. Any suggestions? Should I modify elasticsearch_error_check to ignore this kind of error?

@thotypous thotypous changed the title es: ensure integrity of already recorded logs out_es: ensure integrity of already recorded logs Mar 17, 2020
@thotypous
Copy link
Contributor Author

thotypous commented Mar 18, 2020

I just updated this pull request. Now it changes the elasticsearch_error_check function to ignore errors with status 409, in order to address the concerns of my previous comment.

Please review and comment on your opinion about this approach.

@thotypous
Copy link
Contributor Author

Example config:

[SERVICE]
    Flush            10
    Daemon           Off
    Log_Level        debug
    HTTP_Monitor     Off

[INPUT]
    Name    tail
    Path    /path/messages
    DB      /path/messages.db
    Parser  syslog-busybox

[OUTPUT]
    Name         es
    Host         mydomain.ufscar.br
    Port         443
    HTTP_User    myuser
    HTTP_Passwd  mypassword
    Index        myindex-test
    Generate_ID  On
    tls          On
    tls.verify   On

Valgrind and debug log, testing the version conflict behavior:

==9982== Memcheck, a memory error detector
==9982== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==9982== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==9982== Command: /git/fluent-bit/build/bin/fluent-bit -v -s 24576 -c /path/fluent-bit.conf -R /path/parsers.conf
==9982==
Fluent Bit v1.4.0
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/03/19 09:29:14] [ info] Configuration:
[2020/03/19 09:29:14] [ info]  flush time     | 10.000000 seconds
[2020/03/19 09:29:14] [ info]  grace          | 5 seconds
[2020/03/19 09:29:14] [ info]  daemon         | 0
[2020/03/19 09:29:14] [ info] ___________
[2020/03/19 09:29:14] [ info]  inputs:
[2020/03/19 09:29:14] [ info]      tail
[2020/03/19 09:29:14] [ info] ___________
[2020/03/19 09:29:14] [ info]  filters:
[2020/03/19 09:29:14] [ info] ___________
[2020/03/19 09:29:14] [ info]  outputs:
[2020/03/19 09:29:14] [ info]      es.0
[2020/03/19 09:29:14] [ info] ___________
[2020/03/19 09:29:14] [ info]  collectors:
[2020/03/19 09:29:14] [debug] [storage] [cio stream] new stream registered: tail.0
[2020/03/19 09:29:14] [ info] [storage] version=1.0.2, initializing...
[2020/03/19 09:29:14] [ info] [storage] in-memory
[2020/03/19 09:29:15] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/03/19 09:29:15] [ info] [engine] started (pid=9982)
[2020/03/19 09:29:15] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2020/03/19 09:29:15] [debug] [input:tail:tail.0] inotify watch fd=20
[2020/03/19 09:29:15] [debug] [input:tail:tail.0] scanning path /path/messages
[2020/03/19 09:29:15] [debug] [input:tail:tail.0] add to scan queue /path/messages, offset=0
[2020/03/19 09:29:17] [debug] [output:es:es.0] host=mydomain.ufscar.br port=443 uri=/_bulk index=myindex-test type=flb_type
[2020/03/19 09:29:17] [debug] [router] default match rule tail.0:es.0
[2020/03/19 09:29:17] [ info] [sp] stream processor started
[2020/03/19 09:29:17] [debug] [input:tail:tail.0] file=/path/messages read=53 lines=1
[2020/03/19 09:29:17] [debug] [input:tail:tail.0] file=/path/messages promote to TAIL_EVENT
==9982== Warning: client switching stacks?  SP change: 0x1fff0001f8 --> 0x6333240
==9982==          to suppress, use: --max-stackframe=137318158264 or greater
==9982== Warning: client switching stacks?  SP change: 0x63331b8 --> 0x1fff0001f8
==9982==          to suppress, use: --max-stackframe=137318158400 or greater
==9982== Warning: client switching stacks?  SP change: 0x1fff0001f8 --> 0x63331b8
==9982==          to suppress, use: --max-stackframe=137318158400 or greater
==9982==          further instances of this message will not be shown.
[2020/03/19 09:29:27] [debug] [task] created task=0x632ce20 id=0 OK
[2020/03/19 09:29:29] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2020/03/19 09:29:29] [debug] [output:es:es.0] Elasticsearch response
{"took":3,"errors":true,"items":[{"create":{"_index":"myindex-test","_type":"flb_type","_id":"b1104f3e-6de8-6f6b-5574-defe0d86449b","status":409,"error":{"type":"version_conflict_engine_exception","reason":"[b1104f3e-6de8-6f6b-5574-defe0d86449b]: version conflict, document already exists (current version [1])","index_uuid":"MHbN2sP7T8mdTy55MHTlew","shard":"0","index":"myindex-test"}}}]}
[2020/03/19 09:29:29] [debug] [task] destroy task=0x632ce20 (task_id=0)
^C[engine] caught signal (SIGINT)
[2020/03/19 09:29:50] [ info] [input] pausing tail.0
==9982==
==9982== HEAP SUMMARY:
==9982==     in use at exit: 5,156 bytes in 13 blocks
==9982==   total heap usage: 35,443 allocs, 35,430 frees, 13,460,535 bytes allocated
==9982==
==9982== LEAK SUMMARY:
==9982==    definitely lost: 0 bytes in 0 blocks
==9982==    indirectly lost: 0 bytes in 0 blocks
==9982==      possibly lost: 0 bytes in 0 blocks
==9982==    still reachable: 5,156 bytes in 13 blocks
==9982==         suppressed: 0 bytes in 0 blocks
==9982== Reachable blocks (those to which a pointer was found) are not shown.
==9982== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==9982==
==9982== For lists of detected and suppressed errors, rerun with: -s
==9982== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Valgrind and debug log, testing ordinary behavior:

==12456== Memcheck, a memory error detector
==12456== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==12456== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==12456== Command: /git/fluent-bit/build/bin/fluent-bit -v -s 24576 -c /path/fluent-bit.conf -R /path/parsers.conf
==12456==
Fluent Bit v1.4.0
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/03/19 09:39:12] [ info] Configuration:
[2020/03/19 09:39:12] [ info]  flush time     | 10.000000 seconds
[2020/03/19 09:39:12] [ info]  grace          | 5 seconds
[2020/03/19 09:39:12] [ info]  daemon         | 0
[2020/03/19 09:39:12] [ info] ___________
[2020/03/19 09:39:12] [ info]  inputs:
[2020/03/19 09:39:12] [ info]      tail
[2020/03/19 09:39:12] [ info] ___________
[2020/03/19 09:39:12] [ info]  filters:
[2020/03/19 09:39:12] [ info] ___________
[2020/03/19 09:39:12] [ info]  outputs:
[2020/03/19 09:39:12] [ info]      es.0
[2020/03/19 09:39:12] [ info] ___________
[2020/03/19 09:39:12] [ info]  collectors:
[2020/03/19 09:39:12] [debug] [storage] [cio stream] new stream registered: tail.0
[2020/03/19 09:39:12] [ info] [storage] version=1.0.2, initializing...
[2020/03/19 09:39:12] [ info] [storage] in-memory
[2020/03/19 09:39:12] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/03/19 09:39:12] [ info] [engine] started (pid=12456)
[2020/03/19 09:39:12] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2020/03/19 09:39:13] [debug] [input:tail:tail.0] inotify watch fd=20
[2020/03/19 09:39:13] [debug] [input:tail:tail.0] scanning path /path/messages
[2020/03/19 09:39:13] [debug] [input:tail:tail.0] add to scan queue /path/messages, offset=0
[2020/03/19 09:39:15] [debug] [output:es:es.0] host=elasticsearch.sin.ufscar.br port=443 uri=/_bulk index=myindex-test type=flb_type
[2020/03/19 09:39:15] [debug] [router] default match rule tail.0:es.0
[2020/03/19 09:39:15] [ info] [sp] stream processor started
[2020/03/19 09:39:15] [debug] [input:tail:tail.0] file=/path/messages read=53 lines=1
[2020/03/19 09:39:15] [debug] [input:tail:tail.0] file=/path/messages promote to TAIL_EVENT
==12456== Warning: client switching stacks?  SP change: 0x1fff0001f8 --> 0x6333240
==12456==          to suppress, use: --max-stackframe=137318158264 or greater
==12456== Warning: client switching stacks?  SP change: 0x63331b8 --> 0x1fff0001f8
==12456==          to suppress, use: --max-stackframe=137318158400 or greater
==12456== Warning: client switching stacks?  SP change: 0x1fff0001f8 --> 0x63331b8
==12456==          to suppress, use: --max-stackframe=137318158400 or greater
==12456==          further instances of this message will not be shown.
[2020/03/19 09:39:25] [debug] [task] created task=0x632ce20 id=0 OK
[2020/03/19 09:39:27] [debug] [output:es:es.0] HTTP Status=200 URI=/_bulk
[2020/03/19 09:39:27] [debug] [output:es:es.0] Elasticsearch response
{"took":8,"errors":false,"items":[{"create":{"_index":"myindex-test","_type":"flb_type","_id":"fb14eda3-e02b-ee57-047d-5d6aa4f3136f","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":35,"_primary_term":2,"status":201}}]}
[2020/03/19 09:39:27] [debug] [task] destroy task=0x632ce20 (task_id=0)
^C[engine] caught signal (SIGINT)
[2020/03/19 09:39:30] [ info] [input] pausing tail.0
==12456==
==12456== HEAP SUMMARY:
==12456==     in use at exit: 5,156 bytes in 13 blocks
==12456==   total heap usage: 35,587 allocs, 35,574 frees, 13,465,080 bytes allocated
==12456==
==12456== LEAK SUMMARY:
==12456==    definitely lost: 0 bytes in 0 blocks
==12456==    indirectly lost: 0 bytes in 0 blocks
==12456==      possibly lost: 0 bytes in 0 blocks
==12456==    still reachable: 5,156 bytes in 13 blocks
==12456==         suppressed: 0 bytes in 0 blocks
==12456== Reachable blocks (those to which a pointer was found) are not shown.
==12456== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==12456==
==12456== For lists of detected and suppressed errors, rerun with: -s
==12456== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

@thotypous
Copy link
Contributor Author

I will now force push exactly the same diff just to get the CI to run again. It timed out installing clang packages.

Since ElasticSearch 7.5, the "create_doc" index privilege was introduced,
which ensures a role can only add new logs, but never modify or delete
previously recorded ones.

However, the "index" op_type has the semantic of changing a document if
it already exists with the same "_id". Therefore, any requests with the
"index" op_type are denied for a role whose only privilege is
"create_doc".

We solve this by replacing all "index" operations by the "create"
operation. However, this has the side effect of producing status 409
errors whenever a previously successful operation is retried and the
Generate_ID option is turned on. Therefore, we change the
"elasticsearch_error_check" function to ignore this kind of error.

Signed-off-by: Paulo Matias <matias@ufscar.br>
@thotypous
Copy link
Contributor Author

thotypous commented Mar 19, 2020

Well, now the clang tests passed and the gcc tests timed out installing packages 😆. At least, now we know the change passes all tests 👍

@edsiper
Copy link
Member

edsiper commented Jun 30, 2020

review deferred after v1.5 release

@kaay-it
Copy link

kaay-it commented Nov 13, 2020

Hi @fujimotos
What is about it PR?
We are realy waiting it, because it additionaly fixed issue - #2664

@kaay-it
Copy link

kaay-it commented Nov 13, 2020

@edsiper, @PettitWesley - we need your review

fujimotos
fujimotos previously approved these changes Nov 13, 2020
@fujimotos
Copy link
Member

@thotypous @AlekseyKalinin Sorry for delay. I'm fine with this patch.

@kaay-it
Copy link

kaay-it commented Nov 16, 2020

@edsiper, @PettitWesley - we need your review

@edsiper, @PettitWesley - kindly asking you for making review

@farcop
Copy link

farcop commented Nov 16, 2020

@edsiper @PettitWesley We need your reviews. Thanx in advance!

1 similar comment
@big-dima66
Copy link

@edsiper @PettitWesley We need your reviews. Thanx in advance!

@BlackAlphaS
Copy link

@edsiper @PettitWesley can you please approve this pull-request? Thank you!

@PettitWesley
Copy link
Contributor

PettitWesley commented Dec 17, 2020

This would fall under @edsiper scope more than mine... I can take a look if needed...

However, I think normally Fujimotos approval should be sufficient for a merge....

Either way, Eduardo is the only who manages releases and also merging PRs in general. I am the AWS maintainer for the AWS plugins primarily.

@marco-claudino
Copy link

Hi @edsiper and @PettitWesley,

Is there any chance of work this out?
Datastream, ILM and rollover are pretty stable now in ElasticSearch, it's really bad that we can't use these features.

Thank you

Hi @fujimotos
What is about it PR?
We are realy waiting it, because it additionaly fixed issue - #2664

@PettitWesley
Copy link
Contributor

@edsiper With Fujimotos approval, can we merge this?

@PettitWesley
Copy link
Contributor

@thotypous @marco-claudino Looks like there are conflicts in the PR that need to be fixed.

@edsiper
Copy link
Member

edsiper commented Apr 30, 2021

I trust @fujimotos review. My only requirement is to rebase this PR on top of GIT master so we can get full CI coverage (recently we moved to Github actions)

@fujimotos
Copy link
Member

With Fujimotos approval, can we merge this?

@edsiper @PettitWesley I noticed the discussion in this thread. So I decided to take
a couple hours this mornig to check this PR (again) to make things sure.

I can confirm that it works. Using Elasticsearch v7.12.1, Fluent Bit can send records
without any issues. I also verified that the issue of out_es corrupting existing data is
resolved by this patch.

Attached is some screenshot from my testing. The test was done on the master HEAD
with this patch manually applied:

Screen Shot 2021-05-03 at 10 51 00

@fujimotos
Copy link
Member

fujimotos commented May 2, 2021

I kicked GitHub CI to check this PR, and it seems all green now too.

@edsiper @PettitWesley I'm going to step forward and merge this PR.

I'm plannning to do a rebase merge for a clean commit history (instead of
a plain merge). Here is a candidate branch created for a clean merge:

https://github.com/fluent/fluent-bit/commits/PR2026-for-merge

I'll push this PR to master tonight (around 7:00 in EST) after going back to home.
if you have any concern about this, please just let me know.

@fujimotos fujimotos closed this May 3, 2021
@fujimotos
Copy link
Member

Merged via 7f0db9e.

@thotypous thotypous deleted the es-integrity branch May 3, 2021 12:11
@paulden
Copy link

paulden commented May 26, 2021

Hello @fujimotos, do you know when will this commit be released? We ran into the same issue as #2664 and the fix does not look available in the latest FluentBit version (1.7.6).

Thanks a lot for your work!

@fujimotos
Copy link
Member

@paulden This patch is on track of included in the next major release (v1.8.0).

Ask Eduardo about the exact release date of Fluent Bit v1.8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants