[core] Hash firehose stream name if it is too long #1191

chunyong-lin · 2020-03-14T04:00:52Z

to: @airbnb/streamalert-maintainers
related to: #1115 #1188
resolves: #1190

Background

See issue #1190

Changes

Update firehose stream name in classifier and pass the new stream name to tf_kinesis_firehose_delivery_stream module during terraform generate time. The firehose stream name is based on schema name.
- This change will generate a new stream name if it's longer than 64 characters. In another words, most of the firehose stream name will remain the same.
- For the stream name longer than 64 characters, the first 58 characters remains unchanged and the rest characters will hash to 8 characters by hashlib.md5(). hashlib.md5() returns 32 characters (128-bit) and we will only take the first 8 characters. There is low chance is too reach the stream name limit, so I don't worry the hash conflict.
Remove _streamalert_ (13 chars) from firehose stream name. Firehose stream name is restrict to the name length is no longer than 64 characters. We want the schema names more descriptive and it will be easily hit the limit especially when we add prefix.
Use var.glue_catalog_table_name in the s3 file prefix instead of var.log_name, then latter may be changed if it is too long.

Testing

Add unit test cases to cover the code change.
Tested in staging environment with the data matched to schemas both with short and extremely long schema name, e.g. osquery_differential and cloudwatch:events.dots-test.crazy.long_name_yay_hahahahaha
- classifier can process data and send data to both firehose
- rule's engine can trigger alerts from both types of data.
- alert process can send alerts to firehose correctly
- athena partition can partition both types of data
- both types of data and alerts can be searchable in the Athena in parquet format.

coveralls · 2020-03-14T04:07:57Z

Coverage increased (+0.01%) to 95.434% when pulling 4011cff on firehose_log_stream_name_length into c4d2c12 on release-3-1-0.

ryandeivert · 2020-03-16T16:30:25Z

streamalert/classifier/clients/firehose.py

@@ -52,19 +53,32 @@ class FirehoseClient:
    MAX_RECORD_SIZE = 1000 * 1000 - 2

    # Default firehose name format, should be formatted with deployment prefix
-    DEFAULT_FIREHOSE_FMT = '{}streamalert_data_{}'
+    DEFAULT_FIREHOSE_FMT = '{}data_{}'


IMO I'd rather keep the streamalert portion of this and drop the data bit.. these are Firehose Data Delivery Streams and including data is pretty redundant. at least having streamalert in the name will help with things like filtering streams in console, etc

I mentioned this before, but I recommend explicitly stating the firehose names in the conf/ directory, and explicitly populating them during the generate step. I imagine the workflow is like:

Look in conf directory for explicit firehose name

If no explicit name, use use_prefix and the log type

During manage.py generate, explicitly populate into Terraform json files

Instead of generating them using locals {$prefix}_data_{$hash}, just use the explicit name

@Ryxias quick question.

Look in conf directory for explicit firehose name

Potentially, we may have hundreds of firehose up running due to one log schema using one firehose. The firehose conf will be quite long and add complexity to read all firehose names from the conf/global.json. IMO, I would take @ryandeivert 's suggestion. Thoughts?

I do like Derek's idea, I'm just unsure where this would live.. in global.json? how do we map from logs.json to this reliably... or does it live in logs.json?? any suggestions for how this would actually look?

I also still insist we drop _data_, even with the hashing route. really not sure why we're hung up on keeping it...

@Ryxias again I'm not opposed to your way, I just want to hear more about "how" this will live in the config. does it go in logs.json or the firehose config of global.json?

We can have it in conf/logs.json, or conf/schemas/*.json or conf/firehose.json. I don't really care either way.

I think the important one is that it's OPTIONAL. In most cases, you should not explicitly specify your firehoses. But they should be explicitly stated somewhere. I don't care much about the conf/****.json one, but I am very strongly in favor of explicitly generating them in the terraform/***.tf.json files, instead of dynamically creating them inside of the the Terraform modules.

It would be mean your manage.py generate step would just have a new function that's like:

def get_firehose_name(log_type): # Explicitly stated if config['whatever'].get(log_type, {}).get('firehose_name', None): return config['whatever'].get(log_type, {}).get('firehose_name', None) # Generated use_prefix = config['global']['whatever']['use_prefix'] firehose_name = '{}{}_{}'.format( use_prefix ? '{}_'.format(prefix), 'streamalert_data' # Or whatever log_schema ) if len(firehose_name) > 64: # Truncate streamalert_data ... if len(firehose_name) > 64: # then do some hashing stuff... return firehose_name

And the above would set the real firehose name into the **.tf.json files

I'm drawing attention to the where on this because I think we're forgetting how this is currently performed. data retention is optional and does not need to be enabled.. but enabling means toggling it in global.json. having some settings in global.json and others in logs.json is confusing.. but on the other hand, having the name for an optional feature embedded in logs.json and other settings in global.json is also confusing. we're sorta asking a lot of users here to mind-map all of this together

Wrap up this thread. I took Ryan's suggestion and addressed it in this commit. We will support the custom firehose stream name later and I have open an issue #1193 to track the future work.

Ryxias · 2020-03-16T23:41:11Z

streamalert/classifier/clients/firehose.py

@@ -298,6 +312,46 @@ def firehose_log_name(cls, log_name):
        """
        return re.sub(cls.SPECIAL_CHAR_REGEX, cls.SPECIAL_CHAR_SUB, log_name)

+    @classmethod
+    def generate_firehose_stream_name(cls, use_prefix, prefix, log_stream_name):


This method's name seems to imply that it generates the full firehose name, but that's actually not true right? It generates the prefix-less firehose name.

I suggest renaming the function to generate_firehose_suffix reduce confusion

Good point! Yes, you are right, the return result doesn't include prefix. Will rename this method.

Ryxias

Thank you for the changes. LGTM

Chunyong Lin added 2 commits March 13, 2020 20:32

[core] Hash firehose stream name it is too long

f376d0b

[ci] disable bandt check on hashlib.md5() method

455d3f5

chunyong-lin changed the title ~~[core] Hash firehose stream name it is too long~~ [core] Hash firehose stream name if it is too long Mar 14, 2020

chunyong-lin added breaking change classifier historical search terraform labels Mar 14, 2020

chunyong-lin added this to the 3.1.0 milestone Mar 14, 2020

[docs] Update doc

baee4c6

ryandeivert reviewed Mar 16, 2020

View reviewed changes

Address comments

d4cd2d3

Ryxias reviewed Mar 16, 2020

View reviewed changes

chunyong-lin mentioned this pull request Mar 16, 2020

[Enhancement] Support custom stream name for firehose #1193

Open

Address more comment

4011cff

Ryxias approved these changes Mar 17, 2020

View reviewed changes

chunyong-lin merged commit 666b4cd into release-3-1-0 Mar 17, 2020

chunyong-lin deleted the firehose_log_stream_name_length branch March 17, 2020 15:41

ryandeivert mentioned this pull request Mar 18, 2020

ignoring json files during packaging, fixes for log names #1197

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Hash firehose stream name if it is too long #1191

[core] Hash firehose stream name if it is too long #1191

chunyong-lin commented Mar 14, 2020

coveralls commented Mar 14, 2020 •

edited

Loading

ryandeivert Mar 16, 2020

Ryxias Mar 16, 2020

chunyong-lin Mar 16, 2020

ryandeivert Mar 16, 2020

ryandeivert Mar 16, 2020

ryandeivert Mar 16, 2020

Ryxias Mar 16, 2020 •

edited

Loading

Ryxias Mar 16, 2020 •

edited

Loading

ryandeivert Mar 16, 2020

chunyong-lin Mar 17, 2020

Ryxias Mar 16, 2020

chunyong-lin Mar 16, 2020

Ryxias left a comment

[core] Hash firehose stream name if it is too long #1191

[core] Hash firehose stream name if it is too long #1191

Conversation

chunyong-lin commented Mar 14, 2020

Background

Changes

Testing

coveralls commented Mar 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ryxias Mar 16, 2020 • edited Loading

Choose a reason for hiding this comment

Ryxias Mar 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ryxias left a comment

Choose a reason for hiding this comment

coveralls commented Mar 14, 2020 •

edited

Loading

Ryxias Mar 16, 2020 •

edited

Loading

Ryxias Mar 16, 2020 •

edited

Loading