Max Log Length and S3 Object Path #89

Bin-security · 2023-03-31T00:11:23Z

Bin-security
Mar 31, 2023

Hi Substation team, I am using substation to pipe and process CloudTrail logs stored in S3 buckets. Some of the log files are not newline delimited, and can be quite large (the compressed json files can take up to 50 MB). It seems that substation uses the default bufio.Buffer size which is 64K and truncate the logs if they exceed the size. Is there a configuration that I can change to support long log lines?

Also, the S3 object files use uuid as the prefix. Do you guys support using the same object key as the object key of the processed logs? It would be useful to have a direct 1:1 mapping so that we can compare them. Thanks in advance.

Answered by jshlbrd

Mar 31, 2023

Hi @Bin-security 👋

It seems that substation uses the default bufio.Buffer size which is 64K and truncate the logs if they exceed the size. Is there a configuration that I can change to support long log lines?

That's correct, by default the project uses the same configuration as the bufio package. We use environment variables to control runtime settings, the variable you'll need to configure is SUBSTATION_SCAN_CAPACITY. Internally at Brex we use a scan capacity of 256000000 (256MB) for CloudTrail logs. One thing to keep in mind is that this scan capacity will impact how much memory the Lambda uses, so you'll need to increase that as well otherwise you'll see out of memory errors.

Do you…

View full answer

jshlbrd · 2023-03-31T01:33:21Z

jshlbrd
Mar 31, 2023
Maintainer

Hi @Bin-security 👋

It seems that substation uses the default bufio.Buffer size which is 64K and truncate the logs if they exceed the size. Is there a configuration that I can change to support long log lines?

That's correct, by default the project uses the same configuration as the bufio package. We use environment variables to control runtime settings, the variable you'll need to configure is SUBSTATION_SCAN_CAPACITY. Internally at Brex we use a scan capacity of 256000000 (256MB) for CloudTrail logs. One thing to keep in mind is that this scan capacity will impact how much memory the Lambda uses, so you'll need to increase that as well otherwise you'll see out of memory errors.

Do you guys support using the same object key as the object key of the processed logs? It would be useful to have a direct 1:1 mapping so that we can compare them.

Right now the object path is dynamic but follows this convention: [prefix : optional]/[year]/[month]/[day]/[uuid].gz. What are you interested in comparing between the ingested and outputted S3 objects? There may be existing and simpler ways to do the comparison.

7 replies

Bin-security Apr 3, 2023
Author

Sounds good. Thanks

jshlbrd Apr 4, 2023
Maintainer

The biggest risk with a change like this is writing an object without a UUID and accidentally overwriting a previously existing object due to a name collision. There are multiple ways that could happen:

Sink misconfiguration (e.g., user doesn't know that they have created a non-unique object)
Architecture misconfiguration (e.g., multiple consumers read the same object)
System errors (processing is not idempotent)

Adding the fact that other systems don't do this (CloudTrail, Kinesis Firehose), I would strongly caution every user against doing it. If we were to do it, then we'd have to add friction so that this isn't anyone's first choice.

Here are two ways we could do it:

Add a setting that overwrites the entire object path with a value from the JSON data
- The system handles non-JSON data, so there's a potential failure scenario here
Add a setting to exclude [year]/[month]/[day]/[uuid].gz from being added to the object path, allowing [prefix : optional] to become the entire object path

Any solution we pick has to be forward compatible with support for non-Gzip output -- we hardcode the file extension today because Gzip is the only file format supported, but in the future there might be others. First we should change the object path from [prefix : optional]/[year]/[month]/[day]/[uuid].gz to [prefix : optional]/[year]/[month]/[day]/[uuid].[file extension], then address the object path.

Before we go any further with this, @Bin-security could you share more about the use case and what you want to compare between the input and output files? The reason I ask is that it's very likely no metadata will be the same between the source and destination objects:

By default Substation is a multi-process tool, so the file hashes would never match
If you drop data, then the number of lines would never match
If you transform data, then the size of each line would never match

Unless you are using the transfer transform, there will always be a mismatch between the source and destination object.

Bin-security Apr 4, 2023
Author

Let me explain my use case a little. Currently, we use it to pipe logs from a source S3 bucket to a sink S3 bucket. We plan to add filtering and processing to it in the near future. Every time there is a new create event to the source bucket, S3 will generate an event notification and publish it to SNS and Substation gets triggered by it. We plan to do filtering and processing based on some rules to reduce the log volume before writing them to a sink bucket and push them through a detection system. While we do filtering and processing, we would like to check the end result and see whether they are performed as we expected with a tool. Let's say a new object O1 is created and then Substation gets notified and does some transformation and writes to O1'. We would like to ensure that O1' is there and there is no missing logs in the sink, and that the transformation was performed as expected. You are right that if in json format then the number of lines would not match. We will need to do some parsing in the middle and remove some records based on some rules.

One question for you is that does a source S3 object maps to one S3 sink object if we use transfer? By combing through the code, it seems that's the case but I might be wrong.

chencaoverkada Apr 4, 2023

Hi @jshlbrd, thanks for the explanation here. To add on @Bin-security 's point, being able to add customized path could also help us a lot on processing the logs further. On our data pipeline, after the data written into S3, we will be using glue to process them and convert them to parquet format. AWS glue requires the path to follow a specific format (information can be found here) and we would really appreciate if we could make the paths follow that format.

jshlbrd Apr 5, 2023
Maintainer

@chencaoverkada thanks for clarifying the use case! I see two examples for Glue:

S3://sales/year=2019/month=Jan/day=1
s3://DOC-EXAMPLE-BUCKET/logs/year=2004/month=12/day=11/

The documentation isn't clear on this, but it looks like the requirements are /year=YYYY/month=MM|Month/day=DD/ in the path? If that's true, then would the pattern [prefix : optional]/year=[year]/month=[month]/day=[day]/[uuid].[file extension] work?

Either way, we want to avoid the risk of users creating objects that don't contain a UUID. An option for doing that is to make part of the path configurable (e.g., the format of the date substring), but not the entire path.

I think the requirements described by @Bin-security (tracking the number of events ingested and output) is best addressed via metrics, though there are still edge cases (e.g., all lines in an input S3 object can be dropped by the configuration and an empty S3 object is created).

If you introduce data reduction techniques into the pipeline, then measuring reliability becomes significantly more difficult. CloudTrail makes this even more problematic because the S3 objects contain a single event that contains an unknown number of inner events, so comparing input to output at the object level probably isn't worth it. There are alternate ways to measure reliability, the Google SRE workbook touches on some of these topics and generally has useful guidance. Always happy to chat more about how we do this internally at Brex.

does a source S3 object maps to one S3 sink object if we use transfer?

If the Lambda trigger is an S3 or SNS S3 source, then it will map to a single S3 sink object regardless of what type of transform is used. I mentioned the transfer transform because it doesn't modify the data -- data is sent from the source to the sink untouched, so it's the only way you'd get a somewhat similar (though not identical) output object.

jshlbrd · 2023-04-19T00:40:57Z

jshlbrd
Apr 19, 2023
Maintainer

See #91 for more details on configurable object paths for the AWS S3 sink.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Max Log Length and S3 Object Path #89

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Max Log Length and S3 Object Path #89

Bin-security Mar 31, 2023

Replies: 2 comments · 7 replies

jshlbrd Mar 31, 2023 Maintainer

Bin-security Apr 3, 2023 Author

jshlbrd Apr 4, 2023 Maintainer

Bin-security Apr 4, 2023 Author

chencaoverkada Apr 4, 2023

jshlbrd Apr 5, 2023 Maintainer

jshlbrd Apr 19, 2023 Maintainer

Bin-security
Mar 31, 2023

Replies: 2 comments 7 replies

jshlbrd
Mar 31, 2023
Maintainer

Bin-security Apr 3, 2023
Author

jshlbrd Apr 4, 2023
Maintainer

Bin-security Apr 4, 2023
Author

jshlbrd Apr 5, 2023
Maintainer

jshlbrd
Apr 19, 2023
Maintainer