Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add routing rules for cloudfront logs, elb logs and s3access logs #7932

Merged
merged 6 commits into from Oct 10, 2023

Conversation

kaiyan-sheng
Copy link
Contributor

@kaiyan-sheng kaiyan-sheng commented Sep 22, 2023

What does this PR do?

CloudFront logs, ELB logs and S3 access logs are all requires a lambda function to send from s3 bucket to Firehose. This PR is to define basic routing rules for these logs to send them to the right data streams.

How to route these three log formats?

Combine regex with checking the number of fields both to define routing rules.

CloudFront logs

Define a regular expression pattern to check if the log starts with 2019-12-04 21:02:31 LAX1 392 89.160.20.112 ...

if: >-        
    ctx.message != null && ctx.message =~ /^\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s[a-zA-Z0-9-]+\s\d+\s(\d+\.\d+\.\d+\.\d+|[a-fA-F0-9:]+)/

CloudFront log contains 33 fields, please see Standard log file fields in Amazon CloudFront documentation for more details.
Sample log:

2022-04-19 12:29:36 SEA19-C2 10157 81.2.69.143 POST d111111abcdef8.cloudfront.net /getApplications 200 https://test.com/global Mozilla/5.0%20(Windows%20NT%2010.0;%20Win64;%20x64)%20AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Chrome/100.0.4896.127%20Safari/537.36 source=global - Miss hrsHM5OM6sTIXUleC1G20YtDxMf5Cq0Jbz0pwhVpod2kgEn_W6akCQ== test.com https 1057 0.238 - TLSv1.3 TLS_AES_128_GCM_SHA256 Miss HTTP/2.0 - - 4203 0.238 Miss application/json;charset=UTF-8 - - -

ELB logs

Classic Load Balancer: timestamp elb client:port backend:port ...
Application Load Balancer: type timestamp elb client:port target:port ...
Network Load Balancer: type version timestamp elb listener client:port destination:port...
common part: "client:port destination:port"

if: >-
   (ctx.message != null && ctx.message =~ /.*\s(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d{1,5})|([0-9a-fA-F:.]+:\d{1,5})\s(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d{1,5})|([0-9a-fA-F:.]+:\d{1,5})\s-?\d+(\.\d+)?\s/)

Classic Load Balancer: 15 fields
Application Load Balancer: 29 fields
Network Load Balancer: 22 fields

For example application load balancer log:

http 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 192.168.131.39:2817 10.0.0.1:80 0.000 0.001 0.000 200 200 34 366 \"GET http://www.example.com:80/ HTTP/1.1\" \"curl/7.46.0\" - - arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 \"Root=1-58337262-36d228ad5d99923122bbe354\" \"-\" \"-\" 0 2018-07-02T22:22:48.364000Z \"forward\" \"-\" \"-\" \"10.0.0.1:80\" \"200\" \"-\" \"-\"

S3 access logs

S3 access log always has 25 fields total. For example:

36c1f05b76016b78528454e6e0c60e2b7ff7aa20c0a5e4c748276e5b0a2debd2 test-s3-ks [01/Aug/2019:00:24:41 +0000] 89.160.20.156 arn:aws:sts::123456:assumed-role/AWSServiceRoleForTrustedAdvisor/TrustedAdvisor_627959692251_784ab70b-8cc9-4d37-a2ec-2ff4d0c08af9 44EE8651683CB4DA REST.GET.LOCATION - "GET /test-s3-ks/?location&aws-account=627959692251 HTTP/1.1" 200 - 142 - 17 - "-" "AWS-Support-TrustedAdvisor, aws-internal/3 aws-sdk-java/1.11.590 Linux/4.9.137-0.1.ac.218.74.329.metal1.x86_64 OpenJDK_64-Bit_Server_VM/25.212-b03 java/1.8.0_212 vendor/Oracle_Corporation" - BsCfJedfuSnds2QFoxi+E/O7M6OEWzJnw4dUaes/2hyA363sONRJKzB7EOY+Bt9DTHYUn+HoHxI= SigV4 ECDHE-RSA-AES128-SHA AuthHeader s3.ap-southeast-1.amazonaws.com TLSv1.2

The #24 field is the host header which represents the endpoint used to connect to Amazon S3. For example s3.us-west-2.amazonaws.com. The endpoint always contains s3 and amazonaws.com keywords.

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.

@elasticmachine
Copy link

elasticmachine commented Sep 22, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-09-26T23:16:41.848+0000

  • Duration: 15 min 41 sec

Test stats 🧪

Test Results
Failed 0
Passed 12
Skipped 0
Total 12

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

@elasticmachine
Copy link

elasticmachine commented Sep 22, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 100.0% (1/1) 💚
Files 100.0% (1/1) 💚
Classes 100.0% (1/1) 💚
Methods 60.0% (3/5) 👎 -32.593
Lines 100.0% (135/135) 💚 12.322
Conditionals 100.0% (0/0) 💚

@kaiyan-sheng kaiyan-sheng marked this pull request as ready for review September 22, 2023 22:01
@kaiyan-sheng kaiyan-sheng requested a review from a team as a code owner September 22, 2023 22:01
@kaiyan-sheng kaiyan-sheng self-assigned this Sep 22, 2023
@kaiyan-sheng kaiyan-sheng changed the title [WIP] Add routing rules for cloudfront logs and elb logs Add routing rules for cloudfront logs and elb logs Sep 22, 2023
@kaiyan-sheng kaiyan-sheng changed the title Add routing rules for cloudfront logs and elb logs Add routing rules for cloudfront logs, elb logs and s3access logs Sep 22, 2023
Copy link
Contributor

@zmoog zmoog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cloudfront, s3 access logs and elb logs, we need to use a lambda function to send these logs from S3 bucket to Firehose. Maybe we can take advantage of the lambda function and add some information?!

Adding some routing keys in the lambda function seems a compelling option. Where can I learn more about this lambda function? Who's responsible for adding it, and do we have a reference implementation?

@kaiyan-sheng
Copy link
Contributor Author

Adding some routing keys in the lambda function seems a compelling option. Where can I learn more about this lambda function? Who's responsible for adding it, and do we have a reference implementation?

Adding lambda function is the next step for firehose integration. I have one written for testing but it's not published in any documentation yet. The problem with adding info in lambda function is we will have to create firehose using lambda function. Not sure if user is ok with that. Users can write their own lambda too so if they use their customized lambda, then we will lose all the info. That's why in this PR Im adding routing rules purely based on the log format.

@tommyers-elastic
Copy link
Contributor

tommyers-elastic commented Sep 26, 2023

im' not really comfortable with just a count of the number of fields. as more and more routing rules are added, it becomes ambiguous. couple of further ideas:

  1. for cloudfront - i think we could get a reliable match looking at the types of a few fields in specific positions in the log. the types are well defined in the spec. we start with <date>\r<time>; the third field is a simple regex (^[a-zA-z]{3}[0-9]+$); the 5th field is an IP address; the 6th an HTTP verb; the 9th is a 3-digit HTTP status code. i think if we combine a few of these with the count of 33 fields total, we are good to go.

  2. do ELB logs always contain the ARN? in which case isn't that sufficient to identify it?

  3. do s3 logs always contain the bucket URL? in which case is it always in a certain format (s3.amazonaws.com)?

in all three cases, combining some more simple checks on individual fields with the total number of fields probably adds a lot more weight to the confidence of the rule.

@tommyers-elastic
Copy link
Contributor

i don't think we rely on a custom lamba; many users may want to roll their own including their own enrichment.

@kaiyan-sheng
Copy link
Contributor Author

kaiyan-sheng commented Sep 26, 2023

  1. do ELB logs always contain the ARN? in which case isn't that sufficient to identify it?
    It does not, and elb logs includes application lb, classic lb and network load balancer three kinds of logs.
Classic Load Balancer: timestamp elb client:port backend:port ...
Application Load Balancer: type timestamp elb client:port target:port ...
Network Load Balancer: type version timestamp elb listener client:port destination:port...

For what I see, all three logs contain these fields: "client:port destination:port" so I was trying to add regex to check that.

  1. do s3 logs always contain the bucket URL? in which case is it always in a certain format (s3.amazonaws.com)?

For s3 access logs, any field can be set to - to indicate that the data was unknown or unavailable, or that the field was not applicable to this request. So there is no promise the host header field will exist. But we can definitely check for that field to see if amazonaws.com exists and s3 somewhere. It's not always s3.amazonaws.com because s3 endpoint can be diff: https://docs.aws.amazon.com/general/latest/gr/s3.html

@tommyers-elastic Thanks for the comment. I will add the regex back in and yes I agree to not rely on lambda.

insideQuotes = !insideQuotes;
}
}
if (tokenCount==33 && ctx.message =~ /^\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s[a-zA-Z0-9-]+\s\d+\s(\d+\.\d+\.\d+\.\d+|[a-fA-F0-9:]+)/) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😭

it's a shame we can't make use of the builtin grok pattern matching for routing eh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was reading about this again and doesn't seem like this is possible. It's going to be hard to read/debug in the future. Hopefully, it doesn't get more complicated.

Copy link
Contributor

@tommyers-elastic tommyers-elastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - but the overhead of these complex expressions is slightly worrying to me

we should do some benchmarking with and without routing

@kaiyan-sheng
Copy link
Contributor Author

@tommyers-elastic I agree. I will merge this PR for now and I just created an issue about benchmarking to track that testing work. Thank you!

@kaiyan-sheng kaiyan-sheng merged commit 233c4b0 into elastic:main Oct 10, 2023
4 checks passed
@kaiyan-sheng kaiyan-sheng deleted the cloudfront_logs branch October 10, 2023 23:14
@elasticmachine
Copy link

Package awsfirehose - 0.4.0 containing this change is available at https://epr.elastic.co/search?package=awsfirehose

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants