-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-18257. Merging and Parsing S3A audit logs into Avro format for analysis. #6000
base: trunk
Are you sure you want to change the base?
Conversation
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Some minor comments.
Have you tested this on actual files? And specifically so many files... total size in GB's kind of scale testing?
I don't see any memory issues as data gets written after being processed.
If there are so many files for ex 1000, does it launch multiple mappers to process x files ny each mapper based on the splits?
private static final int INVALID_ARGUMENT = EXIT_COMMAND_ARGUMENT_ERROR; | ||
|
||
private static final String USAGE = | ||
"bin/hadoop " + "Class" + " DestinationPath" + " SourcePath" + "\n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Give an example of a command. Why bin/hadoop twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to essentially define the command in verbose vs an example. This is what we'll see
❯ bin/hadoop org.apache.hadoop.fs.s3a.audit.AuditTool
bin/hadoop Class DestinationPath SourcePath
bin/hadoop org.apache.hadoop.fs.s3a.audit.AuditTool s3a://BUCKET s3a://BUCKET
I'll simplify this, it may be confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.. change one BUCKET to source_bucket and other to destination_bucket.
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/AuditTool.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/AuditTool.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/AuditTool.java
Show resolved
Hide resolved
try { | ||
uri = new URI(s3Path); | ||
} catch (URISyntaxException e) { | ||
throw invalidArgs("Not a valid fileystem path: %s", s3Path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not an invalidArgs exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, interesting, I don't see this being used anywhere, maybe it is a stale code from last PR. I'll remove this.
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/AuditTool.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/AuditTool.java
Outdated
Show resolved
Hide resolved
...p-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/mapreduce/S3AAuditLogMergerAndParser.java
Show resolved
Hide resolved
...p-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/mapreduce/S3AAuditLogMergerAndParser.java
Outdated
Show resolved
Hide resolved
Have tested this but not at scale. Will do that.
Since we're reading each file serially per line, I would assume this would be alot slower in that scenario. Optimisation can be a follow-up patch.
Not currently. Is that something we would have to write the logic off of, I'll have to check the code for it? Specifically for number of mappers maybe we could have a threshold of number of files and then paginate them based on that. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to propose having a small log file in test/resources so that test can actually take a larger file and work through it, with a broader set of requests. A few kb of test run logs is enough.
* Its functionality is to parse the audit log files | ||
* and generate avro file. | ||
*/ | ||
public class AuditTool extends Configured implements Tool, Closeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about makign this something the hadoop s3guard can invoke?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think initially we went with that, but then changed it to be an alone audit log tool, not quite sure why we didn't go that route. Were there any plans to remove s3guard tool in the future, since we would have to separate it out then.
private static final int INVALID_ARGUMENT = EXIT_COMMAND_ARGUMENT_ERROR; | ||
|
||
private static final String USAGE = | ||
"bin/hadoop " + AUDIT_TOOL + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cut the "bin/"; that's only because we do it in our local builds. real deployments have hadoop on the path
public void testParseAuditLogEmptyAndNull() { | ||
Map<String, String> parseAuditLogResultEmpty = | ||
s3AAuditLogMergerAndParser.parseAuditLog(""); | ||
assertTrue("the returned list should be empty for this test", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assertJ asserts as they will include the map if an exception is generated
s3AAuditLogMergerAndParser.mergeAndParseAuditLogFiles(fileSystem, | ||
logsPath, destPath); | ||
assertTrue("the result should be true", mergeAndParseResult); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should explain what "the result" is. better "the merge and parse failed"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- L250 and 251 are duplicate
- comment doesn't seem to match the assert
- propose: cut the comment, and add in assertEquals what you are checking -"count of header logs parsed across both files"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on assert message.
Looks like this is completely serial now. But you can think of this as a follow-up and maybe add support for that in the future once this gets used. Just create a jira for now. |
💔 -1 overall
This message was automatically generated. |
Addressed the review comments. CC: @mukund-thakur @steveloughran |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor test tweak; main blocker is getting rat to stop complaining about licenses.
you need to add an excludes entry in the apache-rat-plugin in hadoop-main pom;
if you name the logs something with a .log suffix then you can do a wildcard of hadoop-tools/hadoop-aws/src/test/resources/TestAuditLogs/*.log for future expansion
s3AAuditLogMergerAndParser.mergeAndParseAuditLogFiles(fileSystem, | ||
logsPath, destPath); | ||
assertTrue("the result should be true", mergeAndParseResult); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- L250 and 251 are duplicate
- comment doesn't seem to match the assert
- propose: cut the comment, and add in assertEquals what you are checking -"count of header logs parsed across both files"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good overall. Let's fix the yetus and we are good to merge.
...p-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/mapreduce/S3AAuditLogMergerAndParser.java
Outdated
Show resolved
Hide resolved
s3AAuditLogMergerAndParser.mergeAndParseAuditLogFiles(fileSystem, | ||
logsPath, destPath); | ||
assertTrue("the result should be true", mergeAndParseResult); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on assert message.
💔 -1 overall
This message was automatically generated. |
/patch-unit-root.txt says the Build was success but still gave a -1. Going to put an empty patch up for yetus again |
💔 -1 overall
This message was automatically generated. |
Description of PR
This is a follow-up to #4383 PR with most of the code from that PR already in place.
Adding support for an Audit Tool to merge, parse audit logs into avro file.
How was this patch tested?
mvn clean verify -Dparallel-tests -DtestsThreadCount=4 -Dscale
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?