Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tags in Object Key #31

Closed
sgessa opened this issue Jul 11, 2013 · 21 comments
Closed

Tags in Object Key #31

sgessa opened this issue Jul 11, 2013 · 21 comments

Comments

@sgessa
Copy link

sgessa commented Jul 11, 2013

is it possible to have tags in the object key?

@repeatedly
Copy link
Member

Currently no.
Do you want to specify %{tag} in s3_object_key_format?

@sgessa
Copy link
Author

sgessa commented Jul 11, 2013

Yes please! How can I achieve this? I just could implement this locally but I just started playing with fluentd and plugins.

@repeatedly
Copy link
Member

How to implement?

We can't assume single tag in s3 plugin because event has own tag.
For example:

<match foo.**>
  type s3
  # ...
</match>

In this case, events in s3 plugin may have foo.bar, foo.baz... tags.

@sgessa
Copy link
Author

sgessa commented Jul 11, 2013

Yep I just want to add ${tag} in object key like this:

<match foo.**>
  type s3
  s3_object_key_format %{time_slice}_${tag}_%{index}.%{file_extension}
   ....
</match>

If event has tag foo.bar for example, I'm expecting to find it in the object key.
Also, if I want to take only "bar", I should be able to call remove_tag_prefix foo.

Thanks

@repeatedly
Copy link
Member

Hm.

In your approach, S3 plugin stores multiple objects into S3 at the same time, right?

@sgessa
Copy link
Author

sgessa commented Jul 11, 2013

Yes. I need the %{tag} because I'm storing access logs grouped by domain and I'm passing the domain name in the tag..

@repeatedly
Copy link
Member

Okay. Could you send the pull request?
Maybe error handling and breaking idempotent are important factor.

@sgessa
Copy link
Author

sgessa commented Jul 11, 2013

I don't know how to implement this, that's why I asked here :(
I started playing with fluentd yesterday :D

@repeatedly
Copy link
Member

I see. We need some time if implement.

@dave7373
Copy link

The is another plugin that adds this feature to the s3 plugin. Please check it out here:
https://github.com/campanja/fluent-output-router

@jsermeno
Copy link

We just began using fluentd in production. Right now we're using the plugin that dave7373 mentioned to achieve storing logs for each event in a different folder. It's working, although I would like to explore if there is a more efficient way. The fluent-output-router starts a new fluent-plugin-s3 for every event. This creates a lot of threads if you have a lot of events. Is it due to fluentd having a single buffer queue structure that new outputs must be instantiated if you want to have separate chunks for each event?

In your approach, S3 plugin stores multiple objects into S3 at the same time, right?

In the approach you discussed above, did you mean that in the write method you would split the chunk into separate pieces based on tag and then write each sub-chunk to a different S3 file? The only problem I see here is that you may get very small S3 files if an event only occurs a few times within a chunk. Where as if you had a individual chunk for each event this would occur less often. Maybe that is not a problem, and can be mitigated by making the chunk size larger? Perhaps it is also more efficient than creating a new output for each event. Are there downsides to making the chunk size larger? According to the documentation the default chunk size is 8m.

I should also mention that I would love to work on implementation if we come to some agreement on what the best solution is.

Thanks!

@repeatedly
Copy link
Member

@jsermeno

We just began using fluentd in production.

Coool 👍

The fluent-output-router starts a new fluent-plugin-s3 for every event

forest and router plugin creates new output when receive the new tag, not every event. So the number of outputs / threads doesn't expload on many cases.

The only problem I see here is that you may get very small S3 files if an event only occurs a few times within a chunk.

Hmm... My concern is error handling.
S3 plugin and forest based tag separation use Fluentd's retry mechanizm when error occurred.

On the other hand, if we supports tag separation in S3 plugin, then we should implement own retry mechanizm which similar to Fluentd.
Because tag separation often executes multiple requests to S3. I already mentioned this point:

"Maybe error handling and breaking idempotent are important factor."

Maintain duplicated retry feature seems high cost and not so many advantages I think.

@jsermeno
Copy link

forest and router plugin creates new output when receive the new tag, not every event. So the number of outputs / threads doesn't expload on many cases.

Oops sorry, I did mean new tag.

Hmm... My concern is error handling.
S3 plugin and forest based tag separation use Fluentd's retry mechanizm when error occurred.

I see, do you believe that this optimization would be better suited to become part of fluentd itself? Perhaps there would be a configuration option that limits the number of threads somehow. Scribe for example, has a configuration option to prevent creating a new thread for each category / tag.

Maintain duplicated retry feature seems high cost and not so many advantages I think.

The cost does seem to be becoming larger than I initially thought. There are many advantages though. There are a number of use cases that require a high number of tags. Particularly when handling multiple applications. The number of tags in our case could easily exceed 1000 in the near future and could grow larger. We are already at several hundred. The main benefit I see in storing that many tags in separate folders is if you want to perform analytics on a small subset of events you do not have to open every file to search for the events and potentially speed up queries by quite a bit.

@ryanc4
Copy link

ryanc4 commented Nov 2, 2013

Can we follow the same approach as in this plugin?

https://github.com/fluent/fluent-plugin-mongo/blob/master/lib/fluent/plugin/out_mongo.rb#L93

@repeatedly
Copy link
Member

Sorry for the late reply.

@jsermeno

Scribe for example, has a configuration option to prevent creating a new thread for each category / tag.

This is interesting. I will check Scribe source code later.

@ryanc4

Currently no because S3 plugin already use same approach to separate record with event time.
For almost users, forest and S3 plugin is enough.
But above jsermeno case, we need more better performance option.

@ryanc4
Copy link

ryanc4 commented Nov 6, 2013

@repeatedly I am not seeing s3 plugin is using emit to split the tag, I think by allowing splitting using the tag it allow us to do log analysis more quickly in S3 (with EMR)

@repeatedly
Copy link
Member

@ryanc4 S3 plugin itself doesn't extend emit but TimeSlicedOutput, super class of S3 plugin, set time sliced string to key in emit. If supports tag included key in S3 plugin, we should extend TimeSlicedOutputs#emit. Maybe forest plugin is now better unless user has special reason.

@repeatedly
Copy link
Member

@jsermeno I checked Scribe's newThreadPerCategory and I understood Scribe's buffer and thread management. I will think about implementing same feature on top of fluentd.

@dieend
Copy link
Contributor

dieend commented Oct 21, 2014

Is there any update for using tags in object key?

@repeatedly
Copy link
Member

You can use fluent-plugin-forest to realize this goal: https://github.com/tagomoris/fluent-plugin-forest

@prtk-ngm
Copy link

please provide example how we can integarte forest plugin with s3 plugin to give dynaminc tag support in path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants