Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS] Enable pure-Java plugins #7986

Closed
jordansissel opened this issue Aug 11, 2017 · 19 comments
Closed

[DISCUSS] Enable pure-Java plugins #7986

jordansissel opened this issue Aug 11, 2017 · 19 comments

Comments

@jordansissel
Copy link
Contributor

jordansissel commented Aug 11, 2017

Goal: Design the API for pure Java plugins

Code Status - all code is work-in-progress and subject to commentary :)

Design/Discussion issues:


WORK IN PROGRESS


Background: Logstash plugins still require at least some Ruby component(s) in order to load. We have several plugins now (beats, dissect, date, geoip, etc) that are written mostly in Java but still have some ruby parts.

Greenfield Opportunities:

  • ✅ Remove filter plugin flush concept
  • ❌ Merge filter and output plugin concepts into a unifying processor type
  • ✅ Provide major categories of settings (SSLContext, HTTPClient) that can be shared
  • ❔ Replace filter plugin filter_matched with something that is less a burden on the plugin author.
  • ❔ Simplify decorate concept for inputs/codecs
  • ✅ Move to rely on Java features where possible.

Key Objectives:

  • Plugin contains no Ruby code
  • Plugin tests contain no Ruby code
  • Stretch goal: Plugin build tools contain no Ruby code

Areas for design:

  • Plugin discovery and loading.
  • Plugin API.

Overview of Plugin API proposed:

This section is relevant to plugin authors.

Plugin Configuration

Today, we have the config class method on ruby Plugin classes:

class MyPlugin < Input
  config :foo, :validate => :string, ...
end

Ideally I would like to keep a similar model. The goal of the config method above is to make it simple for the author to define settings and have Logstash do the hard part of parsing, validating, and providing feedback to the user.

Accepted implementation (so far):

Have plugin instances accept a configuration object to configure themselves with.

This is what Tealess and Elasticsearch do.

Rejected implementation options:

  • Copy what Ingest Node does (Processor.Factory) (I don't like this, all validation/setup would be hand-coded by the author)
  • Factory pattern like PluginLoader.newInput(name, configuration)

Inputs

public interface Input {
  public void run(Consumer<Batch> consumer) {
    // Builds a Batch and calls consumer.accept(batch) when events are ready
    // Terminating conditions:
    //    * when Thread.isInterrupted() is true
    //    * when InterruptedException is received
    //    * if the input has a normal terminating condition (like Elasticsearch query is done, etc)
  }
}

filters and outputs

public interface Processor {
  public process(Batch batch) {
    // must abort if the thread is interrupted.
  }
}

optional new concept: triggers

Watcher has triggers that start inputs. Currently there is only scheduled triggers.

  • It is common for our input plugins to support cron schedules (jdbc, http_poller, etc)
  • other activity could be a trigger, such as an output plugin could trigger a job on an input.

I don't want to go into much detail here. This is optional and best discussed in a separate issue, but if we do want to explore this, I want the new API to be open to it.


(other notes here)

Base plugins

I would like to remove this concept. Currently we rely somewhat on subclassing, and maybe we don't' need this anymore.

Existing key Ruby methods:

  • Plugin#close - optional method for a plugin author to use to perform any tasks during shutdown.
  • Plugin#reloadable? - indicator of a plugin that supports reloading.

Proposals:

  • I think we can get rid of reloadable? by requiring all plugins be reloadable.
  • I think we should try to get rid of close.

Input plugins

Replace stop with Thread.interrupt()

Inputs are basically java.lang.Runnable that, today, can be commanded to terminate if their stop method is invoked from another thread. This is done today with an AtomicBoolean. In Java, all threads already have such a concept which we may be able to use for this which is to have stop call Thread.interrupt() and should we stop? be Thread.isInterrupted().

remove methods

I would also like to remove the register method. I'd also like to see if we can relieve plugin authors from having to invoke decorate themselves.

Filter plugins

@jakelandis
Copy link
Contributor

Re: Plugin Discovery

Have you considered an annotation driven approach ?

For example, I really like the way JSR-330 is implemented. This is the specification for Dependency Injection that is the brain child of Guice and Spring founders. They cleanly separated the specification from the implementation by simply providing well documented annotations and single generic interface.

The entire code for JSR-330 is here:
https://github.com/javax-inject/javax-inject/tree/master/src/javax/inject

Implementations, such as Guice, Spring, Weld, etc. all use the annotations but provide their own implementation. I assume that all implementations are scanning the class path on start up looking for these annotations. (Scanning your entire class path for annonations using reflections at startup is not a scary as it sounds).

I think there is an analogy to be drawn here with Logstash Plugin's. We could have a Plugin interface with @Input @Output @Filter, etc. annotations, or with parameters such as @Input(name="tcp"). At startup (or pipeline reload), LS could re-scan for new plugins.

@jakelandis
Copy link
Contributor

Re: Plugin Configuration

Have plugin instances accept a configuration object to configure themselves with

+1

and have Logstash do the hard part of parsing, validating, and providing feedback to the user.

+1

The Builder pattern tends to lend itself nicely to extensible configuration options.

@jakelandis
Copy link
Contributor

Re: Inputs and Outputs

I think we should try to minimize the amount of thought that developer will need with respect to a thread terminating. Perhaps providing optional lifecycle hooks (start, stop, reload) from the API. If needed we could document some guarantees on how long we will wait for the reload/stop methods to execute before simply killing the thread it is running on.

I am not sure I understand the need for the run method for the Input, won't the input be responsible for obtaining the events off the wire/file/etc. ? It would seem that the contract would be the lifecycle events and what to do with the event after it is has accepted it.

If we combine the filters and output to a singular unit (a Processor) , does it still makes sense within that unit to separate transform and sink ?

@jordansissel
Copy link
Contributor Author

jordansissel commented Aug 12, 2017

I am not sure I understand the need for the run method for the Input

This is the way Logstash Inputs work today. We can certainly do differently, and I am open to discussion.

Today, every input plugin has a run method. The pipeline uses this by doing basically this:

plugin = LogStash::Input::Stdin.new(config)
Thread.new { plugin.run(queue) }

I think this mechanism (which pretty much emulates Java's Runnable with small modifications) has served us well, even for inputs that do their own work management (kafka, rabbitmq, etc).

Having a start+stop, to me, puts the burden of task management on the plugin author where, for most plugins, there is generally only one while (true) { /* do work */ }.

If we combine the filters and output to a singular unit (a Processor) , does it still makes sense within that unit to separate transform and sink ?

Yes. As we have today an elasticsearch filter and elasticsearch output which perform very different tasks, we'll still have this in whatever future we build. The reason for unification, for me, is that both filter and output have the same characteristics:

  • filter and output are both handed an event
  • currently they are implemented with the same basic function signature but just different names

simply killing the thread

One does not simply walk into Mordor .. err .. kill threads ;)

Thread termination in Java requires cooperation on the part of the thread being killed.

I'm open to exploring start/stop lifecycles, though.

BTW, "reloadable" is more for the pipeline's concept of "reload" which means to read the config file, build a new pipeline, stop the old pipeline, and start a new one. Fundamentally, "reloadable" (I think?) really means "will this plugin terminate if I ask it to do so"

@jordansissel
Copy link
Contributor Author

The entire code for JSR-330 is here:

I'm in favor of reusing an existing library or pattern. I read the docs on javax.inject and it's not clear to me the purpose of the @Inject annotation. I'll probably need to study it more to understand the purpose. Maybe we can chat next week and you can fill me in -- my confusion is mostly that the annotation alone seems to do nothing at all? Were you referring to the @Named annotation instead?

@jordansissel
Copy link
Contributor Author

Regarding JSR-330's Provider<T>, this same interface is available by a different name in Java 8 as Supplier<T>.

@jakelandis
Copy link
Contributor

jakelandis commented Aug 12, 2017

I may have confused matters by bringing up JSR-330. I only meant that as an example of an API implemented multiple times, driven almost soley by annotations. (though using formal DI may be useful, that wasn't my point)

Annonations by themselves don't carry any functionality, it just a development time decision to semantically mark classes/methods with the expectation that the framework (in this case LS) will read those annotations and do something useful with them.

Thread termination in Java requires cooperation on the part of the thread being killed.

Only for a clean shut down :)

"reloadable" is more for the pipeline's concept

I assume that some actions (such as reseting the metrics) may be desirable for a plugin at reload time.

I can put together some sample code to better illustrate discovery via annotations and reflections.

@jordansissel
Copy link
Contributor Author

jordansissel commented Aug 12, 2017 via email

@jordansissel
Copy link
Contributor Author

Something like:

@Name("stdin")
public class Stdin implements Input {
  // ...
}

?

@jordansissel
Copy link
Contributor Author

I added two more options for plugin discovery/registration (copy elasticsearch's model, or try to use java.util.ServiceLoader)

@guyboertje
Copy link
Contributor

I am working on a EventEmbellisher Java and JRuby ext to inject into a plugin that does the type, add_field, add_tag, remove_field and remove_tag functions. It has a simple API: a Constructor that takes all the data args expect type, isEmpty, withType(fluent) and apply(event). The Jruby ext exposes the same API (for use in Rubyland) with Ruby -> Java conversions as necessary and a Javaland getEmbellisher call to return the actual object to give to the plugin Java code.

For pure java plugins the EventEmbellisher instance can be accessed via the proposed Configuration class.

The JRuby extensions are still of use while we have the Ruby plugin base class and the input_base etc.

@jakelandis
Copy link
Contributor

I took a swipe at a POC with the following goals:

  • Expose as little as possible to plugins. For example, don't expose the whole queue.
  • Try to keep the API small and in domain familiar terms.
  • Try to keep as much of the threading/looping complexity hidden from the plugin.
  • Clearly define a life-cycle.
  • Custom isolated class loaders
    • Allow for a drop-in style plugin (or even http downloaded plugins)
    • Allow different versions of transitive dependencies
    • First step towards more secure sandboxed security managed plugins

The project is here: https://github.com/jakelandis/skunkstash

The project is FAR from production quality (and never will be)...but it has enough there to emulate the workflow of Logstash, with an Event generated by an input, passed to a queue, run through a filter, then sent to an output. This project can be used to help flush out ideas, in particular the API, isolated class loading, and threading model for plugins.

@jordansissel
Copy link
Contributor Author

jordansissel commented Aug 21, 2017 via email

@jakelandis
Copy link
Contributor

Threading model is probably a poor choice of words. I am not trying to propose a change to the Threading strategy.

I really meant the Threading implementation (Executor service, Thread Factory, Runnable, Callable etc.) and Threading contract with the plugin itself (lifecycle management, state to/from thread, interrupt expectations, etc.)

The the POC is surely not a perfect replication, but hopefully close enough to help flush out ideas.

@jordansissel
Copy link
Contributor Author

Current status:

I talked with @talevy and others a few weeks back about plugin construction. We both aren't a fan of how IngestNode does the plugin factory things. In our discussion @talevy pointed me at ConstructingObjectParser which is very similar to what I was hoping we could accomplish for plugins in Logstash.

Unfortunately the ConstructingObjectParser code is pretty deeply embedded with Elasticsearch, so we'll write our own. It'll look very similar with the exception that our inputs will be config values instead of XContent and XContentParser.

@jordansissel
Copy link
Contributor Author

I'm working in a personal branch right now as I figure some of these things out and experiment. Once I get to a good place, I intend to create a few separate PRs:

  1. defining the plugin interfaces and api
  2. for the ConstructingObjectParser and related code
  3. for plugin building, discovery, and loading.

@jordansissel
Copy link
Contributor Author

Been thinking about discovery, and I think Elasticsearch's model seems pretty good:

  1. plugins install to {ES_HOME}/plugins/<name>/...
  2. Any jars go in that directory.
  3. also includes a plugin-descriptor.properties file which is scanned by Elasticsearch in order to know about the plugin: the main Class entrypoint, the human name for the plugin, version, etc.

@guyboertje
Copy link
Contributor

I thought about this while on holiday.

Do we need a listener/observer interface for plugins that allows for plugin authors to "trigger" certain activities in a loosely coupled way. It would add flexibility to the plugin callback mechanisms. An example of this would be filter_matched. I know that these are usually void methods that would operate on the event arg but I'm not sure how to do this with an encode activity that should produce an encoded version of the event as a string.

These listener/observer activities would be fired during an API call from the pipeline into the plugin and plugins are not required to fire any if they are not needed.

It may also be a way for the plugin author to influence plugin support building during initialisation - say in an input if the author needs to preprocess some config options then fire a setup activity with a 'properties' arg. By "plugin support" I mean the building of any kind of Consumer chain that describes the exact set of event manipulation functions that need to be applied to an event for the active LS config.

@danhermann
Copy link
Contributor

Closing in favor of #9215

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants