[TINKERPOP 3107] Remove Groovy dependency from default server initialization#3384
[TINKERPOP 3107] Remove Groovy dependency from default server initialization#3384Cole-Greer wants to merge 18 commits intomasterfrom
Conversation
|
|
||
| === Upgrading for Users | ||
|
|
||
| ==== Gremlin Server Initialization Without Groovy |
There was a problem hiding this comment.
this is all "what" changed - it's virtually all Reference Documentation. there's nothing about "why". there's nothing about the "benefit". nothing that prepares the reader for what's coming in the "future" with a server without Groovy. get people thinking forward. like, "Auto-created TraversalSources" - like, doesn't that "simplify" the configuration wonderfully? what about the "challenges" - like, how to maintain provider flexibility without a full scripting engine? where are the words that make this major new change have some life?
There was a problem hiding this comment.
I've reworked the upgrade docs to tell more of the "why", and to better demonstrate the conveniences of the new declarative config system.
There was a problem hiding this comment.
How about "More Secure Gremlin Server" as a title?
The first sentence is rough with the semicolon:
Previous versions of Gremlin Server relied on the Groovy script engine for basic server initialization; binding
traversal sources, loading data, and running lifecycle hooks all required Groovy init scripts.
How about:
Previous versions of Gremlin Server relied on a Gremlin-flavored Groovy
ScriptEnginefor basic server initialization, which covered binding traversal sources, loading data, and running lifecycle hooks all setup by Groovy initialization scripts.
Change "script engine" to ScriptEngine consistently. If this is about security, I think that it would make sense to mention how an earlier version took a step toward security by making gremlin-language the default for script execution and this is the second step required to make the server more secure by default. you allude to that in the second paragraph without that earlier reference about sending remote scripts. I would further allude to the idea that a future version will wholly remove Groovy as an installed option in the server.
|
Can you add unit test to verify gremlin-groovy is disabled? something like |
cd8dce5 to
774548f
Compare
I've adapted this test and added it to the server integration tests. I've given it extra cases to additionally verify error responses if the driver explicitly asks for |
| Graphs with explicit `traversalSources` entries are excluded from auto-creation. | ||
|
|
||
| [[server-lifecycle-hooks]] | ||
| ==== LifeCycleHooks |
There was a problem hiding this comment.
now that Gremlin Server relies on compiled code, i think you have to explain where to place your jar for this stuff to work if you build a custom one. can they use the gremlin-server.sh -i command?
you don't have to do it for this PR if you don't want to, but i think an open item here for 4.0 final is to offer another general purpose LifeCycleHook. At least one I can think of that folks would use again and again is a basic DefaultDataLoader that can just take a file path to a TinkerPop supported format as input. seems like a fast add, but if you want to put it off, please create a JIRA.
There was a problem hiding this comment.
I had a similar thought when working on the TinkerFactoryDataLoader hook for this PR. I think it would be nice to elevate the TinkerFactory generate<Dataset>() methods from tinkergraph-gremlin to gremlin-core, along with a refactor to work with any arbitrary Graph. All of the small graphs are already simply based on the Structure API. There's no reason we couldn't make those directly available to arbitrary graphs. For airRoutes and grateful, I'd also like to make them more generally available (perhaps restricted to graphs which support reading gryo via the io() step, or perhaps via a long gremlin load script).
Taking this a step further and providing a prepackaged data loader where users just need to provide a path in the config is a nice convenience. I think with the Java lifecycle hooks, we can offer a lot more prebuilt configurable lifecycle hooks, instead of asking providers to write all their own scripts.
There was a problem hiding this comment.
Spinning out into a new JIRA: https://issues.apache.org/jira/browse/TINKERPOP-3245
There was a problem hiding this comment.
first one: GroovyLifeCycleHook
There was a problem hiding this comment.
I've added instructions for placing the new jars in the /ext directory in the server to include in the classpath.
| * | ||
| * @param config the key/value pairs from the {@code config} block in the YAML entry | ||
| */ | ||
| public default void init(final Map<String, Object> config) {} |
There was a problem hiding this comment.
this was never clear before i don't think but can a LifeCycleHook prevent the server from starting by throwing an exception? if so, i think it would be good to add some guidance here in the javadoc for that.
There was a problem hiding this comment.
I've updated these javadocs with descriptions of the exception handling behaviour.
| org.apache.tinkerpop.gremlin.groovy.jsr223.GroovyCompilerGremlinPlugin: {expectedCompilationTime: 30000}, | ||
| org.apache.tinkerpop.gremlin.jsr223.ImportGremlinPlugin: {classImports: [java.lang.Math], methodImports: [java.lang.Math#*]}, | ||
| org.apache.tinkerpop.gremlin.jsr223.ScriptFileGremlinPlugin: {files: [scripts/generate-all.groovy]}}}} | ||
| traversalSources: { |
There was a problem hiding this comment.
Considering that this is TinkerPop 4, maybe it makes sense to reconsider the graph initialization?
for example can traversal sources and lifecycleHooks be part of graph configuration?
There was a problem hiding this comment.
that makes some sense actually. maybe the configuration makes more sense living with the graph rather than referencing the graph all over the yaml (i think that's what @vkagamlyk 's comment was about at least). that would lead to a nicer configuration:
graphs: {
graph: {
configuration: conf/tinkergraph-empty.properties,
traversalSources:
- g
- { name: "gg", gremlinExpression: "g.withStrategies(ReadOnlyStrategy)", language: "gremlin-lang" }
lifecycleHooks:
- { className: org.apache.tinkerpop.gremlin.server.util.TinkerFactoryDataLoader, config: {dataset: modern}}
- { className: com.software.company.GroovyInsanity, config: {script: "System.exit(0)"}}
}
}also, in the docs it's written that:
- A graph named
graphis implicitly bound tog- All others are bound to
g_<graph-name>(e.g.moderngetsg_modern)
but does that mean that the configuration below produces 2 traversal sources in this server (except for g: {graph: graph}, - like, gclassic and g_classic? a
There was a problem hiding this comment.
looking at the code, i guess the above is just a documentation problem - it's not quite clear that it's referring to graphs that have no explicitly configured TraversalSource that get that naming.
now i see...i was reading the upgrade docs - in that context i guess what they say makes sense and the reference docs are clear. i guess that answers my questions.
There was a problem hiding this comment.
Other possible options:
- To have all graph related config in single
.propertiesfile, something like
gremlin.graph=org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerGraph
gremlin.graph.traversalSource=g
gremlin.graph.traversalSource={ name: "gg", gremlinExpression: "g.withStrategies(ReadOnlyStrategy)", language: "gremlin-lang" }
gremlin.graph.lifecycleHook={ className: org.apache.tinkerpop.gremlin.server.util.TinkerFactoryDataLoader, config: {dataset: modern}}
gremlin.graph.lifecycleHook={ className: com.software.company.GroovyInsanity, config: {script: "System.exit(0)"}}
gremlin.tinkergraph.vertexIdManager=LONG
or structured
gremlin.graph.traversalSource.1.name=gg
gremlin.graph.traversalSource.1.gremlinExpression=g.withStrategies(ReadOnlyStrategy)
gremlin.graph.traversalSource.1.language=gremlin-lang
- To get rid of
.propertiesfiles and have all config in single server yaml file
graphs: {
graph: {
graph: org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerGraph,
vertexIdManager: LONG,
traversalSources:
- g
- { name: "gg", gremlinExpression: "g.withStrategies(ReadOnlyStrategy)", language: "gremlin-lang" }
lifecycleHooks:
- { className: org.apache.tinkerpop.gremlin.server.util.TinkerFactoryDataLoader, config: {dataset: modern}}
- { className: com.software.company.GroovyInsanity, config: {script: "System.exit(0)"}}
}
}
There was a problem hiding this comment.
i like your ideas to simplify....i think the properties file just has relevance as it is because you can give it to Graph.open(Configuration) which wouldn't have any relevance for those settings outside of Gremlin Server. But, maybe that's ok?? they would just be ignored and only Gremlin Server would concern itself with them? 🤔
There was a problem hiding this comment.
yes, properties file is most convenient way for providers for additional graph configuration, It probably makes sense to keep it...
traversalSources and lifecycleHooks are outside of graph, so should have separate config.
I think @spmallette proposal looks better than mine.
There was a problem hiding this comment.
I like the idea of pulling the traversalSources config into the graphs config. I'm less sure about lifecycleHooks. In practice, the lifecycleHooks are generally just used to load data into a graph or something like that, but in general, they are globally scoped in the server, they currently aren't bound to specific graph instances. I'll play around with it a bit and see how it feels.
There was a problem hiding this comment.
yeah, that's true. i suppose they aren't constrained to a specific graph.
There was a problem hiding this comment.
I've pushed changes which relocates traversalSources according to the suggestion but leaves lifecycleHooks alone. I couldn't find something there I liked that handled both globally-scoped hooks and graph-scoped hooks. One thing to note, is that the format as discussed (see below) is a breaking change as it moves the graph properties file into a new configuration key.
graphs: {
graph: {
configuration: conf/tinkergraph-empty.properties,
traversalSources:
- g
- { name: "gg", gremlinExpression: "g.withStrategies(ReadOnlyStrategy)", language: "gremlin-lang" }
}
}
I chose to workaround this by allowing that syntax alongside the old format of:
graphs: {
graph: conf/tinkergraph-empty.properties
}
The Settings get normalized to the new complex form upon parsing.
| final String language = args.containsKey(Tokens.ARGS_LANGUAGE) ? (String) args.get(Tokens.ARGS_LANGUAGE) : "gremlin-lang"; | ||
| final GremlinScriptEngine scriptEngine = gremlinExecutor.getScriptEngineManager().getEngineByName(language); | ||
|
|
||
| if (scriptEngine == null) { |
There was a problem hiding this comment.
This check can be done as soon as the RequestMessage is formed so it should be done before sendHttpResponse() so that the reponse from the server is an error rather than a 200 OK.
There was a problem hiding this comment.
That's a good call out, I've added an earlier check to validate the language field prior to the initial 200 OK response.
|
A lot of the test files do something like |
| case "sink": | ||
| TinkerFactory.generateKitchenSink(tinkerGraph); | ||
| break; | ||
| case "airroutes": |
There was a problem hiding this comment.
Does this take into account the min-jar build that won't come with the kryo data for air-routes?
There was a problem hiding this comment.
It's not explicitly handled here, but it implicitly behaves in the way I would expect. TinkerFactory.generateAirRoutes() already throws an IllegalStateException with a clear message about the min classifier if the file is not found. That error will propagate through the hook and ultimately cause the server to fail to start. I think this is the safest way of handling this case.
https://issues.apache.org/jira/browse/TINKERPOP-3107
Gremlin Server currently relies on Groovy init scripts for server initialization — binding traversal sources, running lifecycle hooks, and loading data. This PR introduces a Groovy-free initialization path so that Groovy can be disabled by default in all shipped server configs. Groovy remains available as an
opt-in for backward compatibility.
Changes
Three new YAML configuration mechanisms replace the Groovy init scripts:
Auto-created TraversalSources — After graphs are loaded, a default TraversalSource is automatically created for each graph that doesn't have an explicit traversalSources entry. A graph named graph gets g; others get g_. A minimal config with just a graphs section is now fully functional.
Declarative traversalSources — A new YAML section for explicit TraversalSource creation with optional strategy configuration via a Gremlin query:
yaml
traversalSources: {
g: {graph: graph},
gReadOnly: {graph: graph, query: "g.withStrategies(ReadOnlyStrategy)"}}
Java-based lifecycleHooks — A new YAML section for configuring LifeCycleHook implementations via reflection, replacing Groovy-based hook creation:
yaml
lifecycleHooks:
config: {graph: graph, dataset: modern}
Supporting changes
explicitly listed in
scriptEngines(plusgremlin-lang, which is always permitted).Requests targeting an engine not in the allowlist are rejected. This prevents clients from
invoking gremlin-groovy when it is removed from the default configuration.
Config migrations
All default configs under gremlin-server/conf/ updated to remove gremlin-groovy from scriptEngines. All Docker integration test configs (docker/gremlin-server/) and JVM test configs migrated to use traversalSources + lifecycleHooks instead of generate-all.groovy. The gremlin-console test infrastructure
similarly migrated.
Deleted files
Testing
VOTE +1