Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a new scripting language #13084

Closed
jdconrad opened this issue Aug 24, 2015 · 16 comments
Closed

Create a new scripting language #13084

jdconrad opened this issue Aug 24, 2015 · 16 comments
Assignees
Labels
:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache >feature Meta

Comments

@jdconrad
Copy link
Contributor

ElasticSearch needs a scripting language that can be used dynamically while remaining secure. While Lucene Expressions covers those two points it does not meet the needs of many scripts due the following behavior:

  1. Expressions is designed to only work effectively with numerical values. Elasticsearch requires a language that can handle strings, dates, and possibly data structures such as map and list. To add these features to expressions would require a large architectural change within Lucene that doesn't really make sense for the purpose of that language.
  2. Expressions is designed to be a single mathematical equation using only one line of code. This does not lend itself well to using things such as a loop to go through multi-valued fields.
  3. Expressions is built to be extremely similar to Javascript. This does not lend itself well to having types other than double for translation into Java efficiently. To translate a language like Javascript into Java would require all variables to be Objects that can also track what type they are. This is extremely inefficient.

One of the main goals of this language is any somewhat experienced developer should be able to learn the entirety in about fifteen minutes. For this reason I'm going to keep the control flow simple by allowing only the equivalent of one linearly-run static Java function to be written in total for any given script. No multiple function/method scripts should be allowed, as at this point the users should be writing custom code for their application instead of leaning on scripting.

For the new language I intend to initially have the following:

  1. Native types - boolean, byte, short, int, long, float, double, string, date, point, list, and map including the ability to cast when necessary
  2. Arithmetic operators - multiplication *, division /, addition +, subtraction -, precedence ( )
  3. Comparison operators - less than <, less than or equal to <=, greater than >, greater than or equal to >=, equal to ==, and not equal to !=
  4. Boolean operators - not !, and &&, or ||
  5. Bitwise operators - shift left <<, shift right >>, unsigned shift >>>, and &, or |, xor ^, not ~
  6. A way to call set list of external functions to be defined at a later time (math functions, geo functions)
  7. API for strings (possibly a limited api for of regular expressions)
  8. API for dates
  9. Assignment operations for native types (int x; x = 0;)
  10. Control flow - if, else if, else and for, while, do-while using brackets { } and semicolons ; to denote the end of operations/lines
  11. Bindings - single-valued and multi-valued field access as available variables along with a way to find out the number of values in a multi-valued field, and a way to access the existing multivaluemode api
  12. Shortcuts for map and list access such as (double)map0.item1.0.item2.1 where map0 is the initial map, item1 is an element in the map of type list, 0 is the first element in the list of type map, item2 is an element in the map of type list, and finally 1 is the second element in the list of type double.

This list will be updated as the project moves forward. To ensure the language does not hang due to an infinite loop or extremely long operational set, the number of instructions will be counted, and an exception will be thrown if a specified limit is reached.

I intend to build the language using ANTLR and ASM as the backbone. The following steps will be required for the language to be created.

  1. Create the ANTLR grammar.
  2. Write the code to build the ASM function from the provided script.
  3. Write tests.
  4. Integrate the language into the ElasticSearch code base.
  5. Write more tests.
  6. Refine the feature set.
  7. Write more tests.
  8. Repeat 6 and 7 until completed.
@jdconrad jdconrad added >feature :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache labels Aug 24, 2015
@jdconrad jdconrad self-assigned this Aug 24, 2015
@clintongormley
Copy link

w00t

@uboness
Copy link
Contributor

uboness commented Aug 24, 2015

w00t indeed

@uboness
Copy link
Contributor

uboness commented Aug 24, 2015

  1. A way to call set list of external functions to be defined at a later time (math functions, geo functions)

this is super important aspect. beyond the basic native operations in the lang, the only other functions that will be available are those that we pre-register with the language (register in code that is). So this mechanism needs to be generic and not hard coded for the math/geo functions.

@jdconrad
Copy link
Contributor Author

jdconrad commented Oct 7, 2015

Haven't commented on here in a while, so I thought I would give a quick update. The majority of the features are implemented as a first pass in a separate project.

What needs to happen before a PR can really be made at this point --

  1. A bit better string support. (goal: this week)
  2. Bindings for search fields. (goal: week after next -- needs item 6)
  3. Tests, lots and lots of tests. (goal: next week)
  4. Generally improved stability and bug fixes as the tests reveal them. (goal: next week)
  5. Documentation. (goal: week after next)
  6. Integration into a plugin. (goal: week after next)
  7. General clean up. (goal: week after next)
  8. Loop counting to prevent runaway code. (goal: week after next)
  9. A name for the language.

This timeline may be a bit ambitious (again), but I'll update in another couple weeks.

@clintongormley
Copy link

Awesome

  1. A name for the language.

Well that's going to push the delivery date out to 2019 :)

@nik9000
Copy link
Member

nik9000 commented Oct 7, 2015

I wonder if the language can continue to live outside of Elasticsearch? The reason we made it is because there isn't a good, safe, sandboxed scripting language in the JVM. So I imagine it'll be useful to other people. If it were easy to embed in other places that'd be good exposure for us and help the open source community.

@jdconrad
Copy link
Contributor Author

jdconrad commented Oct 7, 2015

@nik9000 That's a cool thought, and maybe something to consider down the road, but it's beyond the scope of the initial project by quite a bit. I think the biggest limiter for doing something like that is the language really is designed to only allow scripts that are the equivalent of a single static Java method running on one thread, so for it to be effective outside ES it would certainly need to have it's feature set expanded significantly.

@jdconrad
Copy link
Contributor Author

I wanted to give an update about some of the features and the current state of the language:

The prior list remains except for string features; however, after speaking with @rmuir and @rjernst I would like to take a week to explore the possibility of using invoke dynamic to offer a language that doesn't require casting. Currently the language has strict typing making it very similar to Java. However, given the languages that people are used to this may be an issue, longer term. It's likely I won't solve this in a week, but wanted to explore the possibility as something to do after the initial release with the need to make sure that it's at least possible with the current design.

As it stands the following features exist:

  1. Native types - boolean, byte, short, int, long, float, double, object, string, list, and map including the ability to cast when necessary
  2. Arithmetic operators - multiplication *, division /, addition +, subtraction -, precedence ( )
  3. Comparison operators - less than <, less than or equal to <=, greater than >, greater than or equal to >=, equal to ==, and not equal to !=
  4. Boolean operators - not !, and &&, or ||
  5. Bitwise operators - shift left <<, shift right >>, unsigned shift >>>, and &, or |, xor ^, not ~
  6. A language definition specified in a properties file that allows the ability to add more types (Java classes), and serves as a whitelist for all the things available to create and call.
  7. Assignment operations for native types (int x; x = 0;) including increment, decrement, +=, -=, etc.
  8. Control flow - if, else if, else and for, while, do-while using brackets { } and semicolons ; to denote the end of operations/lines
  9. Shortcuts for maps and lists using brace notation such as (int x = (int)map["test0"][0]["test1"];) where test0 is an index into a map, 0 is an index into a list, test1 is an index into a map, where the value is casted from object to int.
  10. A string concatenation operator in the form of '..' instead of '+' because '+' leads to ambiguities such as "string" + 2 + 2, where in java this ends up being string22 as a string, but I believe this may be confusing
  11. Promotion for numerics is done in the form of the java style where things are upcast as necessary (int -> double, or long -> float, etc.) or require a cast if promotion cannot be done (long --> int, etc.)
  12. Auto-boxing -- since this is using basic types and is written using the JVM, auto boxing is necessary, and will be done when it can be automatically

A small example of the language definition:
class.object = object java.lang.Object // define java class Object as the type object
class.string = string java.lang.String // define string class String as the type string
method.object.string = object string string toString() // define java class Object toString method as string for use on the object type
...

A small example of what a script will look like:

list nums = input["inner"]["list"];
int size = nums.size();
double total = 0;
for (int count = 0; count < size; ++count) {
total += (double)nums[count];
}
return total;

where the automatically generated signature for the script is Object execute(Map<String, Object> input);

Note that this can be thought of as a single static Java method when writing the script. There is no way to script new functions/methods as if that's necessary, scripting may not be the best choice for the work that needs to be done in most cases. It may also be possible to add the ability to execute other scripts from the original script to make up for the lack of method calls, but will likely not be included in the initial release.

@jdconrad
Copy link
Contributor Author

Quick update:

@rmuir has added ES plugin logic for the prototype language. This week will be about adding tests and fixing bugs as they arise. Not much else to add for now.

@jdconrad
Copy link
Contributor Author

I have removed all shortcuts for now to reduce the amount of debugging necessary for a first iteration. Shortcuts add a huge amount of complication and ambiguity to the language at this point in time. A second iteration somewhere down the road will have the goal of shortcuts, plus dynamic method calls, and inferred casting. The main goal of this project at this time is a simple language that can improve security needs to be safe enough to run dynamic scripts in ES.

@eskibars
Copy link
Contributor

I see "+=", what about ".="?

@jdconrad
Copy link
Contributor Author

@eskibars To be clear is .= for string concatenation? If so, we have decided to use ..= as the (.) operator may end up overloaded with an alternative shortcut for reading through maps/lists at a later time and will be needed for that. We do not want to use += because that creates some possible ambiguities of it's own and is a math operator. (While this works in Java, some of the assumptions that need to be made may not be for the best in all situations.)

@eskibars
Copy link
Contributor

@jdconrad yes, I was referring to string concatenation

@damienalexandre
Copy link
Contributor

Great stuffs! One issue I have with scripting in Elasticsearch is that is really hard to test and debug a script; We need, IMO:

  • a way to play a script against any indexed document (maybe an API index/type/123/_script?)
  • a debug mode, maybe integrated in ?explain to show:
    • the count of instructions (specially if you limit them)
    • all the available variables
    • the return value, un-edited
  • a way to log directly in Elasticsearch logs without playing with Java imports
  • better exceptions: if a script fail, often, we get a QueryExec exception but the actual scripting error is hidden in the stack.

Maybe this new (un-named?) scripting language could fix or at least improve those points :)

Cheers from Elastic{ON} Paris! 🍻

@jdconrad
Copy link
Contributor Author

jdconrad commented Nov 5, 2015

@damienalexandre Thanks for the feedback here. Points 1, 2, and 4 are all solid ideas, and hopefully somewhere down the road we will have time to spec and code some incarnation of them. For point 3, it's very unlikely that we will ever log anything from this language as we really don't want to write any files because it would mean that we have to open up security to allow this to happen.

@clintongormley
Copy link

Closed by #15136

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache >feature Meta
Projects
None yet
Development

No branches or pull requests

6 participants