Add user function definition support #2624

lenticularis39 · 2023-05-25T08:19:42Z

A new structure for defining subprograms is added on the same level as probes:

fn <function_name>(<type0> <arg0>, ...): <return_type> {
  <body>
}

The existing return statement is extended to also allow returning a specified value. SemanticAnalyser checks if the type is correct, while CodegenLLVM goes through the generated code to ensure all code flow paths return a value in a non-void function.

Only a limited number of builtins are allowed to be used inside functions due to current implementation restrictions; these can be addressed in later development.

This is the first step in implementing #2516 that can be either merged separately or together with the rest of the implementation. Note that this does not implement a way to call the functions nor a way to generate the correct BPF code for that.

(I'm opening this draft PR, since I believe there is quite a few non-trivial changes that it makes sense to review while I work on the rest.)

Checklist

Note: this checklist is only for the function definition part, excluding the BPF code generator and function calls.

Tests for parser
Tests for semantic analyser
Tests for codegen
Function argument support

src/ast/passes/semantic_analyser.cpp

src/ast/ast.cpp

viktormalik

Looks great, the implementation is pretty straightforward. Just a couple of comments.

I'd advocate for merging this before implementing support for function calls.

tests/parser.cpp

src/ast/passes/codegen_llvm.cpp

lenticularis39 · 2023-06-20T10:43:01Z

Resolved comments and rebased.

ajor

Love the tests!

The one main structural comment I've got is that something like checking all paths return a value should come under semantic analysis rather than codegen. Codegen should be able to assume that what it's been given is valid, and not worry about checking things itself.

Slightly off topic: We need to start caring more about the performance of the parser. Our approach previously was to not worry because it's not on the tracing hot-path, but the unit tests taking multiple seconds to run indicates that we've taken this too far. I did a quick profile and we're spending a lot of our time just copying strings around. I left a few comments on the new changes here, but we'll need to revisit the rest of ast.h at some point.

tests/parser.cpp

src/ast/ast.cpp

src/ast/ast.h

src/ast/visitors.cpp

ajor · 2023-07-03T14:21:18Z

src/lexer.l

+  \n                    buffer += '\n'; loc.lines(1); loc.step();
+}
+<STRUCT,BRACE>{
+  "*"|")"|","|{space}+"$" {


Why are the space changes needed in lexer.l (this line and also in the AFTER_COLON section)? Why is space handled differently in this section?

AFTER_COLON is an %x section, IIUC the usual rules do not apply there. Here the space handling should be a part of the logic ensuring struct test $x (for example) is parsed correctly.

I can take a closer look at it if that did not make much sense, that's what I remember from writing this code.

danobi · 2023-07-12T16:57:47Z

Going to set aside some time this weekend to take a look

lenticularis39 · 2023-07-13T19:13:22Z

Love the tests!

Thank you! I had to write thorough tests, the area of possibility of bugs here is almost unlimited :D

The one main structural comment I've got is that something like checking all paths return a value should come under semantic analysis rather than codegen. Codegen should be able to assume that what it's been given is valid, and not worry about checking things itself.

I also thought that initially, however, checking if all paths return a value is actually not semantic analysis, but control flow analysis, since semantic analysis operates on the AST, while this checks for a property of the CFG. Implementing it in semantic analyser would require to construct the control flow inside the AST, which would duplicate work that is already done in codegen.

Slightly off topic: We need to start caring more about the performance of the parser. Our approach previously was to not worry because it's not on the tracing hot-path, but the unit tests taking multiple seconds to run indicates that we've taken this too far. I did a quick profile and we're spending a lot of our time just copying strings around. I left a few comments on the new changes here, but we'll need to revisit the rest of ast.h at some point.

Good point.

ajor · 2023-07-14T14:09:44Z

I also thought that initially, however, checking if all paths return a value is actually not semantic analysis, but control flow analysis, since semantic analysis operates on the AST, while this checks for a property of the CFG. Implementing it in semantic analyser would require to construct the control flow inside the AST, which would duplicate work that is already done in codegen.

In languages with more complex control flow (e.g. goto statements) I think this makes sense, but is it necessary for bpftrace, which only has structured control flow?

I feel like something similar to this pseudo-code should be able to detect missing return statements:

fn all_paths_return(block):
  for stmt in block:
    if stmt == "return":
      return true
    if stmt == "break" or stmt == "continue":
      return false
    if stmt == "if":
      if all_paths_return(stmt.true_branch) && all_paths_return(stmt.false_branch):
        return true
    if stmt has children: (e.g. a loop)
      if all_paths_return(stmt.children):
        return true
  return false

src/ast/ast.h

src/ast/passes/codegen_llvm.h

src/ast/passes/codegen_llvm.cpp

danobi · 2023-07-16T19:52:18Z

tests/semantic_analyser.cpp

+
+TEST(semantic_analyser, subprog_arguments)
+{
+  test("fn f(int64 $a): int64 { return $a; }", 0);


Sorry if this was discussed elsewhere already, but the argument syntax feels a bit odd to me.

The function syntax is fn <idents and stuff>: <ret type>. It seems to me it would follow that argument syntax should also be <ident>: <type>, but here it's reversed.

IIRC we also added type annotations for local variables as $var: type = <blah> right?

I posted the syntax in the RFC some time ago (#2516 (comment)), but I'm not against changing it - actually, it will make the lexer simpler, since currently it has to check separately if we are inside arguments or after a colon when dealing with structs (definition vs use, related to your comment above at #2624 (comment)).

$var: type = <blah> syntax is not yet supported, #2560 did not add any new features; but it should be also handled by the AFTER_COLON lexer logic once it is added.

In that case I am in favor of $arg: type syntax

I wonder if this type syntax can be expanded to other cases, like:

kprobe:some_kernel_fn(a: int, b: int): int

Basically adding the option to specify function signatures in places (althoug with btf and dwarf that might be a bit late). In that case, i wonder if having -> like rust,python will make parsing easier

fbs · 2023-07-17T21:30:36Z

I also thought that initially, however, checking if all paths return a value is actually not semantic analysis, but control flow analysis, since semantic analysis operates on the AST, while this checks for a property of the CFG. Implementing it in semantic analyser would require to construct the control flow inside the AST, which would duplicate work that is already done in codegen.

In languages with more complex control flow (e.g. goto statements) I think this makes sense, but is it necessary for bpftrace, which only has structured control flow?

I feel like something similar to this pseudo-code should be able to detect missing return statements:
fn all_paths_return(block):
  for stmt in block:
    if stmt == "return":
      return true
    if stmt == "break" or stmt == "continue":
      return false
    if stmt == "if":
      if all_paths_return(stmt.true_branch) && all_paths_return(stmt.false_branch):
        return true
    if stmt has children: (e.g. a loop)
      if all_paths_return(stmt.children):
        return true
  return false

I'd put something like this in its own pass. Is not semantic but also shouldn't go in the codegen. Passes should be fairly easy to add as all the boilerplate is handled (see e.g. nodecount.h).

Might be a good usecase for the Dispatcher visitor as it allows for a return value which makes it a bit easier to write.

Might be good to do it in a separte PR to avoid some noise in this one

lenticularis39 · 2023-07-18T06:05:46Z

I'd put something like this in its own pass. Is not semantic but also shouldn't go in the codegen. Passes should be fairly easy to add as all the boilerplate is handled (see e.g. nodecount.h).

Might be a good usecase for the Dispatcher visitor as it allows for a return value which makes it a bit easier to write.

That sounds like the best idea to me, that will avoid the check being in codegen and having it returning bool, and also complicating the code of semantic analyzer.

Also, there is a direct benefit of doing it on the AST as @ajor and @danobi suggested: we can print the exact place where a return instruction should be and isn't, which will make it easier for the programmer.

Might be good to do it in a separte PR to avoid some noise in this one
There might be just an issue with the fact that without the check, bpftrace will generate invalid LLVM IR; however, I don't think that is that big of an issue, since the function feature is not officially supported yet. I can also make it a separate commit, that might be better.

fbs · 2023-07-18T06:57:42Z

The idea behind the passmanager and all the visitor stuff is that it should make new passes easy to add, to avoid increasing the complexity of existing passes, the semantic analyser is quite tricky in places.

If you do want to try the dispather visitor, it will be the first usecase and its a bit hacky so it might not work that well :(

Also, really nice work here :)

lenticularis39 · 2023-07-18T07:00:37Z

Thank you all for the helpful comments, I feel like it cleared a lot of things up and will increase the quality of the PR a lot.

lenticularis39 · 2023-09-12T16:16:27Z

Resolved (hopefully) all of the comments. There are some minor issues with the code like leftover declarations, I will clean it after rebasing to current master.

src/ast/passes/codegen_llvm.h

src/ast/passes/codegen_llvm.cpp

lenticularis39 · 2023-11-07T15:52:03Z

Just fixed a small readability issue (moving check for void return value to ReturnPathAnalyser::visit(Subprog &subprog)).

viktormalik

Thanks!

ajor

This is really looking pretty good now, thanks!

I just found some bugs that we should fix before merging.

Crash using disallowed builtins in functions:

$ sudo ./bpftrace -e 'fn foo(): int64 { return func; }' 
Segmentation fault

Crash writing to a map in a function:

$ sudo ./bpftrace -e 'fn bar(): void { @x = 1; }'
bpftrace: /home/ajor/src/bpftrace/src/ast/irbuilderbpf.cpp:444: void bpftrace::ast::IRBuilderBPF::CreateMapUpdateElem(llvm::Value*, bpftrace::ast::Map&, llvm::Value*, llvm::Value*, const bpftrace::location&): Assertion `ctx && ctx->getType() == getInt8PtrTy()' failed.
Aborted

Crash using map's value in a function:

$ sudo ./bpftrace -e 'fn bar(): uint64 { $res = @x + 1; return $res; } BEGIN { @x = 1; }'
bpftrace: /home/ajor/src/bpftrace/src/ast/irbuilderbpf.cpp:379: llvm::Value* bpftrace::ast::IRBuilderBPF::CreateMapLookupElem(llvm::Value*, bpftrace::ast::Map&, llvm::Value*, const bpftrace::location&): Assertion `ctx && ctx->getType() == getInt8PtrTy()' failed.
Aborted

Poor error message reading a map from a function (it'd be good if this could work):

$ sudo ./bpftrace -e 'fn bar(): int64 { return @x; } BEGIN { @x = 1; }'
stdin:1:20-29: ERROR: Function bar is of type int64, cannot return none
fn bar(): int64 { return @x; } BEGIN { @x = 1; }
                   ~~~~~~~~~

ajor · 2024-01-04T15:24:25Z

src/ast/passes/codegen_llvm.cpp

+  FunctionPassManager fpm;
+  FunctionAnalysisManager fam;
+  llvm::PassBuilder pb;
+  pb.registerFunctionAnalyses(fam);
+  fpm.addPass(UnreachableBlockElimPass());
+  fpm.run(*func, fam);


Why is this code needed for subprograms specifically and not top-level probes?

Unlike probes which have a default return value, subprograms might leave unreachable blocks that are not finished with a return instruction, these have to be removed for the IR to be correct.

Just been thinking about this some more and I realise that I don't actually understand. How can a subprogram generate an unreachable block without a return instruction?

src/ast/passes/printer.cpp

tests/parser.cpp

src/ast/passes/semantic_analyser.cpp

lenticularis39 · 2024-01-25T16:48:03Z

Rebased to solve merge conflicts.

I also noticed another bug: createRet does not return here:

  else if (inside_subprog_)
  {
    b_.CreateRetVoid();
  }

causing segfault for void functions. I'll fix that together with implementing the comments and expand codegen tests to cover such issues.

lenticularis39 · 2024-02-11T11:33:12Z

Fixed maps and casts in subprogs. Also added more tests to cover this functionality.

Subprog argument implementation in codegen was changed to include ctx as the first argument, which is needed for perf-event-based builtin functions (like print) to work properly. This was tested at runtime with the in-progress function call PR.

With this all feedback should be resolved.

danobi · 2024-02-25T19:36:08Z

Are there any outstanding work items left? Or are we missing a final round of reviews?

Side note, this behavior is a bit unexpected:

$ sudo ./build/src/bpftrace -e 'fn foo($a: int32): int32 { return $a + 4; }'
stdin:1:30-43: ERROR: Function foo is of type int32, cannot return int64
fn foo($a: int32): int32 { return $a + 4; }
                             ~~~~~~~~~~~~~

I suspect this is more related to how we represent all ints as int64 internally. Not a show stopper -- IMO fixing this relaxes some rules so this is ok to do later.

Also the error underline seems to be off.

danobi · 2024-02-25T19:39:12Z

Also another bit of fix-later feedback:

sudo ./build/src/bpftrace -e 'fn foo(): void { return; }'
No probes to attach

$ sudo ./build/src/bpftrace -e 'fn foo() { return; }'
stdin:1:10-11: ERROR: syntax error, unexpected {, expecting :
fn foo() { return; }

Going with bpftrace's philosophy of brevity, perhaps good to allow void return types to be omitted. But I think this is debatable

lenticularis39 · 2024-02-26T07:01:30Z

Are there any outstanding work items left? Or are we missing a final round of reviews?

No, just waiting for reviews.

Side note, this behavior is a bit unexpected:
$ sudo ./build/src/bpftrace -e 'fn foo($a: int32): int32 { return $a + 4; }'
stdin:1:30-43: ERROR: Function foo is of type int32, cannot return int64
fn foo($a: int32): int32 { return $a + 4; }
                             ~~~~~~~~~~~~~
I suspect this is more related to how we represent all ints as int64 internally. Not a show stopper -- IMO fixing this relaxes some rules so this is ok to do later.

Also the error underline seems to be off.

I know about this, it's because types were not explicitly given anywhere previously (besides casts). Not sure why the underlining is off.

ajor

Sorry for taking so long to review this. Really appreciate the work you've put it!

The code's looking good to me - the comments are just a bunch of minor things that can be fixed later. Let's just get this one through without more delay!

The only issue I found was that we're allowing multiple functions to be defined with the same name. Again, this can be fixed later:

fn aaa(): void { print(1); }
fn aaa(): void { print(2); }

ajor · 2024-02-22T17:17:23Z

src/ast/passes/resource_analyser.cpp

There are a couple of places in this file where we do: resources_.probes_using_usym.insert(probe_);

I don't think it'll cause any problems to put a null pointer in that set, but it looks a bit odd and might cause confusion in the future. Would probably be safer to add a check on probe_ for these too

ajor · 2024-02-22T17:56:54Z

src/ast/passes/semantic_analyser.cpp

@@ -164,11 +164,11 @@ void SemanticAnalyser::builtin_args_tracepoint(AttachPoint *attach_point,
  }
 }

-ProbeType SemanticAnalyser::single_provider_type(void)
+ProbeType SemanticAnalyser::single_provider_type(Probe *probe)


Suggested change

ProbeType SemanticAnalyser::single_provider_type(Probe *probe)

ProbeType SemanticAnalyser::single_provider_type(const Probe &probe)

Nit: If a null value isn't allowed then use references instead of pointers
https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rf-ptr-ref

ajor · 2024-02-29T15:07:59Z

src/ast/passes/semantic_analyser.cpp

@@ -1332,6 +1372,24 @@ void SemanticAnalyser::check_stack_call(Call &call, bool kernel)
  call.type = CreateStack(kernel, stack_type);
 }

+Probe *SemanticAnalyser::get_probe_from_scope(Scope *scope,
+                                              const location &loc,
+                                              std::string name)


Suggested change

std::string name)

const std::string &name)

tests/semantic_analyser.cpp

ajor · 2024-02-29T15:23:24Z

src/ast/passes/semantic_analyser.cpp

+      ProbeType type = single_provider_type(probe);
+      cast.type.SetAS(find_addrspace(type));
+    } else {
+      // Assume kernel space for data in subprogs


This will need to be able to work with user space data in the future too

Yes, the issue is this probably requires a new language feature, since in the context of a subprog, there is no way to tell if you are accessing userspace or kernel space data based on pointers alone.

ajor · 2024-02-29T15:43:38Z

src/ast/passes/codegen_llvm.cpp

+                 [this](SubprogArg *arg) { return b_.GetType(arg->type); });
+  FunctionType *func_type = FunctionType::get(b_.GetType(subprog.return_type),
+                                              arg_types,
+                                              0);


Suggested change

0);

false);

This function takes a boolean: https://llvm.org/doxygen/classllvm_1_1FunctionType.html#af8be7844c269f201ebcee1e15048c378

Yeah I must have copied this from somewhere :D

ajor · 2024-02-29T15:54:34Z

src/ast/passes/codegen_llvm.cpp

@@ -879,7 +878,8 @@ void CodegenLLVM::visit(Call &call)
    expr_ = b_.CreateRegisterRead(ctx_, offset, call.func + "_" + reg_name);
  } else if (call.func == "printf") {
    // We overload printf call for iterator probe's seq_printf helper.
-    if (probetype(current_attach_point_->provider) == ProbeType::iter) {
+    if (!inside_subprog_ &&


If an iterator probe calls a subprogram, does it not still count as being in an iterator probe?

Currently subprogs do not track if they are called from an iterator probe so printf from subprog from iterator probe is treated like outside iterator probe. This should be probably fixed eventually, I will fix it in the follow-up PR for calls if it causes problems in tests.

ajor · 2024-02-29T16:03:00Z

src/ast/passes/codegen_llvm.cpp

 void CodegenLLVM::createRet(Value *value)
 {
  // If value is explicitly provided, use it
  if (value) {
    b_.CreateRet(value);
    return;
+  } else if (inside_subprog_) {
+    b_.CreateRetVoid();


I'm assuming this is just a temporary thing until your next PR?

This is because unlike with probes, there is not any default return value for subprogs. createRet() without argument is only used to end unreachable code that is then removed in subprogs.

A new structure for defining subprograms is added on the same level as probes: fn <function_name>(<arg0> : <type0>, ...): <return_type> { <body> } The existing return statement is extended to also allow returning a specified value. SemanticAnalyser checks if the type is correct, while a new pass called ReturnPathAnalyser checks if all paths return a value for non-void functions. Features requiring accessing the probe they are used in are not supported in subprograms; using them throws an error at compilation.

Includes parser, codegen, semantic analyser, field analyser, resource analyser, and return path analyser tests.

lenticularis39 · 2024-03-14T20:24:36Z

The PR should be now rebased and ready for merge.

ajor · 2024-03-15T08:42:02Z

Thanks @lenticularis39!

lenticularis39 force-pushed the user-defined-functions-new-new branch from 1b1c288 to 83f7c9b Compare May 25, 2023 09:05

github-advanced-security bot found potential problems May 25, 2023

View reviewed changes

src/ast/passes/semantic_analyser.cpp Fixed Show fixed Hide fixed

src/ast/ast.cpp Fixed Show fixed Hide fixed

src/ast/ast.cpp Fixed Show fixed Hide fixed

lenticularis39 force-pushed the user-defined-functions-new-new branch 3 times, most recently from 208972c to d5b7bad Compare May 31, 2023 12:50

lenticularis39 marked this pull request as ready for review May 31, 2023 16:50

lenticularis39 requested review from ajor, viktormalik, danobi and fbs as code owners May 31, 2023 16:50

viktormalik reviewed Jun 19, 2023

View reviewed changes

tests/parser.cpp Outdated Show resolved Hide resolved

src/ast/passes/codegen_llvm.cpp Outdated Show resolved Hide resolved

src/ast/passes/codegen_llvm.cpp Outdated Show resolved Hide resolved

lenticularis39 force-pushed the user-defined-functions-new-new branch 2 times, most recently from 94dd1d4 to b5b6c32 Compare June 20, 2023 09:37

ajor self-assigned this Jun 29, 2023

ajor requested changes Jul 3, 2023

View reviewed changes

ajor mentioned this pull request Jul 4, 2023

AST: Various cleanups #2664

Draft

danobi reviewed Jul 16, 2023

View reviewed changes

lenticularis39 force-pushed the user-defined-functions-new-new branch from b5b6c32 to f8db42a Compare September 12, 2023 16:10

lenticularis39 force-pushed the user-defined-functions-new-new branch from f8db42a to 66635ae Compare September 12, 2023 17:18

lenticularis39 force-pushed the user-defined-functions-new-new branch from 66635ae to 9ee2231 Compare September 28, 2023 08:21

lenticularis39 requested a review from ajor September 28, 2023 17:42

danobi reviewed Oct 26, 2023

View reviewed changes

src/ast/passes/codegen_llvm.h Show resolved Hide resolved

src/ast/passes/codegen_llvm.cpp Outdated Show resolved Hide resolved

src/ast/passes/codegen_llvm.cpp Outdated Show resolved Hide resolved

danobi approved these changes Nov 6, 2023

View reviewed changes

lenticularis39 force-pushed the user-defined-functions-new-new branch from 57b5954 to bbe9804 Compare November 7, 2023 15:51

viktormalik approved these changes Nov 8, 2023

View reviewed changes

danobi added the waiting-for-feedback label Nov 13, 2023

viktormalik added the priority: high label Dec 19, 2023

ajor requested changes Jan 5, 2024

View reviewed changes

viktormalik mentioned this pull request Jan 25, 2024

RFC: Declare Maps Up-Front #2954

Open

lenticularis39 force-pushed the user-defined-functions-new-new branch from bbe9804 to 237f002 Compare January 25, 2024 16:45

lenticularis39 force-pushed the user-defined-functions-new-new branch 4 times, most recently from f92b07f to 6abd22b Compare February 11, 2024 10:01

lenticularis39 requested a review from ajor February 11, 2024 11:33

ajor approved these changes Feb 29, 2024

View reviewed changes

lenticularis39 added 2 commits March 14, 2024 20:18

Tests for user function definition support

f6c0ba1

Includes parser, codegen, semantic analyser, field analyser, resource analyser, and return path analyser tests.

lenticularis39 force-pushed the user-defined-functions-new-new branch from 6abd22b to f6c0ba1 Compare March 14, 2024 20:04

ajor merged commit 26e310f into bpftrace:master Mar 15, 2024
19 checks passed

ajor mentioned this pull request Mar 15, 2024

Parser returns incorrect locations #3063

Open

lenticularis39 mentioned this pull request Mar 16, 2024

Add user function call support #3068

Open

3 tasks

ajor removed the waiting-for-feedback label Apr 9, 2024

	ProbeType SemanticAnalyser::single_provider_type(Probe *probe)
	ProbeType SemanticAnalyser::single_provider_type(const Probe &probe)

Add user function definition support #2624

Add user function definition support #2624

Conversation

lenticularis39 commented May 25, 2023 • edited

Checklist

viktormalik left a comment

Choose a reason for hiding this comment

lenticularis39 commented Jun 20, 2023

ajor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danobi commented Jul 12, 2023

lenticularis39 commented Jul 13, 2023

ajor commented Jul 14, 2023

Choose a reason for hiding this comment

lenticularis39 Jul 17, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fbs commented Jul 17, 2023

lenticularis39 commented Jul 18, 2023

fbs commented Jul 18, 2023

lenticularis39 commented Jul 18, 2023

lenticularis39 commented Sep 12, 2023 • edited

lenticularis39 commented Nov 7, 2023

viktormalik left a comment • edited

Choose a reason for hiding this comment

ajor left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lenticularis39 commented Jan 25, 2024

lenticularis39 commented Feb 11, 2024

danobi commented Feb 25, 2024 • edited

danobi commented Feb 25, 2024

lenticularis39 commented Feb 26, 2024

ajor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lenticularis39 Mar 8, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lenticularis39 commented Mar 14, 2024

ajor commented Mar 15, 2024

lenticularis39 commented May 25, 2023 •

edited

lenticularis39 Jul 17, 2023 •

edited

lenticularis39 commented Sep 12, 2023 •

edited

viktormalik left a comment •

edited

ajor left a comment •

edited

danobi commented Feb 25, 2024 •

edited

lenticularis39 Mar 8, 2024 •

edited