Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine Array_Index or Function_Call in AST #246

Open
chaoqing opened this issue Dec 13, 2021 · 19 comments
Open

Determine Array_Index or Function_Call in AST #246

chaoqing opened this issue Dec 13, 2021 · 19 comments
Labels
component: sem Affects semantic analysis

Comments

@chaoqing
Copy link

First thanks for this awesome codebase which save me a lot of efforts when I am trying to create a tool converting Matlab to Python3/Numpy code within this in-progress fork.

What kind of feature is this?
I noticed there is an ToDo item here. Determine the exact type of Reference to be Array_Index or Function_Call is a critical step when converting the Matlab to Numpy-like code.

Describe the solution you'd like
Just want to check the the progress in this item. I may implement my own version, most probably ugly, if no update from the upstream soon.

Thanks again for this fantastic repo! I really enjoy reading the well organized methods and code. And of cause I will consider do a pull request after my fork is ready.

@florianschanda
Copy link
Owner

Hey, that is nice to see that you're trying to do this. But please beware that this is harder than you might think, due to MATLAB's... lets say casual relationship with common sense when it comes to language design.

I cannot do this until I get semantic analysis working, because I need to know what functions are. The only way you can tell apart a function call from an array access is analysis there is nothing in the syntax of the language that deals with this. This I would consider a serious mistake from a language design point of view, but whatever.

It's even worse. Consider this:

function x = potato(a)
   x = a(5);
end

Why is this hard? Consider this calling context:

potato(@(x) x + 1);

So yeah. Here this could be either an array access or a function pointer call; and this is sadly valid MATLAB.

I have not yet decided how to deal with this.

Do I treat any function as essentially generic, and give you multiple ASTs for each unique way of calling it? Or do I just make it illegal inside the MISS_HIT subset of MATLAB?

@florianschanda florianschanda added the component: sem Affects semantic analysis label Dec 14, 2021
@florianschanda
Copy link
Owner

You could do something a bit different, that might work without a working sem. You could create a run-time environment and use that. And one of the things you could do is make everything a class, and then if you have a function-instance you do call the index() method, and if you have a array instance you also call the index() method.

Not sure if I am explaining it in a way that makes sense; to paraphrase generate code that is way more inefficient but deals with these problems dynamically. Similar to how python has dynamic typing; you could build a class hierarchy that supports everything you need and then generate code using it.

@chaoqing
Copy link
Author

chaoqing commented Dec 15, 2021

Thanks for the heads up. I understand it is difficult to do the semantic analysis because of the complexity of Matlab language. So for now I came up with another way by providing an inherited class for numpy array to implement the __call__ magic method incase I am not sure it is array. Like I did here.

Anyway, I will watch this todo item and hopefully we can do it in the clean way.

@florianschanda
Copy link
Owner

providing an inherited class for numpy array to implement the call magic method incase I am not sure it is array

Yes, that is what I had in mind. Since it's Python anyway, I think the performance impact of this is negligible.

Some additional - extremely superficial - comments when looking at your code so far:

  • try not to put any logic into m_ast or similar; put this into your main program. I have tried very hard to keep things separate.
  • you should write a new backend (mh_python or similar), look for inspiration on how to do this in the smaller tools I have (mh_lint or mh_copyright). The tree walk should go in there too.
  • you will need to update errors.py as well, to register your new tool

Have a look also at the make target m_ast_picture.pdf, it will give some clarity on the structure of the AST.

@chaoqing
Copy link
Author

Sure I can do as you suggested later as for now it is just a prototype. The hint on types picture is really useful!

@florianschanda
Copy link
Owner

Also, a few more things to help you out...

There is a tree-traversal construct that you might be able to use:

class AST_Visitor:

You can subclass this and overwrite the visit and visit_end. Visit is called when a node is visited for the first time, and visit_end is called after all the children have been visited. Both have 3 parameters:

  • node - the node visited
  • n_parent - the parent node
  • relation - a string describing how this node relates to the parent node. If you want to know what they are you can search the various visit functions, for example:
    def visit(self, parent, function, relation):
    self._visit(parent, function, relation)
    self.n_statements.visit(self, function, "Statements")
    self._visit_list(self.l_functions, function, "Functions")
    self._visit_end(parent, function, relation)

If you want to know what kind of parents a node can have, you can check the set_parent method on the node; which will contain an assert. For example:

def set_parent(self, n_parent):
assert isinstance(n_parent, Class_File)
super().set_parent(n_parent)

This is set_parent of the Class_Definition node, and so what is enforced here is that a class definition can only have one possible parent node, a Class_File node.

All nodes have a parent, except the compilation unit node.

Finally, for code generation you likely have more than one file. I saw you extended mh_debug_parser; but this is a really bad starting point because it only ever processes one file. It's for me to test the parser. You really want to create a proper back-end, which gives you proper command-line handling, access to the config mechanism, etc.

For this a good example is MH Lint, as it's the smallest. First you subclass the result class. This can be trivial:

class MH_Lint_Result(work_package.Result):
def __init__(self, wp, sem=None):
super().__init__(wp, True)
self.sem = sem

Then you subclass command_line.MISS_HIT_Back_End, as seen here:

class MH_Lint(command_line.MISS_HIT_Back_End):

There are three functions you may want to overwrite:

  • process_wp - process a MATLAB file, and return a result, this will be done in parallel
  • process_result - called on each result, in a linear non-multithreaded fashion
  • post_process - called at the very end on the backend-class, after everything has been processed

HTH

@chaoqing
Copy link
Author

Yes I indeed implemented the AST_Visitor visit_end method and can already generate syntax correct python code.

Translating

A=A+3+4i;
disp(' a b " ');
disp('x');
disp(A(:, 4, 2:end));

clear A;

[out(4), ~] = func_example(~A(1, end), A(end-1)', A(3:(end-1)).');

function [out, out2] = func_example(A, B, C)
    out = [];
    out = [1;2;3];
    out = [-.1 -0.1 .1; 1, func_example(A, B, C)+out+3, 3];
    out2 = [out; out] / B;
    out2 = out{A, B};

    if (C == 1) && (B > 3) || (A==4)
        out2 = out2 | 1;
    elseif C == 2
    else
        for i = 1:A
            continue
            out2 = addone(out2(1:5:2)./A, end) + out(i)\out(:);
            break
        end
        return
    end

end


function out = addone(A)
    out=A+1;
end

into

def func_example(A, B, C):
    out = mnp.array([])
    out = mnp.stack((1, 2, 3))
    out = mnp.stack(([-.1, -0.1, .1], [1, (func_example(A, B, C) + out) + 3, 3]))
    out2 = mnp.mrdivide(mnp.stack((out, out)), B)
    out2 = out[A, B]
    if ((C == 1) and (B > 3)) or (A == 4):
        out2 = out2 | 1
    elif C == 2:
        pass
    else:
        for i in range(1, A):
            continue
            out2 = addone[out2(range(1, 2, 5)) / A, end] + (mnp.mldivide(out(i), out[:]))
            break
        
        return
    return out, out2

def addone(A):
    out = A + 1
    return out

A = (A + 3) + 4j
disp(' a b " ')
disp('x')
disp(A[:, 4, range(2, end)])
clear('A')
out[4], _ = func_example(~(A[1, end]), (A[end - 1]).T, (A[range(3, end - 1)]).T)

The results seems promising so I am now working on a new package to provide a wrapper for numpy for compatible Matlab behavior like 1-based index, end keywords, etc.

I indeed encounter problem when it comes to multiple Matlab files, especially the Naked_Expression_Statement because it means sourcing Matlab script which do not have equivalent python representing(eval do not works for complex statement). So my choice would be insert the script directly into the generated python code. That need I generate a big AST from multiple files. Do you have any suggestion on this?

@chaoqing
Copy link
Author

Another idea in my mind is to insert the original Matlab script into the generated python code as comment in corresponding position. I did notice the MATLAB_Token have Location which can be useful to determine current python line of code came from where of the Matlab code. But there is not a unified way to extract all tokens from one Node because different Node subclass may have different named/depth of MATLAB_Token. Do you happen to have solution on this or I need to implement this one by one Node subclass?

@florianschanda
Copy link
Owner

Let me answer the second comment first, as the first one will take time.

Good news, there is, but it's the other way around. Kinda.
So each token has an ast_link:

# A link back to the AST so that we can identify to which node
# tokens nominally belong.
self.ast_link = None

To go the other way is not possible right now, but you could go through the token stream. Normally the token stream is discarded when the AST is made, but there is a way to preserve this (mh_style uses it to pretty print). You can use the same setup MH Style uses:

lexer = MATLAB_Lexer(wp.mh, content, wp.filename, wp.blockname)
if wp.cfg.octave:
lexer.set_octave_mode()
if not wp.cfg.pragmas:
lexer.process_pragmas = False

try:
tbuf = Token_Buffer(lexer, wp.cfg)
except Error:
# If there are lex errors, we can stop here
return MH_Style_Result(wp)
# Create parse tree
try:
parser = MATLAB_Parser(wp.mh, tbuf, wp.cfg)
parse_tree = parser.parse_file()

Most AST nodes do remember the token that is important to them, e.g. t_op for the binary operator points to the token that is the + or similar.

class Binary_Operation(Expression):
def __init__(self, precedence, t_op, n_lhs, n_rhs):
super().__init__()
assert 1 <= precedence <= 12
assert isinstance(t_op, MATLAB_Token)
assert t_op.kind == "OPERATOR"
assert isinstance(n_lhs, Expression)
assert isinstance(n_rhs, Expression)
self.precedence = precedence
# Numeric precedence to determine if brackets are necessary or
# not.
self.t_op = t_op
self.t_op.set_ast(self)
# The token for the operator symbol

In general if have some conventions (but you can see them enforced via asserts):

  • t_ a token
  • n_ a node of some kind
  • e_ (not yet, but later, an entity)
  • l_ a list of nodes
  • something without prefix is usually a python builtin type

So, long story short. Right now this is kinda hard and messy; but you could create a MATLAB pretty printer for the expressions you need and use that. But, if you go through the effort of preserving the token stream then the good news is that you can pull out comments as well. And that might be super nice for tracability.

@florianschanda
Copy link
Owner

I indeed encounter problem when it comes to multiple Matlab files, especially the Naked_Expression_Statement because it means sourcing Matlab script which do not have equivalent python representing(eval do not works for complex statement). So my choice would be insert the script directly into the generated python code. That need I generate a big AST from multiple files.

I am not sure I understand everything in this question.

Naked_Expression_Statement can occur at any point where you get a statement, script or no. For example:

function x = potato(a)
   a;
   x = a;
   disp(x);

The first and last line in the function is a naked_expression_statement.

Secondly, script files should translate nearly into python files, so no issue there.

Finally scripts using eval() are EVIL. You might want to just not do that. As far as I know, the MATLAB coder, Simulink coder, and Embedded Coder all do not support eval. I suggest you do the same :)

@chaoqing
Copy link
Author

chaoqing commented Dec 16, 2021

Thanks for the detailed explanation. I will try to follow the way you described but that may come lower priority for now.

Let me answer the second comment first, as the first one will take time.

Good news, there is, but it's the other way around. Kinda. So each token has an ast_link:

# A link back to the AST so that we can identify to which node
# tokens nominally belong.
self.ast_link = None

To go the other way is not possible right now, but you could go through the token stream. Normally the token stream is discarded when the AST is made, but there is a way to preserve this (mh_style uses it to pretty print). You can use the same setup MH Style uses:

lexer = MATLAB_Lexer(wp.mh, content, wp.filename, wp.blockname)
if wp.cfg.octave:
lexer.set_octave_mode()
if not wp.cfg.pragmas:
lexer.process_pragmas = False

try:
tbuf = Token_Buffer(lexer, wp.cfg)
except Error:
# If there are lex errors, we can stop here
return MH_Style_Result(wp)
# Create parse tree
try:
parser = MATLAB_Parser(wp.mh, tbuf, wp.cfg)
parse_tree = parser.parse_file()

Most AST nodes do remember the token that is important to them, e.g. t_op for the binary operator points to the token that is the + or similar.

class Binary_Operation(Expression):
def __init__(self, precedence, t_op, n_lhs, n_rhs):
super().__init__()
assert 1 <= precedence <= 12
assert isinstance(t_op, MATLAB_Token)
assert t_op.kind == "OPERATOR"
assert isinstance(n_lhs, Expression)
assert isinstance(n_rhs, Expression)
self.precedence = precedence
# Numeric precedence to determine if brackets are necessary or
# not.
self.t_op = t_op
self.t_op.set_ast(self)
# The token for the operator symbol

In general if have some conventions (but you can see them enforced via asserts):

  • t_ a token
  • n_ a node of some kind
  • e_ (not yet, but later, an entity)
  • l_ a list of nodes
  • something without prefix is usually a python builtin type

So, long story short. Right now this is kinda hard and messy; but you could create a MATLAB pretty printer for the expressions you need and use that. But, if you go through the effort of preserving the token stream then the good news is that you can pull out comments as well. And that might be super nice for tracability.

For the second question, let me explain with an example. Say we have two Script_File called a.m and b.m looks like following:

disp('this is a.m');
disp('there might be complex operation in a.m');
j = 0;
for i = 1:x
    j = j+i;
end 
disp('this is b.m');
x = 5;
disp('the following is similar to Linux sourcing script')
a; 
disp(['j:', num2str(j)]);

When excuting b you see b.m source the script a.m in the same stack so that variable x and j exist in the same level. The only way I can think of for now is to embed the whole a.m into b.m.

BTW, I have separate the translating logic into a separate backend ml_python as you suggested😄.

@florianschanda
Copy link
Owner

BTW, I have separate the translating logic into a separate backend ml_python as you suggested.

Great, it will make a lot of things much easier :)

When excuting b you see b.m source the script a.m in the same stack so that variable x and j exist in the same level. The only way I can think of for now is to embed the whole a.m into b.m.

I understand now, OK.

So what I would suggest is that you will need to start keeping track of a little symbol table for yourself. The way I intend to deal with this eventually is that:

  • variables used in a script are in a special global scope
  • the a; comes out as a function/script call, and scripts have access to this global scope
  • the only way this works is that b.m is registered as the entry point (see https://florianschanda.github.io/miss_hit/configuration.html section "Dealing with multiple programs and shared code") as a.m cannot be run on its own (since x is not declared)

Again, I have plans to do this, but I just haven't had the energy to start with this mammoth task.

You could first get a list of files that will be processed; and then when you deal with things likea; you check first if the name called is a script and special case it. But not sure how to deal with the global scope of variables. Inlining might work but could get real messy.

I would try to translate as a function and pass the variables as parameters; or otherwise have a python module generated called global_workspace that uses these variables there.

This is going to be ugly. You could perhaps start by not supporting that construct, unless in the MATLAB code your targetting this is extremely common.

@RobBW
Copy link

RobBW commented Dec 21, 2021

I may be out of my depth here, but I notice that one or two problems you refer to above appear to have been solved by Victor Leikehman in Smop. (https://github.com/victorlei/smop) It might be worth looking at his code to see how he handled comments for instance.

@RobBW
Copy link

RobBW commented Dec 21, 2021

You may find the references in https://github.com/mbdevpl/transpyle/issues/7 helpful.
In particular to Bysiek's (Typed)Unparser https://mbdevpl.github.io/posts/typed-astunparse/.
Looking at what you are doing and Bysiek's Transpyle there are interesting other possible outcomes to follow up:

Python to Python AST

Python 3.6 with whole-line comments outside expressions is fully supported. Presence of end-of-line comments or comments in expressions might result in errors.

Python AST to Fortran

Currently, the Fortran unparser uses special attribute fortran_metadata attached to selected Python AST nodes, and therefore unparsing raw Python AST created directly from ordinary Python file might not work as expected.

The above behaviour will change in the future.

Bysiek's main interest is in HPC and he is using Fortran with Typing to achieve speed and Python to provide maintainability.

@chaoqing
Copy link
Author

chaoqing commented Dec 21, 2021

Hi @RobBW , thanks for the info about Transpyle.

I am aware of smop but somehow it failed to produce the correct result when my first trying. So I tend to produce my own version of Matlab to python project. And I think smop do not have some key features of Matlab like general end keyword. Actually I followed smop issue table to miss_hit and I believe it provide a better/modern AST parser than smop did.

I am working on a fully Matlab compatible package and will open source it after considering key components done. The key idea is to translate Matlab code as-is and run without heavy modifications.

@florianschanda
Copy link
Owner

@RobBW - random aside - a quick superficial read of the lexer in smop indicates to me that language support is probably quite partial. Specifically I have doubts about the correctness of matrix lexing, and I also think command-form is not dealt with either. Also, classdef is not supported either: https://github.com/victorlei/smop/blob/bdad96b715d1dd75ce8ab4724f76b9b1bb1f61cd/smop/lexer.py#L139

There is tons of MATLAB parsers on github, I reviewed about 10 of them before I decided that I need to write my own; because none of them are complete. And once you get to something complete, the complications pile on and on :)

As to comments and parsing; they way they have done it is through "comment statements". See a discussion here: #239

This is indeed one of the way I was thinking of dealing with comments, but it's not done yet. It adds further complications to the language definition, a complication I already encountered with the pragmas.

@chaoqing
Copy link
Author

Hi @florianschanda, I completed the first working prototype on translating Matlab to python. The project contains two main parts: one language translator mh_python under forking of miss_hit with the same AGPL license; a separate runtime engine mat2py under MIT license to provide Matlab accent in Python.

There's still a lot of things to do but I am considering the collaboration patterns now. For now my fork is identical to miss_hit except two new scripts. I plan to deploy a web services for the new script mh_python only because I don't have lisence permission on other mh_* script. Or we merge the fork back to the main miss_hit and I can help deploy a full version on the miss_hit toolchain. Which way do you prefer?

@florianschanda
Copy link
Owner

@chaoqing sorry for the long radio silence I had some real life stuff to deal with. I am pleasantly surprised how far you managed to take this concept, without semantic analysis. I'll get a lot easier.

The licensing seems fine. AGPL for the translator, whatever else you want for the run-time support.

I am open to collaboration, but if we want to merge mh_python then I will need to do a very detailed review. I have fairly strict ideas as to when something is ready for release (a key one being it operates under any reasonable circumstance). I also need to think about copyright. I will either ask for an assignment (similar to what the FSF/gcc does) or I will create a 3rdparty directory. I honestly haven't thought this far ahead.

@chaoqing
Copy link
Author

chaoqing commented May 8, 2022

Hi @florianschanda, yes I know merging my fork back to miss_hit seems need a lot of validation work. So I released the utility in a separated package mh-python under AGPL, which contains only the single file I created. This is mainly for the purpose of the in-browser serverless Matlab emulator I created.

BTW, for anyone interested on the emulator, it now supports basic Matlab math operation and plot command. If a NotImplementedError popup, you are welcome to do a feature request like here did.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: sem Affects semantic analysis
Projects
None yet
Development

No branches or pull requests

3 participants