Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some general questions/suggestions #1702

Closed
deadlocklogic opened this issue Jun 20, 2021 · 18 comments
Closed

Some general questions/suggestions #1702

deadlocklogic opened this issue Jun 20, 2021 · 18 comments

Comments

@deadlocklogic
Copy link

deadlocklogic commented Jun 20, 2021

Hi, great and interesting project, just few questions/suggestions:
1- Why you don't use the default python tokenizer module converted to Javascript because it is much simpler and battle tested, the current implementation is a bit bloated, plus there are many bugs which I can of course contribute/report (mainly related with error tokens and the start/end range).
2- Why not using ANTLR to generate the parse tree (ANTLR python grammar is already defined and battle tested and the Javascript or python runtimes are also available+ possibility to interface any python version like 2.x which is out the scope of this project + possibility to get inspired from other projects like Jython) and thus the whole project will consist of a visitor which will have all the transform methods migrated to.
3- Implementing 2 will give an easier out-of-the-box implementation of the ast module which is unimplemented currently.
4- Why not using a safer typed language like Typescript and divide the transpiler into smaller chunks using better OOP paradigms(an ultimate goal is the whole project written in python and which it can transpile itself to Javascript)

@deadlocklogic
Copy link
Author

Some simple issues/bugs to clarify my point:

Issue1:

var _tokenizer = __BRYTHON__.tokenizer("\rtest");

var a;
while (a = l_tokenizer.next()) {
    if(a.done){
        break;
    }
    console.log(a.value);
}

The 3rd token which is the name token prints:

{
  '0': 'NAME',
  '1': 'est',
  '2': [ 2, 0 ],
  '3': [ 2, 3 ],
  '4': 'st',
  type: 'NAME',
  string: 'est',
  start: [ 2, 0 ],
  end: [ 2, 3 ],
  line: 'st'
}

which is wrong, the name should be 'test'. (The fix is quite simple just remove this line

)

Issue2:

The line property is really broken as you can see for the previous example (after fixing the bug the line is still wrong, almost every time the first character of the line is omitted after the first token on a given line)

Issue3

Tons of variables are unused but still reside and are somehow confusing.

brython/www/src/py2js.js

Lines 10937 to 10955 in 048601d

try{
var token = tokenizer.next()
}catch(err){
if(err.type == 'IndentationError'){
context = context || new $NodeCtx(node)
$pos = line2pos[err.line_num]
$_SyntaxError(context, err.message, 'indent')
}else if(err instanceof SyntaxError){
if(braces_stack.length > 0){
var last_brace = $B.last(braces_stack),
start = last_brace.start
$pos = line2pos[start[0]] + start[1]
$_SyntaxError(context, [`'${last_brace.string} was ` +
'never closed'])
}
$_SyntaxError(context, err.message)
}
throw err
}

the tokenizer never throws a $_SyntaxError why even test against it then?

Issue4

py2js.js is 99.9% py2js_tokenizer.js with few debug functions. Why not implementing a logger or a simple flag and merge these 2 files into one? Or there is any purpose of doing this?

This is all just by quick review of the project, but overall interesting one.

@schmave
Copy link
Contributor

schmave commented Jun 21, 2021

From your original post, (1) directly translating the Python tokenizer module to JS, and/or (2) using ANTLR for parsing seem like reasonable ideas to me.

But these are big projects. Are you interested in implementing them?

@PierreQuentel may have more thoughts, since he just did an overhaul of the tokenization code.

@PierreQuentel
Copy link
Contributor

Interesting questions, thanks !

1 - there are 2 tokenizers in CPython, the pure-Python module tokenize and an implementation in C used by the interpreter.

My first idea was to convert the pure-Python module to Javascript using Brython and optimize it. Unfortunately, as often, it has dependencies : 34 other modules need to be imported to make it work... There are probably very few parts of these modules that are actually used, but idenfiying them and replacing the imported code by Javascript equivalents would have required at least as much work as writing a Javascript implementation from scratch.

Another option would have been to use the C implementation, but, even assuming that it's possible, I don't know C enough for that.

This is why I decided to write the Javascript version in python_tokenizer.js. It is certainly not perfect, but its results are similar enough to the CPython module that it successfully parses the modules in the standard library. And I hope that the code is not too difficult to understand.

To sum up, unless someone can come with a version derived from one of the CPython implementations - you are right, it would certainly be 100% compliant - that is at least as fast as this one, I prefer to keep it and fix the bugs that you or other developers would report. Many thanks for the ones you reported in your second post, I will fix them ASAP.

2 - and 3 - I didn't know ANTLR. It is mentioned in PEP 617 as an alternative for a new Python parser and it was not selected. It seems that it's an LL parser, and I wonder if it would be possible to use it for new Python constructs such as the "soft keywords" match and case that are going to be introduced in Python 3.10 : they require unlimited lookahead to detect that match is an identifier in match = 0 and a keyword in match expression:.

Anyway, I actually gave a try at implementing a PEG parser, based on the simplified Python grammar and the left recursion algorithm referenced in PEP 617. The result is in python_parser.js.

The next step would have been to generate an AST from the tree produced by this parser, based on the complete grammar. I must admit that the amount of work needed for that frightened me... and I suppose that this step would probably be necessary with ANTLR.

For AST, another option might be to use the tree built by the current implementation by dispatch_tokens(). I'm not sure it's possible and how much work it would require though.

4 - If I started the project now, the choice between Javascript and Typescript would not be obvious. But it started in 2012, I'm afraid it's too late to switch to another language, however close to JS as it might be.

@PierreQuentel
Copy link
Contributor

The commits referenced above fix most of the issues you have reported.

I didn't understand Issue 3. The tokenizer throws Javascript SyntaxErrors and errors with attribute .type set to "IndentationError", and this is what dispatch_tokens() tests.

@deadlocklogic
Copy link
Author

deadlocklogic commented Jun 21, 2021

The tokenizer in python_tokenizer.js throws 4 exceptions, 2 of type Error, 2 of type SyntaxError which are native JavaScript types, so as a quick review, the try block won't ever catch the first test. I guess maybe you meant throwing a BRYTHON.$_SyntaxError?

brython/www/src/py2js.js

Lines 10940 to 10943 in 048601d

if(err.type == 'IndentationError'){
context = context || new $NodeCtx(node)
$pos = line2pos[err.line_num]
$_SyntaxError(context, err.message, 'indent')

By the way I come from C/C++. C#, Java even Python, Lua etc... languages background, I know Javascript of course but I prefer Typescript or the new ECMAScript standards with better OOP because the dynamism of Javascript can be heaven or hell, nothing between. So for small webpages nothing is wrong with JS, but for larger projects I believe Typescript is better at least from structuring point of view, at least won't compile anything you throw at it.

The way I am finding such bugs is that I am analyzing the project as much as I can now because I need such functionality, not for the web necessarily. By the way, I am finding many other bugs or possible ones which I will report soon.

If somebody is willing to review PR's I can try a shot on providing a transpiration of the tokenize module.(maybe some experimental branch with CPython ANTLR stuff)
To be honest, I am seriously considering a manual/careful transpiration of the project to pure python at some point, therefore it can transpile itself virtually to any language given a solid battle tested ast, therefore transpiling to any possible language will be nothing more than a visitor implementation and some built-in function mapping.

PierreQuentel pushed a commit that referenced this issue Jun 21, 2021
@PierreQuentel
Copy link
Contributor

Issue3 is fixed - you were right, I probably removed the attribute .type at some point.

@deadlocklogic
Copy link
Author

deadlocklogic commented Jun 23, 2021

After some little digging/searches, I found this project: Skulpt which also tries to transpile Python 2/3 to JavaScript. It has most of libs unimplemented compared to Brython but it has a token/tokenize files compliant with Python standard:
https://github.com/skulpt/skulpt/blob/master/src/token.js
https://github.com/skulpt/skulpt/blob/master/src/tokenize.js
So basically transpiling them again would be like reinventing the wheel (even their license is MIT).
And I see they are using an asdl to generate the ast and its component just like CPython does, on the other hand Jython uses a mix of this and ANTLR, couldn't figure out why exactly they needed ANTLR.
Edit: Jython is using ANTLR as the main tokenizer, parser (the asdl is just for generating the ast module)

So basically each of these projects generate the ast module from an asdl file and implement a compiler as a node visitor, by doing that they avoid the hassle of maintaining handwritten code (the ast module they auto generate with a python script which parses the asdl).
Wonder if Brython could benefit from something like this.

@PierreQuentel
Copy link
Contributor

@deadlocklogic

Thanks for the pointers to the Skulpt scripts, I didn't think of looking for this.

What you suggest makes sense, obviously, but it means a very radical change in Brython code - probably an almost complete rewriting of py2js.js. I would suggest that you start a separate project, relying on the new PEG parser, with Javascript code to generate a similar AST to that of CPython 3.10. I and other contributors would follow the project and we could discuss its progress and a possible integration in Brython.

What do you think ?

@deadlocklogic
Copy link
Author

Yes exactly my thought. This needs to be done in a separate branch or a completely separate project and see how things can go from there. So I have no problem with starting a new project.
Just a question, the Javascript code generation phase is done completely with calling the transform function?
I was looking at CPython and evaluating the possibility of following their lexing/parsing scheme, the main concern is the compatibility between their ast and Brython's ast in case I want to use the code generation part.

@s-cork
Copy link

s-cork commented Jul 5, 2021

Skulpt is also currently experimenting with the pegen parser which is skulpt agnostic and so it may be useful for Brython.
You can take a look at the efforts here: https://github.com/skulpt/skulpt_parser

We're very much in experimental phase right now.
And focusing on making it work for 3.9.5 before we touch 3.10 and match/case statements.
But at the time of writing - after a pr for fstrings is committed - we are generating the correct AST for std library modules.

(Full disclosure - I'm a skulpt contributor and working on the parser)

@PierreQuentel
Copy link
Contributor

Just a question, the Javascript code generation phase is done completely with calling the transform function?

It's more complex than that :-(

In short :

  • dispatch_tokens() creates a tree similar to the Python AST
  • transform() adds nodes to / removes nodes from the tree for the Python statements that are translated into several Javascript statements
  • to_js() generates the Javascript code

@deadlocklogic
Copy link
Author

Yes I see now, by the way I am kind of transpiling the whole CPython parser/tokenizer and about to do some tests with it. When I am done I will create a repo and let you know, if Brython could make use of it somehow. (It is a straightforward port but compared with Brython's parser, it is bigger but with some obfuscation/minifying it could be reduced).

@PierreQuentel
Copy link
Contributor

For your information, the latest Brython version (3.10.4) includes the ast module, which produces the same results as the CPython module.

@deadlocklogic
Copy link
Author

deadlocklogic commented Jan 1, 2022

Hi, it is been a while. Actually I remember writing a big chunk of a parser using CPython as base code, but for many reasons I had to abandon the project before polishing it.
Yes I saw this PR which is a huge improvement.

So now the parsing is done internally using the CPython's compliant parser? Because it would be much easier to write a simple visitor around it and generate the JS code.
I had a bit or hard time understanding how the parsing is done in Brython due to lack of documentations.

All and all, great news for such a promising library.

@deadlocklogic
Copy link
Author

By the way, does Brython supports separating code into modules? Because I was using it for simple scripts. Never tried with a multiple file project.

@PierreQuentel
Copy link
Contributor

I am not too sure what you mean by "separating code into modules" but if you mean different Python modules that can import each other, possibly in a package, yes this is supported.

I am currently working on an engine that generates Javascript from the AST representation instead of the Brython-specific tree, in module ast_to_js.js. It is progressing very well, still a few weeks of work before it can replace the current implementation but I am very optimistic.

If it succeeds, it will make the Python-to-JS engine much easier to understand : since version 3.9.4 the tokenizer produces the same tokens as CPython, since 3.10.4 the AST tree is also the same, and both are documented in Python docs. The generation engine will also be much shorter (notably, no need for the .transform() methods, which are currently very obscure and poorly documented).

@deadlocklogic
Copy link
Author

Yeah I meant different Python modules that can import each other.

I am currently working on an engine that generates Javascript from the AST representation instead of the Brython-specific tree, in module ast_to_js.js. It is progressing very well, still a few weeks of work before it can replace the current implementation but I am very optimistic.

Exactly, what I think. It would be much easier to write a visitor which generates JS from a conforming CPython AST. Maybe a typescript rewrite, when everything is refactored would be easy and neat.
Overall, nice progress!

@PierreQuentel
Copy link
Contributor

Since release 3.10.6, the Javascript code is generated by the Python-compliant ATS. This is done in ast_to_js.js.

I have written a PEG parser based on the standard Python grammar to generate the AST, as described in PEP 617. It works, but it's much slower than the current Brython parser... More information in this post.

I think we can close this issue. If you prefer to leave it open please add a comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants