Some general questions/suggestions #1702

deadlocklogic · 2021-06-20T12:14:13Z

Hi, great and interesting project, just few questions/suggestions:
1- Why you don't use the default python tokenizer module converted to Javascript because it is much simpler and battle tested, the current implementation is a bit bloated, plus there are many bugs which I can of course contribute/report (mainly related with error tokens and the start/end range).
2- Why not using ANTLR to generate the parse tree (ANTLR python grammar is already defined and battle tested and the Javascript or python runtimes are also available+ possibility to interface any python version like 2.x which is out the scope of this project + possibility to get inspired from other projects like Jython) and thus the whole project will consist of a visitor which will have all the transform methods migrated to.
3- Implementing 2 will give an easier out-of-the-box implementation of the ast module which is unimplemented currently.
4- Why not using a safer typed language like Typescript and divide the transpiler into smaller chunks using better OOP paradigms(an ultimate goal is the whole project written in python and which it can transpile itself to Javascript)

The text was updated successfully, but these errors were encountered:

deadlocklogic · 2021-06-20T15:39:24Z

Some simple issues/bugs to clarify my point:

Issue1:

var _tokenizer = __BRYTHON__.tokenizer("\rtest");

var a;
while (a = l_tokenizer.next()) {
    if(a.done){
        break;
    }
    console.log(a.value);
}

The 3rd token which is the name token prints:

{
  '0': 'NAME',
  '1': 'est',
  '2': [ 2, 0 ],
  '3': [ 2, 3 ],
  '4': 'st',
  type: 'NAME',
  string: 'est',
  start: [ 2, 0 ],
  end: [ 2, 3 ],
  line: 'st'
}

which is wrong, the name should be 'test'. (The fix is quite simple just remove this line

brython/www/src/python_tokenizer.js

Line 128 in 048601d

pos++

)

Issue2:

The line property is really broken as you can see for the previous example (after fixing the bug the line is still wrong, almost every time the first character of the line is omitted after the first token on a given line)

Issue3

Tons of variables are unused but still reside and are somehow confusing.

brython/www/src/py2js.js

Lines 10937 to 10955 in 048601d

    
           try{ 
        
               var token = tokenizer.next() 
        
           }catch(err){ 
        
               if(err.type == 'IndentationError'){ 
        
                   context = context || new $NodeCtx(node) 
        
                   $pos = line2pos[err.line_num] 
        
                   $_SyntaxError(context, err.message, 'indent') 
        
               }else if(err instanceof SyntaxError){ 
        
                   if(braces_stack.length > 0){ 
        
                       var last_brace = $B.last(braces_stack), 
        
                           start = last_brace.start 
        
                       $pos = line2pos[start[0]] + start[1] 
        
                       $_SyntaxError(context, [`'${last_brace.string} was ` + 
        
                          'never closed']) 
        
                   } 
        
                   $_SyntaxError(context, err.message) 
        
               } 
        
               throw err 
        
           }

the tokenizer never throws a $_SyntaxError why even test against it then?

Issue4

py2js.js is 99.9% py2js_tokenizer.js with few debug functions. Why not implementing a logger or a simple flag and merge these 2 files into one? Or there is any purpose of doing this?

This is all just by quick review of the project, but overall interesting one.

schmave · 2021-06-21T14:02:09Z

From your original post, (1) directly translating the Python tokenizer module to JS, and/or (2) using ANTLR for parsing seem like reasonable ideas to me.

But these are big projects. Are you interested in implementing them?

@PierreQuentel may have more thoughts, since he just did an overhaul of the tokenization code.

PierreQuentel · 2021-06-21T14:23:18Z

Interesting questions, thanks !

1 - there are 2 tokenizers in CPython, the pure-Python module tokenize and an implementation in C used by the interpreter.

My first idea was to convert the pure-Python module to Javascript using Brython and optimize it. Unfortunately, as often, it has dependencies : 34 other modules need to be imported to make it work... There are probably very few parts of these modules that are actually used, but idenfiying them and replacing the imported code by Javascript equivalents would have required at least as much work as writing a Javascript implementation from scratch.

Another option would have been to use the C implementation, but, even assuming that it's possible, I don't know C enough for that.

This is why I decided to write the Javascript version in python_tokenizer.js. It is certainly not perfect, but its results are similar enough to the CPython module that it successfully parses the modules in the standard library. And I hope that the code is not too difficult to understand.

To sum up, unless someone can come with a version derived from one of the CPython implementations - you are right, it would certainly be 100% compliant - that is at least as fast as this one, I prefer to keep it and fix the bugs that you or other developers would report. Many thanks for the ones you reported in your second post, I will fix them ASAP.

2 - and 3 - I didn't know ANTLR. It is mentioned in PEP 617 as an alternative for a new Python parser and it was not selected. It seems that it's an LL parser, and I wonder if it would be possible to use it for new Python constructs such as the "soft keywords" match and case that are going to be introduced in Python 3.10 : they require unlimited lookahead to detect that match is an identifier in match = 0 and a keyword in match expression:.

Anyway, I actually gave a try at implementing a PEG parser, based on the simplified Python grammar and the left recursion algorithm referenced in PEP 617. The result is in python_parser.js.

The next step would have been to generate an AST from the tree produced by this parser, based on the complete grammar. I must admit that the amount of work needed for that frightened me... and I suppose that this step would probably be necessary with ANTLR.

For AST, another option might be to use the tree built by the current implementation by dispatch_tokens(). I'm not sure it's possible and how much work it would require though.

4 - If I started the project now, the choice between Javascript and Typescript would not be obvious. But it started in 2012, I'm afraid it's too late to switch to another language, however close to JS as it might be.

…n issue #1702)

PierreQuentel · 2021-06-21T19:20:25Z

The commits referenced above fix most of the issues you have reported.

I didn't understand Issue 3. The tokenizer throws Javascript SyntaxErrors and errors with attribute .type set to "IndentationError", and this is what dispatch_tokens() tests.

deadlocklogic · 2021-06-21T20:13:24Z

The tokenizer in python_tokenizer.js throws 4 exceptions, 2 of type Error, 2 of type SyntaxError which are native JavaScript types, so as a quick review, the try block won't ever catch the first test. I guess maybe you meant throwing a BRYTHON.$_SyntaxError?

brython/www/src/py2js.js

Lines 10940 to 10943 in 048601d

    
           if(err.type == 'IndentationError'){ 
        
               context = context || new $NodeCtx(node) 
        
               $pos = line2pos[err.line_num] 
        
               $_SyntaxError(context, err.message, 'indent')

By the way I come from C/C++. C#, Java even Python, Lua etc... languages background, I know Javascript of course but I prefer Typescript or the new ECMAScript standards with better OOP because the dynamism of Javascript can be heaven or hell, nothing between. So for small webpages nothing is wrong with JS, but for larger projects I believe Typescript is better at least from structuring point of view, at least won't compile anything you throw at it.

The way I am finding such bugs is that I am analyzing the project as much as I can now because I need such functionality, not for the web necessarily. By the way, I am finding many other bugs or possible ones which I will report soon.

If somebody is willing to review PR's I can try a shot on providing a transpiration of the tokenize module.(maybe some experimental branch with CPython ANTLR stuff)
To be honest, I am seriously considering a manual/careful transpiration of the project to pure python at some point, therefore it can transpile itself virtually to any language given a solid battle tested ast, therefore transpiling to any possible language will be nothing more than a visitor implementation and some built-in function mapping.

…wn). Related to issue #1702.

PierreQuentel · 2021-06-21T21:05:39Z

Issue3 is fixed - you were right, I probably removed the attribute .type at some point.

deadlocklogic · 2021-06-23T11:41:07Z

After some little digging/searches, I found this project: Skulpt which also tries to transpile Python 2/3 to JavaScript. It has most of libs unimplemented compared to Brython but it has a token/tokenize files compliant with Python standard:
https://github.com/skulpt/skulpt/blob/master/src/token.js
https://github.com/skulpt/skulpt/blob/master/src/tokenize.js
So basically transpiling them again would be like reinventing the wheel (even their license is MIT).
And I see they are using an asdl to generate the ast and its component just like CPython does, on the other hand Jython uses a mix of this and ANTLR, couldn't figure out why exactly they needed ANTLR.
Edit: Jython is using ANTLR as the main tokenizer, parser (the asdl is just for generating the ast module)

So basically each of these projects generate the ast module from an asdl file and implement a compiler as a node visitor, by doing that they avoid the hassle of maintaining handwritten code (the ast module they auto generate with a python script which parses the asdl).
Wonder if Brython could benefit from something like this.

PierreQuentel · 2021-06-25T11:32:25Z

@deadlocklogic

Thanks for the pointers to the Skulpt scripts, I didn't think of looking for this.

What you suggest makes sense, obviously, but it means a very radical change in Brython code - probably an almost complete rewriting of py2js.js. I would suggest that you start a separate project, relying on the new PEG parser, with Javascript code to generate a similar AST to that of CPython 3.10. I and other contributors would follow the project and we could discuss its progress and a possible integration in Brython.

What do you think ?

deadlocklogic · 2021-06-25T14:05:57Z

Yes exactly my thought. This needs to be done in a separate branch or a completely separate project and see how things can go from there. So I have no problem with starting a new project.
Just a question, the Javascript code generation phase is done completely with calling the transform function?
I was looking at CPython and evaluating the possibility of following their lexing/parsing scheme, the main concern is the compatibility between their ast and Brython's ast in case I want to use the code generation part.

s-cork · 2021-07-05T09:32:16Z

Skulpt is also currently experimenting with the pegen parser which is skulpt agnostic and so it may be useful for Brython.
You can take a look at the efforts here: https://github.com/skulpt/skulpt_parser

We're very much in experimental phase right now.
And focusing on making it work for 3.9.5 before we touch 3.10 and match/case statements.
But at the time of writing - after a pr for fstrings is committed - we are generating the correct AST for std library modules.

(Full disclosure - I'm a skulpt contributor and working on the parser)

PierreQuentel · 2021-07-05T10:15:47Z

Just a question, the Javascript code generation phase is done completely with calling the transform function?

It's more complex than that :-(

In short :

dispatch_tokens() creates a tree similar to the Python AST
transform() adds nodes to / removes nodes from the tree for the Python statements that are translated into several Javascript statements
to_js() generates the Javascript code

deadlocklogic · 2021-07-05T13:33:16Z

Yes I see now, by the way I am kind of transpiling the whole CPython parser/tokenizer and about to do some tests with it. When I am done I will create a repo and let you know, if Brython could make use of it somehow. (It is a straightforward port but compared with Brython's parser, it is bigger but with some obfuscation/minifying it could be reduced).

PierreQuentel · 2021-12-31T09:50:55Z

For your information, the latest Brython version (3.10.4) includes the ast module, which produces the same results as the CPython module.

deadlocklogic · 2022-01-01T08:50:54Z

Hi, it is been a while. Actually I remember writing a big chunk of a parser using CPython as base code, but for many reasons I had to abandon the project before polishing it.
Yes I saw this PR which is a huge improvement.

So now the parsing is done internally using the CPython's compliant parser? Because it would be much easier to write a simple visitor around it and generate the JS code.
I had a bit or hard time understanding how the parsing is done in Brython due to lack of documentations.

All and all, great news for such a promising library.

deadlocklogic · 2022-01-01T08:58:22Z

By the way, does Brython supports separating code into modules? Because I was using it for simple scripts. Never tried with a multiple file project.

PierreQuentel · 2022-01-13T13:41:16Z

I am not too sure what you mean by "separating code into modules" but if you mean different Python modules that can import each other, possibly in a package, yes this is supported.

I am currently working on an engine that generates Javascript from the AST representation instead of the Brython-specific tree, in module ast_to_js.js. It is progressing very well, still a few weeks of work before it can replace the current implementation but I am very optimistic.

If it succeeds, it will make the Python-to-JS engine much easier to understand : since version 3.9.4 the tokenizer produces the same tokens as CPython, since 3.10.4 the AST tree is also the same, and both are documented in Python docs. The generation engine will also be much shorter (notably, no need for the .transform() methods, which are currently very obscure and poorly documented).

deadlocklogic · 2022-01-15T18:34:19Z

Yeah I meant different Python modules that can import each other.

I am currently working on an engine that generates Javascript from the AST representation instead of the Brython-specific tree, in module ast_to_js.js. It is progressing very well, still a few weeks of work before it can replace the current implementation but I am very optimistic.

Exactly, what I think. It would be much easier to write a visitor which generates JS from a conforming CPython AST. Maybe a typescript rewrite, when everything is refactored would be easy and neat.
Overall, nice progress!

PierreQuentel · 2022-06-30T11:12:42Z

Since release 3.10.6, the Javascript code is generated by the Python-compliant ATS. This is done in ast_to_js.js.

I have written a PEG parser based on the standard Python grammar to generate the AST, as described in PEP 617. It works, but it's much slower than the current Brython parser... More information in this post.

I think we can close this issue. If you prefer to leave it open please add a comment.

PierreQuentel pushed a commit that referenced this issue Jun 21, 2021

Fix bug in python_tokenizer.js when a line starts with \r (reported i…

8bcd66c

…n issue #1702)

PierreQuentel pushed a commit that referenced this issue Jun 21, 2021

Remove py2js_tokenizer.js (reported in issue #1702)

57abf97

PierreQuentel pushed a commit that referenced this issue Jun 21, 2021

Fixes bugs in python_tokenizer.js. Related to issue #1702.

8ac6604

PierreQuentel pushed a commit that referenced this issue Jun 21, 2021

Fixes bug in python_tokenizer.js (IndentationError not correctly thro…

90b61b3

…wn). Related to issue #1702.

austin-schick mentioned this issue Oct 13, 2021

Merge brython-dev 10/12/22 cmu-cs-academy/brython#37

Merged

PierreQuentel closed this as completed Jun 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some general questions/suggestions #1702

Some general questions/suggestions #1702

deadlocklogic commented Jun 20, 2021 •

edited

deadlocklogic commented Jun 20, 2021

schmave commented Jun 21, 2021

PierreQuentel commented Jun 21, 2021

PierreQuentel commented Jun 21, 2021

deadlocklogic commented Jun 21, 2021 •

edited

PierreQuentel commented Jun 21, 2021

deadlocklogic commented Jun 23, 2021 •

edited

PierreQuentel commented Jun 25, 2021

deadlocklogic commented Jun 25, 2021

s-cork commented Jul 5, 2021

PierreQuentel commented Jul 5, 2021

deadlocklogic commented Jul 5, 2021

PierreQuentel commented Dec 31, 2021

deadlocklogic commented Jan 1, 2022 •

edited

deadlocklogic commented Jan 1, 2022

PierreQuentel commented Jan 13, 2022

deadlocklogic commented Jan 15, 2022

PierreQuentel commented Jun 30, 2022

Some general questions/suggestions #1702

Some general questions/suggestions #1702

Comments

deadlocklogic commented Jun 20, 2021 • edited

deadlocklogic commented Jun 20, 2021

Issue1:

Issue2:

Issue3

Issue4

schmave commented Jun 21, 2021

PierreQuentel commented Jun 21, 2021

PierreQuentel commented Jun 21, 2021

deadlocklogic commented Jun 21, 2021 • edited

PierreQuentel commented Jun 21, 2021

deadlocklogic commented Jun 23, 2021 • edited

PierreQuentel commented Jun 25, 2021

deadlocklogic commented Jun 25, 2021

s-cork commented Jul 5, 2021

PierreQuentel commented Jul 5, 2021

deadlocklogic commented Jul 5, 2021

PierreQuentel commented Dec 31, 2021

deadlocklogic commented Jan 1, 2022 • edited

deadlocklogic commented Jan 1, 2022

PierreQuentel commented Jan 13, 2022

deadlocklogic commented Jan 15, 2022

PierreQuentel commented Jun 30, 2022

deadlocklogic commented Jun 20, 2021 •

edited

deadlocklogic commented Jun 21, 2021 •

edited

deadlocklogic commented Jun 23, 2021 •

edited

deadlocklogic commented Jan 1, 2022 •

edited