Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add new lanuages #42

Open
hxue3 opened this issue Oct 20, 2021 · 8 comments
Open

How to add new lanuages #42

hxue3 opened this issue Oct 20, 2021 · 8 comments
Labels
question Further information is requested

Comments

@hxue3
Copy link

hxue3 commented Oct 20, 2021

I am wondering if I can add new lanuages for code translation, for example I want to translate COBOL code to python

Do you have any tips if you can explain briefly that what I should do

@baptisteroziere baptisteroziere added the question Further information is requested label Oct 21, 2021
@baptisteroziere
Copy link
Contributor

I'll detail the steps you'd need to take to translate from COBOL to python:

  • Get monolingual COBOL and python data using google bigquery or something else
  • preprocess your data using our preprocess.py pipeline. You will need to add COBOL as a supported language and a CobolProcessor object inheriting from LangProcessor with at least tokenize_code, detokenize_code and extract_functions methods. The methods to tokenize and detokenize code may not require any work if you inherit from TreeSitterLangProcessor and use something like this: https://github.com/Neppord/tree-sitter-cobol. Check JavaProcessor for an example of what to do. Note that your tokenize_code method needs at least to return something on a single line for the pipeline to work but the rest is not completely necessary. You can also try to translate at file level and in that case you won't need to be able to extract functions.
  • (Optional) build valid/test datasets for COBOL to Python with a few examples of parallel functions and some test scripts where you can just insert a solution by replacing a #TOFILL comment and run the script. If you don't create the test scripts you can just set --eval_computation false.

Then train a model. At this point the steps are the same as for C++ to Python if you created the datasets properly.

  • Train a MLM model on COBOL and Python
  • Train a TransCoder model on COBOL and Python functions: DAE and BT steps in both directions. If you have no parallel valid/test sets you can create mock ones with just one non-empty line for each or deactivate the MT evaluation in the code.

@raffian
Copy link

raffian commented Dec 14, 2021

@brozi

I need a processor for C. Should I follow the same steps?

@baptisteroziere
Copy link
Contributor

@raffian
Copy link

raffian commented Dec 15, 2021

@brozi

Thanks for the quick reply.

Would you be willing to share the queries (BigQuery) used to extract monolingual cpp data from the github public dataset? Whitepaper suggests I'll need to do that for C, but can probably reuse the java dataset under data/test_dataset as-is given java is our target, correct?

@arpieb
Copy link

arpieb commented Mar 20, 2023

@raffian @brozi did you ever suss out what the queries should look like? Hoping to leverage this to train a Delphi <-> Python3 transcoder model.

Thanks!

@raffian
Copy link

raffian commented Mar 20, 2023

@arpieb

Unfortunately, no - I gave up on this approach.

I switched my focus to ANTLR4, it's mature, stable, and comes with lots of grammars for parsing nearly every programming language ever created, though surprisingly Delphi is missing from the list.
https://github.com/antlr/grammars-v4

I found this one - unofficial I guess so your mileage may vary.
https://github.com/gotthardsen/Delphi-ANTRL4-Grammar/blob/master/Delphi.g4

If you go down the ANTLR4 path for code translation, take my advice and do the training at udemy, it's worth it. Understanding ANTLR fundamentals are essential to using it effectively, otherwise, you'll just get frustrated with it.
https://www.udemy.com/course/antlr-programming-masterclass-with-python

Good luck,
raffian

@raffian
Copy link

raffian commented Mar 21, 2023

@arpieb

Best place to start with ANTLR: https://www.antlr.org/

@arpieb
Copy link

arpieb commented Mar 21, 2023

@raffian thanks for the links, will check them out! really hoping to find something that can intelligently perform the translation, or at least a large part of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants