Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing of datasets #2

Closed
maximzubkov opened this issue Mar 5, 2021 · 1 comment
Closed

Preprocessing of datasets #2

maximzubkov opened this issue Mar 5, 2021 · 1 comment

Comments

@maximzubkov
Copy link

maximzubkov commented Mar 5, 2021

Hello!

Thanks for your works on InferCode, it's awesome!
My name is Maksim Zubkov, and I am doing my bachelor thesis at JetBrains Research on the topic of self-supervised learning techniques on source code. I want to compare the pre-training scheme proposed in your paper with one I investigate in the scope of my research.

I tried to initialize CodeClassificationData to train the model on my date, but I could not find a script to create files with a .pkl extension. Now it seems like I was finally able to run preprocessing. In order to achieve this goal, I followed the following steps:

  1. As suggested in the README, I execute: docker run --rm -v $(pwd):/data -w /data --entrypoint /usr/local/bin/subtree -it yijun/fast examples/raw_code examples/subtrees node_types.csv to create .ids.csv files in examples/subtrees
  2. Then I explored yijun/fast docker image and found binaries /usr/local/bin/pkl. I ran docker with /usr/local/bin/pkl as an entry point which resulted in several .pkl files.
  3. Then I added minor changes to your repo, namely add some __init__.py files
  4. The next step was to deal with the fast_pb2.py file, which I simply copied from graph-ast repo

Finally, I have succeeded to create trees object and run put_trees_into_bucket, but could you please answer several questions:

  1. Is this a correct algorithm to prepare data for your model? If so, I can create a pull request and add all this information to the README? Or maybe I missed some important point?
  2. I didn't got the difference between /usr/local/bin/pkl and /usr/local/bin/pklpos, could you please explain what is the difference?

If I can somehow help you with open-sourcing the code base of InferCode, I will be pleased to help you, if it is possible

@bdqnghi
Copy link
Owner

bdqnghi commented Jun 25, 2021

hi Maxim, we have updated the code, could you have a look at the new instruction? Any suggestion to fix the code is welcome

@bdqnghi bdqnghi closed this as completed Jul 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants