Preprocessing of datasets #2

maximzubkov · 2021-03-05T22:01:05Z

Hello!

Thanks for your works on InferCode, it's awesome!
My name is Maksim Zubkov, and I am doing my bachelor thesis at JetBrains Research on the topic of self-supervised learning techniques on source code. I want to compare the pre-training scheme proposed in your paper with one I investigate in the scope of my research.

I tried to initialize CodeClassificationData to train the model on my date, but I could not find a script to create files with a .pkl extension. Now it seems like I was finally able to run preprocessing. In order to achieve this goal, I followed the following steps:

As suggested in the README, I execute: docker run --rm -v $(pwd):/data -w /data --entrypoint /usr/local/bin/subtree -it yijun/fast examples/raw_code examples/subtrees node_types.csv to create .ids.csv files in examples/subtrees
Then I explored yijun/fast docker image and found binaries /usr/local/bin/pkl. I ran docker with /usr/local/bin/pkl as an entry point which resulted in several .pkl files.
Then I added minor changes to your repo, namely add some __init__.py files
The next step was to deal with the fast_pb2.py file, which I simply copied from graph-ast repo

Finally, I have succeeded to create trees object and run put_trees_into_bucket, but could you please answer several questions:

Is this a correct algorithm to prepare data for your model? If so, I can create a pull request and add all this information to the README? Or maybe I missed some important point?
I didn't got the difference between /usr/local/bin/pkl and /usr/local/bin/pklpos, could you please explain what is the difference?

If I can somehow help you with open-sourcing the code base of InferCode, I will be pleased to help you, if it is possible

The text was updated successfully, but these errors were encountered:

bdqnghi · 2021-06-25T11:42:16Z

hi Maxim, we have updated the code, could you have a look at the new instruction? Any suggestion to fix the code is welcome

Ledenel mentioned this issue Mar 22, 2021

Where are processing_data.sh, CodeClassificationData, CorderModel, yijun/fast and requirements.txt? #3

Closed

bdqnghi closed this as completed Jul 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing of datasets #2

Preprocessing of datasets #2

maximzubkov commented Mar 5, 2021 •

edited

Loading

bdqnghi commented Jun 25, 2021

Preprocessing of datasets #2

Preprocessing of datasets #2

Comments

maximzubkov commented Mar 5, 2021 • edited Loading

bdqnghi commented Jun 25, 2021

maximzubkov commented Mar 5, 2021 •

edited

Loading