Skip to content
This repository has been archived by the owner on Apr 11, 2023. It is now read-only.

How to deconstruct code into tokens to extract functions and comments? #234

Closed
skye95git opened this issue Aug 6, 2021 · 2 comments
Closed

Comments

@skye95git
Copy link

I want to make a code search corpus. I have collected a lots of GitHub repositories. Now I need to deconstruct code into tokens to extract functions and comments. You describe in the paper CodeSearchNet Challenge Evaluating the State of Semantic Code Search: We then tokenize all Go, Java, JavaScript, Python, PHP and Ruby functions (or methods) using TreeSitter — GitHub’s universal parser — and, where available, their respective documentation text using a heuristic regular expression.

I can extract functions in python. But it hasn't comments. How do you extract functions with comments? Can you share your codes?

@mallamanis
Copy link
Contributor

You can find all our parsing code here.

@skye95git
Copy link
Author

You can find all our parsing code here.

Thank you for your reply! I have try the function parer in CodeSearchNet/function_parser/ folder. But I met some problems:

  1. What is the input? In the examples, the input is library keras-team/keras. Is it https://github.com/keras-team/keras? But it's a repository. Is it one repository per input?

  2. What is the output? The output in the examples is

{ 'nwo': 'keras-team/keras', 'sha': '0fc33feb5f4efe3bb823c57a8390f52932a966ab', 'path': 'keras/layers/core.py', 'language': 'python', 'identifier': 'Activation.__init__', 'parameters': '(self, activation, **kwargs)', 'argument_list': '', 'return_statement': '', 'docstring': '', 'function': 'def __init__(self, activation, **kwargs):\n super(Activation, self).__init__(**kwargs)\n self.supports_masking = True\n self.activation = activations.get(activation)', 'url': 'https://github.com/keras-team/keras/blob/0fc33feb5f4efe3bb823c57a8390f52932a966ab/keras/layers/core.py#L294-L297' }

The path is just a core.py file in library keras-team/keras. How do I set the file to parse? By dependee = "keras-team/keras"?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants