Introduction

This research project proposes a new automatic comment generation approach, which mines comments from a large programming Question and Answer (Q&A) site. The tool generates code comments automatically by analyzing existing software repositories. It applies code clone detection techniques to discover similar code segments and use the comments from some code segments to describe the other similar code segments.

It is a hybrid technique that can generate code comments by mining (1) StackOverflow and (2) open source projects.

The paper is included in this repo as clocom-saner15.pdf.

Installation

Dependencies

sudo apt-get install openjdk-8-jdk

Place the following tars under the folder, lib2, which can be obtained from Stanford's website (https://stanfordnlp.github.io/CoreNLP/):

stanford-corenlp-3.8.0.jar stanford-corenlp-3.8.0-models.jar

lib and lib2 contains all the library dependencies that are required to run the tool.

Running the Tool

The configuration is done using an XML file. Configure "config.xml"'s database and project tag with the appropriate folder path. The folders should contain Java source code files. The 'database' can be a StackOverflow database and the 'project' can be any Java project's source code.

Code inside the project tag represents the target project that you want to generate comments for.

Code inside the database tag represents the database. CloCom will extract code comments from this folder for the code inside the project folder.

See the config folder for xml examples and modify the parameters to suit your needs.

Compilation

Type make to compile.

Execute

./cloneDigger.sh config.xml

The output is a list of clones from the database path and the respective generated comment.

Please note the XML parameter settings for config.xml:

databaseFormat
- 0 for standard Java source code, 1 for autocomment format
minNumLines
- number of statements a clone should have
matchAlgorithm
- 1 for gapped, 0 for non-gapped
matchMode
- 1 for between comparison (loads files in the project path into memory), 0 for full mesh
gapSize
- set this for gapped comparision
meshBlockSize
- number of source code files to load into memory for full mesh
numberThreads
- 0 to disable multithreading, else specify the number of cores
debug
- turn on debug statements
removeEmpty
- remove the display of clones that doesn't have a comment
forceRetokenization
- set true if we want to retokenize the database directory, else it will only tokenize as needed
loadDatabaseFilePaths
- use the existing cached list of files

Publication

Please see the research folder for PDF version of the publication.

Edmund Wong, Jinqiu Yang, and Lin Tan, AutoComment: Mining Question and Answer Sites for Automatic Comment Generation, International Conference on Automated Software Engineering (ASE) 2013
Edmund Wong, Taiyue Liu, Lin Tan, CloCom: Mining Existing Source Code for Automatic Comment Generation, International Conference on Software Analysis, Evolution, and Reengineering (SANER) 2015

Please use commit f89b780bac5ea7322633550c29fffd3c199e1f2e if you would like to reproduce the results of CloCom's SANER conference paper.

We no longer support the old SANER codebase because it had evolved greatly since 2013.

Data

We are thankful to StackOverflow for providing a data dump of their website's public data. Please refer to the Stack Overflow Creative Commons Data Dump for further details on how to obtain the data. We utilized the data dump that was published on Match 2017. You will have to download stackoverflow.com-Posts.7z to obtain Posts.xml if you would like to build a mapping database.

Mapping Database

We provide a code-comment mapping data for the Java and Android tag under the database folder javaAndroidDB.tar.gz.

It was generated using the newer Match 2017 StackOverflow database with the following criteria:

Must have a least 3 lines in the code segment
Must have the java or android tag
Code mappings are only extracted from the answer of the post
Question and answer must have a score larger than 0
No NLP parse tree editing other than basic sentence refinements. The reason that the NLP is not processed in this database is because we recently moved the NLP component into the code clone detection tool. You can run it yourself using the Java class (NLP.java).
Each mapping is named as [questionID]-[answerID]-[codeSegmentNumber].autocom

User Study

We had modified the source code and fixed many bugs since the SANER publication. We performed a new user study with 20 participants to evaluate technique on the codebase. The following results reflects on commit cf530f047afb227adbdf8c76d9994db4c3f4c1b8.

Automatically Generated Comments

Here is the format of the results from the CloCom tool:

Match number in this project (excluded matches that contain no comments)
- xx+yy is the grouping of the source code, together it represents a clone group
- xx represents the number of code from target project
- yy represents the number of code from Stack Overflow
One or more code from target project
- File name of the target project's source code and the matched line numbers (xx-xx)
- Length represents the number of matched statements
- Source code that had been matched
One or more code from Stack Overflow
- File name of the Stack Overflow database entry and the matched line numbers (xx-xx)
  - The code segment ID consists of two parts:
    1. StackOverflow Answer ID
    2. Index number of the code snippet within the answer (there can be more than one code snippet within an answer)
- Length represents the number of matched statements
- Text similarity terms that had been matched (local, global and variable)
- Source code that had been matched
List of comment candidates from all sources
- size x, where x represents the total number of candidates
Ranked output from all sources
- Ranking is based on the text similarity score

The output of AutoComment for the 16 evaluated Java projects are under the research folder, which are then used for the user study questionnaire.

Questionnaire

Below are two links that contains the full digital questionnaire that we presented to the users:

Google Form - Group 1

Google Form - Group 2

Raw CSV Dump of the results are located under the research/output folder.

There are two files group1.csv and group2.csv which corresponds to the two sets of questions. Evaulation was performed on 20 participants where each person received 15 questions (12 from AutoComment and 3 from previous work SumSlice).

Summarized Results: Google Sheet Link

Comparison Against Previous Work

AutoComment is compared against the work (SumSlice) from Breno Dantas Cruz, Paul "Will" McBurney, and Collin McMillan from TSE 2015. SumSlice's output and intermediate files on the evaluated Java project, NanoXML, are placed under the research/output folder.

Comments (nanoXML.txt)
SWUM (NanoXML.out)
PageRank (ND_PageRankFormatter.txt)
XML SumGen (nd_xmlsumgen.xml)

Study of Developer Written Comments

We also provide the list of the extracted comments for the study of developer comments under the research/output/developer folder. The original source code files are located on Jajuk's website under: /jajuk-src-1.10.3/src/main/java/org/qdwizard/

We studied the following randomly selected Java files. The generated comments and the original source code had been provided.

ActionsPanel
ClearPoint
Header
Langpack
Screen
ScreenState
Wizard

The classification of the comments (type one and type two) are under type1-comment.txt and type2-comment.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
config		config
database		database
lib		lib
lib2		lib2
models		models
research		research
scripts		scripts
Analyze.java		Analyze.java
Apriori.java		Apriori.java
CloneDigger.java		CloneDigger.java
CommentMap.java		CommentMap.java
CommentParser.java		CommentParser.java
CommentVisitor.java		CommentVisitor.java
Compare.java		Compare.java
ConfigFile.java		ConfigFile.java
Database.java		Database.java
FrequencyMap.java		FrequencyMap.java
LICENSE.md		LICENSE.md
Makefile		Makefile
MatchGroup.java		MatchGroup.java
MatchInstance.java		MatchInstance.java
Method.java		Method.java
NLP.java		NLP.java
Output.java		Output.java
Parser.java		Parser.java
Query.java		Query.java
README.md		README.md
Result.java		Result.java
RunnableDemo.java		RunnableDemo.java
StanfordLemmatizer.java		StanfordLemmatizer.java
Statement.java		Statement.java
Stemmer.java		Stemmer.java
Text.java		Text.java
TextCompare.java		TextCompare.java
TokenType.java		TokenType.java
Tokenizer.java		Tokenizer.java
Utilities.java		Utilities.java
clocom-saner15.pdf		clocom-saner15.pdf
execute.sh		execute.sh
negative.txt		negative.txt
similarityBanList.txt		similarityBanList.txt

License

e32wong/CloCom

Folders and files

Latest commit

History

Repository files navigation