Skip to content
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Doc2vec for Feature Location



Deep learning models are a class of neural networks. Relative to n-gram models, deep learning models can capture more complex statistical patterns based on smaller training corpora. In this paper we explore the use of a particular deep learning model, document vectors (DVs), for feature location. DVs seem well suited to use with source code, because they both capture the influence of context on each term in a corpus and map terms into a continuous semantic space that encodes semantic relationships such as synonymy. We present preliminary results that show that a feature location technique (FLT) based on DVs can outperform an analogous FLT based on latent Dirichlet allocation (LDA) and then suggest several directions for future work on the use of deep learning models to improve developer effectiveness in feature location.



Most things related to this project can be found in the GitHub repository.

Some files which you might find of immediate interest:


Our dataset is available as part of the GitHub repository (and release archive), but things such as the corpora go through extraction and pre-processing steps.

All project information is in projects.csv, and supplementary data is under the data/ directory.

data/ layout

Each subdirectory of data/ follows the following schema:

  • <project>/ -- the main project name (from projects.csv)
    • repos.txt -- the repository URLs for cloning
    • svn2git.csv -- if the repository was converted to git from subversion, this is the SVN revision -> Git sha mapping
    • <version>/ -- version specific files:
      • ids.txt -- line separated query ids that relate to the issue report ids or feature request ids used in the datasets.
      • queries/ -- contains the unpreprocessed queries:
        • ShortDescription<id>.txt -- the title or summary of the query
        • LongDescription<id>.txt -- the full description of the query
      • goldsets/ -- contains the goldset files:
        • class/ -- contains class-level goldsets
          • <id>.txt -- line separated class names related to the query id
        • method/ -- contains method-level goldsets
          • <id>.txt -- line separated method names related to the query id
      • <FLT>_<topic modeler>-<level>-ranks.csv -- the effectiveness measures for each of the various experiment setups, e.g., changeset_lda-class-ranks.csv is the ranks of the batch-mode LDA experiment at the class-level.

There may be other files in the directories which originate from the two original datasets.


We've done our best to include every possible bit of code written in order to complete this paper. Much of the main experiment is under src/, with a couple of corresponding tests in tests/, and some helper scripts in scripts that helped to convert a dataset to our format or generate tables for the paper.


Install everything using make:

$ make install

Or, if you use virtualenv, you can make init instead.

Now, you should be able to run commands:

$ dfl <project name>


$ dfl mucommander

This will run the experiment on the given project name.

To run the experiment on a certain version or level, use the --version and --level flags.

$ dfl  mucommander --version v0.8.5 --level class

See --help for additional usage.

$ dfl --help
Usage: dfl [OPTIONS] NAME

Changesets for Feature Location

--path TEXT     Set the directory to work within
--version TEXT  Version of project to run experiment on
--level TEXT    Granularity level of project to run experiment on
--help          Show this message and exit.


No description, website, or topics provided.




No releases published


No packages published
You can’t perform that action at this time.