Recodoc - Documentation Analyzer and Recommender
Recodoc is a set of tools to analyze developer documentation, mailing lists, and codebases and recommend improvements to the documentation. It is highly modularized and extensible so only parts of recodoc can be used (e.g., the mailing list crawler).
Recodoc is mainly used through the command line. Users can navigate the data model using the web admin interface. News formats of documentation and mailing lists are added by writing Python classes that extend the existing documentation and mailing list parsers and crawlers.
Because Recodoc is fairly involved (it is a mix of web crawler, parser, screen scraping, knowledge modelling, partial code compilation, natural language processing, analytics, recommendation system), the installation instructions are more complex than your typical developer tool. This document assumes that the users are familiar with Python and to a certain extent, Java.
- User Guide
- List of Implemented Parsers and Syncers
- Java parsing
- Parses an Eclipse Java project and creates a model of the code: packages, classes, methods, fields, parameters. Declaration and hierarchy relationships between these code elements are captured in the model.
- XML parsing
- Parses a DTD or a schema (.xsd) file and creates a corresponding model. The code is not ported yet to this new version.
- Given a URL, downloads all the documentation pages of an HTML document.
- Parses a document and creates a model: sections, pages, code-like terms, code snippets.
- Given a URL to a forum or a mailing list archive, downloads all the messages.
- Parses a support channel archive and creates a model: support threads, messages, code-like terms, code snippets.
- snippet classification
- Classifies code snippets into the following categories: Java code snippet, Java exception stack trace, XML snippet, log trace.
- snippet parsing
- Compiles Java code snippets to resolve and infer missing types. Parses snippets (Java and XML) to extract a list of code-like terms.
- code-like terms classification
- Classifies code-like terms into likely code element kind. For example, the
FooBaris classified as a
- Java model linking
- Links Java codebases with Java code-like terms found in documentation and support channels.
- XML model linking
- Links XML model with xml code-like terms found in documentation and support channels. The code is not ported yet to this new version.
This is currently being implemented.
This is a list of libraries and programs that need to be installed and configured prior to install Recodoc. All of the other dependencies are covered in the Installation section.
- Python 2.7 (Python 2.6 or 3.0+ will not work).
- PostgreSQL(>=8.4), MySQL (>=5.0.3), or Oracle. Note: sqlite will not work because Recodoc uses multiple processes to speed up the parsers and sqlite does not like that.
Please note that PostgreSQL is strongly recommended because it has less limitations than MySQL and the default configuration is better for Recodoc needs. Recodoc was developed with PostgreSQL, and lightly tested with MySQL.
These libraries and programs are required to analyze Java codebases. Analyzing Java codebases is a prerequisite for Java snippet parsing, model linking, and documentation improvements recommendation. If you plan to only use Support Channel or Documentation Analysis, you do not need these libraries and programs.
- Java 1.6
- Eclipse 3.6
These libraries and programs should be installed to improve the performance and the usage/maintenance of Recodoc:
- ipython (for interacting with the model through a Python shell)
This section assumes that you are familiar with Python and virtualenv. The following code snippets will walk you through the installation of Recodoc and of its dependencies. The steps assume a Linux environment under a bash shell.
We assume that you want to install the dependencies in a virtual environment. If you want to install the dependencies globally, skip this step.
cd $HOME mkdir .virtualenvs virtualenv --no-site-package --distribute .virtualenvs/recodoc2 # The following step will activate the virtual environment. # It is assumed that the next steps are performed while # the environment is activated. source .virtualenvs/recodoc2/bin/activate
pip install django==1.3.1 pip install lxml==2.3.4 pip install pyenchant==1.6.5 pip install Py4J==0.7 pip install chardet==1.0.1 pip install django-devserver==0.3.1 pip install pylibmc==1.2.0 pip install ipython==0.10 # For PostgreSQL (requires gcc. otherwise download the binary) pip install psycopg2==2.4.1
If you want to install pyscopg2 without compiling it (e.g., on windows), download the binary package.
These dependencies are only required if you want to analyze Java code. First,
install Py4J in Eclipse using the following update site:
http://py4j.sourceforge.net/py4j_eclipse. Note: the update site is no longer maintained. Please rebuild the site locally from this release: https://github.com/bartdag/py4j-eclipse/archive/62663131a0ba7df42ef9f038f77f2f8e1c655fc0.zip
Then, install PPA in Eclipse using the following update site:
Since PPA is updated frequently but not released often, it might be better to download it and build the update site locally. The source code is located on bitbucket.
If you want to contribute to Recodoc, install the following Python programs:
pip install gitli pip install coverage pip install django-test-coverage
First, clone the Recodoc git repository.
git clone -b develop email@example.com:bartdag/recodoc2.git
Then, copy and edit the localsettings file. The file is heavily commented and there are only a few steps to follow.
cd recodoc2/recodoc2 cp localsettings_template.py localsettings.py vim localsettings.py
Initialize the database by running the following command and creating an admin user (one index might fail to install if you use MySQL):
./manage.py syncdb # Alternatively, if manage.py does not have execution permission: python manage.py syncdb
Finally, run one of the following unit tests to ensure that everything was installed correctly. These tests do not require Eclipse/Java.
# Test Documentation Analysis. ./manage.py test doc # Test Support Channel Analysis. Can take 30 seconds. ./manage.py test channel # Test some utility functions ./manage.py test docutil
You should see these lines at the end:
Ran x tests in xs OK
If you use MySQL, you may see some error messages at the end of the unit tests: as long as the OK is printed, you should ignore these annoying error messages.
If you see these lines instead, there was an error and you should contact me:
Ran x tests in xs FAILED (failures=x, skipped=x)
This short user guide will show you how to analyze the codebases, documentation, and support channels of a project.
The guide assumes that you are located in the
recodoc2 directory containing
the manage.py script and that this script has the executable permission.
A list of all the available commands are available by issuing the help command:
./manage.py help # Print info about a specific command and its options: ./manage.py help createproject
Currently, it takes many small commands to analyze the artifacts of a project: this is done on purpose to ease troubleshooting. It is easier to help you if I know that an error occurred in a smaller command than if it occurred in a big command that does everything. Moreover, some of the operations can be lengthy, so it makes sense to break them in smaller steps.
This guide will assume that you want to analyze the HttpClient project. Steps for other projects should be easy to infer.
The various models generated by Recodoc can be seen, searched, and edited through a web interface. Just run the following command to start a webserver:
Then open your web browser to
http://localhost:8000/admin and enter the
username and password to you provided when you executed the
The following step will create a bunch of metadata in the database. It should complete quickly and without error: this is thus a good first step!
This command should only be issued once after running the
Create a project by issuing the following command.
./manage.py createproject --pname hclient --pfullname 'HttpClient Library' --url 'http://hc.apache.org/httpcomponents-client-ga/index.html' --local
Note that a project will be created in the database and a folder will be
created in the Recodoc data directory specified in the PROJECT_FS_ROOT
variable in the localsettings.py file. Commands that begin by "create" usually
create a model in the database. They can optionnally initialize the required
directory structure if the
--local flag is provided. Alternatively, there
is always the possibility to invoke the "createXlocal" command. The rationale
is that sometimes, it can be useful to transfer the local data, but not the
database from one machine to another.
If you want to analyze the code and the documentation of a project, you need to
./manage.py createrelease --pname hclient --release '4.0' --is_major ./manage.py createrelease --pname hclient --release '4.1'
To analyze a codebase, you will need to have Eclipse installed with Py4J. You can open Eclipse yourself or use this command:
Execute this command to create a codebase model and the appropriate directory structure:
./manage.py createcode --pname hclient --bname main --release '4.0' --local
Then, execute the following command to add the project to the Eclipse workspace. You will see that a project name htclientmain4.0 is created. It contains a src folder for the source code and a lib folder for the libraries (e.g., jar files). Once the project is added to Eclipse, add the source code and the dependencies to this project.
Note that the project is only "linked" in the Eclipse workspace. The actual source code and structure resides in the PROJECT_FS_ROOT/code directory.
./manage.py linkeclipse --pname hclient --bname main --release '4.0'
Once the project compiles under Eclipse, execute the following command to generate the codebase model. Recodoc will go through the code in the project and generate the appropriate code elements in the database (e.g., packages, classes, methods, fields, hierarchy relationships).
./manage.py parsecode --pname htclient --bname main --release '4.0' --parser java
If there is any problem while parsing the code (e.g., you notice a compilation error that you missed first or you want to add some packages), you can execute this command to delete the model (just rerun the parsecode command after):
./manage.py clearcode --pname htclient --bname main --release '4.0' --parser java
Finally, if you want to see the difference between two codebase releases, you can use the Recodoc codediff command:
./manage.py codediff --pname htclient --bname main --release1 4.0 --release2 4.1
The codebase diff report is available through the web interface (under Codediffs).
Execute the following command to create the appropriate model and directory structure:
./manage.py createdoc --pname htclient --release 4.0 --dname clienttut \ --parser doc.parser.common_parsers.NewDocBookParser \ --url "file:///local_path/httpcomponents-client-4.0.1/tutorial/html/index.html" \ --local
url parameter can be a local or remote (e.g., http) path. The
documentation will be downloaded starting from this URL.
parser parameter refers to the Python class responsible for generating
a model from the documents. There is also an optional
syncer parameter if
the documentation is not contained in a subdirectory (e.g., a wiki has a flat
structure when it comes to URL so if you use the default "syncer", all pages
in the wiki will be included, not just the ones that are related to the
To analyze a support channel, you will need to perform the following steps:
- Get a table of contents of all the threads or messages.
- Get the url of all threads and messages.
- Download all pages containing each thread or messages.
- Parse each page to generate a model of threads and messages and identify the code snippets and the code-like terms.
First, create a channel using the following command:
./manage.py createchannel --pname hclient --cfull_name usermail --cname usermail \ --syncer channel.syncer.common_syncers.ApacheMailSyncer \ --parser channel.parser.common_parsers.ApacheMailParser \ --url 'http://mail-archives.apache.org/mod_mbox/hc-httpclient-users/' --local
Note that the
parser parameters refer to the Python class
responsible for crawling the channel (syncer) and generating a model from it
After you have created the channel structure, you need to retrieve the table of contents. This should not take long.
./manage.py tocrefresh --pname hclient --cname usermail ./manage.py tocview --pname hclient --cname usermail
The next step is to download the sections in the table of contents. A section is a page listing messages or threads. For example, for a mailing list, a section is a page for a month (e.g., the page showing all messages for December 2010). For a forum, a section is a page in the forum index (Page 1 for threads 0 to 40, Page 2 for threads 41 to 80, etc.).
# This will download sections in increment of 20. This is recommended. ./manage.py tocdownload --pname hclient --cname usermail --start 0 --end 20 ./manage.py tocdownload --pname hclient --cname usermail --start 20 --end 40 ./manage.py tocdownload --pname hclient --cname usermail --start 40 --end -1 ./manage.py tocview --pname hclient --cname usermail # You can also download all sections in one go: ./manage.py tocdownload --pname hclient --cname usermail --start 0 --end -1 ./manage.py tocview --pname hclient --cname usermail
You can now download the individual messages or threads. Each message/thread is identified by an index. Indexes are incremented by 1000 for each table of contents sections. For example, the first (hypothetical) 50 messages in December 2010 are indexed from 0 to 49. The first 25 messages in January 2011 are indexed from 1000 to 1024 and so on.
./manage.py tocviewentries --pname hclient --cname usermail ./manage.py tocdownloadentries --pname hclient --cname usermail --start 0 --end 1000 ./manage.py tocdownloadentries --pname hclient --cname usermail --start 1000 --end 2000 ./manage.py tocviewentries --pname hclient --cname usermail # Continue until -1
You can see that the pages are downloaded in the
Finally, if you want to parse these messages and generate a model (channel/support threads/messages/code-like terms/code snippets), you can execute this command:
./manage.py parsechannel --pname hclient --cname usermail
If it ever happens that an error occurred while parsing or that you find a bug in your parser, you can delete the generated model from the db with this command:
./manage.py clearchannel --pname hclient --cname usermail
Once you have analyzed the documentation and the support channel, you need to further analyze the code snippets identified by Recodoc. In the following step, individual code-like terms will be extracted from the code snippets.
This step assumes that Eclipse is running. Run the command once to parse all snippets from the documentation, then from the support channel.
./manage.py parsesnippets --pname hclient --parser java --source d ./manage.py parsesnippets --pname hclient --parser java --source s
If there is a problem while parsing the code snippets (e.g., there is a thunderstorm and your computer crashes), you can delete all the code-like terms that were extracted from the code snippets with this command:
./manage.py clearsnippets --pname hclient --language j --source d
Once all code-like terms have been identified and classified, you can ask Recodoc to link the terms with specific code elements. Run these two commands to start the linking process:
# Link code elements from main 4.0 with terms in documentation from 4.0 ./manage.py linkall --pname hclient --bname main --release 4.0 --srelease 4.0 --source d # Link code elements from main 4.0 with terms in the support channel ./manage.py linkall --pname hclient --bname main --release 4.0 --source s
Note that it is possible to link different releases together: for example, you could try to link the release 4.1 of the codebase with the release 4.0 of the documentation.
Linking large support channels can take several days on modern hardware so it makes sense to divide the work in smaller chunks. Contact me if you want to learn how to do this.
If you want to remove all links and start again (e.g., because you found a bug in the linker...), execute these commands:
# To clear all the links. ./manage.py clearlinks --pname hclient --release 4.0 --source d ./manage.py clearlinks --pname hclient --release 4.0 --source s # Restore the original classification computed by the parser ./manage.py restorekinds --pname hclient --release 4.0 --source d ./manage.py restorekinds --pname hclient --release 4.0 --source s
Here we will generate recommendations for version 4.1 by analyzing the documents of both versions 4.0 and 4.1 of the client tutorial.
# To clear all the links. ./manage.py clearlinks --pname hclient --release 4.0 --source d # Link for version 4.0 to 4.0 ./manage.py linkall --pname hclient --bname main --release 4.0 --srelease 4.0 --source d # Link for version 4.1 to 4.1 ./manage.py linkall --pname hclient --bname main --release 4.1 --srelease 4.1 --source d # Link against previous version ./manage.py linkall --pname hclient --bname main --release 4.1 --srelease 4.0 --source d # Computes link differences ./manage.py doclinkdiff --pname hclient --bname main --release1 4.0 --release2 4.1 # Compute and compare coverage ./manage.py computefamilies --pname hclient --bname main --release 4.0 ./manage.py computefamilies --pname hclient --bname main --release 4.1 ./manage.py computedoccoverage --pname hclient --bname main --release 4.0 --dname clienttut --srelease 4.0 ./manage.py computedoccoverage --pname hclient --bname main --release 4.1 --dname clienttut --srelease 4.0 ./manage.py comparecoverage --pname hclient --bname main --release1 4.0 --release2 4.1 --source d --pk 1 # Compute and show addition recommendations ./manage.py computeaddrecs --pname hclient --bname main --release1 4.0 --release2 4.1 --source d --pk 1 ./manage.py showaddrecs --pname hclient --bname main --release1 4.0 --release2 4.1 --source d --pk 1 # Compute and show deletion recommendations ./manage.py computeremoverecs --pname hclient --bname main --release1 4.0 --release2 4.1 --source d --pk 1 ./manage.py showremoverecs --pname hclient --bname main --release1 4.0 --release2 4.1 --source d --pk 1
For now, please look in the following modules:
- doc.syncer.generic_syncer (SingleURLSyncer)
This software is licensed under the New BSD License. See the LICENSE file in the for the full license text.