ast2vec

Machine Learning models on top of Abstract Syntax Trees.

Currently, there are implemented:

id2vec, source code identifier embeddings
docfreq, source code identifier document frequencies (part of TF-IDF)
nBOW, weighted bag of vectors, as in src-d/wmd-relax
topic modeling

This project can be the foundation for MLoSC research and development. It abstracts feature extraction and working with models, thus allowing to focus on the higher level tasks.

It is written in Python3 and has been tested on Linux and macOS. ast2vec is tightly coupled with Babelfish and delegates all the AST parsing to it.

Here is the list of projects which are built with ast2vec:

vecino - finding similar repositories
tmsc - topic modeling of repositories
role2vec - AST node embedding and correction
snippet-ranger - topic modeling of source code snippets

Installation

pip3 install ast2vec

You need to have libxml2 installed. E.g., on Ubuntu apt install libxml2-dev.

Usage

This project exposes two interfaces: API and command line. The command line is

ast2vec --help

There is an example of using Python API here.

It exposes several tools to generate the models and setup the environment.

API is divided into two domains: models and training. The first is about using while the second is about creating. Models: Id2Vec, DocumentFrequencies, NBOW, Cooccurrences. Transformers (keras/sklearn style): Repo2nBOWTransformer, Repo2CooccTransformer, PreprocessTransformer, SwivelTransformer and PostprocessTransformer.

Docker image

docker build -t srcd/ast2vec .
BBLFSH_DRIVER_IMAGES="python=docker://bblfsh/python-driver:v0.8.2;java=docker://bblfsh/java-driver:v0.6.0" docker run -e BBLFSH_DRIVER_IMAGES -d --privileged -p 9432:9432 --name bblfsh bblfsh/server:v0.7.0 --log-level DEBUG
docker run -it --rm srcd/ast2vec --help

If the first command fails with

Cannot connect to the Docker daemon. Is the docker daemon running on this host?

And you are sure that the daemon is running, then you need to add your user to docker group: refer to the documentation.

Algorithms

Identifier embeddings

We build the source code identifier co-occurrence matrix for every repository.

Clone or read the repository from disk.
Classify files using enry.
Extract UAST from each supported file.
Split and stem all the identifiers in each tree.
Traverse UAST, collapse all non-identifier paths and record all identifiers on the same level as co-occurring. Besides, connect them with their immediate parents.
Write the individual co-occurrence matrices.
Merge co-occurrence matrices from all repositories. Write the document frequencies model.
Train the embeddings using Swivel running on Tensorflow. Interactively view the intermediate results in Tensorboard using --logs.
Write the identifier embeddings model.
Publish generated models to the Google Cloud Storage.

1-6 is performed with repo2coocc tool / Repo2CooccTransformer class, 7 with id2vec_preproc / id_embedding.PreprocessTransformer, 8 with id2vec_train / id_embedding.SwivelTransformer, 9 with id2vec_postproc / id_embedding.PostprocessTransformer and 10 with publish.

Weighted Bag of Vectors

We represent every repository as a weighted bag-of-vectors, provided by we've got document frequencies ("docfreq") and identifier embeddings ("id2vec").

Clone or read the repository from disk.
Classify files using enry.
Extract UAST from each supported file.
Split and stem all the identifiers in each tree.
Leave only those identifiers which are present in "docfreq" and "id2vec".
Set the weight of each such identifier as TF-IDF.
Set the value of each such identifier as the corresponding embedding vector.
Write the nBOW model.
Publish it to the Google Cloud Storage.

1-8 is performed with repo2nbow tool / Repo2nBOWTransformer class and 9 with publish.

Topic modeling

See here.

Contributions

We use PEP8 with line length 99 and ". All the tests must pass:

unittest discover /path/to/ast2vec

License

Apache 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 356 Commits
ast2vec		ast2vec
doc		doc
.coveragerc		.coveragerc
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
topic_modeling.md		topic_modeling.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ast2vec

ast2vec

doc

doc

.coveragerc

.coveragerc

.gitignore

.gitignore

.travis.yml

.travis.yml

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

setup.py

setup.py

topic_modeling.md

topic_modeling.md

Repository files navigation

ast2vec

Installation

Usage

Docker image

Algorithms

Identifier embeddings

Weighted Bag of Vectors

Topic modeling

Contributions

License

About

Releases

Packages

Languages

License

fineguy/ast2vec

Folders and files

Latest commit

History

Repository files navigation

ast2vec

Installation

Usage

Docker image

Algorithms

Identifier embeddings

Weighted Bag of Vectors

Topic modeling

Contributions

License

About

Resources

License

Stars

Watchers

Forks

Languages