develop branch:
master branch:
Explore our QA system: kbqa-pg.cs.upb.de
Within ./KBQA
you find the library of this project. It is filled with READMEs to guide you through the library and
modules are thooroughly documented. We worked on two different approaches, called App A and App B.
App B enhances SPBERT by adding
triples as additional knowledge. The major parts it consists of are the summarizers
and the transformer architectures. For more information on
App B, visit /KBQA/appB
.
To use most of our modules, install the library with e.g.
pip install -e <path to cloned library>/KBQA-PG
.
Following we will provide examples on how to use our library.
This example goes through the steps to use our bert-spbert-spbert model. This model consists of an encoder BERT model for questions, an encoder SPBERT model for triples. The output tokens of both models are concatenated and fed into the decoder. The decoder is another SPBERT instance.
Training this model corresponds to the following steps:
- qtq-Dataset Generation
- qtq-Dataset Preprocessing
- Training the Model
We will go through them in detail next.
The preprocessors are designed to parse a qtq-dataset (question-triple-query-dataset). Therefore, the first step is the generation of such a dataset by the data_generator. In the following the steps are shown to generate the qtq-dataset from the dataset qald-9-train-multilingual using the LaurenSummarizer:
- Navigate to the data_generator directory:
cd KBQA/appB/data_generator
- Run the command
python generator -d qald-9/updated/updated-qald-9-train-multilingual.json -s LaurenSummarizer --output qtq-updated-qald-9-train-multilingual.json
The generator will then generate the qtq-dataset and save it to the datasets directory. Note that for qald-8
, qald-9
and lc-quad
all qtq-datasets were already generated and can be found in this directory.
Every model has specific preprocessors for the input data. You find the preprocessor corresponding to each model in the READMEs of the model you want to use. The input data must have qtq-format.
We want to use
bert-spbert-spbert which is located in
KBQA/appB/transformer_architectures/bert_spbert_spbert
.
The README there explains how to preprocess. First
you have to follow the installation
instructions. Then the preprocessing
instructions.
This step also is explained at the model README.
Follow the example in the Model Usage
section. The trained model will be stored as /output/checkpoint-best-bleu/pytorch_model.bin
by default. This binary
can be given as argument to the prediction phase.
Prediction with the trained model consists of the following steps:
- qtq-Dataset Generation
- qtq-Dataset Preprocessing
- Prediction with Trained Model
- Decode Prediction
Step 1 and 2 are the same as in the training example, except you will not see any queries in the qtq-dataset.
Step 3 is explained on the README within Model Usage again.
For step 4, follow the Postprocessing part.
All the models in KBQA/appB/transformer_architectures
follow similar
schemes as in the App B: bert-spbert-spbert example. The corresponding READMEs should provide you with the information to use the corresponding model.
We use the workflow as described by GitHub flow.
Follow the guidelines for commit messages.
A properly formed git commit subject line should always be able to complete the following sentence
If applied, this commit will <your subject line here>
We have two main branches master
for releases and develop
for development builds. These branches always contain completed changes and executable, formated code and will be deployed to our server. Therefore, these branches are protected and only reviewed pull requests (PR) can be merged. For every feature/topic a new branch based on develop has to be created. A PR should only contain one topic and can also be opened as a draft to get early feedback. When merging a PR, individual commits can be combined (rebased) if they describe a related change.
Instance | Branch | Description |
---|---|---|
Release | master | Accepts merges from Develop |
Working | develop | Accepts merges from Features/Issues and Hotfixes |
Features/Issues | feature/* | A branch for each Feature/Issue |
Hotfix | hotfix/* | Always branch off Develop |
The top directory contains only configuration files that refer to this repository. Everything else is in the KBQA folder:
The end-to-end system that is automatically deployed on the VM is located in the folder kbqa. Other topics that are not (yet) included in the end-to-end system should have their own folder.
We use the standard style guides.
For python, this is the PEP 8.
Type hints (PEP 484) should be used whenever possible. Static analysis can then ensure that variables and functions are used correctly.
For documenting the code we use docstrings (PEP 257). Every method and class has a docstring describing its function and arguments. We follow the numpy docstring format. Using consistent docstrings in the project, we automatically create a code documentation website.
In order to include modules from different directories, you can install the project as a package. This way the project can be splitted into different subdirectories/subprojects, which can be imported by each other. The installation can be done by running the following command in this directory:
pip install -e .
After that, you can import all source files starting with the root directory KBQA
.
We use the different linters to apply style rules and point out issues in code. This greatly simplifies code reviews and helps to detect errors early on.
To automatically run the linters on every commit, we use pre-commit. To setup pre-commit once run the following commands:
pip install pre-commit
pre-commit install
Now on every commit the linters defined in pre-commit config will run automatically.
If you are in a hurry, you can skip the linting with git commit --no-verify
.
But to merge into the develop branch the pipeline has to pass.
The linters should not be applied to external code files (libraries), configs, and non-code folders as they do not have to meet our coding conventions. Therefore, these files or folders have to be excluded from the linting process. This can be done in the pre-commit config by adding the files or folders to the exclude option, which is a regular expression.
Example: exclude: ^documentation/
Quickly generate docstrings for python functions in the right format by typing triple quotes.