Skip to content
This repository has been archived by the owner on Jul 26, 2023. It is now read-only.

dice-group/KBQA-PG

Repository files navigation

Knowledge Base Question Answering

develop branch:  Deployment develop Testing develop Gerbil develop

master branch:  Deployment master Testing master Gerbil master

Explore our QA system: kbqa-pg.cs.upb.de

KBQA Library

Within ./KBQA you find the library of this project. It is filled with READMEs to guide you through the library and modules are thooroughly documented. We worked on two different approaches, called App A and App B.

App B enhances SPBERT by adding triples as additional knowledge. The major parts it consists of are the summarizers and the transformer architectures. For more information on App B, visit /KBQA/appB.

To use most of our modules, install the library with e.g.

pip install -e <path to cloned library>/KBQA-PG.

Following we will provide examples on how to use our library.

Examples

App B: bert-spbert-spbert

This example goes through the steps to use our bert-spbert-spbert model. This model consists of an encoder BERT model for questions, an encoder SPBERT model for triples. The output tokens of both models are concatenated and fed into the decoder. The decoder is another SPBERT instance.

Training

Training this model corresponds to the following steps:

  1. qtq-Dataset Generation
  2. qtq-Dataset Preprocessing
  3. Training the Model

We will go through them in detail next.

1. qtq-Dataset Generation

The preprocessors are designed to parse a qtq-dataset (question-triple-query-dataset). Therefore, the first step is the generation of such a dataset by the data_generator. In the following the steps are shown to generate the qtq-dataset from the dataset qald-9-train-multilingual using the LaurenSummarizer:

  1. Navigate to the data_generator directory: cd KBQA/appB/data_generator
  2. Run the command python generator -d qald-9/updated/updated-qald-9-train-multilingual.json -s LaurenSummarizer --output qtq-updated-qald-9-train-multilingual.json

The generator will then generate the qtq-dataset and save it to the datasets directory. Note that for qald-8, qald-9 and lc-quad all qtq-datasets were already generated and can be found in this directory.

2. qtq-Dataset Preprocessing

Every model has specific preprocessors for the input data. You find the preprocessor corresponding to each model in the READMEs of the model you want to use. The input data must have qtq-format.

We want to use bert-spbert-spbert which is located in KBQA/appB/transformer_architectures/bert_spbert_spbert. The README there explains how to preprocess. First you have to follow the installation instructions. Then the preprocessing instructions.

3. Training the Model

This step also is explained at the model README. Follow the example in the Model Usage section. The trained model will be stored as /output/checkpoint-best-bleu/pytorch_model.bin by default. This binary can be given as argument to the prediction phase.

Prediction

Prediction with the trained model consists of the following steps:

  1. qtq-Dataset Generation
  2. qtq-Dataset Preprocessing
  3. Prediction with Trained Model
  4. Decode Prediction

Step 1 and 2 are the same as in the training example, except you will not see any queries in the qtq-dataset.

Step 3 is explained on the README within Model Usage again.

For step 4, follow the Postprocessing part.

App B: Other Models

All the models in KBQA/appB/transformer_architectures follow similar schemes as in the App B: bert-spbert-spbert example. The corresponding READMEs should provide you with the information to use the corresponding model.

Contributing

Workflow

We use the workflow as described by GitHub flow.

Follow the guidelines for commit messages.

A properly formed git commit subject line should always be able to complete the following sentence

If applied, this commit will <your subject line here>

Branches

We have two main branches master for releases and develop for development builds. These branches always contain completed changes and executable, formated code and will be deployed to our server. Therefore, these branches are protected and only reviewed pull requests (PR) can be merged. For every feature/topic a new branch based on develop has to be created. A PR should only contain one topic and can also be opened as a draft to get early feedback. When merging a PR, individual commits can be combined (rebased) if they describe a related change.

Instance Branch Description
Release master Accepts merges from Develop
Working develop Accepts merges from Features/Issues and Hotfixes
Features/Issues feature/* A branch for each Feature/Issue
Hotfix hotfix/* Always branch off Develop

Folder structure

The top directory contains only configuration files that refer to this repository. Everything else is in the KBQA folder:

The end-to-end system that is automatically deployed on the VM is located in the folder kbqa. Other topics that are not (yet) included in the end-to-end system should have their own folder.

Code Style

We use the standard style guides.

Python conventions

For python, this is the PEP 8.

Type hints (PEP 484) should be used whenever possible. Static analysis can then ensure that variables and functions are used correctly.

For documenting the code we use docstrings (PEP 257). Every method and class has a docstring describing its function and arguments. We follow the numpy docstring format. Using consistent docstrings in the project, we automatically create a code documentation website.

Setup

Installation

In order to include modules from different directories, you can install the project as a package. This way the project can be splitted into different subdirectories/subprojects, which can be imported by each other. The installation can be done by running the following command in this directory:

pip install -e .

After that, you can import all source files starting with the root directory KBQA.

Linters

We use the different linters to apply style rules and point out issues in code. This greatly simplifies code reviews and helps to detect errors early on.

To automatically run the linters on every commit, we use pre-commit. To setup pre-commit once run the following commands:

pip install pre-commit
pre-commit install

Now on every commit the linters defined in pre-commit config will run automatically.

If you are in a hurry, you can skip the linting with git commit --no-verify. But to merge into the develop branch the pipeline has to pass.

Exclude external code files

The linters should not be applied to external code files (libraries), configs, and non-code folders as they do not have to meet our coding conventions. Therefore, these files or folders have to be excluded from the linting process. This can be done in the pre-commit config by adding the files or folders to the exclude option, which is a regular expression. Example: exclude: ^documentation/

Recommended VS Code Extensions

Python Docstring Generator

Quickly generate docstrings for python functions in the right format by typing triple quotes.

About

Project Group 2021/2022: Knowledge Base Question Answering

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages