Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
GSoC 2019: Predicate Finder
Predicate Detection using Word Embeddings for Question Answering over Linked Data
Student: Yajing Bian
Mentors: Ram G Athreya, Rricha Jalota, Ricardo Usbeck
Proposal: Predicate Finder Commit: Commit in dev branch
Question answering (QA) approaches allow users to query RDF datasets in natural language, which could be completed via three steps: identifying named entities, detecting predicates and generating SPARQL queries. Aiming on predicate detecting step, this project identifies the KB relations that queries in natural language refers to.
The main task is building a mapping between natural language questions and knowledge base predicates. For instance, given a query ”What was the university of the rugby player who coached the Stanford rugby teams during 1906-1917?”, we could get results as "headcoach" and "university", for that mentioned query could be represented as following two triples:
- < 190617 Stanford rugby teams, headcoach, ?x >
- < ?x, university, ?ans >
As one of the requirements of our project, this system could be adapted to both simple queries and complex queries. Give a illustration of these two kinds of queries:\
Queries can be answered using a SPARQL query with a single triple pattern. Such as ”What is the capital of France?”
queries involve multiple triples, or COUNT/ASK keywords, or both of them, which exhibit large syntactic and structural variations. Nested relational clauses are also common. Such as ”What was the university of the rugby player who coached the Stanford rugby teams during 1906-1917?”
The specified dataset of our project, LC-QuAD, contains 5000 questions, of which simple queries accounts for 18%. Reference: Trivedi, Priyansh, et al. ”Lc-quad: A corpus for complex question answering over knowledge graphs.” International Semantic Web Conference. Springer, Cham, 2017.
Our predicate detecting system could be divided into the following sub-tasks:
- Entity linking: Execute entity linking to identify a topic entity in the question and retrieve top-k candidate entities from DBpedia.
- Build candidate predicate sets: Employ a predicate detector to predict the potential DBpedia relations that could exist between the entities in the question and the answer entities.
- Predicate detection: Find the golden predicate(s) of given query from its corresponding candidate sets mentioned above. The main contribution of this work is to build a predicate detection model, which consists of building a mapping between natural language questions and knowledge base predicates. In addition, a baseline is available for comparison.
As shown in pic, the baseline process can be divided into following three steps:
- Entity linking: Using existing entity linking tools.
- Build candidate predicate sets: For each top-K candidate entity, we collect all predicates occurring in triples from DBpedia whose subject is given candidate entity.
- Remove predicate word(s) from given query, and calculate its 1-gram and 2-gram embedding simultaneously as query representation(QR).
- For a candidate predicate, its representation(PR) is composed of the word's embedding and its label's embedding. Calculate the similarity between PR and QR, which could be considered as the score S of that predicate.
- Multiply S and the idf score of predicate. Set this result as its final score.
- Find the predicate with highest final score if given query is simple. If not, decompose that query and categorize decomposable questions into two types: parallel and nested. When complex questions contain a nested relation clause, execute the sub-questions in order of syntactic dependency. For these sorted sub-questions, the answer of a former question will act as the topic entity of the next sub-question. Repeating the procedure ii, iii until the predicates of all sub-questions are detected.
How to Operate
Predicate Detection Model
- Consult the entity linking and candidate predicate sets building steps in the chapter "baseline".
Use a joint model to find a globally optimal entity-predicate assignment from local predictions. For a given entity-predicate pair, we extract the following features which are passed to the XGBoost ranker as input:
- Question-Entity Similarity. We use the similarity score between the question and entity which returned by DBpedia Spotlight as a feature.
- Question-Predicate Similarity. We use the similarity score between the question and predicate which returned by the predicate detection module as a feature.
- Question-Relation Embedding Similarity. We use the similarity score between the question embedding and relation embedding as a feature. The question embeddings are trained by our MGNN model and the relation embeddings are pre-trained by TransE model.
- Question-Predicate Overlap Number. We use the number of word overlaps be- tween the question and DBpedia predicate name as a feature.
- Question-Predicate Jaro-Winkler Distance. We use the Jaro-Winkler distance between the words in question and the predicate name in DBpedia as a feature.
- Question-Predicted Answer Similarity. For each entity-predicate pair, we generate the corresponding query to retrieve the answers, and we use the semantics similarity score between the question and answer entity description as a feature.
How to Operate
CUDA_VISIBLE_DEVICES=xx python train.py
CUDA_VISIBLE_DEVICES=xx python test.py
- Simple queries: 0.459
- Complex queries: 0.327
Predicate Detection Model
- Simple queries: 0.172
- Complex queries: 0.091
Tools and Technologies
- Python 3.7
- DBpedia Spotlight
This page will be updated against actual conditions.
Community Bonding Period (May 6th to May 26th)
- Have a meeting with mentors and understand the considerations.
- Introduce myself in Slack channel.
- Create GitHub repository.
- Set to work on Wiki page.
Week1 (May 27th to June 2nd)
- Obtain pre-trained word embedding by using PyTorch-NLP package, including both FastText and GloVe.
- Filter queries using the number of parameters in SPARQL template as a first attempt.
- Implement named entity recognition and entity linking step using queries obtained from the above part
- Create a demo of data extracting, which is to find all the predicates of a given entity from DBpedia.
- Add .gitignore file and a script for dataset downloading.
Week2 (June 3rd to June 9th)
- Complete the semantic part of query encoding step, which considers every query as a word sequence and obtain its embedding using a LSTM sequence.
- Obtain syntactic tree of queries using Stanford Core NLP.
- Distinguish wh-words in given query using a pre-defined dataset.
- Convert syntactic tree of every query into a word sequence representing its syntactic structure, which considers wh-word and entity as its starting point and end point.
- Obtain syntactic embedding using a LSTM sequence.
Week3 (June 10th to June 16th)
- Observe Lc-QuAD dataset and select the queries with template 1, 2, 101, 151, 152 as our aiming data in the first period. Meanwhile, modify query filter based on templates rather than parameter numbers.
- Implement a structure to obtain hierarchical feature embedding and answer type embedding, but the corresponding data acquisition has not completed yet.
- Complete a simple structure of MGNN, including training process and testing process. Input a query, the program could return a series of <entity, predicate> pair and its corresponding similarity score. For the SPARQL tool reports error sometimes, we use a static predicate as test data.
Week4 (June 17th to June 23rd)
- Complete the hierarchical feature acquiring step.
- Remove feature "answer type".
- Add a module enables GPU training.
- Add several exception-handling method to improve robustness.
- Parse the given dataset LC-QuAD: Observe these five templates separately to extract golden pairs.
- Modify the model testing approach to read data automatically from files.
- Run the whole model and store result into a CSV file, which includes input query, the most likely pair and its similarity score.
Week5 (June 24th to June 30th)
- Obtain similarity between query and s-p pair using 5 different metrics, which can be considered as input features for XGBoost along with the result of MGNN: Question-Entity Similarity，Question-Relation Embedding Similarity，Question-Predicate Overlap Number，Question-Predicate Jaro-Winkler Distance，Question-Predicted Answer Similarity.
- Construct entity-predicate pairs of different scores for XGBoost training, such as (right_entity, right_predicate)，(right_entity, wrong_predicate) and so on.
- Complete the training part of XGBoost model.
Week6 (July 1st to July 7th)
- Generate a set of candidate s-p pairs for each query.
- Calculate 6 kinds of similarities between each query and its s-p pairs, which will be generated in LIBSVM format.
- Complete the testing part of XGBoost model and predict the best s-p pair for each query.
- Modify the approach to storing and loading model.
- Modify the data extracting interface, which allows us to obtain XGBoost features directly.
Week7 (July 8th to July 14th)
- Adjust parameters of MGNN and XGBoost to improve performance.
- Make an attempt to split predicates, for instance, convert "isPrimaryTopicOf" into "is Primary Topic Of".
- Try to add some pre-defined rules to remove common predicates.
Week8 (July 15th to July 21st)
- Consider both subject and object when collecting candidate entities.
- Replace hierarchical feature with type of entity, which encoded as one-hot vector.
- Replace word-level embedding with pre-trained embedding and compare their performance.
Week9 (July 22nd to July 28th)
- Implement a baseline method, which predict answer by simply consider the similarity between candidate predicate embedding and query embedding. Calculate similarity between candidate predicate and every word in given query, and find the maximum similarity as its score.
- Add idf feature to avoid interference by common predicates.
Week10 (July 29th to August 4th)
- Replace GloVe embedding with FastText.
- Entities obtained from entity-linking tool contain special characters like ",+.)(' sometimes. Add escape characters to solve this problem.
- Calculate 1-gram and 2-gram embedding simultaneously as query representation for a higher tolerance rate.
- If there exist several entities in a query, calculate similarity score using query words except entities.
Week11 (August 5th to August 11th)
- Attempt to use several approaches to scale the weight of idf feature.
- Attempt to represent predicate using both word embedding and label embedding, which would be used to calculate the similarity between query and given predicate.
- Modify the Maxpooling layer and predicate part of MGNN, as well as the methods to obtain the quota.
Week12 (August 12th to August 18th)
- Complete the part of decomposing and extracting s-p pair from complex queries.
- Adapt the model to complex queries by serially executing a set of simple triples.
- Calculate the accuracy of result of predicting complex queries.