Skip to content

colab-nyuad/CrunchQA

Repository files navigation

CrunchQA: New Challenges in Question Answering over Knowledge Graphs

Image of Version Image of Dependencies

CrunchQA is a new dataset for question-answering on knowledge graphs (KGQA). The dataset was created to reflect the challenges we identified in real-world applications which are not covered by existing benchmarks, namely, multi-hop constraints, numeric and literal embeddings, ranking, reification, and hyper-relations.

The repository contains scripts for:

  • creating a Knowledge Graph from the Crunchbase database;
  • creating a Question Answering dataset based on multiple-hop templates and paraphrasing;
  • running experiments with state-of-the-art KGQA models on CrunchQA.

⚠️ IMPORTANT: Since the Crunchbase dataset is subject to licensing, the repository contains a script to process a dump and reconstruct KG. The dump provided by Cruchbase under the academic license contains all records till the current timestamp. To match the KG we used to generate questions, the script construct_kg.py processes data records until the given timestamp (31st January 2022 23:59:59 to match our KG).

🌟 CrunchQA can be downloaded from this link.

Quick start

# retrieve and install project in development mode
git clone https://github.com/colab-nyuad/CrunchQA

# set environment variables
source set_env.sh

Table of contents

  1. Data
  2. KG constraction
    1. KG construction from CSV
    2. KG construction from RDF
  3. QA templates
    1. Constraints Format
  4. QA dataset
  5. Evaluation
    1. Training Embeddings
    2. Running KGQA
  6. Results

Data

Crunchbase CSV export is updated each morning and includes separate files for companies, people, funding rounds, acquisitions, and IPOs. User key is required to obtain the export, which can be ontained under research access on the Crunchbase website. After downloading the export, please unzip it into the folder data.

Creating KG from crunchbase data

The Crunchbase data dump comprises 17 relational tables with primary and foreign keys to link tables together. To build the KG, we use a simple approach. We create new entities for each main entity type and use reification nodes to map the relationship between the base entity types and link additional information like start date, end date, title, etc. For the job titles we limit the range to a set of categories that can be found in jobs.json.

KG construction from CSV

The knowledge graph generated from the csv dump includes 3.2 million entities, 31 relations, and 17.6 million triples. Following is the structure of the created KG:

The command to generate KG:

python construct_kg.py

KG construction from RDF

In the paper A Linked Data Wrapper for CrunchBase, authors proposed a wraper around the CruchBase API that provides data in RDF format. The paper includes a link to the dump dated October, 2015. Since this dump is publicly available, we map the RDF data to the KG tirples fromat and provide a link for download. The mapped KG is smaller than the used KG for constracting the questions. This version is missing events and a set of atrributes for other entities but contains the product entities. This smaller version of KG and the scheme of it can be downloaded from link.

QA templates

The templates are classified into 3 categories:

  • 1-hop (inferential chain of the length 1 and at most 1 constraint)
  • 2-hop (inferential chain of the length 2 and at most 1 constraint)
  • advanced - the rest (Includes questions with inferential chain of length 1 to 3, with more than one constraints)

Following is an example of the advanced question template:

   {
    "main_chain": "org1-in_org-job1-has_job-person",
    "question": [
       "[org1] alumni who founded more than 5 companies",
       "founders of more than 5 companies who previously worked in [org1]",
       "list people who formerly worked in [org1] and founded more than 5 companies"
    ],
    "constraints": [
      {
       "entity_constraint": {
        "job1-is_current-job_current": ["False"]
       }
      },
      {
        "entity_constraint": {
          "person-has_job-job2-job_title-job_title": ["founder", "co-founder"]
         }
      },
      {
        "numeric_constraint": {
         "job2-in_org-org2": {
          "count_over":"org2",
          "group_by": ["person"],
          "numeric": ["", ">", 5]
         }
        }
       }
     ]
   }

Each template contains:

  • main_chain - a path in the KG leading form the head entity to the answer entity in the format entity1-relation1-entity2-relation2-.... relationn-entityn+1.
  • question - a language form of the question where '[]' indicates the head entity and '()' indicate contraints entitites. They will be replaced by the actual values of the entities when questions are generated from the templates.
  • constraints (entity constraint, temporal constraint, maximum constraint, numeric constraint)
  • type: temporal when a template requires temporal data, numeric when a template requires numeric data

The templates support multi-entity/relation type (format: entity1/entity2 or relation1/relation2) to cover questions, which can refer to multiple entities or relations, e.g., if we ask about investors, both companies and people can make investments, or if a question is about participating in an event without specifying a specific role, we should encounter all types of relations, i.e., sponsor, speaker, organizer, contestant and exhibitor. Subscripts (job1, job2) refer to the same column as job. The order is introduced for the convience of our implementation to simplify the join while attaching the constraints (for example, when "job" entity appears more than one time in a question template, we are able to indicate in our code which "job" entity to address to).

The following table shows the number of created templates.

1-hop 2-hop advanced total
FinQA 107 70 66 243

Constraints Format

In the following, in the description of each constraint type "constraint_chain" indicates a constraint inferential path, which can be up to 2-hop since some constraints involve the reification nodes. The begining of the chain is one of the entities from the main inferential chain (format: entitymain_chain-relation1-entity1-relation2-entity2).

Entity constraint requires the tail entity of the constraint_chain to be equal to a certain value (constraint_chain: [value1, value2, ..., valuen]. E.g., for queries asking about female employees it can be specified as gender = 'female':

    "entity_constraint": {
      "person-gender-gender": ["female"]
     }

Temporal constraint requires the date to be within a specified time range (constraint_chain: {} or {"before":year} or {"after":year} or {"between": [year1, year2]}). If the time range is not specified, we just sample with the indicated year or year+month depending on setting granularity. E.g, for queries asking about funding rounds announced between 2010 and 2020:

     "temporal_constraint": {
        "funding_round-announced_on-date": {
           "between": ["2010", "2020"]
        }
       }

Maximum constraint is introduced to reflect key words as "top", "at most", "the highest" and etc. Maximum is always computed within a group and works in two settings:

(1) grouping by the head entity and taking maximum by the column, or

(2) first counting over edges and then selecting the row with maximum count value.

  1. Grouping by the head entity is the default setting so we just need to specify by which column the maximum needs to be taken (constraint_chain: {"max": entity}). E.g, for the query asking which Software company raised the highest amount of money in its ipo, the question generator will group by "Software" (org_category) and select maximum among raised_price:
    "max_constraint": {
     "org-org_ipo-ipo-raised_amount_usd-raised_price": {
      "max": "raised_price"
     }
    }
  1. Another setting the maximum constraint supports is first counting over edges and then selecting maximum. For this setting we need to specify three fields:
    • "count_group_by" - the list of columns to group by
    • "count_over" - the column we count over while grouping and from wich we will select the maximum
    • "max_group_by": the entity around which the question is centered

E.g., for the query asking what types of events did Bill Gates mostly participate in, we need to group by Bill Gates and type of the events he participated in (person, event_type) and while grouping we count how many times (we count over the column events), after grouping we select maximun from the count for Bill Gates, in the notation of templates we select maximum from the count for central entity (person):

   "max_constraint": {
     "event-type_of-event_role": {
       "count_over": "event",
       "count_group_by": ["person", "event_type"],
       "max_group_by": ["person"],
     }
   }

Numeric constraint reflects the key words "more than", "less than", "at least". This constraint also supports two settings:

(1) filtering records by applying the specified condition on the columns values, or

(2) counting over edges and then applying the specified condition on this count.

For the first setting we just need to specify the condition in the format ["entity", ">|=|<",200]. For the second setting we need to specify:

  • "group_by" - the list of columns to group by
  • "count_over" - the column we count over while grouping and on which the condition will be applied
  • "numeric": condition in the format ["", ">|=|<", number]

E.g., for a query asking to list companies which acquired more than 50 companies, we need to group by organizations and count over organizations it acquired, then filter out the records where count is <= than 50:

    "numeric_constraint": {
      "org1-is_acquirer-acquisition-is_acquiree-org2": {
       "count_over":"org2",
       "group_by": ["org1"],
       "numeric": ["", ">", 50]
      }
   }

QA dataset

To generate the dataset from the templates:

python genrate_dataset.py --sample_size 200

Sample size indicates how many questions per template to genrate. We generetated 100 questions per 1, 2-hop templates and 200 per advanced templates. Command to shuffle and split generated questions into train, valid, test:

./qa_dataset/split_train_valid_test.bh

Evaluation

We focus our evaluation on EmbedKGQA, an approach that combines graph embeddings and reasoning using graph traversal. To learn KG embeddings, we use the PyKEEN library. We conduct experiments with three KG embedding models:

We first shuffle and split generated triples into train, valid, test:

./kg/split_train_valid_test.bh

Command to train kg embeddings using pykeen:

python train_embeddings.py --model DistMult --train_path kg/clustering/train.txt --valid_path kg/vanilla/valid.txt 
--test_path kg/vanilla/test.txt --dim 200 --results_folder embeddings/clustering_distmult

Link prediction results:

Model Setting MRR H@1 H@3
TransE Vanilla 0.165 0.119 0.181
ComplEx Vanilla 0.185 0.174 0.247
DistMult Vanilla 0.254 0.2 0.297
TransE Clustering 0.129 0.081 0.145
ComplEx Clustering 0.14 0.154 0.297
DistMult Clustering 0.226 0.175 0.265
ComplEx Literals 0.408 0.346 0.43
DistMult Literals 0.255 0.155 0.269

Command to run KGQA framework:

python run.py --model DistMult --embeddings_folder embeddings/clustering_distmult --freeze True

arguments:
  --model               [TransE, ComplEx, DistMult]
  --optimizer           [Adagrad,Adam]
  --max_epochs          Maximum number of epochs
  --patience            Number of epochs before early stopping for KG embeddings
  --valid_every         Number of epochs before validation for QA task
  --dim                 Embedding dimension
  --batch_size          Batch size for QA task 
  --learning_rate       Learning rate for QA task
  --freeze              Freeze weights of trained KG embeddings
  --use_cuda            Use gpu
  --gpu                 Which gpu to use
  --num_workers         Number of workers for parallel computing 
  --labels_smoothing    Labels smoothing
  --setting             ["vanilla", "clustering"]

Results

For more details regarding the different KGQA model settings please refer to the manuscript.

KGQA Model Vanilla Temporal Numeric
TransE 17% - -
ComplEx 12% - -
DistMult 12.7% - -
ComplExLiteral 13.86% - -
DistMultGatedLiteral 10.68% - -
TransE (clustering) 18.2% 8.09% 37.43%
ComplEx (clustering) 13.2% 8.89% 33.49%
DistMult (clustering) 11.1% 8.99% 8.86%

How to cite

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published