Code needed to compute the features used by ResearchLink, condensed into a single file for convenience.
The hypotheses must be in the file data/hypotheses.csv
with the following format: subject,predicate,object,label,scicheck_score
. Then, execute run.py
to generate a processed file with all features computed for each hypothesis, which can then be passed on to a classifier.
A number of precomputations are needed for different features. In the data/
folder, we provide all of them for the CSKG-600 dataset.
- /CSKG-600.csv contains our validation dataset, containing 600 manually verified research hypotheses.
- /data contains files needed to run ResearchLink, mostly containing metadata needed to compute features:
- /data/compute_embeddings.py is a script used to compute graph-embeddings of each entity in the graph.
- /data/embeddings_.npy contains the pre-computed graph embeddings of each entity in the graph.
- /data/entities.txt contains a list of entities in the graph including their degree (total, in-degree, out-degree) and type.
- /data/entities_graph_emb.tsv contains a list of entities with the same index as their corresponding embedding.
- /data/entity2count.json contains the number of apparition of each entity in research abstracts, per year.
- /data/hypotheses.csv contains the score assigned by SciCheck to each triple in the CSKG-600 dataset.
- /data/max_cos_sim.csv contains the maximum cosine similarity of each triple in the CSKG-600 dataset to any other triple in the graph.
- /data/pair2freq.pkl contains the number of times pairs of entities appear together in research abstracts, per year.
- /data/train.txt contains the entire knowledge graph that can be used for training of relevant techniques.
- /test_baselines.py is a script used to test different baseline models on the CSKG-600 dataset.
- /run.py is the script that computes and stores ResearchLink's features, storing them in hypotheses_processed.csv. This script heavily relies on the files in /data.
- First, install the required libraries using
pip install -r requirements.txt
. - Compute the desired graph embeddings by running the compute_embeddings.py script.
- Use the run.py script to compute and store ResearchLink's features.
- Train any desired model from the computed features.
- Knowledge Graph (KG)
- Number of top entities (N)
- SciCheck threshold (0.5)
- Classified triples as true or false hypotheses
-
Generate Candidate Triples
- Identify top N entities in KG.
- For each entity ( e_i ) in top N entities:
- Retrieve its type and possible relations.
- For each relation ( r ), find compatible entities in top N.
- Generate triples ( (e_i, r, t) ) not already in KG.
-
Initial Pruning with SciCheck
- Train SciCheck on KG to compute confidence scores ( c ) for candidate triples.
- Prune triples with ( c < 0.5 ) and keep the rest.
-
Pre-compute Information for Features
- Compute TransH embeddings, degrees, occurrences, and joint occurrences for entities in KG.
-
Compute Features
- For each labelled triple ( (s, p, o) ) and pruned triple ( (s, p, o) ):
- Compute features ( f_1, f_2, \ldots, f_8 ).
- For each labelled triple ( (s, p, o) ) and pruned triple ( (s, p, o) ):
-
Train Random Forest Model
- Train Random Forest model on labelled triples using computed features.
-
Apply Model for Inference
- For each pruned triple ( (s, p, o) ):
- Apply Random Forest model to compute classification.
- Label the triple as true or false hypothesis.
- For each pruned triple ( (s, p, o) ):