kgdata

KGData is a library to process dumps of Wikipedia, Wikidata. What it can do:

Clean up the dumps to ensure the data is consistent (resolve redirect, remove dangling references)
Create embedded key-value databases to access entities from the dumps.
Extract Wikidata ontology.
Extract Wikipedia tables and convert the hyperlinks to Wikidata entities.
Create Pyserini indices to search Wikidata’s entities.
and more

For a full documentation, please see the website.

Installation

From PyPI (using pre-built binaries):

pip install kgdata[spark]   # omit spark to manually specify its version if your cluster has different version

Name		Name	Last commit message	Last commit date
Latest commit History 366 Commits
.github/workflows		.github/workflows
.vscode		.vscode
benches		benches
data		data
docs		docs
kgdata		kgdata
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.pbtignore		.pbtignore
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml