An intelligent reading agent that understands text and translates it into Wikidata statements.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
doc
samples
strephit
tests
.gitignore
LICENSE
README.md
requirements.txt
scrapy.cfg

README.md

StrepHit

StrepHit is a Natural Language Processing pipeline that understands human language, extracts facts from text and produces Wikidata statements with references.

StrepHit is a IEG project funded by the Wikimedia Foundation.

StrepHit will enhance the data quality of Wikidata by suggesting references to validate statements, and will help Wikidata become the gold-standard hub of the Open Data landscape.

Official Project Page

https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References

Documentation

https://www.mediawiki.org/wiki/StrepHit

Features

Pipeline

  1. Corpus Harvesting
  2. Corpus Analysis
  3. Sentence Extraction
  4. N-ary Relation Extraction
  5. Dataset Serialization

Get Ready

  • Install Python 2.7 and pip
  • Clone the repository and create the output folder:
$ git clone https://github.com/Wikidata/StrepHit.git
$ mkdir StrepHit/output
  • Install all the Python requirements (preferably in a virtualenv)
$ cd StrepHit
$ pip install -r requirements.txt
NEX_URL = 'https://api.dandelion.eu/datatxt/nex/v1/'
NEX_TOKEN = 'your API token here'

Optional dependency

If you want to extract sentences via syntactic parsing, you will need to install:

$ python -m strephit commons download stanford_corenlp

Command Line

You can run all the NLP pipeline components through a command line. Do not specify any argument, or use --help to see the available options. Each command can have a set of sub-commands, depending on its granularity.

$ python -m strephit                                                                             
Usage: __main__.py [OPTIONS] COMMAND [ARGS]...

Options:
  --log-level <TEXT CHOICE>...
  --cache-dir DIRECTORY
  --help                        Show this message and exit.

Commands:
  annotation          Corpus annotation via crowdsourcing
  classification      Roles classification
  commons             Common utilities used by others
  corpus_analysis     Corpus analysis module
  extraction          Data extraction from the corpus
  rule_based          Unsupervised fact extraction
  side_projects       Side projects scripts
  web_sources_corpus  Corpus retrieval from the web

Get Started

  • Generate a dataset of Wikidata assertions (QuickStatements syntax) from semi-structured data in the corpus (takes time, and a good internet connection):
$ python -m strephit extraction process_semistructured -p 1 samples/corpus.jsonlines
  • Produce a ranking of meaningful verbs:
$ python -m strephit commons pos_tag samples/corpus.jsonlines bio en
$ python -m strephit corpus_analysis rank_verbs output/pos_tagged.jsonlines bio en
$ python -m strephit extraction extract_sentences samples/corpus.jsonlines output/verbs.json en
$ python -m strephit commons entity_linking -p 1 output/sentences.jsonlines en
  • Extract facts with the rule-based classifier:
$ python -m strephit rule_based classify output/entity_linked.jsonlines samples/lexical_db.json en
  • Train the supervised classifier and extract facts:
$ python -m strephit annotation parse_results samples/crowdflower_results.csv
$ python -m strephit classification train output/training_set.jsonlines en
$ python -m strephit classification classify output/entity_linked.jsonlines output/classifier_model.pkl en
  • Serialize the supervised classification results into a dataset of Wikidata assertions (QuickStatements):
$ python -m strephit commons serialize -p 1 output/supervised_classified.jsonlines samples/lexical_db.json en

N.B.: you will find all the output files in the output folder.

Note on Parallel Processing

By default, StrepHit uses as many processes as the number of CPU cores in the machine where it runs. Add the -p parameter if you want to change the behavior.

Set -p 1 to disable parallel processing.

License

The source code is under the terms of the GNU General Public License, version 3.