# Knowledge bases

<sup>This notebook is a part of Natural Language Processing class at the University of Ljubljana, Faculty for computer and information science. Please contact [slavko.zitnik@fri.uni-lj.si](mailto:slavko.zitnik@fri.uni-lj.si) for any comments.</sub>

## Open Information extraction (unsupervised approaches)

### Open information(/relation) extraction

As for all of the above tasks we need classification into predefined classes, a pure unsupervised end-to-end information extraction approach would use Open information extraction. The latter means that a method extracts tuples of type (subject, predicate, object) without classification of any of them to a list of predefined classes. 

These methods mostly use some technique of bootstrapping or semi-supervised approach to define lexical rules for extraction. As it may seem as an easy task, it is hard to retrieve clean examples and also evaluate the results.

The area became popular with the introduction of the *TextRunner* system in 2007, after which new systems were introduced. The same research group released the latest version, called [Open IE 5.0](https://github.com/dair-iitd/OpenIE-standalone) in the beginning of 2017. Results of such systems are very useful for QA systems as you can test at [http://openie.allenai.org/](http://openie.allenai.org/).

## Information extraction (supervised approaches)

### Named entity recognition

Named entity recognition is a sequence labeling task. The goal of the algorithm is to define a specific class for each token in a sequence (see previous lab sessions for examples).

### Coreference resolution

Coreference resolution is a task of mention clustering. It basically consists of the following:

1. Mention identification.
2. Mention clustering. 

Mentions refer to underlying entities and are of named, nominal or pronominal type.

There exist many supervised or unsupervised approaches to coreference resolution. One of the most known approached has been `Sieve-based coreference resolution` system by the Stanford NLP group, which achieves comparable state-of-the art results by only employing lexically predefined rules.

Python bindings to a recent state-of-the-art system is available in a [public source code repository](https://github.com/huggingface/neuralcoref) along with a [web-based example](https://huggingface.co/coref/).

### Relation(ship) extraction

Relationship extraction or relation extraction is another information extraction task in which the idea is to idetify relationships between mentions or text spans. The task of true relationship extraction consists of:

1. Subject and object identification.
2. Relationship identification and extraction.

The area of relationships extraction is also very broad and is sometimes related to ontology extraction or building. As there may exist a number of different relationships in text, only some tagged datasets exist, which contain a few basic relationships. Apart from general relationship extraction, the task has became popular mostly because of biological relationships extraction (interactions between genes and proteins).

An example of a successful relationship extraction system that is using neural networks was presented at the EMNLP 2017 and is accessible in the [public source code repository](https://github.com/UKPLab/emnlp2017-relation-extraction).

## Other approaches

Apart from fully supervised or unsupervised there exist semi-supervised approaches and manual knowledge base engineering. Some of the knowledge bases contain a predefined schema with rules following Semantic Web principles. The latter enables graph-based data representation within Linked Open Data, easy interconnection of databases and structured querying using SPARQL.

## Examples

Currently, ***COMMON-SENSE REASONING KNOWLEDGE BASES*** are popular ;). This is just another marketing term, similar as ***AI*** :). 

One of the recent databases developed by AllenNLP institute is called ATOMIC. It contains 887.000 descriptions of reasoning over 300.000 events. The descriptions are provided using IF-THEN rules for 9 types of relationships (Figure below). 

![](ATOMIC.png)

Over ATOMIC a "reasoning" tool COMeT (Bosselut in sod., 2019) was developed using GPT algorithm. It is available online via [Mosaic Web application](https://mosaickg.apps.allenai.org/comet_atomic).

Other examples of knowledge bases: 
* ConceptNet
* Web Child
* Wikidata
* WordNet
* Roget
* VerbNet
* FrameNet
* VisualGenome
* ImageNet


## Exercises

Think of possible usages of a knowledge base and search for them. 