# WebNLG Corpus Documentation
This notebook explains the structure of the WebNLG corpus and gives some corpus descriptive statistics. All of that is illustrated by the means of the `webnlg_corpus` PyPi package.

## Installing Corpus Access Library
The `webnlg_corpus` PyPi package provides API to interact with XML WebNLG files: https://gitlab.com/shimorina/webnlg-dataset

In [15]:
!pip install webnlg_corpus==0.1.dev10

[33mYou are using pip version 19.0.3, however version 19.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [4]:
!pip show webnlg_corpus

Name: webnlg-corpus
Version: 0.1.dev10
Summary: WebNLG Corpus
Home-page: https://github.com/abevieiramota/webnlg_corpus
Author: Abelardo Vieira Mota
Author-email: abevieiramota@gmail.com
License: CC BY-NC-SA 4.0
Location: /home/anastasia/.local/lib/python3.6/site-packages
Requires: tinydb, pandas
Required-by: 


## Downloading Corpus Releases

### Currently available releases

In [1]:
from webnlg_corpus import config

config.RELEASES_URLS.keys()

dict_keys(['webnlg_challenge_2017', 'release_v2', 'release_v2_constrained'])

Currently there are three releases available. The release history is described [here](https://gitlab.com/shimorina/webnlg-dataset).

### Downloading the release used in the WebNLG challenge

In [2]:
from webnlg_corpus import downloader

# force=True overrides previous release download
downloader.download('webnlg_challenge_2017', force=True)

In [3]:
# download the last version (v2)
downloader.download('release_v2', force=True)

In [4]:
downloader.default_download_dir()

'/home/anastasia/webnlg_data'

## Explaining WebNLG Structure

### Loading the corpus into memory

In [5]:
from webnlg_corpus import webnlg

corpus = webnlg.load('release_v2')

### Looking at some examples

You can select a corpus instance (called _entry_) with `corpus.sample`.

In [13]:
# set random seed to 1 to get the same corpus instance across different runs
# look only in the training data. For including other parts, use ['dev', 'test']
corpus.sample(seed=1, datasets=['train'])

# TODO: need to add odf to the sample method as well; triple info should be entry info

Triple info: Category=City eid=Id80 idx=train_City_4_Id80

	Modified Triples:

Antioch,_California | populationTotal | 102372
Antioch,_California | UTCOffset | "-8"
Antioch,_California | areaCode | 925
Antioch,_California | areaTotal | 75.324 (square kilometres)


	Lexicalizations:

Antioch in California has a population of 102,372 and covers an area of 75.324 square kilometres. The area code for Antioch is 925 and the UTC offset is -8.


The population of Antioch, California is 102372 and the UTC offset is -8. Antioch has the area code 925 and its total area is 75.324 square km.


### Understanding an entry

WebNLG consists of Data/Text pairs where the data is a set of RDF triples extracted from DBpedia and the text is a verbalisation of these triples in English.

Each entry contains:

* **Modified Triples**: a set of RDF triples that should be verbalised. They are a modification of the triples extracted from DBpedia. More about it here: http://webnlg.loria.fr/pages/docs.html

* **Original Triples**: todo

* **Category**: a DBpedia type of entity (e.g., [Astronaut](http://dbpedia.org/ontology/Astronaut)), a.k.a. the main topic of the entry input data

* **Lexicalisations**: texts in English, verbalising the triples

* **eid**: an entry identifier in the XML file

* **idx**: an entry identifier in the release. Its structure is: {dataset part}\_{category}\_{number of triples}\_{eid}

Each RDF triple has a (_subject, property, object_) structure.

For example, in the triple _Antioch,_\__California | populationTotal | 102372_
* _Antioch,_\__California_ is a subject,
* _populationTotal_ is a property,
* _102372_ is an object.

The number of triples in the input data can vary from 1 to 7.

In [14]:
# let's look at 7 triples only
corpus.sample(seed=1, ntriples=[7], datasets=['train'])

Triple info: Category=Astronaut eid=Id9 idx=train_Astronaut_7_Id9

	Modified Triples:

Alan_Bean | nationality | United_States
Alan_Bean | occupation | Test_pilot
Alan_Bean | birthDate | "1932-03-15"
Alan_Bean | was a crew member of | Apollo_12
Alan_Bean | birthPlace | Wheeler,_Texas
Alan_Bean | timeInSpace | "100305.0"(minutes)
Alan_Bean | status | "Retired"


	Lexicalizations:

Alan Bean is a US national who was born in Wheeler, Texas on the 15th March 1932. He was a test pilot and crew member of Apollo 12. He spent 100305 minutes in space.


Alan Bean is a United States test pilot. He was born on March 15, 1932 in Wheeler, Texas. Alan Bean retired after spending 100305 minutes in space, he was also a crew member of Apollo 12.


WebNLG covers 15 DBpedia categories:

In [None]:
# todo: add method to show all DBpedia categories covered (Monument, Food, etc...)

In each category, WebNLG covers 5 entities (Astronaut, Monument, University) or 20 entities (all other categories).

In [None]:
# todo: show entities (roots of trees)

## Calculating Some Corpus Statistics

In [24]:
import pandas as pd

train_dev = corpus.subset(datasets=['train', 'dev'])
test = corpus.subset(datasets=['test'])

In [29]:
# todo: would be nice to add all corpus (train+dev+test) to the calculations below as well

In [41]:
# number of unique triples

{
    'train_dev': train_dev.as_pandas.mdf.text.nunique(),
    'test': test.as_pandas.mdf.text.nunique()
}

{'train_dev': 3776, 'test': 2146}

In [42]:
# todo: we also need a number of unique data inputs (i.e. unique sets of triples)

In [43]:
# number of lexicalizations

{
    'train_dev': len(train_dev.as_pandas.ldf),
    'test': len(test.as_pandas.ldf)
}

{'train_dev': 38668, 'test': 4224}

In [44]:
# number of unique properties

{
    'train_dev': train_dev.as_pandas.mdf.predicate.nunique(),
    'test': test.as_pandas.mdf.predicate.nunique()
}

{'train_dev': 373, 'test': 291}

In [45]:
# number of unique subjects

{
    'train_dev': train_dev.as_pandas.mdf.subject.nunique(),
    'test': test.as_pandas.mdf.subject.nunique()
}

{'train_dev': 731, 'test': 558}

In [40]:
# number of unique objects

{
    'train_dev': train_dev.as_pandas.mdf.object.nunique(),
    'test': test.as_pandas.mdf.object.nunique()
}

{'train_dev': 2909, 'test': 1745}

### per category

In [None]:
# distribution of above per category

In [31]:
# number of unique triples

pd.concat({
    'train_dev': train_dev.as_pandas.mdf.groupby('category').text.nunique(),
    'test': test.as_pandas.mdf.groupby('category').text.nunique()
}, axis=1, sort=True, copy=False).fillna(0)

Unnamed: 0_level_0,test,train_dev
category,Unnamed: 1_level_1,Unnamed: 2_level_1
Airport,197,376
Artist,209,345
Astronaut,71,90
Athlete,177,352
Building,190,295
CelestialBody,104,211
City,161,302
ComicsCharacter,46,120
Food,225,340
MeanOfTransportation,195,372


In [32]:
# number of lexicalizations

pd.concat({
    'train_dev': train_dev.as_pandas.ldf.category.value_counts(),
    'test': test.as_pandas.ldf.category.value_counts()
}, axis=1, sort=True, copy=False).fillna(0)

Unnamed: 0,test,train_dev
Airport,351,3157
Artist,429,3832
Astronaut,174,1718
Athlete,304,2773
Building,285,2677
CelestialBody,209,1908
City,367,3392
ComicsCharacter,88,849
Food,463,4075
MeanOfTransportation,387,3492


In [33]:
# number of unique properties

pd.concat({
    'train_dev': train_dev.as_pandas.mdf.groupby('category').predicate.nunique(),
    'test': test.as_pandas.mdf.groupby('category').predicate.nunique()
}, axis=1, sort=True, copy=False).fillna(0)

Unnamed: 0_level_0,test,train_dev
category,Unnamed: 1_level_1,Unnamed: 2_level_1
Airport,34,52
Artist,26,37
Astronaut,27,38
Athlete,28,42
Building,38,46
CelestialBody,23,25
City,18,23
ComicsCharacter,16,19
Food,22,34
MeanOfTransportation,62,75


In [34]:
# number of unique subjects

pd.concat({
    'train_dev': train_dev.as_pandas.mdf.groupby('category').subject.nunique(),
    'test': test.as_pandas.mdf.groupby('category').subject.nunique()
}, axis=1, sort=True, copy=False).fillna(0)

Unnamed: 0_level_0,test,train_dev
category,Unnamed: 1_level_1,Unnamed: 2_level_1
Airport,54,69
Artist,47,58
Astronaut,16,18
Athlete,57,89
Building,53,59
CelestialBody,20,23
City,46,66
ComicsCharacter,21,44
Food,51,60
MeanOfTransportation,48,71


In [35]:
# number of unique subjects

pd.concat({
    'train_dev': train_dev.as_pandas.mdf.groupby('category').object.nunique(),
    'test': test.as_pandas.mdf.groupby('category').object.nunique()
}, axis=1, sort=True, copy=False).fillna(0)

Unnamed: 0_level_0,test,train_dev
category,Unnamed: 1_level_1,Unnamed: 2_level_1
Airport,178,305
Artist,165,255
Astronaut,56,71
Athlete,164,305
Building,172,258
CelestialBody,94,193
City,112,206
ComicsCharacter,44,97
Food,187,276
MeanOfTransportation,172,304


# TODO

 6. which dbpedia entities are described and their count per category (info from here) - by 'described' do you mean it appearing as a subject in one triple? So 'Brazil' is being described in <Brazil, is, country>, but not in <Abelardo, lives_in, Brazil>?
Those entities are not only subjects, but more importantly roots of the trees in WebNLG.


Then explain and show that each input is a tree.
            1. n of tree types (sibling, mixed, chain), examples - is this the attribute shape_type added to the v2 release, right?
Yes, it is that attribute. Please use the final release for your statistics. Then you can make another notebook for the challenge data.


            2. n of tree shapes -  and this the attribute shape?
yes, in the v2 release.



Vocabulary
            1. n of tokens

            2. n of types

            3. distribution of 1. and 2. per size.

In [None]:
# todo: explain hierarchical structure