# WebNLG Corpus Documentation
This notebook explains the structure of the WebNLG corpus and gives some corpus descriptive statistics. All of that is illustrated by the means of the `webnlg_corpus` PyPi package.

## Installing Corpus Access Library
The `webnlg_corpus` PyPi package provides API to interact with XML: https://gitlab.com/shimorina/webnlg-dataset

In [15]:
!pip install webnlg_corpus==0.1.dev10

[33mYou are using pip version 19.0.3, however version 19.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [4]:
!pip show webnlg_corpus

Name: webnlg-corpus
Version: 0.1.dev10
Summary: WebNLG Corpus
Home-page: https://github.com/abevieiramota/webnlg_corpus
Author: Abelardo Vieira Mota
Author-email: abevieiramota@gmail.com
License: CC BY-NC-SA 4.0
Location: /home/anastasia/.local/lib/python3.6/site-packages
Requires: tinydb, pandas
Required-by: 


## Downloading Corpus Releases

### Currently available releases

In [5]:
from webnlg_corpus import config

config.RELEASES_URLS.keys()

dict_keys(['webnlg_challenge_2017', 'release_v2', 'release_v2_constrained'])

Currently there are three releases available. The release history is described [here](https://gitlab.com/shimorina/webnlg-dataset).

### Downloading the release used in the WebNLG challenge

In [6]:
from webnlg_corpus import downloader

# force=True overrides previous release download
downloader.download('webnlg_challenge_2017', force=True)

In [None]:
# download v2
downloader.download('release_v2', force=True)

In [7]:
downloader.default_download_dir()

'/home/anastasia/webnlg_data'

## Explaining WebNLG

### Loading the corpus into memory

In [8]:
from webnlg_corpus import webnlg

corpus = webnlg.load('release_v2')

### Looking at some examples

You can select a corpus instance (called _entry_) with `corpus.sample`.

In [15]:
# set random seed to 1 to get the same corpus instance across different runs
# look only in the training data
corpus.sample(seed=1, datasets=['train'])

# TODO: need to add odf to the sample method as well; triple info -> entry info

Triple info: Category=City eid=Id80 idx=train_City_4_Id80

	Modified Triples:

Antioch,_California | populationTotal | 102372
Antioch,_California | UTCOffset | "-8"
Antioch,_California | areaCode | 925
Antioch,_California | areaTotal | 75.324 (square kilometres)


	Lexicalizations:

Antioch in California has a population of 102,372 and covers an area of 75.324 square kilometres. The area code for Antioch is 925 and the UTC offset is -8.


The population of Antioch, California is 102372 and the UTC offset is -8. Antioch has the area code 925 and its total area is 75.324 square km.


### Understanding an entry

WebNLG consists of Data/Text pairs where the data is a set of RDF triples extracted from DBpedia and the text is a verbalisation of these triples.

Each entry contains

An entry in WebNLG contains the following information:

* **Original Triples**: todo

* **Modified Triples**: contains the entry input data, a set of lexicalized triples that should be verbalized. They are an modification of the triples extracted from DBpedia. More about in: http://webnlg.loria.fr/pages/docs.html 

* **Category**: the main topic of the entry input data(Modified Triples)

* **Lexicalizations**: contains the entry reference texts

* **eid**: is an entry identifier in the XML file

* **idx**: is an entry identifier in the release. Its structure is: {dataset part}\_{category}\_{number of triples}\_{eid}

There are entries with a number of triples going from 1 to 7

In [10]:
corpus.sample(seed=1, ntriples=[7], datasets=['train'])

Triple info: Category=University eid=Id35 idx=train_University_7_Id35

	Modified Triples:

Karnataka | has to its northeast | Telangana
Visvesvaraya_Technological_University | city | Belgaum
Acharya_Institute_of_Technology | sportsOffered | Tennis
Tennis | sportsGoverningBody | International_Tennis_Federation
Karnataka | has to its west | Arabian_Sea
Acharya_Institute_of_Technology | state | Karnataka
Acharya_Institute_of_Technology | affiliation | Visvesvaraya_Technological_University


	Lexicalizations:

Karnataka is located southwest of Telangana and has the Arabian Sea to the west. The state is the location of the Acharya Institute of Technology which is affiliated to the Visvesvaraya Technological University in the city of Belgaum. The Institute offers the sport of tennis which has the International tennis Federation as it's governing body.


The Acharya Institute of Technology is located in Karnatka and it is affiliated to the Visvesvaraya Technological University in Belgaum. Its

## Calculating Some Corpus Statistics

In [10]:
import pandas as pd

train_dev = corpus.subset(datasets=['train', 'dev'])
test = corpus.subset(datasets=['test'])

In [11]:
# number of unique triples

{
    'train_dev': train_dev.as_pandas.mdf.text.nunique(),
    'test': test.as_pandas.mdf.text.nunique()
}

{'train_dev': 2131, 'test': 2331}

In [12]:
# number of lexicalizations

{
    'train_dev': len(train_dev.as_pandas.ldf),
    'test': len(test.as_pandas.ldf)
}

{'train_dev': 20370, 'test': 7361}

In [13]:
# number of unique properties

{
    'train_dev': train_dev.as_pandas.mdf.predicate.nunique(),
    'test': test.as_pandas.mdf.predicate.nunique()
}

{'train_dev': 246, 'test': 300}

In [14]:
# number of unique subjects

{
    'train_dev': train_dev.as_pandas.mdf.subject.nunique(),
    'test': test.as_pandas.mdf.subject.nunique()
}

{'train_dev': 434, 'test': 575}

In [15]:
# distribution of above per category

{
    'train_dev': train_dev.as_pandas.mdf.object.nunique(),
    'test': test.as_pandas.mdf.object.nunique()
}

{'train_dev': 1642, 'test': 1888}

### per category

In [16]:
# number of unique triples

pd.concat({
    'train_dev': train_dev.as_pandas.mdf.groupby('category').text.nunique(),
    'test': test.as_pandas.mdf.groupby('category').text.nunique()
}, axis=1, sort=True, copy=False).fillna(0)

Unnamed: 0,test,train_dev
Airport,196,376.0
Artist,252,0.0
Astronaut,71,90.0
Athlete,199,0.0
Building,190,295.0
CelestialBody,132,0.0
City,168,274.0
ComicsCharacter,46,120.0
Food,230,340.0
MeanOfTransportation,229,0.0


In [17]:
# number of lexicalizations

pd.concat({
    'train_dev': train_dev.as_pandas.ldf.category.value_counts(),
    'test': test.as_pandas.ldf.category.value_counts()
}, axis=1, sort=True, copy=False).fillna(0)

Unnamed: 0,test,train_dev
Airport,349,3174.0
Artist,1198,0.0
Astronaut,174,1718.0
Athlete,856,0.0
Building,285,2677.0
CelestialBody,598,0.0
City,368,681.0
ComicsCharacter,88,849.0
Food,463,4107.0
MeanOfTransportation,1096,0.0


In [18]:
# number of unique properties

pd.concat({
    'train_dev': train_dev.as_pandas.mdf.groupby('category').predicate.nunique(),
    'test': test.as_pandas.mdf.groupby('category').predicate.nunique()
}, axis=1, sort=True, copy=False).fillna(0)

Unnamed: 0,test,train_dev
Airport,34,52.0
Artist,28,0.0
Astronaut,27,38.0
Athlete,30,0.0
Building,38,46.0
CelestialBody,25,0.0
City,19,23.0
ComicsCharacter,16,19.0
Food,20,34.0
MeanOfTransportation,66,0.0


In [19]:
# number of unique subjects

pd.concat({
    'train_dev': train_dev.as_pandas.mdf.groupby('category').subject.nunique(),
    'test': test.as_pandas.mdf.groupby('category').subject.nunique()
}, axis=1, sort=True, copy=False).fillna(0)

Unnamed: 0,test,train_dev
Airport,54,69.0
Artist,50,0.0
Astronaut,16,18.0
Athlete,60,0.0
Building,53,59.0
CelestialBody,21,0.0
City,48,66.0
ComicsCharacter,21,44.0
Food,51,60.0
MeanOfTransportation,51,0.0


In [20]:
# number of unique subjects

pd.concat({
    'train_dev': train_dev.as_pandas.mdf.groupby('category').object.nunique(),
    'test': test.as_pandas.mdf.groupby('category').object.nunique()
}, axis=1, sort=True, copy=False).fillna(0)

Unnamed: 0,test,train_dev
Airport,178,305.0
Artist,194,0.0
Astronaut,56,71.0
Athlete,183,0.0
Building,172,258.0
CelestialBody,119,0.0
City,116,186.0
ComicsCharacter,44,97.0
Food,192,276.0
MeanOfTransportation,203,0.0


# TODO

 6. which dbpedia entities are described and their count per category (info from here) - by 'described' do you mean it appearing as a subject in one triple? So 'Brazil' is being described in <Brazil, is, country>, but not in <Abelardo, lives_in, Brazil>?
Those entities are not only subjects, but more importantly roots of the trees in WebNLG.


Then explain and show that each input is a tree.
            1. n of tree types (sibling, mixed, chain), examples - is this the attribute shape_type added to the v2 release, right?
Yes, it is that attribute. Please use the final release for your statistics. Then you can make another notebook for the challenge data.


            2. n of tree shapes -  and this the attribute shape?
yes, in the v2 release.



Vocabulary
            1. n of tokens

            2. n of types

            3. distribution of 1. and 2. per size.