# Installing the corpus access library

In [2]:
!pip install webnlg_corpus==0.1.dev10

Collecting webnlg_corpus==0.1.dev10
  Downloading https://files.pythonhosted.org/packages/ce/95/7c1d49a31be847e93cb39115c7447c9c9f24188a2e4c1aa3285de087862f/webnlg_corpus-0.1.dev10-py3-none-any.whl
Installing collected packages: webnlg-corpus
Successfully installed webnlg-corpus-0.1.dev10


In [3]:
!pip show webnlg_corpus

Name: webnlg-corpus
Version: 0.1.dev10
Summary: WebNLG Corpus
Home-page: https://github.com/abevieiramota/webnlg_corpus
Author: Abelardo Vieira Mota
Author-email: abevieiramota@gmail.com
License: CC BY-NC-SA 4.0
Location: c:\programdata\anaconda3\envs\webnlg\lib\site-packages
Requires: pandas, tinydb
Required-by: 


# Downloading one corpus release

### Currently available releases

In [4]:
from webnlg_corpus import config

config.RELEASES_URLS.keys()

dict_keys(['webnlg_challenge_2017', 'release_v2', 'release_v2_constrained'])

### Downloading the one used in the competition

In [5]:
from webnlg_corpus import downloader

# force=True overrides previous release download
downloader.download('webnlg_challenge_2017', force=True)

In [6]:
downloader.default_download_dir()

'C:\\Users\\1513 MX5_7\\AppData\\Roaming\\webnlg_data'

# Explaining the corpus

### Reading one release

In [7]:
from webnlg_corpus import webnlg

corpus = webnlg.load('webnlg_challenge_2017')

### Looking at some examples

You can select a sample of a dataset with the method `.sample`

In [8]:
corpus.sample(seed=1, datasets=['train'])

Triple info: Category=Food eid=Id151 idx=train_Food_1_Id151

	Modified Triples:

Bhajji | related | Pakora


	Lexicalizations:

The dish bhajji is related to pakora.


Bhajji is a snack that is similar to Pakora.


bhajji and pakora are related.


### Understanding an entry

An entry in a WebNLG releases contains the following informations:

* **Modified Triples**: contains the entry input data, a set of lexicalized triples that should be verbalized. They are an modification of the triples extracted from DBpedia. More about in: http://webnlg.loria.fr/pages/docs.html 

* **Category**: the main topic of the entry input data(Modified Triples)

* **Lexicalizations**: contains the entry reference texts

* **eid**: is an entry identifier in the .XML file where it is contained

* **idx**: is an entry identifier in the release. It's structure is: {dataset}_{category}_{number of triples}_{eid}

There are entries with a number of triples going from 1 to 7

In [9]:
corpus.sample(seed=1, ntriples=[7], datasets=['train'])

Triple info: Category=Astronaut eid=Id35 idx=train_Astronaut_7_Id35

	Modified Triples:

Apollo_12 | backup pilot | Alfred_Worden
Alan_Bean | was a crew member of | Apollo_12
Apollo_12 | operator | NASA
Apollo_12 | commander | David_Scott
Alan_Bean | birthPlace | Wheeler,_Texas
Alan_Bean | status | "Retired"
Alan_Bean | almaMater | "UT Austin, B.S. 1955"


	Lexicalizations:

Alan Bean was born in Wheeler, Texas and graduated from UT Austin in 1955 with a B.S. He was a crew member of the NASA operated Apollo 12. He is now retired. The back up pilot of Apollo 12 was Alfred Worden, and the commander was David Scott.


Alan Bean, who has now retired, was born in Wheeler Texas and graduated from UT Austin in 1955 with a Bachelor of Science degree. He was a member of the NASA operated Apollo 12 along with commander David Scott and backup pilot Alfred Worden.


### Calculating some corpus' statistics

In [10]:
import pandas as pd

train_dev = corpus.subset(datasets=['train', 'dev'])
test = corpus.subset(datasets=['test'])

In [11]:
# number of unique triples

{
    'train_dev': train_dev.as_pandas.mdf.text.nunique(),
    'test': test.as_pandas.mdf.text.nunique()
}

{'train_dev': 2131, 'test': 2331}

In [12]:
# number of lexicalizations

{
    'train_dev': len(train_dev.as_pandas.ldf),
    'test': len(test.as_pandas.ldf)
}

{'train_dev': 20370, 'test': 7361}

In [13]:
# number of unique properties

{
    'train_dev': train_dev.as_pandas.mdf.predicate.nunique(),
    'test': test.as_pandas.mdf.predicate.nunique()
}

{'train_dev': 246, 'test': 300}

In [14]:
# number of unique subjects

{
    'train_dev': train_dev.as_pandas.mdf.subject.nunique(),
    'test': test.as_pandas.mdf.subject.nunique()
}

{'train_dev': 434, 'test': 575}

In [15]:
# distribution of above per category

{
    'train_dev': train_dev.as_pandas.mdf.object.nunique(),
    'test': test.as_pandas.mdf.object.nunique()
}

{'train_dev': 1642, 'test': 1888}

### per category

In [16]:
# number of unique triples

pd.concat({
    'train_dev': train_dev.as_pandas.mdf.groupby('category').text.nunique(),
    'test': test.as_pandas.mdf.groupby('category').text.nunique()
}, axis=1, sort=True, copy=False).fillna(0)

Unnamed: 0,test,train_dev
Airport,196,376.0
Artist,252,0.0
Astronaut,71,90.0
Athlete,199,0.0
Building,190,295.0
CelestialBody,132,0.0
City,168,274.0
ComicsCharacter,46,120.0
Food,230,340.0
MeanOfTransportation,229,0.0


In [17]:
# number of lexicalizations

pd.concat({
    'train_dev': train_dev.as_pandas.ldf.category.value_counts(),
    'test': test.as_pandas.ldf.category.value_counts()
}, axis=1, sort=True, copy=False).fillna(0)

Unnamed: 0,test,train_dev
Airport,349,3174.0
Artist,1198,0.0
Astronaut,174,1718.0
Athlete,856,0.0
Building,285,2677.0
CelestialBody,598,0.0
City,368,681.0
ComicsCharacter,88,849.0
Food,463,4107.0
MeanOfTransportation,1096,0.0


In [18]:
# number of unique properties

pd.concat({
    'train_dev': train_dev.as_pandas.mdf.groupby('category').predicate.nunique(),
    'test': test.as_pandas.mdf.groupby('category').predicate.nunique()
}, axis=1, sort=True, copy=False).fillna(0)

Unnamed: 0,test,train_dev
Airport,34,52.0
Artist,28,0.0
Astronaut,27,38.0
Athlete,30,0.0
Building,38,46.0
CelestialBody,25,0.0
City,19,23.0
ComicsCharacter,16,19.0
Food,20,34.0
MeanOfTransportation,66,0.0


In [19]:
# number of unique subjects

pd.concat({
    'train_dev': train_dev.as_pandas.mdf.groupby('category').subject.nunique(),
    'test': test.as_pandas.mdf.groupby('category').subject.nunique()
}, axis=1, sort=True, copy=False).fillna(0)

Unnamed: 0,test,train_dev
Airport,54,69.0
Artist,50,0.0
Astronaut,16,18.0
Athlete,60,0.0
Building,53,59.0
CelestialBody,21,0.0
City,48,66.0
ComicsCharacter,21,44.0
Food,51,60.0
MeanOfTransportation,51,0.0


In [20]:
# number of unique subjects

pd.concat({
    'train_dev': train_dev.as_pandas.mdf.groupby('category').object.nunique(),
    'test': test.as_pandas.mdf.groupby('category').object.nunique()
}, axis=1, sort=True, copy=False).fillna(0)

Unnamed: 0,test,train_dev
Airport,178,305.0
Artist,194,0.0
Astronaut,56,71.0
Athlete,183,0.0
Building,172,258.0
CelestialBody,119,0.0
City,116,186.0
ComicsCharacter,44,97.0
Food,192,276.0
MeanOfTransportation,203,0.0


# TODO

 6. which dbpedia entities are described and their count per category (info from here) - by 'described' do you mean it appearing as a subject in one triple? So 'Brazil' is being described in <Brazil, is, country>, but not in <Abelardo, lives_in, Brazil>?
Those entities are not only subjects, but more importantly roots of the trees in WebNLG.


Then explain and show that each input is a tree.
            1. n of tree types (sibling, mixed, chain), examples - is this the attribute shape_type added to the v2 release, right?
Yes, it is that attribute. Please use the final release for your statistics. Then you can make another notebook for the challenge data.


            2. n of tree shapes -  and this the attribute shape?
yes, in the v2 release.



Vocabulary
            1. n of tokens

            2. n of types

            3. distribution of 1. and 2. per size.