In [1]:
from webnlg_corpus import webnlg

corpus = webnlg.load('webnlg_challenge_2017')

train_dev = corpus.subset(datasets=['train', 'dev'])

In [3]:
train_dev.mdf.subject.sample(n=10, random_state=100)

20553                                        United_States
5611                                                 Bakso
6597     Abhandlungen_aus_dem_Mathematischen_Seminar_de...
5834                                           Ayam_penyet
7865                                   Akita_Museum_of_Art
7465                                            Elliot_See
1446                                               Bionico
20653                                        United_States
13581                                        Bandeja_paisa
3455                                     Bianca_Castafiore
Name: subject, dtype: object

So, there are subjects(probably predicates and objects too) that have *underlines* and *parenthesis*.

## How many has underlines or parenthesis?

In [4]:
has_underline_parenthesis = train_dev.mdf.subject.str.match(r'.*[_\(\)].*')

train_dev.mdf.shape, has_underline_parenthesis.sum()

((23021, 7), 17505)

In [5]:
# some samples
train_dev.mdf[has_underline_parenthesis].sample(n=10, random_state=200)

Unnamed: 0,category,dataset,idx,object,predicate,subject,text
3022,Building,train,train_Building_1_Id87,Sri_Lanka,location,Adisham_Hall,Adisham_Hall | location | Sri_Lanka
8163,ComicsCharacter,train,train_ComicsCharacter_3_Id12,Walt_Simonson,creator,Auron_(comicsCharacter),Auron_(comicsCharacter) | creator | Walt_Simonson
10180,WrittenWork,train,train_WrittenWork_3_Id36,"""ACM Trans. Inf. Syst.""",abbreviation,ACM_Transactions_on_Information_Systems,ACM_Transactions_on_Information_Systems | abbr...
184,SportsTeam,dev,dev_SportsTeam_1_Id26,AZ_Alkmaar,club,John_van_den_Brom,John_van_den_Brom | club | AZ_Alkmaar
18380,Food,train,train_Food_5_Id23,"""Tomatoes, red chili, garlic, olive oil""",mainIngredients,Arrabbiata_sauce,"Arrabbiata_sauce | mainIngredients | ""Tomatoes..."
172,SportsTeam,dev,dev_SportsTeam_1_Id14,"""Joden , Godenzonen""",nickname,AFC_Ajax_(amateurs),"AFC_Ajax_(amateurs) | nickname | ""Joden , Gode..."
22142,Astronaut,train,train_Astronaut_7_Id45,Apollo_11,was a crew member of,Buzz_Aldrin,Buzz_Aldrin | was a crew member of | Apollo_11
2168,SportsTeam,dev,dev_SportsTeam_5_Id4,Estádio_Municipal_Coaracy_da_Mata_Fonseca,ground,Agremiação_Sportiva_Arapiraquense,Agremiação_Sportiva_Arapiraquense | ground | E...
1740,Airport,dev,dev_Airport_5_Id10,United_States_invasion_of_Panama,battles,United_States_Air_Force,United_States_Air_Force | battles | United_Sta...
17446,Building,train,train_Building_5_Id3,"""120 million (Australian dollars)""",cost,108_St_Georges_Terrace,"108_St_Georges_Terrace | cost | ""120 million (..."


In [6]:
# and without parenthesis or underlines
train_dev.mdf[~has_underline_parenthesis].sample(n=10, random_state=123)

Unnamed: 0,category,dataset,idx,object,predicate,subject,text
12948,Food,train,train_Food_4_Id3,Water,ingredient,Ajoblanco,Ajoblanco | ingredient | Water
20118,University,train,train_University_5_Id44,Telangana,has to its northeast,Karnataka,Karnataka | has to its northeast | Telangana
651,Building,dev,dev_Building_3_Id9,Republic_of_Ireland,country,Dublin,Dublin | country | Republic_of_Ireland
3426,ComicsCharacter,train,train_ComicsCharacter_1_Id12,"""Aurakles""",alternativeName,Aurakles,"Aurakles | alternativeName | ""Aurakles"""
18323,Food,train,train_Food_5_Id12,Indonesia,country,Arem-arem,Arem-arem | country | Indonesia
18734,Food,train,train_Food_5_Id94,Manuel_Valls,leaderName,France,France | leaderName | Manuel_Valls
13850,Food,train,train_Food_4_Id229,Sandesh_(confectionery),dishVariation,Dessert,Dessert | dishVariation | Sandesh_(confectionery)
21743,University,train,train_University_6_Id48,Klaus_Iohannis,leaderName,Romania,Romania | leaderName | Klaus_Iohannis
406,Food,dev,dev_Food_2_Id31,Sweet_potato,mainIngredients,Binignit,Binignit | mainIngredients | Sweet_potato
1019,WrittenWork,dev,dev_WrittenWork_3_Id24,45644811,OCLC_number,Aenir,Aenir | OCLC_number | 45644811


## How underlines are lexicalized?

In [9]:
train_dev.sample(idx='train_University_6_Id48')

Triple info: Category=University eid=Id48 idx=train_University_6_Id48

	Modified Triples:

Romania | ethnicGroup | Germans_of_Romania
Alba_Iulia | isPartOf | Alba_County
Romania | leaderName | Klaus_Iohannis
Romania | capital | Bucharest
1_Decembrie_1918_University | city | Alba_Iulia
1_Decembrie_1918_University | country | Romania


	Lexicalizations:

The 1 Decembrie 1918 University is located in Alba Iulia, Alba County, Romania. Romania's capital is Bucharest, its leader is Klaus Iohannis and its ethnic group is Germans of Romania.


The country of Romania is lead by Klaus Iohannis and the capital city is Bucharest. One of the ethnic groups in the country are the Germans of Romania. The country is the location of the 1 Decembrie 1918 University in Alba Iulia, Alba County.


Romania is governed by Klaus Iohannis and it's capital city is Bucharest. It is known for being home to the ethnic group the Germans of Romania and the 1 Decembrie 1918 University which is located in Alba Iulia in

### Well, underline translates to a space?

In [11]:
train_dev.mdf.subject.str.translate('_'.maketrans({'_': ' '})).sample(n=10, random_state=200)

14814                         John van den Brom
5983                                A.C. Cesena
5443                                 Asam pedas
8786                                      Bakso
21626           Acharya Institute of Technology
5834                                Ayam penyet
8165                                        BBC
3585                             Bacon sandwich
3580                             Bacon sandwich
19954    Accademia di Architettura di Mendrisio
Name: subject, dtype: object

Looks fine

## How parenthesis are lexicalized?

In [41]:
sample = train_dev.mdf[train_dev.mdf.subject.str.contains('\(')][['subject', 'object']].copy()
sample['subject_processed'] = sample.subject\
    .str.replace('(.*?)\((.*?)\)', '\g<2> \g<1>')\
    .str.replace('_', ' ')\
    .str.replace('([a-z])([A-Z])', '\g<1> \g<2>')\
    .str.strip('" ')

sample['object_processed'] = sample.object\
    .str.replace('(.*?)\((.*?)\)', '\g<2> \g<1>')\
    .str.replace('_', ' ')\
    .str.replace('([a-z])([A-Z])', '\g<1> \g<2>')\
    .str.strip('" ')

sample.tail()

Unnamed: 0,subject,object,subject_processed,object_processed
22631,Atatürk_Monument_(İzmir),Pietro_Canonica,İzmir Atatürk Monument,Pietro Canonica
22634,Atatürk_Monument_(İzmir),"""1932-07-27""",İzmir Atatürk Monument,1932-07-27
22635,Atatürk_Monument_(İzmir),Turkey,İzmir Atatürk Monument,Turkey
22640,Atatürk_Monument_(İzmir),"""Bronze""",İzmir Atatürk Monument,Bronze
22642,Atatürk_Monument_(İzmir),Turkey,İzmir Atatürk Monument,Turkey
