In [2]:
import os

os.sys.path.insert(0, '../script')

from webnlg import WebNLGCorpus

In [3]:
train_dev = WebNLGCorpus.load(dataset=['train', 'dev'])

In [5]:
train_dev.mdf.m_subject.sample(n=10, random_state=100)

20553                          Amdavad_ni_Gufa
5611     Visvesvaraya_Technological_University
6597                             Bandeja_paisa
5834                Adams_County,_Pennsylvania
7865                                Eric_Flint
7465                              Sweet_potato
1446                                 A.S._Roma
20653                                Arem-arem
13581                                 Apollo_8
3455                            AIDS_(journal)
Name: m_subject, dtype: object

So, there are subjects(probably predicates and objects too) that have *underlines* and *parenthesis*.

In [11]:
train_dev.mdf.shape

(23021, 5)

## How many has underlines or parenthesis?

In [10]:
has_underline_parenthesis = train_dev.mdf.m_subject.str.match(r'.*[_\(\)].*')

has_underline_parenthesis.sum()

17505

In [22]:
# some samples
train_dev.mdf[has_underline_parenthesis].sample(n=10, random_state=200)

Unnamed: 0,idx,mtext,m_subject,m_predicate,m_object
3118,15_39,"A.D._Isidro_Metapán | ground | ""Metapán, El Sa...",A.D._Isidro_Metapán,ground,"""Metapán, El Salvador"""
8484,26_167,A_Glastonbury_Romance | precededBy | Wolf_Solent,A_Glastonbury_Romance,precededBy,Wolf_Solent
10099,29_203,United_States | leaderName | John_Roberts,United_States,leaderName,John_Roberts
146,0_146,Antwerp_International_Airport | owner | Flemis...,Antwerp_International_Airport,owner,Flemish_Region
18840,48_17,Balder_(comicsCharacter) | creator | Stan_Lee,Balder_(comicsCharacter),creator,Stan_Lee
133,0_133,Angola_International_Airport | 1st_runway_Leng...,Angola_International_Airport,1st_runway_LengthFeet,13123
22134,89_8,United_States_Air_Force | battles | Invasion_o...,United_States_Air_Force,battles,Invasion_of_Grenada
2054,10_133,Antwerp_International_Airport | operatingOrgan...,Antwerp_International_Airport,operatingOrganisation,Flemish_Government
1693,8_1,11th_Mississippi_Infantry_Monument | country |...,11th_Mississippi_Infantry_Monument,country,"""United States"""
18232,46_49,Elliot_See | deathPlace | St._Louis,Elliot_See,deathPlace,St._Louis


In [13]:
# and without parenthesis or underlines
train_dev.mdf[~has_underline_parenthesis].sample(n=10, random_state=123)

Unnamed: 0,idx,mtext,m_subject,m_predicate,m_object
12218,35_42,Turkey | leaderName | Ahmet_Davutoğlu,Turkey,leaderName,Ahmet_Davutoğlu
20934,69_32,Bionico | dishVariation | Honey,Bionico,dishVariation,Honey
810,3_168,Dublin | leaderTitle | European_Parliament,Dublin,leaderTitle,European_Parliament
2956,13_268,Singapore | leaderName | Halimah_Yacob,Singapore,leaderName,Halimah_Yacob
16947,43_179,France | leaderName | Manuel_Valls,France,leaderName,Manuel_Valls
17122,43_238,Binignit | mainIngredients | Banana,Binignit,mainIngredients,Banana
13024,36_158,Bionico | course | Dessert,Bionico,course,Dessert
21735,83_15,Italy | language | Italian_language,Italy,language,Italian_language
575,2_31,Bananaman | starring | Graeme_Garden,Bananaman,starring,Graeme_Garden
1247,5_150,Bhajji | related | Pakora,Bhajji,related,Pakora


## How underlines are lexicalized?

In [24]:
train_dev.sample(idx='15_47')

Triple info: {'category': 'SportsTeam', 'eid': 'Id48', 'idx': '15_47', 'ntriples': 2}

	Modified triples:

A.D._Isidro_Metapán | manager | Jorge_Humberto_Rodríguez
Jorge_Humberto_Rodríguez | club | El_Salvador_national_football_team


	Lexicalizations:

Jorge Humberto Rodríguez manages the A.D. Isidro Metapan and plays for the El Salvador national football team.
Jorge Humberto Rodríguez, who is a member of the El Salvador National Football Team, also manages the team.
Once manager of A D Isidro Metapán, Jorge Humberto Rodriguez, plays for the El Salvador national football team.

**Hey, take a look at the second lexicalization! Where is the lexicalization of the first triple?! '*also manages the team.*'**

### Well, underline translates to a space?

In [21]:
train_dev.mdf.m_subject.str.translate('_'.maketrans({'_': ' '})).sample(n=10, random_state=200)

14814                           Ampara Hospital
5983                                  Alan Bean
5443     Accademia di Architettura di Mendrisio
8786                                  A.S. Roma
21626                             United States
5834                 Adams County, Pennsylvania
8165              Acta Palaeontologica Polonica
3585                              United States
3580                             A Severed Wasp
19954                           A Long Long Way
Name: m_subject, dtype: object

Looks fine

## How parenthesis are lexicalized?

In [26]:
sample = train_dev.sample(idx='48_17')
sample

Triple info: {'category': 'ComicsCharacter', 'eid': 'Id18', 'idx': '48_17', 'ntriples': 3}

	Modified triples:

Balder_(comicsCharacter) | creator | Jack_Kirby
Jack_Kirby | nationality | Americans
Balder_(comicsCharacter) | creator | Stan_Lee


	Lexicalizations:

Stan Lee and American Jack Kirby created the comic character Balder.
Stan Lee and the American, Jack Kirby created the comic character of Balder.

So, is the content between parenthesis some sort of qualifier of the main content?

Let's remove the parenthesis and place the content before the main content

In [30]:
sample.mdf.m_subject.str.replace('(.*?)\((.*?)\)', '\g<2> \g<1>')

18838    comicsCharacter Balder_
18839                 Jack_Kirby
18840    comicsCharacter Balder_
Name: m_subject, dtype: object

And then remove spaces

In [31]:
sample.mdf.m_subject.str.replace('(.*?)\((.*?)\)', '\g<2> \g<1>').str.replace('_', ' ')

18838    comicsCharacter Balder 
18839                 Jack Kirby
18840    comicsCharacter Balder 
Name: m_subject, dtype: object

Maybe trim?

In [33]:
sample.mdf.m_subject.str.replace('(.*?)\((.*?)\)', '\g<2> \g<1>').str.replace('_', ' ').str.strip()

18838    comicsCharacter Balder
18839                Jack Kirby
18840    comicsCharacter Balder
Name: m_subject, dtype: object

Now, let's deal with these camelCase like tokens

In [46]:
sample.mdf.m_subject\
    .str.replace('(.*?)\((.*?)\)', '\g<2> \g<1>')\
    .str.replace('_', ' ')\
    .str.replace('([a-z])([A-Z])', '\g<1> \g<2>')\
    .str.strip()

18838    comics Character Balder
18839                 Jack Kirby
18840    comics Character Balder
Name: m_subject, dtype: object

# Let's inspect more values

looking at both m_subject, m_predicate and m_object

In [36]:
train_dev.mdf.sample(n=10, random_state=1)

Unnamed: 0,idx,mtext,m_subject,m_predicate,m_object
3782,16_201,Wizards_at_War | publisher | Harcourt_(publisher),Wizards_at_War,publisher,Harcourt_(publisher)
18306,47_9,Jens_Härtel | club | SV_Babelsberg_03,Jens_Härtel,club,SV_Babelsberg_03
6709,25_46,Spain | leaderName | Felipe_VI_of_Spain,Spain,leaderName,Felipe_VI_of_Spain
11568,33_174,Adolfo_Suárez_Madrid–Barajas_Airport | locatio...,Adolfo_Suárez_Madrid–Barajas_Airport,location,San_Sebastián_de_los_Reyes
11408,33_134,"Greenville,_Wisconsin | isPartOf | Menasha_(to...","Greenville,_Wisconsin",isPartOf,"Menasha_(town),_Wisconsin"
195,0_195,Atlantic_City_International_Airport | operatin...,Atlantic_City_International_Airport,operatingOrganisation,Port_Authority_of_New_York_and_New_Jersey
4501,19_42,School of Business and Social Sciences at the ...,School of Business and Social Sciences at the ...,numberOfStudents,16000
837,3_195,Julia_Morgan | significantBuilding | Riverside...,Julia_Morgan,significantBuilding,Riverside_Art_Museum
13184,36_190,Indonesia | leaderName | Jusuf_Kalla,Indonesia,leaderName,Jusuf_Kalla
7366,25_211,Beef_kway_teow | region | Singapore,Beef_kway_teow,region,Singapore


And that comma?!

In [37]:
train_dev.sample(idx='33_134')

Triple info: {'category': 'Airport', 'eid': 'Id135', 'idx': '33_134', 'ntriples': 4}

	Modified triples:

Appleton_International_Airport | location | Greenville,_Wisconsin
Greenville,_Wisconsin | isPartOf | Ellington,_Wisconsin
Greenville,_Wisconsin | isPartOf | Menasha_(town),_Wisconsin
Appleton_International_Airport | cityServed | Appleton,_Wisconsin


	Lexicalizations:

Part of both Ellington, and the town of Menasha, Appleton, Wisconsin is a city which is served by Appleton International Airport.
Greenville, Wisconsin, a part of Ellington and the town of Menasha, is home to the Appleton International Airport. The city of Appleton, Wisconsin, is served by the Appleton International Airport.
Appleton International airport serves the city of Appleton in Greenville, Wisconsin. Both Ellington and Menasha (town) are parts of Greenville.

Well, it looks like it's lexicalizad as it is... Let's see other examples

In [44]:
examples = train_dev.mdf[train_dev.mdf.m_subject.str.match('.*,.*')].sample(n=10, random_state=123).idx

train_dev.sample(idx=examples.iloc[0])

Triple info: {'category': 'Airport', 'eid': 'Id90', 'idx': '10_89', 'ntriples': 2}

	Modified triples:

Allama_Iqbal_International_Airport | location | Punjab,_Pakistan
Punjab,_Pakistan | leaderTitle | Provincial_Assembly_of_the_Punjab


	Lexicalizations:

Allama Iqbal International Airport is located in Punjab, Pakistan, which is led by the Provincial Assembly of the Punjab.
Punjab, Pakistan, led by the Provincial Assembly, is the location of Allama Iqbal International Airport.
Allama Iqbal International Airport is located in Punjab, Pakistan which is led by the Provincial Assembly of the Punjab.

Well, it seens it's lexicalized as it is

More examples

In [None]:
'Balder_(comicsCharacter)'

In [45]:
train_dev.mdf.sample(n=10, random_state=2)

Unnamed: 0,idx,mtext,m_subject,m_predicate,m_object
632,2_88,Roy_Thomas | award | Alley_Award,Roy_Thomas,award,Alley_Award
15587,40_93,"Punjab,_Pakistan | country | Pakistan","Punjab,_Pakistan",country,Pakistan
21855,85_2,United_Kingdom | leaderName | Elizabeth_II,United_Kingdom,leaderName,Elizabeth_II
17853,45_125,Mason_School_of_Business | country | United_St...,Mason_School_of_Business,country,United_States
18834,48_15,"Balder_(comicsCharacter) | alternativeName | ""...",Balder_(comicsCharacter),alternativeName,"""Balder Odinson"""
12195,35_37,14th_New_Jersey_Volunteer_Infantry_Monument | ...,14th_New_Jersey_Volunteer_Infantry_Monument,category,Historic_districts_in_the_United_States
16884,43_158,Bakso | ingredient | Celery,Bakso,ingredient,Celery
13387,36_230,Spain | ethnicGroup | Spaniards,Spain,ethnicGroup,Spaniards
4572,19_52,Acharya_Institute_of_Technology | numberOfPost...,Acharya_Institute_of_Technology,numberOfPostgraduateStudents,700
15455,40_67,United_States_Air_Force | transportAircraft | ...,United_States_Air_Force,transportAircraft,Boeing_C-17_Globemaster_III
