# Extracting GPE Entities in One Book

In [26]:
import pandas as pd
import os

In [27]:
pwd

'C:\\Users\\usuario\\ELENA\\it-training uzh\\it-training uzh\\Python Data Analytics Essentials\\Charles Dickens\\entities\\GPE'

In [28]:
data = pd.read_csv(r"Files\a_christmas_carol.entities.csv", delimiter=';')
data

Unnamed: 0,COREF,start_token,end_token,prop,cat,text
0,0,1,1,PRON,PER,I
1,0,22,22,PRON,PER,my
2,-1,22,23,NOM,PER,my readers
3,-1,28,28,PRON,PER,themselves
4,-1,31,32,NOM,PER,each other
...,...,...,...,...,...,...
4771,23,37008,37008,PROP,PER,God
4772,-1,37010,37010,PRON,PER,Us
4773,86,37049,37050,PROP,PER,Charles Dickens
4774,26,37055,37055,PROP,PER,Scrooge


Let's start having a look at how the category "Location" looks like

In [29]:
location = data[data["cat"] == "LOC"]

In [30]:
location

Unnamed: 0,COREF,start_token,end_token,prop,cat,text
33,-1,231,232,NOM,LOC,the Country
294,-1,2172,2173,NOM,LOC,the world
688,-1,5155,5155,NOM,LOC,sea
854,-1,6613,6614,NOM,LOC,the earth
863,-1,6684,6685,NOM,LOC,the world
865,-1,6704,6704,NOM,LOC,earth
942,-1,7277,7278,NOM,LOC,this earth
1082,7,8439,8441,PROP,LOC,the Invisible World
1105,-1,8794,8795,NOM,LOC,the world
1232,-1,10199,10199,PRON,LOC,it


That looks a little bit abstract. Let's try GPE instead

In [31]:
gpe = data[data["cat"] == "GPE"]
gpe

Unnamed: 0,COREF,start_token,end_token,prop,cat,text
336,1,2424,2424,PROP,GPE,Bedlam
426,-1,2990,2990,NOM,GPE,there
537,3,3903,3904,PROP,GPE,Camden Town
574,-1,4165,4166,NOM,GPE,that place
578,-1,4185,4187,NOM,GPE,the City of
579,4,4188,4188,PROP,GPE,London
683,5,5132,5132,PROP,GPE,Sheba
773,4,5960,5960,PROP,GPE,London
900,-1,6939,6940,NOM,GPE,other regions
1108,-1,8833,8834,NOM,GPE,United States


That looks much better! Let's stick with that. Now let's remove duplicates!

In [32]:
df = gpe.drop_duplicates(subset=['text'])

In [33]:
len(df)

27

In [34]:
df

Unnamed: 0,COREF,start_token,end_token,prop,cat,text
336,1,2424,2424,PROP,GPE,Bedlam
426,-1,2990,2990,NOM,GPE,there
537,3,3903,3904,PROP,GPE,Camden Town
574,-1,4165,4166,NOM,GPE,that place
578,-1,4185,4187,NOM,GPE,the City of
579,4,4188,4188,PROP,GPE,London
683,5,5132,5132,PROP,GPE,Sheba
900,-1,6939,6940,NOM,GPE,other regions
1108,-1,8833,8834,NOM,GPE,United States
1231,-1,10189,10190,NOM,GPE,The city


We can see that cities (London, Damascus), or specific locations (Camdem Town, Bedlam (a 19th century asylum) appear under the category "PROP". It is true that "NOM" also includes geographic locations (such as United States or Great Britain), but there is also a lot of noise in there (The City, the town...). So let's just focus on extracting **PROP** (that means, proper nouns).

In [35]:
gpe_name = df[df["prop"] == "PROP"]

In [36]:
gpe_name

Unnamed: 0,COREF,start_token,end_token,prop,cat,text
336,1,2424,2424,PROP,GPE,Bedlam
537,3,3903,3904,PROP,GPE,Camden Town
579,4,4188,4188,PROP,GPE,London
683,5,5132,5132,PROP,GPE,Sheba
1372,9,11229,11229,PROP,GPE,Damascus
1412,10,11431,11431,PROP,GPE,Halloa
1609,26,12735,12735,PROP,GPE,Ebenezer


And now, to make sure that we are not speaking hundreds of times about the same location (i.e. London, that city, the town...), let's filter out things by "COREF". Now we will have just the number of locations appearing in the book!

In [37]:
unique_gpe_name = gpe_name.drop_duplicates(subset=['COREF'])

In [38]:
unique_gpe_name

Unnamed: 0,COREF,start_token,end_token,prop,cat,text
336,1,2424,2424,PROP,GPE,Bedlam
537,3,3903,3904,PROP,GPE,Camden Town
579,4,4188,4188,PROP,GPE,London
683,5,5132,5132,PROP,GPE,Sheba
1372,9,11229,11229,PROP,GPE,Damascus
1412,10,11431,11431,PROP,GPE,Halloa
1609,26,12735,12735,PROP,GPE,Ebenezer


In [39]:
mkdir output_files

Ya existe el subdirectorio o el archivo output_files.


In [40]:
unique_gpe_name.to_csv(r"output_files\a_christmas_carol_gpe.csv")

In [41]:
number_of_unique_gpe = len(unique_gpe_name)

In [42]:
number_of_unique_gpe

7

So, we can see that if we compare number of characters with number of locations, the number of characters is much higher!