# Extracting PERSON Entities in One Book

In [3]:
import pandas as pd
import os

In [4]:
pwd

'C:\\Users\\usuario\\ELENA\\it-training uzh\\it-training uzh\\Python Data Analytics Essentials\\Charles Dickens\\entities\\Person'

In [5]:
data = pd.read_csv(r"Files\a_christmas_carol.entities.csv", delimiter=';')
data

Unnamed: 0,COREF,start_token,end_token,prop,cat,text
0,0,1,1,PRON,PER,I
1,0,22,22,PRON,PER,my
2,-1,22,23,NOM,PER,my readers
3,-1,28,28,PRON,PER,themselves
4,-1,31,32,NOM,PER,each other
...,...,...,...,...,...,...
4771,23,37008,37008,PROP,PER,God
4772,-1,37010,37010,PRON,PER,Us
4773,86,37049,37050,PROP,PER,Charles Dickens
4774,26,37055,37055,PROP,PER,Scrooge


What we need is to extract the proper nouns of person entities (that means, concrete characters). So let's go to the column "Cat" and in there let's extract first "LOC" category (i.e. Location).

In [10]:
person = data[data["cat"] == "PERSON"]

In [11]:
person

Unnamed: 0,COREF,start_token,end_token,prop,cat,text


Now let's remove duplicates to make sure that we don´t have "I", "Me", "They" tons of times

In [23]:
df = person.drop_duplicates(subset=['text'])

In [24]:
len(df)

773

And now let's only extract Personal Names

In [25]:
personal_name = df[df["prop"] == "PROP"]

In [26]:
personal_name

Unnamed: 0,COREF,start_token,end_token,prop,cat,text
12,18,81,83,PROP,PER,MARLEY 'S GHOST
13,18,85,85,PROP,PER,Marley
19,26,123,123,PROP,PER,Scrooge
23,20,147,148,PROP,PER,Old Marley
64,21,406,406,PROP,PER,Hamlet
...,...,...,...,...,...,...
4391,85,34703,34703,PROP,PER,Hallo
4439,78,35018,35019,PROP,PER,Joe Miller
4659,64,36332,36333,PROP,PER,No Bob
4764,42,36949,36949,PROP,PER,Spirits


And now, to make sure that we are not speaking hundreds of times about the same person (i.e. Elena, She, The Teacher, are the same person), let's filter out things by "COREF". Now we will have just the number of characters appearing in the book!

In [27]:
unique_personal_name = personal_name.drop_duplicates(subset=['COREF'])

In [28]:
unique_personal_name

Unnamed: 0,COREF,start_token,end_token,prop,cat,text
12,18,81,83,PROP,PER,MARLEY 'S GHOST
19,26,123,123,PROP,PER,Scrooge
23,20,147,148,PROP,PER,Old Marley
64,21,406,406,PROP,PER,Hamlet
69,22,461,462,PROP,PER,Saint Paul
...,...,...,...,...,...,...
3866,81,31300,31300,PROP,PER,Caroline
4178,83,33015,33015,PROP,PER,Spectre
4267,84,33696,33697,PROP,PER,Good Spirit
4391,85,34703,34703,PROP,PER,Hallo


In [29]:
mkdir output_files

Ya existe el subdirectorio o el archivo output_files.


In [30]:
unique_personal_name.to_csv(r"output_files\a_christmas_carol_person.csv")

In [31]:
number_of_unique_characters = len(unique_personal_name)

In [32]:
number_of_unique_characters

68