# Personas Creation

Our approach aligns closely with the input preparation model proposed by David W. Embley ([2021](https://doi.org/10.1007/978-3-030-88358-4_6)). We structure our data around _personas_, defined as "each mention instance of a person in a document" (p. 66), as a foundational step toward probabilistic record linkage (PRL). Each _persona_ is created by ingesting available individual metadata (such as name, last name, birth date), associating the person with a sacramental event (baptism, marriage, or burial). The relationship between personas is established by their participation at the event (e.g., as father, mother, godfather, witness).

## Personas Data Structure

The `personas` data structure is very straightforward:

- event_idno: unique semantically meaningful identifier for the event
- persona_idno: unique semantically meaningful identifier for the persona
- persona_type: role of the persona in the event (e.g., baptized, father, mother, witness)
- name: first name of the persona
- last_name: last name of the persona
- birth_date: birth date of the persona
- birth_place: birth place of the persona
- resident_in: persona residence at the time of the event
- gender: inferred gender of the persona
- social_condition: harmonized social condition of the persona
- legitimacy_status: harmonized legitimacy status of the persona
- marital_status: harmonized marital status of the persona


Identification of individuals is done by parsing one or a list of dataframes with the clean data, and processing the data using the `Persona` class. Results are stored in `data/interim/personas_extracted.csv` for testing, and in `data/clean/personas.csv` for production.

In [14]:
import pandas as pd
from actions.extractors import Persona

In [15]:
bautismos = pd.read_csv("../data/clean/bautismos_clean.csv")
entierros = pd.read_csv("../data/clean/entierros_clean.csv")
matrimonios = pd.read_csv("../data/clean/matrimonios_clean.csv")

In [16]:
extractor = Persona.PersonaExtractor([bautismos, matrimonios, entierros])
personas = extractor.extract_personas()

personas.describe(include='all')

Unnamed: 0,event_idno,original_identifier,persona_type,name,birth_place,birth_date,legitimacy_status,lastname,persona_idno,social_condition,marital_status,resident_in,death_place,death_date,gender
count,47072,47072,47072,46999,2298,8596,11866,46762,47072,9643,4275,395,761,2114,47072
unique,10180,10179,14,4286,41,7001,2,2616,47072,7,3,14,6,1813,6
top,matrimonio-490,APAucará-LM-L001_M490,mother,mariano,pampamarca,1901-09-04,legitimo,quispe,persona-1,indio,soltero,pampamarca,aucará,1871-11-04,male
freq,12,12,7614,1556,1919,8,9104,2712,1,5654,2779,292,303,7,20150


In [17]:
personas.to_csv("../data/clean/personas.csv", index=False)

In [37]:
personas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47072 entries, 0 to 47071
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   event_idno           47072 non-null  object
 1   original_identifier  47072 non-null  object
 2   persona_type         47072 non-null  object
 3   name                 46999 non-null  object
 4   birth_place          2298 non-null   object
 5   birth_date           8596 non-null   object
 6   legitimacy_status    11866 non-null  object
 7   lastname             46762 non-null  object
 8   persona_idno         47072 non-null  object
 9   social_condition     9643 non-null   object
 10  marital_status       4275 non-null   object
 11  resident_in          395 non-null    object
 12  death_place          761 non-null    object
 13  death_date           2114 non-null   object
 14  gender               47072 non-null  object
dtypes: object(15)
memory usage: 5.4+ MB


In [38]:
personas['persona_type'].value_counts()

persona_type
mother               7614
father               7369
baptized             6340
witness              4249
godparent            3260
godmother            3251
godfather            3012
deceased             2120
wife                 2060
husband              2051
mother_of_wife       1459
mother_of_husband    1439
father_of_wife       1438
father_of_husband    1410
Name: count, dtype: int64

## Quality Assessment

### Completeness of Personas

#### Missing full names

In [18]:
personas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47072 entries, 0 to 47071
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   event_idno           47072 non-null  object
 1   original_identifier  47072 non-null  object
 2   persona_type         47072 non-null  object
 3   name                 46999 non-null  object
 4   birth_place          2298 non-null   object
 5   birth_date           8596 non-null   object
 6   legitimacy_status    11866 non-null  object
 7   lastname             46762 non-null  object
 8   persona_idno         47072 non-null  object
 9   social_condition     9643 non-null   object
 10  marital_status       4275 non-null   object
 11  resident_in          395 non-null    object
 12  death_place          761 non-null    object
 13  death_date           2114 non-null   object
 14  gender               47072 non-null  object
dtypes: object(15)
memory usage: 5.4+ MB


In [19]:
name_completeness = personas.loc[(personas['name'].isna()) | (personas['lastname'].isna())]
name_completeness.info()

<class 'pandas.core.frame.DataFrame'>
Index: 383 entries, 62 to 46928
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   event_idno           383 non-null    object
 1   original_identifier  383 non-null    object
 2   persona_type         383 non-null    object
 3   name                 310 non-null    object
 4   birth_place          17 non-null     object
 5   birth_date           52 non-null     object
 6   legitimacy_status    41 non-null     object
 7   lastname             73 non-null     object
 8   persona_idno         383 non-null    object
 9   social_condition     55 non-null     object
 10  marital_status       20 non-null     object
 11  resident_in          0 non-null      object
 12  death_place          3 non-null      object
 13  death_date           53 non-null     object
 14  gender               383 non-null    object
dtypes: object(15)
memory usage: 47.9+ KB


In [20]:
name_completeness['persona_type'].value_counts()

persona_type
mother               83
deceased             53
father_of_husband    45
father_of_wife       45
father               44
godmother            27
godfather            25
godparent            13
wife                 13
witness              10
husband               9
mother_of_wife        7
mother_of_husband     5
baptized              4
Name: count, dtype: int64

In [21]:
# Percentage of missing names
total_personas = len(personas)
missing_names = len(name_completeness)
percentage_missing_names = (missing_names / total_personas) * 100
print(f"Percentage of personas with missing names: {percentage_missing_names:.2f}%")

Percentage of personas with missing names: 0.81%


In [22]:
missing_firstnames = personas.loc[personas['name'].isna()]
percentage_missing_firstnames = (len(missing_firstnames) / total_personas) * 100
print(f"Percentage of personas with missing firstnames: {percentage_missing_firstnames:.2f}%")

Percentage of personas with missing firstnames: 0.16%


In [23]:
missing_surnames = personas.loc[personas['lastname'].isna()]
percentage_missing_surnames = (len(missing_surnames) / total_personas) * 100
print(f"Percentage of personas with missing lastnames: {percentage_missing_surnames:.2f}%")

Percentage of personas with missing lastnames: 0.66%


### Fathers and Mothers Completeness

In [24]:
legitimate_sons = personas.loc[personas['legitimacy_status'] == 'legitimo']
ilegitimate_sons = personas.loc[personas['legitimacy_status'] == 'ilegitimo']

sons_types = legitimate_sons['persona_type'].unique().tolist()

# filter ilegitimate sons by sons types to avoid including other ilegitimate personas
ilegitimate_sons = ilegitimate_sons.loc[ilegitimate_sons['persona_type'].isin(sons_types)]

print("Legitimate Sons Persona Types and Counts:")
print(legitimate_sons['persona_type'].value_counts())
print("\nIlegitimate Sons Persona Types and Counts:")
print(ilegitimate_sons['persona_type'].value_counts())

Legitimate Sons Persona Types and Counts:
persona_type
baptized    5483
wife        1267
husband     1248
deceased    1106
Name: count, dtype: int64

Ilegitimate Sons Persona Types and Counts:
persona_type
baptized    810
deceased    201
wife        170
husband     165
Name: count, dtype: int64


In [25]:
def check_parents_completeness(sons_df, personas_df, legitimacy='leg'):
    # Get unique event_idno from sons
    event_ids = sons_df['event_idno'].unique()
    
    # Filter personas to only relevant events
    relevant_personas = personas_df[personas_df['event_idno'].isin(event_ids)]
    
    # Check for father and mother presence by event
    events_with_father = set(relevant_personas[relevant_personas['persona_type'].str.contains('father', na=False)]['event_idno'])
    events_with_mother = set(relevant_personas[relevant_personas['persona_type'].str.contains('mother', na=False)]['event_idno'])
    
    if legitimacy == 'leg':
        # For legitimate sons, both parents should be present
        # Incomplete if missing father OR missing mother
        events_missing_father = set(event_ids) - events_with_father
        events_missing_mother = set(event_ids) - events_with_mother
        incomplete_events = events_missing_father | events_missing_mother
    elif legitimacy == 'ileg':
        # For illegitimate sons, at least one parent should be present
        # Incomplete if missing BOTH father AND mother
        incomplete_events = set(event_ids) - events_with_father - events_with_mother
    else:
        raise ValueError("Legitimacy must be 'leg' or 'ileg'")
    
    # Get sons with incomplete parents
    incomplete_sons = sons_df[sons_df['event_idno'].isin(incomplete_events)]
    return incomplete_sons

incomplete_legit_parents = check_parents_completeness(legitimate_sons, personas)
incomplete_ilegit_parents = check_parents_completeness(ilegitimate_sons, personas, legitimacy='ileg')

print(f"Number of legitimate sons with incomplete parents: {len(incomplete_legit_parents)}")
print(f"Number of ilegitimate sons with incomplete parents: {len(incomplete_ilegit_parents)}")


Number of legitimate sons with incomplete parents: 42
Number of ilegitimate sons with incomplete parents: 9


Baptized 6339 - legitimate 5489, illegitimate 810 = 6299
Husbands 1719 - legitimate 1248, illegitimate 165 = 1413
Wives 1717 - legitimate 1267 illegitimate 170 = 1437
Deceased 2120 - legitimate 1106, illegitimate 165 = 1271

Total legitimate = 9110 -> 0.46 % (42) with incomplete parents
Total illegitimate = 1310 -> 0.68 % (9) with incomplete parents

#### Persons with missing birth dates

In [32]:
nobirthdate = personas.loc[personas['birth_date'].isna()]
personas_size = len(personas)
nobirthdate_size = len(nobirthdate)
percentage_nobirthdate = (nobirthdate_size / personas_size) * 100
print(f"Percentage of personas with missing birth dates: {percentage_nobirthdate:.2f}%")
nobirthdate['persona_type'].value_counts()

Percentage of personas with missing birth dates: 81.74%


persona_type
mother               7614
father               7369
witness              4249
godparent            3260
godmother            3251
godfather            3012
wife                 1470
mother_of_wife       1459
husband              1441
mother_of_husband    1439
father_of_wife       1438
father_of_husband    1410
baptized              978
deceased               86
Name: count, dtype: int64

In [36]:
deceased = nobirthdate.loc[nobirthdate['persona_type'] == 'deceased']
deceased

Unnamed: 0,event_idno,original_identifier,persona_type,name,birth_place,birth_date,legitimacy_status,lastname,persona_idno,social_condition,marital_status,resident_in,death_place,death_date,gender
41677,entierro-1,APAucará-LD-L001_E001,deceased,julian,,,,xavies,persona-41678,,casado,,,1846-10-06,male
41804,entierro-60,APAucará-LD-L001_E060,deceased,dionicia,,,,osorio,persona-41805,,,,,1864-12-02,unknown
41805,entierro-61,APAucará-LD-L001_E061,deceased,sebastian,,,,guillen,persona-41806,,,,,1864-07-05,male
41809,entierro-63,APAucará-LD-L001_E063,deceased,jesús,,,legitimo,bendezú,persona-41810,,,,,1905-02-07,mostly_male
41812,entierro-64,APAucará-LD-L001_E064,deceased,norberto,,,,bega,persona-41813,,,,,1865-01-10,male
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46567,entierro-2035,APAucará-LE-L003_E691,deceased,teófila,,,legitimo,polanco,persona-46568,indio,,,pampamarca,1919-07-25,unknown
46570,entierro-2036,APAucará-LE-L003_E692,deceased,leonila,,,legitimo,polanco,persona-46571,indio,,,pampamarca,1919-07-25,female
46573,entierro-2037,APAucará-LE-L003_E693,deceased,naceanseno,,,legitimo,polanco,persona-46574,indio,,,pampamarca,1919-07-25,unknown
46968,entierro-2159,APAucará-LE-L003_E815,deceased,lorenzo,sacsamarca,,ilegitimo,huaccari,persona-46969,indio,casado,,aucará,1920-06-19,male


In [33]:
nodeathdate = personas.loc[personas['death_date'].isna()]
nodeathdate_size = len(nodeathdate)
percentage_nodeathdate = (nodeathdate_size / personas_size) * 100
print(f"Percentage of personas with missing death dates: {percentage_nodeathdate:.2f}%")
nodeathdate['persona_type'].value_counts()

Percentage of personas with missing death dates: 95.51%


persona_type
mother               7614
father               7369
baptized             6340
witness              4249
godparent            3260
godmother            3251
godfather            3012
wife                 2060
husband              2051
mother_of_wife       1459
mother_of_husband    1439
father_of_wife       1438
father_of_husband    1410
deceased                6
Name: count, dtype: int64

### Birth and Death places Completeness

In [39]:
nonbirthplace = personas.loc[personas['birth_place'].isna()]
nonbirthplace_size = len(nonbirthplace)
percentage_nonbirthplace = (nonbirthplace_size / personas_size) * 100
print(f"Percentage of personas with missing birth places: {percentage_nonbirthplace:.2f}%")
nonbirthplace['persona_type'].value_counts()

Percentage of personas with missing birth places: 95.12%


persona_type
mother               7614
father               7369
baptized             5502
witness              4249
godparent            3260
godmother            3251
godfather            3012
wife                 1657
husband              1650
deceased             1464
mother_of_wife       1459
mother_of_husband    1439
father_of_wife       1438
father_of_husband    1410
Name: count, dtype: int64

In [40]:
nodeathplace = personas.loc[personas['death_place'].isna()]
nodeathplace_size = len(nodeathplace)
percentage_nodeathplace = (nodeathplace_size / personas_size) * 100
print(f"Percentage of personas with missing death places: {percentage_nodeathplace:.2f}%")
nodeathplace['persona_type'].value_counts()

Percentage of personas with missing death places: 98.38%


persona_type
mother               7614
father               7369
baptized             6340
witness              4249
godparent            3260
godmother            3251
godfather            3012
wife                 2060
husband              2051
mother_of_wife       1459
mother_of_husband    1439
father_of_wife       1438
father_of_husband    1410
deceased             1359
Name: count, dtype: int64