# Personas Creation

Our approach aligns closely with the input preparation model proposed by David W. Embley ([2021](https://doi.org/10.1007/978-3-030-88358-4_6)). We structure our data around _personas_, defined as "each mention instance of a person in a document" (p. 66), as a foundational step toward probabilistic record linkage (PRL). Each _persona_ is created by ingesting available individual metadata (such as name, last name, birth date), associating the person with a sacramental event (baptism, marriage, or burial). The relationship between personas is established by their participation at the event (e.g., as father, mother, godfather, witness).

## Personas Data Structure

The `personas` data structure is very straightforward:

- event_idno: unique semantically meaningful identifier for the event
- persona_idno: unique semantically meaningful identifier for the persona
- persona_type: role of the persona in the event (e.g., baptized, father, mother, witness)
- name: first name of the persona
- last_name: last name of the persona
- birth_date: birth date of the persona
- birth_place: birth place of the persona
- resident_in: persona residence at the time of the event
- gender: inferred gender of the persona
- social_condition: harmonized social condition of the persona
- legitimacy_status: harmonized legitimacy status of the persona
- marital_status: harmonized marital status of the persona


Identification of individuals is done by parsing one or a list of dataframes with the clean data, and processing the data using the `Persona` class. Results are stored in `data/interim/personas_extracted.csv` for testing, and in `data/clean/personas.csv` for production.

In [1]:
import pandas as pd
from actions.extractors import Persona

In [2]:
bautismos = pd.read_csv("../data/clean/bautismos_clean.csv")
entierros = pd.read_csv("../data/clean/entierros_clean.csv")
matrimonios = pd.read_csv("../data/clean/matrimonios_clean.csv")

In [3]:
extractor = Persona.PersonaExtractor([bautismos, matrimonios, entierros])
personas = extractor.extract_personas()

personas.describe(include='all')

Unnamed: 0,event_idno,persona_type,name,birth_place,birth_date,legitimacy_status,lastname,persona_idno,social_condition,marital_status,resident_in,gender
count,39585,39585,39517,2329,8594,11862,39013,39585,9350,4256,399,39585
unique,10179,16,4103,85,6999,2,2328,39585,7,3,18,6
top,matrimonio-465,mother,maria,Pampamarca,1901-09-04,legitimo,quispe,persona-1,indio,soltero,Pampamarca,male
freq,7,7613,1056,1902,8,9104,2305,1,5652,2779,292,15425


In [4]:
personas.to_csv("../data/clean/personas.csv", index=False)