# Personas Creation

Our approach aligns closely with the input preparation model proposed by David W. Embley ([2021](https://doi.org/10.1007/978-3-030-88358-4_6)). We structure our data around _personas_, defined as "each mention instance of a person in a document" (p. 66), as a foundational step toward probabilistic record linkage (PRL). Each _persona_ is created by ingesting available individual metadata (such as name, last name, birth date), associating the person with a sacramental event (baptism, marriage, or burial), and establishing direct relationships (to parents, spouse, godparents).

In [1]:
import pandas as pd
from pathlib import Path

In [2]:
bautismos = pd.read_csv("../data/clean/bautismos_clean.csv")
entierros = pd.read_csv("../data/clean/entierros_clean.csv")
matrimonios = pd.read_csv("../data/clean/matrimonios_clean.csv")

## Identifying Prefixes

This step is a very simple exercise to ensure consistency and good naming conventions across the datasets. Also helps to identify the entities for the ER model, and associate prefixes to those entities.



In [3]:
import re

prefix_pattern = re.compile(r"(^[A-Za-z]*_[\d]?_?)([A-Za-z]*_?[\w\d]*)")

prefixes = set()

for df in [bautismos, entierros, matrimonios]:
    for col in df.columns:
        if prefix_pattern.match(col):
            prefix = prefix_pattern.match(col).group(1)
            remove_pattern = re.compile(r"\d")
            prefix = remove_pattern.sub("", prefix).strip("_")
            prefixes.add(prefix)

prefixes

{'baptized',
 'bride',
 'burial',
 'deceased',
 'event',
 'father',
 'godfather',
 'godmother',
 'godparent',
 'groom',
 'husband',
 'mother',
 'parents',
 'wife',
 'witness'}

## ER Model

With those prefixes in mind, and after basic cleaning of data, we can create a more accurate version of the ER model ([See original version](https://github.com/UCSB-AMPLab/sondondo/tree/ffc80515805dd7b00c9c127c3093cb65c7da8b23/database)). The model has now the following structure:

![ER Model](../database/db_diagram.png)



### Ingestion Order

```mathematica
ConditionVocab
    |
    ▼
Event
    |
    ▼
OriginalTerms
    |
    ▼
Persona
    |
    ▼
PersonaCondition
    |
    ▼
PersonaRelationship
    |
    ▼
PersonaRoleInEvent
    |
    ▼
Place
    |
    ▼
Record
```

In [4]:
from actions.extractors import Persona

In [5]:
extractor = Persona.PersonaExtractor([bautismos, matrimonios, entierros])
personas = extractor.extract_personas()

personas.describe(include='all')

Unnamed: 0,event_idno,persona_type,name,birth_place,birth_date,legitimacy_status,lastname,persona_idno,social_condition,marital_status,resident_in,gender
count,39585,39585,39517,2329,8594,11862,39013,39585,9350,4256,399,39585
unique,10179,16,4103,85,6999,2,2328,39585,7,3,18,6
top,matrimonio-465,mother,maria,Pampamarca,1901-09-04,legitimo,quispe,persona-1,indio,soltero,Pampamarca,male
freq,7,7613,1056,1902,8,9104,2305,1,5652,2779,292,15425


In [7]:
personas.to_csv("../data/interim/personas_extracted.csv", index=False)