# Names for People Reconciliation

The first idea is to get to know the database by reconciliating some of the data entries (cf [README](https://github.com/dfk-paris/DFKV-illustrations/blob/main/1_data_reconciliation/README.md#data-reconciliation)). The task of this notebook is to ease this process by finding some names that could potentially be easier to reconciliate.


We know that a majority of people are already linked to an authority file. Let's explore the database and find some ideas for the reconciliation.

In [1]:
# import pandas
import pandas as pd

First, we load the data. The dataframe contains the name of all the people that are not reconciliated with any Wikidata nor ULAN.

In [2]:
people = pd.read_excel ('data/people.xls')

Let's look at random samples of the data

In [8]:
people.sample(5)

Unnamed: 0,ID,ID_2,display_name,first_name,last_name,ULAN,Wikidata,Column
976,5850,105850,Faber du Faur,Faber du Faur,Faber du Faur,,,
2952,8085,108085,"Saucet, Jean",Jean,Saucet,,,
3370,6227,106227,Tieck,Tieck,Tieck,,,
3438,2093,102093,van de Velde,van de Velde,van de Velde,,,
2761,4294,104294,"Raymond, Pierre",Pierre,Raymond,,,


## Full names, but not at the right place

Here we found something interesting : sometimes, there are rows in the dataframe where the first name and last name both contain the full name of the person, just like "Faber du Faur" above. It might be easier to start the reconciliation with these ones, as they may have been missed by a previous automatic reconciliation in Open Refine.

In [9]:
# look for the same entries for first and last name
double_name_pp = people[people.first_name == people.last_name]
print('Number of people with duplicated first and last name entries : ', len(double_name_pp.index))
double_name_pp.sample(10)

Number of people with duplicated first and last name entries :  1996


Unnamed: 0,ID,ID_2,display_name,first_name,last_name,ULAN,Wikidata,Column
213,5316,105316,Berndl,Berndl,Berndl,,,
3188,5743,105743,Simmler,Simmler,Simmler,,,
2552,4918,104918,Pellegrini,Pellegrini,Pellegrini,,,
3281,6080,106080,Stieler,Stieler,Stieler,,,
1027,4997,104997,Fierens-Gevaërt,Fierens-Gevaërt,Fierens-Gevaërt,,,
1055,1203,101203,Forain,Forain,Forain,,,
6,1043,101043,A. D.,A. D.,A. D.,,,
2272,980,100980,Mm.,Mm.,Mm.,,,
877,2974,102974,Dunoyer de Segonzac,Dunoyer de Segonzac,Dunoyer de Segonzac,,,
1763,4603,104603,Koepping,Koepping,Koepping,,,


Different cases appear when both the two entries are the same. It might be that :
- We only have the initials of the person
- We only have one of the two names of the person
- We have the entire name but it was not broken down into first and last name

For the first case, it will not be possible to reconciliate the data (no way to figure out who "M.S." is). We will remove them from our subset, by considering the entry to be the initials when more than 1/3 of the characters of the display name are points.

In [5]:
initials_pp = double_name_pp[double_name_pp['display_name'].str.count('\\.')/double_name_pp['display_name'].str.len() >= 1/3]
print('Number of people only with initials : ', len(initials_pp.index))
initials_pp.sample(5)

Number of people only with initials :  133


Unnamed: 0,ID,ID_2,display_name,first_name,last_name,ULAN,Wikidata,Column
975,1029,101029,F. R.,F. R.,F. R.,,,
3526,1033,101033,W. K. Z.,W. K. Z.,W. K. Z.,,,
3524,7295,107295,W. G.,W. G.,W. G.,,,
1353,249,100249,H.,H.,H.,,,
2083,8123,108123,M. S.,M. S.,M. S.,,,


Now we try to find the people whose names have not been broken down, because they might be the easier to reconciliate. We assume that these all have names that have at least one blank space in their display name.

In [11]:
# removing rows which are the initials of people
common = double_name_pp.merge(initials_pp, on=["ID_2"])
result = double_name_pp[~double_name_pp.ID_2.isin(common.ID_2)]

# finding people who have a blank space in their display name
full_name_pp = result[result['display_name'].str.contains(' ')]
print('Number of people with blank spaces in them : ', len(full_name_pp.index))
full_name_pp.sample(5)

Number of people with blank spaces in them :  159


Unnamed: 0,ID,ID_2,display_name,first_name,last_name,ULAN,Wikidata,Column
1932,3551,103551,Le Tellier,Le Tellier,Le Tellier,,,
3441,4417,104417,Van Dyck,Van Dyck,Van Dyck,,,
1887,9870,109870,Lankheit Dr.,Lankheit Dr.,Lankheit Dr.,,,
837,6414,106414,Dr Sargeant,Dr Sargeant,Dr Sargeant,,,
2543,8967,108967,Pauvert Jean-Jacques,Pauvert Jean-Jacques,Pauvert Jean-Jacques,,,


And now we have a smaller list of names that are good candidates for reconciliation, we go through each of the rows and manually check for an authority file, and if we find one we manually enter it into our database.