# Names for People Reconciliation

Notebook dedicated at finding out the easiest names to continue the reconciliation

In [3]:
import pandas as pd

First, we load the data. The dataframe contains the name of all the people that are not reconciliated with any Wikidata nor ULAN.

In [12]:
people = pd.read_excel ('data/people.xls')

In [16]:
people.head()

Unnamed: 0,ID,ID_2,display_name,first_name,last_name,ULAN,Wikidata,Column
0,304,100304,-u-,-u-,-u-,,,
1,305,100305,-u-g-e,-u-g-e,-u-g-e,,,
2,47,100047,-y-,-y-,-y-,,,
3,654,100654,A.,A.,A.,,,
4,7426,107426,A. B.,A. B.,A. B.,,,


## Full names, but not at the right place

The first idea is that there exists rows in the dataframe where the first name and last name both contain the full name of the person. It might be easier to start the reconciliation with these ones.

In [113]:
# look for the same entries for first and last name
double_name_pp = people[people.first_name == people.last_name]
print('Number of people with duplicated first and last name entries : ', len(double_name_pp.index))
double_name_pp.sample(10)

Number of people with duplicated first and last name entries :  1996


Unnamed: 0,ID,ID_2,display_name,first_name,last_name,ULAN,Wikidata,Column
3088,5586,105586,Schühlein,Schühlein,Schühlein,,,
1227,5336,105336,Gey,Gey,Gey,,,
3603,4686,104686,Willroider,Willroider,Willroider,,,
2204,4684,104684,Meisel,Meisel,Meisel,,,
658,6266,106266,Cuyp,Cuyp,Cuyp,,,
3256,4726,104726,Steffan,Steffan,Steffan,,,
3011,8722,108722,Schlaf,Schlaf,Schlaf,,,
3418,7424,107424,v. M.,v. M.,v. M.,,,
2439,9903,109903,Nurisso,Nurisso,Nurisso,,,
3462,8610,108610,Verkade,Verkade,Verkade,,,


Different cases appear when both the two entries are the same. It might be that :
- We only have the initials of the person
- We only have one of the two names of the person
- We have the entire name but it was not broken down into first and last name

For the first case, it will not be possible to reconciliate the data. We consider the entry to be the initials when more than 1/3 of the characters of the display name are points.

In [73]:
initials_pp = double_name_pp[double_name_pp['display_name'].str.count('\\.')/double_name_pp['display_name'].str.len() >= 1/3]
print('Number of people only with initials : ', len(initials_pp.index))
initials_pp.sample(5)

Number of people only with initials :  133


Unnamed: 0,ID,ID_2,display_name,first_name,last_name,ULAN,Wikidata,Column
82,106,100106,b.,b.,b.,,,
1592,7408,107408,J. C.,J. C.,J. C.,,,
1844,7283,107283,L. V. H.,L. V. H.,L. V. H.,,,
2495,3694,103694,P. L.,P. L.,P. L.,,,
1364,7248,107248,H. P.,H. P.,H. P.,,,


Now we try to find the people whose names have not been broken down. They all have names that have at least one blank space in their display name 

In [105]:
# removing rows which are the initials of people
common = double_name_pp.merge(initials_pp, on=["ID_2"])
result = double_name_pp[~double_name_pp.ID_2.isin(common.ID_2)]

# finding people who have a blank space in their display name
full_name_pp = result[result['display_name'].str.contains(' ')]
print('Number of people with blank spaces in them : ', len(full_name_pp.index))
full_name_pp.sample(5)

Number of people with blank spaces in them :  159


Unnamed: 0,ID,ID_2,display_name,first_name,last_name,ULAN,Wikidata,Column
2086,3996,103996,M.G. Valbert,M.G. Valbert,M.G. Valbert,,,
2860,8668,108668,Romiti Fasce,Romiti Fasce,Romiti Fasce,,,
2090,8231,108231,Mac Laren,Mac Laren,Mac Laren,,,
2010,7014,107014,Lichten der J.,Lichten der J.,Lichten der J.,,,
546,2680,102680,Charles V.,Charles V.,Charles V.,,,


In [115]:
full_name_pp.iloc[30:40]

Unnamed: 0,ID,ID_2,display_name,first_name,last_name,ULAN,Wikidata,Column
817,6902,106902,Doesburg van,Doesburg van,Doesburg van,,,
833,6468,106468,Douanier Rousseau,Douanier Rousseau,Douanier Rousseau,,,
836,6413,106413,Dr Baroux,Dr Baroux,Dr Baroux,,,
837,6414,106414,Dr Sargeant,Dr Sargeant,Dr Sargeant,,,
851,6119,106119,du Gardier,du Gardier,du Gardier,,,
852,7438,107438,Du Quesne - van Gogh,Du Quesne - van Gogh,Du Quesne - van Gogh,,,
855,3723,103723,Dubosc de Pesquidoux,Dubosc de Pesquidoux,Dubosc de Pesquidoux,,,
860,9255,109255,Duc de Brunswick,Duc de Brunswick,Duc de Brunswick,,,
861,6418,106418,Duc de Saxe,Duc de Saxe,Duc de Saxe,,,
877,2974,102974,Dunoyer de Segonzac,Dunoyer de Segonzac,Dunoyer de Segonzac,,,


This list should be the one of the easiest names to reconciliate.