## Family Search Community Reconstruction Tutorial
This notebook will demonstrate how to run the community reconstruction process from end-to-end. Ultimately, what community reconstruction does is it adds every nuclear family member in a census to FamilySearch with minimal overlap of data that already exists on the website. It does this through several requests made to the FamilySearch API.

Also, for families that already exist on the tree, this process returns tree-extending hint urls for famillies that are only partially covered on the tree. This way volunteers can go in and add family members of people who already exist on the tree. 

First you will need to import your census file. This can be any subset of a census so long as it has the required variables:

**ark_id | relationship to head | household id | first name | surname | sex | census date | residence place | birthdate | birth place**


In [1]:
import pandas as pd

df=pd.read_csv('el_paso_census.csv')
df.head()

Unnamed: 0,ark1900,event_date,event_place,household_id,pr_birth_date,pr_birth_place,pr_name_gn,pr_name_surn,pr_relationship_code,pr_sex_code
0,MQMC-111,1900,Precinct 37 &amp; 41 Colorado Springs city War...,468.0,Aug 1858,Pennsylvania,Vincin*,King,Head,Male
1,MQMC-112,1900,Precinct 37 &amp; 41 Colorado Springs city War...,465.0,Oct 1863,Illinois,E A,Forbes,Head,Male
2,MQMC-113,1900,Precinct 37 &amp; 41 Colorado Springs city War...,460.0,Oct 1874,Missouri,Geo,Williams,,Male
3,MQMC-114,1900,Precinct 37 &amp; 41 Colorado Springs city War...,461.0,Feb 1884,Ohio,Beulah,White,Daughter,Female
4,MQMC-115,1900,Precinct 37 &amp; 41 Colorado Springs city War...,466.0,Dec 1868,Illinois,L,Kirkpatrick,Head,Male


### 1. Clean Census Data

Notice that the given name of the first person listed contains a asterisk. Be sure to filter out households containing people who have special characters in either their given or surname. We have steps in the pipeline to ensure these don't get added to the tree, but it is best to just filter these out in the master census file from the start. We also should take out people with any null values.

In [5]:
first_name_filt=df['pr_name_gn'].str.contains('[!@#$%^&*(),?":{}|<>]',regex=True).fillna(True)
last_name_filt=df['pr_name_surn'].str.contains('[!@#$%^&*(),?":{}|<>]',regex=True).fillna(True)

bad_households=df[(first_name_filt)|(last_name_filt)]['household_id']

# '~' operator means logical not -- thus this can ituitively be read as .notin(bad_households)
df=df[~df['household_id'].isin(bad_households)]
df.dropna(inplace=True) # drop people with null values


It is also good practice to only keep households with more than one person, meaning we do not add singletons to the tree.

In [6]:
# create filter with all duplicate households with more than one person dropped
singletons=df['household_id'].drop_duplicates(keep=False)

# keep those not in singleton filter we created
df=df[~df['household_id'].isin(singletons)]

Finally, the master census should only contain nuclear family members. In the context of community reconstrction, we only actually care about the head of household, spouse of head, and children of head.

In [7]:
df=df[df['pr_relationship_code'].isin(['Head','Wife','Son','Daughter'])]

### 2. Get Attatched Pids

Now that we have a cleaned census file to work with, we will find out how much of this census is already attatched to pids on FamilySearch. We do this by running the GetPidFromArk.py method on the arks in our master census file. 

This will take any set of arks and return a crosswalk between attatched arks and pids. Any arks from our master census not in this crosswalk are not on FamilySearch yet.



In [8]:
import sys
import os
sys.path.append('R:/JoePriceResearch/Python/all_code')
from FamilySearch1 import FamilySearch
# below step is technically not necassary. You could just make the in file your census file (as long as it's cleaned).
df[['ark1900']].to_csv('in.csv',index=False)
fs=FamilySearch('benbusath','1254Castlecombe.',os.getcwd(),'in.csv','out.csv',auth=True)
ark_pid_cw=fs.GetPidFromArk(ark_col=0)

ark_pid_cw.head()

1800 of 13266
Average Time:	        0.0062 Seconds
Hours Remaining:	0.02 Hours

3600 of 13266
Average Time:	        0.0062 Seconds
Hours Remaining:	0.02 Hours

5400 of 13266
Average Time:	        0.0074 Seconds
Hours Remaining:	0.03 Hours

7200 of 13266
Average Time:	        0.0071 Seconds
Hours Remaining:	0.03 Hours

9000 of 13266
Average Time:	        0.0071 Seconds
Hours Remaining:	0.03 Hours

10800 of 13266
Average Time:	        0.0073 Seconds
Hours Remaining:	0.03 Hours

12600 of 13266
Average Time:	        0.007 Seconds
Hours Remaining:	0.02 Hours


PIDs Collected


Unnamed: 0,ark,pid
0,MQMC-14C,MCLX-HTN
1,MQMC-149,LZZ6-3HK
2,MQMC-179,GS2M-J13
3,MQMC-14V,LCJW-VRB
4,MQMC-14W,L6L8-G1B


In [9]:
print('on_tree: \t'+str(len(ark_pid_cw)))
print('on_census: \t'+str(len(df)))

on_tree: 	5832
on_census: 	13266


Looks like we have a FamilySearch coverage rate of about 50% in this county. We should now save our ark-pid crosswalk as a csv into our working directory.

It is important to determine which households have people on the tree, so let's merge the ark-pid crosswalk back with our original census file to get a household_id column.

In [10]:
ark_pid_cw.rename(columns={'ark':'ark1900'},inplace=True)
ark_pid_cw=ark_pid_cw.merge(df[['ark1900','household_id']],on='ark1900')

ark_pid_cw.to_csv('el_paso_ark_pid_cw.csv',index=False)

Running this GetPidFromArk code on our master census file essentially gives us progress updates for how our coverage rates are improving.This should be a step that is continually repeated throughout the community reconstruction process as more and more people are added to the tree.

### 3. Create Island Census

An island census is the subset of people in our target census whose households contain no nuclear family members on FamilySearch. These are the households that we can later add to FamilySearch using our automated source linker. 

To create the island census, filter out any people with households_ids associated with our arks that are attatched to a pid.

In [11]:
island_census=df[~df['household_id'].isin(ark_pid_cw['household_id'])]

### 4. Run Hint Cleaner
We will now run the hint_cleaner code on our ark-pid crosswalk we created. hint_cleaner.py takes in any dataset with columns of arks and pids, and returns a dataframe of classified pid-hints from this crosswalk that is filtered down by household. You can read the documentation of hint_cleaner for more information on how this code works. This step will be much more important for sub-censuses that already have good FamilySearch coverage

Status Classification Information:
	Output for hint_cleaner function is a dataframe in the format 'ark | pid | status | url'
	The status column contains the following classifications stored as strings:
		
		complete:	Ark-pid hint has already been matched in the source linker
		
		duplicate:	Hint ark is matched to pid other than hint pid given

		tree-ext:	Hint that adds a new person to family search
		
		normal:		Basically everything that isn't classified as one of the three above. Usually
				a pid hint that adds a census source to an already existing pid in FamilySearch.
				May include tree-extending hints that were not successfully classified as such
				or other weird cases.	 

In [26]:
sys.path.append(r'V:\Python\community_reconstruction\hint_cleaner')
from hint_cleaner import hint_cleaner

cleaned_hints=hint_cleaner(ark_pid_cw[['ark1900','pid']],username='benbusath',password='1254Castlecombe.')
cleaned_hints.head()

TypeError: __init__() got an unexpected keyword argument 'auth'