## Objective: Construct a family tree for persons of interest using Wikipedia 
This code takes a csv file with names (in this instance, the world leaders list from the Open Sanctions website) and scapes wikipedia to identify relatives. <br> 
It outputs a new csv file, 'wikipedia_family_tree.csv', with the following columns: 
<ul> 
<li> key (person of interest) </li>
<li> source (link to source wiki page)</li>
<li> relationship (first letter represents: 'Spouse', 'Partner', 'Children', 'Parent', 'Relatives','Family')</li> 
<li> name (of relative) </li> 
<li> wiki_link (for relative, if available) </li> 
<li> notes (any additional detail) </li>
</ul>

TO DO:
<ul> 
<li>More robustly extract family information (outside rule-based methods?) </li> 
<li>Extend to second and third-order relationships and manage duplicates </li> 
<li>Model family networks e.g. using visualisations </li> 
<li>This code works for prominent people (e.g. world leaders) - assess whether it extends to less prominent people (whose wiki pages may be differently structured) </li> 
</ul> 

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import json
import re

In [2]:
#Data source
key_url = 'http://ia802807.s3dns.us.archive.org/opensanctions/worldpresidentsdb.csv'
key_df = pd.read_csv(key_url)

In [3]:
#Construct ids for each name (Barack Obama will become barackobama).
key_df["id"] = key_df["name"].apply(lambda x: ''.join(x.lower().split()))

In [4]:
#Family data labels in wiki infobox. 
fam_list = ['Spouse', 'Partner', 'Children', 'Parent', 'Relatives','Family'] #https://en.wikipedia.org/wiki/Template:Infobox_person

In [15]:
def fam_tree(key_orig):
    #Search for most relevant wiki link
    url = 'https://en.wikipedia.org/w/api.php?action=opensearch&search=' + key_orig + '&limit=1&namespace=0&format=json'
    r = requests.get(url)
    json_data = r.json()
    try: 
        new_url = json_data[-1][0]
        #Scrape wiki page using BeautifulSoup
        res = requests.get(new_url)
        soup = BeautifulSoup(res.text, 'html.parser')
        fam_details = []
        for item in soup.find_all("tr"):
            #Assume relevant data appears before 'Website', Military Service', 'Signature'
            if item.text.startswith(tuple(['Website','Military service', 'Signature'])):
                break
            else: 
                if item.text.startswith(tuple(fam_list)):
                    item_id = item.text[0]
                    [fam_details.append([key_orig, new_url, item_id, x.get('title'),x.get('href'), ""]) if len(item.find_all('a'))> 0 else '' for x in item.find_all('a')]
                    [fam_details.append([key_orig, new_url, item_id, x.text, "N/A", ""]) for x in item.find_all('li') if x.find_all('a')==[]]
                    [fam_details.append([key_orig, new_url, item_id, item.find('td').text,"N/A", ""]) if len(item.find_all('a'))== 0 else '']        
        if len(fam_details) == 0: 
            return [[key_orig, new_url, "N/A","N/A", "N/A", "no_family"]]
        else:
            return fam_details
    except: 
        return [[key_orig, "N/A", "N/A", "N/A", "N/A", "no_wiki",]]

In [16]:
final_fam = []
for name in key_df['id']: 
    final_fam.extend(fam_tree(name))

In [17]:
final_df = pd.DataFrame(final_fam, columns = ["key", "source", "relationship", "name", "wiki_link", "notes"])
final_df = final_df.astype(str)
final_df.head()

Unnamed: 0,key,source,relationship,name,wiki_link,notes
0,dorisleuthard,https://en.wikipedia.org/wiki/Doris_Leuthard,S,Roland Hausin,,
1,moonjaein,https://en.wikipedia.org/wiki/Moon_Jae-in,S,Kim Jung-sook,/wiki/Kim_Jung-sook,
2,moonjaein,https://en.wikipedia.org/wiki/Moon_Jae-in,C,2,,
3,benjaminnetanyahu,https://en.wikipedia.org/wiki/Benjamin_Netanyahu,S,Sara Netanyahu,/wiki/Sara_Netanyahu,
4,benjaminnetanyahu,https://en.wikipedia.org/wiki/Benjamin_Netanyahu,S,Miriam Weizmann(m. 1972; div. 1978),,


In [18]:
#Data cleaning 
##Move all information in brackets to 'notes' column
final_df["notes"] = final_df["name"].apply(lambda x: x.split("(")[1].replace(")", "") if len(x.split("("))>1 else '')
final_df["name"] = final_df["name"].apply(lambda x: x.split("(")[0])

##Replace numbers in 'name' column with 'Unknown'. Move numbers to 'notes'
## Children - 'name: 3' will now read 'name: Unknown, notes: 3' 
final_df.loc[final_df["name"].str.isdigit(), 'notes'] = final_df["name"] 
final_df.loc[final_df["name"].str.isdigit(), 'name'] = "Unknown"
final_df.head()

Unnamed: 0,key,source,relationship,name,wiki_link,notes
0,dorisleuthard,https://en.wikipedia.org/wiki/Doris_Leuthard,S,Roland Hausin,,
1,moonjaein,https://en.wikipedia.org/wiki/Moon_Jae-in,S,Kim Jung-sook,/wiki/Kim_Jung-sook,
2,moonjaein,https://en.wikipedia.org/wiki/Moon_Jae-in,C,Unknown,,2
3,benjaminnetanyahu,https://en.wikipedia.org/wiki/Benjamin_Netanyahu,S,Sara Netanyahu,/wiki/Sara_Netanyahu,
4,benjaminnetanyahu,https://en.wikipedia.org/wiki/Benjamin_Netanyahu,S,Miriam Weizmann,,m. 1972; div. 1978


In [19]:
#Export data to csv 
final_df.to_csv("wikipedia_family_tree.csv", index = False)