Eva Sibinga
Assignment 6 - Open assignment 

My goal here is to create a csv from genealogical data I have in a GEDCOM file -- that's the format available from Ancestry.com. It requires proprietary software to read and won't work with d3.js, so I need to get this data into another file format. Ideally I'd like the data to be a JSON file so that I can use it for hierarchical data viz, but I've had trouble converting directly from GEDCOM to JSON. This notebook will instead convert from GEDCOM to CSV using a GEDCOM parse package and pandas. As long as I have "parent" and "child" fields, I can use d3 packages to visualize hierarchical data.

In [27]:
pip install python-gedcom

Note: you may need to restart the kernel to use updated packages.


In [5]:
#install gedcom packages
from gedcom.element.individual import IndividualElement
from gedcom.parser import Parser

In [6]:
# check my working directory, change if needed
import os
os.getcwd()

'/Users/evasibinga'

In [7]:
os.chdir('./IDV-exploratory-viz/')

In [8]:
#install dataframe packages needed
import pandas as pd

In [9]:
# we're in the right directory so ready to get started
os.getcwd()

'/Users/evasibinga/IDV-exploratory-viz'

In [28]:
# create empty dataframe 
## I didn't end up using this but it was a helpful starting place to envision the columns I do and don't need

# col_names = ['INDI', 'first', 'last', 'full_name', 'family', 'sex', 'DOB', 'DOD', 'birthplace', 'deathplace', 'generation', 'level', 'parent_id' ]
# my_df = pd.DataFrame(columns=col_names)

In [11]:
# my_df

Unnamed: 0,INDI,first,last,full_name,family,sex,DOB,DOD,birthplace,deathplace,generation,level,parent_id


In [10]:
#file path to my gedcom file
file_path = 'data/JGB_MHB.ged'

In [11]:
#initialize gedcom parser
gedcom_parser = Parser()

In [12]:
gedcom_parser.parse_file(file_path)

In [13]:
root_child_elements = gedcom_parser.get_root_child_elements()

In [29]:
# check out the data generated by an unfamiliar package / parser
type(root_child_elements)

list

In [25]:
root_child_elements[0:5]

[<gedcom.element.element.Element at 0x7fbc216ea3a0>,
 <gedcom.element.individual.IndividualElement at 0x7fbc22c92c40>,
 <gedcom.element.individual.IndividualElement at 0x7fbc22cb6fa0>,
 <gedcom.element.individual.IndividualElement at 0x7fbc22cc8e20>,
 <gedcom.element.individual.IndividualElement at 0x7fbc22cd1f70>]

In [16]:
# the parser keeps gedcom-specific data types
type(root_child_elements[1])

gedcom.element.individual.IndividualElement

In [30]:
# make an empty list to collect data
data = []
# iterate through all root child elements (i.e. everyone in the family tree)
for element in root_child_elements:

    # Is the `element` an actual `IndividualElement`? (Allows usage of extra functions such as `surname_match` and `get_name`.)
    if isinstance(element, IndividualElement):
        if element.is_individual:
            # define the fields I want in my final CSV
            INDI = element.get_pointer() # INDI is a unique ID for each individual on the tree (helpful for repeat names)
            (first, last) = element.get_name()
            sex = element.get_gender()
            DOB = element.get_birth_data()  ## this includes source info and full date, so below I just access the first item in list for birthdate
            DOD = element.get_death_data()  ## ditto
            parents = gedcom_parser.get_parents(element)
            for element in parents:    # this is extremely heteronormative, but the most obvious way to split parents into 2 elements using the constraints of the GEDCOM file
                sex1 = element.get_gender()
                if sex1 == 'F':
                    (first_1, last_1) = element.get_name()
                    INDI_mother = element.get_pointer() 
                else: # if sex1 == 'M':
                    (first_2, last_2) = element.get_name()
                    INDI_father = element.get_pointer()
                        
            
            #append all variables to my empty list to make it a big ol' list of data
            data.append([first, last, sex, DOB[0], DOD[0], INDI, parents, INDI_mother, first_1, last_1, INDI_father, first_2, last_2])
            
            # FYI for hierarchy - get_level returned 0 for everybody (confirmation of D3.js suspicion)

In [31]:
# check out the data
data[0:3]

[['James Garrett',
  'Biddle',
  'M',
  '13 Oct 1868',
  '21 Dec 1947',
  '@P1@',
  [<gedcom.element.individual.IndividualElement at 0x7fbc22cd1f70>,
   <gedcom.element.individual.IndividualElement at 0x7fbc22cc8e20>],
  '@P3@',
  'Mary',
  'Hewes',
  '@P4@',
  'John William',
  'Biddle'],
 ['Mary',
  'Hutton',
  'F',
  '11 Sep 1869',
  '17 Oct 1925',
  '@P2@',
  [<gedcom.element.individual.IndividualElement at 0x7fbc22d81820>,
   <gedcom.element.individual.IndividualElement at 0x7fbc22d6bdc0>],
  '@P15@',
  'Rebecca',
  'Savery',
  '@P16@',
  'Addison',
  'Hutton'],
 ['Mary',
  'Hewes',
  'F',
  '26 Oct 1842',
  '25 May 1874',
  '@P3@',
  [<gedcom.element.individual.IndividualElement at 0x7fbc22d95730>,
   <gedcom.element.individual.IndividualElement at 0x7fbc22d954c0>],
  '@P17@',
  'Sarah S',
  'Garrett',
  '@P18@',
  'Edward C',
  'Hewes']]

In [22]:
# explicitly name columns and use pandas to turn the list into a dataframe
my_df = pd.DataFrame(data, columns=['first', 'last', 'sex', 'DOB', 'DOD', 'INDI_ID', 'parents', 'INDI_mother', 'first_1', 'last_1','INDI_father', 'first_2', 'last_2'])

In [23]:
# check out the dataframe
my_df[0:10]

Unnamed: 0,first,last,sex,DOB,DOD,INDI_ID,parents,INDI_mother,first_1,last_1,INDI_father,first_2,last_2
0,James Garrett,Biddle,M,13 Oct 1868,21 Dec 1947,@P1@,"[0 @P4@ INDI\r\n, 0 @P3@ INDI\r\n]",@P3@,Mary,Hewes,@P4@,John William,Biddle
1,Mary,Hutton,F,11 Sep 1869,17 Oct 1925,@P2@,"[0 @P16@ INDI\r\n, 0 @P15@ INDI\r\n]",@P15@,Rebecca,Savery,@P16@,Addison,Hutton
2,Mary,Hewes,F,26 Oct 1842,25 May 1874,@P3@,"[0 @P18@ INDI\r\n, 0 @P17@ INDI\r\n]",@P17@,Sarah S,Garrett,@P18@,Edward C,Hewes
3,John William,Biddle,M,02 Aug 1835,02 Jun 1916,@P4@,"[0 @P12@ INDI\r\n, 0 @P13@ INDI\r\n]",@P13@,Elizabeth Cesson,Biddle,@P12@,William,Biddle
4,Addison Hutton,Biddle,M,11 Dec 1903,03 Mar 1912,@P5@,"[0 @P1@ INDI\r\n, 0 @P2@ INDI\r\n]",@P2@,Mary,Hutton,@P1@,James Garrett,Biddle
5,Ruth,Biddle,F,06 Nov 1906,29 Jan 1969,@P6@,"[0 @P1@ INDI\r\n, 0 @P2@ INDI\r\n]",@P2@,Mary,Hutton,@P1@,James Garrett,Biddle
6,Rebecca Hutton,Biddle,F,08 Jun 1901,04 Mar 1991,@P7@,"[0 @P1@ INDI\r\n, 0 @P2@ INDI\r\n]",@P2@,Mary,Hutton,@P1@,James Garrett,Biddle
7,Elizabeth Rebecca,Biddle,F,22 Mar 1897,25 Mar 1975,@P8@,"[0 @P1@ INDI\r\n, 0 @P2@ INDI\r\n]",@P2@,Mary,Hutton,@P1@,James Garrett,Biddle
8,Dorothy,Biddle,F,25 Jan 1900,15 Feb 1985,@P9@,"[0 @P1@ INDI\r\n, 0 @P2@ INDI\r\n]",@P2@,Mary,Hutton,@P1@,James Garrett,Biddle
9,Mary Hewes,Biddle,F,10 Aug 1898,28 Dec 1963,@P10@,"[0 @P1@ INDI\r\n, 0 @P2@ INDI\r\n]",@P2@,Mary,Hutton,@P1@,James Garrett,Biddle


In [342]:
# it works! export as a csv file
my_df.to_csv('JGB_MHB.csv')