#Developing a graph representation of name relationships 

On 23andme.com, you can get info on other DNA relatives of yours who have also signed up on 23andMe. 23andMe provides a csv filed, with all relatives above some threshold of genetic similarity. I have 838 "DNA relatives" with whom I share at least 0.14% DNA (at least according to 23andme).


In [1]:
import pandas as pd
my_rel = pd.read_csv('relative_finder_ME.csv')
rows = my_rel['Family Surnames'].dropna()
print rows.head()
my_fam_names =['horstman','nussbeck','lamont','boyd','harris'] 
# omitting smith cause it is way too common

0    Sahlstrom, Horstman, Horstmann, Astrom, Hass, ...
1    wolf, hoff, hartzell, spangler, hamilton, boyd...
2    Tucker, Yowell, Schroeder, Loudermilk, Elliott...
3    Barnes, Baker, Belknap, Bemis, Blakeslee, Carl...
5                                    york, duffy, York
Name: Family Surnames, dtype: object


People can voluntarily provide genealogical info, such as Family Surnames.  Unfortunately, this data is quite messy, so I first clean it up a bit.  For now I'll ignore complications such as spelling differences or family relationships or the fact that two people with same surname may be unrelated. 
**I'll define two people (rows in the above df) as neighbors if they share a surname from their 'family surnames' list.**  This will define a graph.  In this project, I want to see who I am "connected" to in this graph.  The data is small enough that I can represent the graph with a 


# Data Cleanup

In [24]:
import re

def _stripstr(s):
    s2 = re.sub(r',|\(|\)|&|father=|/|-|[\s,]du\s|[\s,]de\s|[\s,]di\s|^\w\s|\s\w\s|paternal|maternal|\sjr[\s\.,]|jr$', r' ',s.lower())
    return s2
def _stripword(s):
    """This function removes non-letter characters from a word
    
    for example _strip('Hi there!') == 'Hi there'
    """
    return re.sub(r'[\s\W_]+','', s)

## To check, make sure this isn't screwing up Swedish letters

In [47]:
# test my the text cleaning
s= '(Fall), Paternal & s  Miller-Rhodes, Rhode Jr, Seal Jr. (SeJrale), di Stefano Jr'
for word in _stripstr(s).split():
    print _stripword(word)

fall
miller
rhodes
rhode
seal
sejrale
stefano


#Setting up the NameGraph
(Implemented as a dict of lists)
Each family surname represents a possible edge between two "rows."
We use a set of edges to avoid redundant edges.


In [43]:
from collections import defaultdict, Counter

namelist_dict = defaultdict(list)
namecount_dict = Counter()
edges = set({})

ncomp = 0
for (idx,v) in rows.iteritems():
    for word in _stripstr(v).split():
        ncomp +=1
        key = _stripword(word)
        namecount_dict[key]+=1
        if namecount_dict[key] > 1:
            for otheridx in namelist_dict[key]:
                if otheridx != idx:
                    edges.add( (otheridx,idx) )
                    edges.add( (idx,otheridx) )
        namelist_dict[key].append(idx)
    
# should be of order V+E
print ncomp
print len(edges) + len(rows)



2939
3216


In [28]:
NameGraph = defaultdict(list)
for (a,b) in edges:
    NameGraph[a].append(b)

In [48]:
print NameGraph[5]
print rows.ix[5]
print rows.ix[533]


[533]
york, duffy, York
Allen, York, Bolstridge, Rand, Pelkey, Mahan, Clark, Burgess


2181