## Snap Facebook graph data 

### Untarring the tar file

This only needs to be done the first time you use this notebook, to extract the files from the archive downloaded from snap.

In [2]:
import tarfile
import os.path

def py_files(members,extension):
    for tarinfo in members:
        if os.path.splitext(tarinfo.name)[1] == extension:
            yield tarinfo

# If you get an IO error it's because you havent placed the 
# facebook tar file in the same directory as the notebook.
tar = tarfile.open("facebook.tar.gz")
# To untar just one type of file
#tar.extractall(members=py_files(tar,extension=".edges"))
tar.extractall()
tar.close()

###  Reading in the edges of the ego network

Starting here, we have cells that need to be re-executed each time you run the notebook, first to build the graph of ego's friends, or the ego network.

In [3]:
import networkx as nx
# Read in edges of graph, treating node id as ints (otherwise they'd be strings)
# Change the value of egoid to look at a different ego.  There are 10 ego graphs
# in the data set.
egoid = 0
G = nx.read_edgelist(os.path.join('facebook','{0}.edges'.format(0)),nodetype=int)

This particular egoid has 348 friends.  Note that there is never a node for ego in this graph.  If there were, ego would hust be a node connected to all the others.

In [4]:
len(G.nodes())

333

### Adding data about ego and nodes

In [5]:

from collections import defaultdict

def read_featnames_file (ego_id):
    """
    Each feature index in the SNAP feature system represents a feature,value pair.
    For example, the feature index 24 might represent the value 'Harvard' for
    the 'education;school' feature.  For each node, the feature is either on
    or off.  In the ego graph for ego_id 0, Features 24-52 all represent possible values for
    the 'education;school' feature. For most individuals only one of the features in that range
    will be on. We're using numbers so we don't know which feature values
    represent which actual schools. Similarly features 77 and 78 represent the two values 
    for the gender feature, but we don't know which represents male and which female.
    Using integers **anonymizes** the feature values, so we can't use the cluster of features
    belonging to an individual in a network to identify them.
    
    Return a Decoding list and a feature dict. The decoding list maps from a SNAP feature id
    to a feature name. decoding_dict[i] is the feature name for which feature code `i` 
    defines a value. So `decoding_dict[77]` and `decoding_list[78]` both are 'gender`. 
    The keys of the feature_dict are feature names. For each feature name,
    `feature_dict[i]` gives the the list of features that represent values for that feature,
    so for the ego network for egoid 0, feature_dict['gender'] is [77,78].
    """
    global decoding_dict,feat_dict,feats0
    with open(os.path.join('facebook','{0:d}.featnames'.format(ego_id))) as fh:
        feats = fh.readlines()
    decoding_list,feat_dict = [],defaultdict(list)
    feats0 = [l.strip().split() for l in feats]
    decoding_list =  [';'.join(featname.split(';')[:-1])
                      for (local_index,featname,_,global_index) in feats0]
    for (index,featname) in enumerate(decoding_list):
        feat_dict[featname].append(index)
    return (decoding_list, feat_dict)

egoid = 0
(decoding_list,feat_dict) = read_featnames_file(egoid)
    
    
    

Over 200 features were found. All are there because a user decided to include a 
particular kind of information (such as high school attended) in their profile.
Bear in mind that many Facebook users provide very little information about themselves,
so that most features have no value for most users.  For example, the graph for egoid 0
includes several individuals about whom we know nothing but their gender.

How many total feature values are there, combining the values from all features?

In [27]:
len(decoding_list)

224

How many features are there? What are they and how many values does each have?

In [28]:
sorted(feat_dict.keys())

['birthday',
 'education;classes;id',
 'education;concentration;id',
 'education;degree;id',
 'education;school;id',
 'education;type',
 'education;with;id',
 'education;year;id',
 'first_name',
 'gender',
 'hometown;id',
 'languages;id',
 'last_name',
 'locale',
 'location;id',
 'work;employer;id',
 'work;end_date',
 'work;location;id',
 'work;position;id',
 'work;start_date',
 'work;with;id']

In [6]:
from collections import Counter
value_ctr = Counter(decoding_list)
print 'There are {0} features'.format(len(feat_dict))
print
ctr = 0
for k,v in sorted(feat_dict.items()):
    print '{0:27s}  {1:>2d} values'.format(k+':',value_ctr[k])
    ctr += len(v)

print '-' * 45
print '{0:>{width}} values'.format(ctr,width=5 + max(len(f) for f in feat_dict.keys()))

There are 21 features

birthday:                     8 values
education;classes;id:         5 values
education;concentration;id:   7 values
education;degree;id:          4 values
education;school;id:         29 values
education;type:               3 values
education;with;id:            1 values
education;year;id:           16 values
first_name:                   4 values
gender:                       2 values
hometown;id:                 11 values
languages;id:                14 values
last_name:                   21 values
locale:                       3 values
location;id:                 12 values
work;employer;id:            20 values
work;end_date:               16 values
work;location;id:            12 values
work;position;id:            13 values
work;start_date:             22 values
work;with;id:                 1 values
---------------------------------------------
                            224 values


In [7]:
def add_node_properties_to_graph(G,ego_id,decoding_list):
    global featlist
    with open(os.path.join('facebook','{0:d}.feat'.format(ego_id))) as fh:
        featlist = [[int(x) for x in line.strip().split()] for line in fh.readlines()]
    nodelist = G.nodes()
    for atts in featlist:
        node,feats = atts[0],atts[1:]
        #print len(feats)
        if node in nodelist:
            pass
        else:
            # For noticing the addition of unconnected nodes
            #print 'Adding {0}'.format(node)
            G.add_node(node,attr_dict = {})
        add_feats_to_feat_dict(feats, G.node[node], decoding_list)

def add_feats_to_feat_dict (feats, feat_dict, decoding_list):
    """
    We do not assume features are single-valued; i.e., each person has only one
    highest degree, one school attended, one gender.  
    
    For example the feature `languages` may have multiple vals.
    """
    for (feat_index,val) in enumerate(feats):
        feat = decoding_list[feat_index]
        if val:
            if feat in feat_dict:
                feat_dict[feat] += (feat_index,)
            else:
                feat_dict[feat] = (feat_index,)
  
    
def read_ego_features(ego_id, decoding_list):
    """
    Return a feat dict for ego just like the feat_dicts found in G.node,
    except this one won't belong to a node in the graph.  Useful for comparing
    features of ego to features of ego's friends.
    """
    with open(os.path.join('facebook','{0:d}.egofeat'.format(ego_id))) as fh:
        featlist = [int(x) for x in fh.readline().strip().split()]
    ego_feat_dict = {}
    add_feats_to_feat_dict(featlist, ego_feat_dict, decoding_list)
    return ego_feat_dict

def add_circles_to_graph(G,ego_id,decoding_list):
    with open(os.path.join('facebook','{0:d}.circles'.format(ego_id))) as fh:
         circlelist0 = [line.strip().split() for line in fh.readlines()]
    #circles = [circ[0] for circ in circlelist]
    print '{0} circles found!'.format(len(circlelist0))
    # We treat the n circles found as the n possible values for a new feature named circles
    for i in range(len(circlelist0)):
        decoding_list.append('circles')
    circlelist = [[int(ind) for ind in circ[1:]] for circ in circlelist0]
    for (circid,members) in enumerate(circlelist):
        for m in members:
            if 'circles' not in G.node[m]:
                # Because we want to do set based comparison 
                # of circles, we want tuples
                G.node[m]['circles'] = (circid,)
            else:
                G.node[m]['circles'] += (circid,)
    return circlelist
    

Execute the cell above before executung the cell below.  Click on the cell below and Type `Esc-l` to toggle line numbers if they aren't already there.

In [8]:
egoid = 0
add_node_properties_to_graph(G,egoid,decoding_list) 
circlelist = add_circles_to_graph(G,egoid,decoding_list)
ego_feat_dict = read_ego_features(egoid, decoding_list)

24 circles found!


In line 1 we decide on the ego id whose ego graph we are analyzing. In  line 2 we add the known properities of the friends in that graph and store them on the graph (see below).  In line 3 we compute the circles
and return them in a list.  Each circle is a list of ego's friends, so `circlelist` is a list of lists.  For example there might be one circle for family members, another for work, another for karate club members, and so on. Finally in line 5 we compute the properties ego has made public on his/her profile page and store them in a dictionary.

Friend properties have been stored as dictionaries we'll call **feat_dicts**.  The feat dicts of all ego's friends are stored in one big dictionary keyed by node names in `G.node`. Ego is not part of the graph, nor is ego's feat dict.  It's just a separate feat dict that we computed in line 4 in the cell above. 

Here are ego's features.

In [42]:
ego_feat_dict

{'education;classes;id': (9,),
 'education;concentration;id': (14,),
 'education;school;id': (39, 50, 52),
 'education;type': (53, 54, 55),
 'education;year;id': (69,),
 'gender': (78,),
 'last_name': (104,),
 'locale': (127,),
 'location;id': (129,),
 'work;employer;id': (145, 147, 151, 156),
 'work;end_date': (160, 163, 166, 168),
 'work;location;id': (176,),
 'work;position;id': (192, 195),
 'work;start_date': (205, 206, 208, 210, 212, 219)}

Here's a sample of the kinds of features found among ego's friends.

In [43]:
G.node.items()[:10]

[(1, {'circles': (15,), 'gender': (77,), 'locale': (127,)}),
 (2,
  {'circles': (10,),
   'education;school;id': (35,),
   'education;type': (53, 55),
   'education;year;id': (57,),
   'gender': (78,),
   'languages;id': (92, 98),
   'last_name': (114,),
   'locale': (126,),
   'location;id': (135,)}),
 (3,
  {'birthday': (7,),
   'circles': (15,),
   'education;concentration;id': (14,),
   'education;school;id': (34, 50),
   'education;type': (53, 55),
   'education;year;id': (59, 65),
   'gender': (78,),
   'languages;id': (92,),
   'locale': (127,),
   'location;id': (138,),
   'work;end_date': (171, 173),
   'work;location;id': (185,),
   'work;start_date': (210, 217)}),
 (4,
  {'education;school;id': (50,),
   'education;type': (53, 55),
   'education;with;id': (56,),
   'gender': (78,),
   'locale': (127,)}),
 (5,
  {'circles': (16,),
   'education;school;id': (49, 50),
   'education;type': (53, 54),
   'education;year;id': (65,),
   'gender': (78,),
   'locale': (127,)}),
 (6,
 

Does anyone belong to more than one circle?

In [44]:
for friend in G.node:
    feat_dict = G.node[friend]
    if 'circles' in feat_dict:
        if len(feat_dict['circles']) > 1:
            print friend, feat_dict['circles']

9 (15, 16)
17 (6, 19)
20 (6, 19)
23 (5, 15)
36 (15, 16)
41 (6, 19)
52 (6, 17)
54 (0, 11)
55 (4, 15)
69 (4, 15)
93 (6, 19)
97 (0, 11)
105 (15, 17)
115 (6, 19)
122 (4, 15)
125 (4, 15)
127 (15, 16)
135 (15, 16)
137 (6, 19)
139 (15, 16)
146 (9, 15)
172 (15, 17)
173 (1, 16)
183 (0, 15)
197 (15, 16)
214 (6, 19)
236 (4, 15)
251 (15, 16)
258 (4, 16)
280 (4, 15)
281 (15, 16)
282 (8, 20)
294 (15, 17)
298 (0, 11)
308 (11, 15)
309 (15, 16)
312 (6, 19)
326 (6, 19)
343 (6, 19)


How many of ego's friends have the same last name as ego?

In [9]:
ctr = 0
for friend in G.node:
    if 'last_name' in G.node[friend]:
        if G.node[friend]['last_name'] == ego_feat_dict['last_name']:
            ctr += 1
ctr

5

What circle has the most people with the same last name as ego?  Perhaps a family circle?

In [12]:
circlelist[0]

[71,
 215,
 54,
 61,
 298,
 229,
 81,
 253,
 193,
 97,
 264,
 29,
 132,
 110,
 163,
 259,
 183,
 334,
 245,
 222]

In [13]:
from collections import Counter
shared_last = Counter()
ego_last = ego_feat_dict['last_name']

# Loop through all the circles
for (i,c) in enumerate(circlelist):
    # loop through the members of a given circle
    for m in c:
        # If this member has revealed his last name and 
        # it is the same as ego's
        if 'last_name' in G.node[m] and \
            G.node[m]['last_name'] == ego_last:
                # increment the count of how many last name sharers
                # there are in this circle
                shared_last[i] += 1

#What are top three circles as far as sharing last names with ego goes?
shared_last.most_common(3)            
    

[(14, 2), (12, 1), (13, 1)]

One circle, circle 14, has two members with the same last name as ego.  How many members does circle 14 have?

In [14]:
len(circlelist[14])

2

## Similarity of friends: Homophily

We speculated above that the two members of Circle 14 might be related to ego, because they have the same last name.

Let's try to **measure** how similar these two friends are, as well as how similar they are to ego.

We'll use a very famous similarity function known as the **Dice coefficient**, after its inventor, Lee Dice, who introduced it in the following work:

> Dice, Lee R. "Measures of the amount of ecologic association between species." Ecology 26.3 (1945): 297-302.

The Dice coefficient counts the number of shared properties two entities have, but divides by the size of their combined set of properties.  To facilitate  comparison with ego's properties, we define it as a function of feature_dicts.

In [57]:
def dice_coefficient (fd1,fd2):
    """
    Returns a number between 0 and 1 representing the similarity of
    feature set `fd1` to feature set `fd2`, which are dictionaries
    with hashable values (strings, ints, or tuples, no lists).
    """
    fd1_s,fd2_s = set(tuple(fd1.items())),set(tuple(fd2.items()))
    return len(fd1_s.intersection(fd2_s))/float(len(fd1_s.union(fd2_s)))

# The two members of circle 14
mem1, mem2 = circlelist[14][0],circlelist[14][1]
# The feat dicts of those two members, taken from the graph.
mem1_feat_dict,mem2_feat_dict = G.node[mem1],G.node[mem2]
# The similarities of 3 pairs of individuals
mems12_sim = dice_coefficient(mem1_feat_dict, mem2_feat_dict)
mem1_ego_sim = dice_coefficient(mem1_feat_dict, ego_feat_dict)
mem2_ego_sim = dice_coefficient(mem2_feat_dict, ego_feat_dict)
print 'First mem',mem1,'Second mem', mem2,'{0:.2f}'.format(mems12_sim)
print 'First member',mem1, 'ego', '{0:.2f}'.format(mem1_ego_sim)
print 'Second member',mem2, 'ego', '{0:.2f}'.format(mem2_ego_sim)

First mem 175 Second mem 227 0.31
First member 175 ego 0.15
Second member 227 ego 0.08


In [50]:
mem1_feat_dict

{'birthday': (4,),
 'circles': (14,),
 'education;school;id': (45,),
 'education;type': (55,),
 'gender': (78,),
 'hometown;id': (88,),
 'last_name': (104,),
 'locale': (127,),
 'location;id': (137,)}

In [51]:
mem2_feat_dict

{'birthday': (7,),
 'circles': (14,),
 'education;school;id': (26, 45),
 'education;type': (53, 55),
 'education;year;id': (65, 69),
 'gender': (77,),
 'hometown;id': (88,),
 'last_name': (104,),
 'locale': (127,),
 'location;id': (137,),
 'work;location;id': (177,),
 'work;start_date': (215,)}