#### Facebook Data Directory
    * nodeId.edges : The edges in the ego network for the node 'nodeId'. Edges are undirected for facebook, and directed (a follows b) for twitter and gplus. The 'ego' node does not appear, but it is assumed that they follow every node id that appears in this file.

    * nodeId.circles : The set of circles for the ego node. Each line contains one circle, consisting of a series of node ids. The first entry in each line is the name of the circle.

    * nodeId.feat : The features for each of the nodes that appears in the edge file.

    * nodeId.egofeat : The features for the ego user.

    * nodeId.featnames : The names of each of the feature dimensions. Features are '1' if the user has this property in their profile, and '0' otherwise. This file has been anonymized for facebook users, since the names of the features would reveal private data.

In [1]:
import os
import sys
import glob
import matplotlib.pyplot as plt

%matplotlib inline
%precision 4

'%.4f'

In [15]:
from os import path
data_dir = path.join(os.getcwd(), "data")
file_names = glob.glob(path.join(data_dir, "facebook", "*"))

In [55]:
file_tags = set([file_name.split(".")[1] for file_name in file_names])
file_tags = list(file_tags)
file_tags

['edges', 'featnames', 'circles', 'egofeat', 'feat']

"circles": describe formations of circle in terms of memeber (denoted by id)
"egofeat": a single vector of 224 binary indicators (what does the indicator stand for ?) 
"edges": user_id pairs
"feat": user_id, feature vector of 224 binary indicators
"featnames": 224 lines, each line is strucuted like, cetegory, feature, value (e.g. work;start_date;anonymized feature 204)

In [57]:
file_groups = set([file_name.split("\\")[-1].split(".")[0] for file_name in file_names]) 
print("Total of groups: {}".format(len(file_groups)))
file_groups = list(file_groups)
file_groups

Total of groups: 10


['348', '0', '698', '686', '3980', '1684', '414', '107', '3437', '1912']

#### data unstandin:
* Do all featnames include a same set of features ?

In [51]:
featnames_files = [f+".featnames" for f in file_groups]
featnames_files

['348.featnames',
 '0.featnames',
 '698.featnames',
 '686.featnames',
 '3980.featnames',
 '1684.featnames',
 '414.featnames',
 '107.featnames',
 '3437.featnames',
 '1912.featnames']

In [121]:
def feat_extractor(line, group_id):
    items = line.replace("anonymized feature", "").split(" ")
    position = int(items[0])
    feature = items[1][:-1].replace(";", "_")
    value = int(items[2]) 
    return {"group":int(group_id), "position":int(position), \
            "feature":feature, "value":int(value)}

In [122]:
# create data.frame to keep mappig of (circile_id, index, binary vectors) 
# to feature and values
rows = []

for i, file_group in enumerate(file_groups):
    print("{} th loop (ID: {})".format(i, file_group))
    file_name = path.join(data_dir, "facebook", file_group + ".featnames") 
    with open(file_name, mode='r', encoding='utf-8') as fs:
        for line in fs.readlines():
            rows.append(feat_extractor(line, group_id=file_group))

0 th loop (ID: 348)
1 th loop (ID: 0)
2 th loop (ID: 698)
3 th loop (ID: 686)
4 th loop (ID: 3980)
5 th loop (ID: 1684)
6 th loop (ID: 414)
7 th loop (ID: 107)
8 th loop (ID: 3437)
9 th loop (ID: 1912)


In [123]:
import pandas as pd
import numpy as np
feat_dict = pd.DataFrame(rows)
feat_dict.ix[np.random.choice(range(feat_dict.shape[0]), 10), :]
feat_dict.to_csv(path.join(data_dir, "processed", "feature_dict.csv"), 
                 header=True, index=False, sep=",", encoding="utf-8")

In [206]:
circle_feat_dict = feat_dict.ix[feat_dict["group"]==0, :]
circle_feat_dict.head(5)

Unnamed: 0,feature,group,position,value
161,birthday,0,0,0
162,birthday,0,1,1
163,birthday,0,2,2
164,birthday,0,3,3
165,birthday,0,4,4


In [129]:
feat_dict["feature"].unique()

array(['birthday', 'education_concentration_id', 'education_degree_id',
       'education_school_id', 'education_type', 'education_with_id',
       'education_year_id', 'first_name', 'gender', 'hometown_id',
       'languages_id', 'last_name', 'locale', 'location_id',
       'work_employer_id', 'work_end_date', 'work_location_id',
       'work_position_id', 'work_start_date', 'education_classes_id',
       'work_with_id', 'name', 'political', 'religion', 'middle_name',
       'work_from_id', 'work_projects_id'], dtype=object)

In [238]:
def profile_extracttor(line, circle_id, feat_dict):
    """Convert text line (from .feat data) to dictionary"""
    if isinstance(circle_id, str):
        circle_id = int(circle_id)
    items = line.split(" ")
    user_idx, feature = items[0], list(map(int, items[1:]))
    nonzero_feat_idx = [i for i, val in enumerate(feature) if val == 1]
    feat_dict = feat_dict.ix[feat_dict["group"]==circle_id,\
                             ["feature", "position", "value"]]
    feat_profile =  {row[0]: row[1] \
           for i, row in feat_dict.iterrows() \
           if row[1] in nonzero_feat_idx}
    user_id = "-".join((str(circle_id), str(user_idx)))
    return {"user_id": user_id, "profile":feat_profile}

In [246]:
user_profiles = []

for i, file_group in enumerate(file_groups):
    print("{} th loop (ID: {})".format(i, file_group))
    file_name = path.join(data_dir, "facebook", file_group + ".feat") 
    with open(file_name, mode='r', encoding='utf-8') as fs:
        for line in fs.readlines():
            res = profile_extracttor(line, file_group, feat_dict)
            user_profiles.append(res)

0 th loop (ID: 348)


  result = getattr(x, name)(y)


TypeError: invalid type comparison

In [247]:
line

'349 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0\n'

In [248]:
profile_extracttor(line, file_group, feat_dict)

  result = getattr(x, name)(y)


TypeError: invalid type comparison

In [249]:
file_group

'348'