## Data 620 - Project 1

Baron Curtin, Heither Geiger

## Introduction

For this project, we will be using data containing the networks of 10 different Facebook users (aka the "ego nodes"). This data will be downloaded from the Stanford Large Network Dataset Collection. The dataset is available here: https://snap.stanford.edu/data/ego-Facebook.html.

Here, each node ID represents a Facebook user.

For each of the 10 different networks (denoted by the ID of the user at the center of the network), we have the following files:

\*.edges - Each line will have two columns. The node IDs in column 1 and column 2 are connected to each other and to the ego node.

\*.circles - Each line will have an ID column (i.e. circle0, circle1, and so on), followed by anywhere from one to 100+ additional columns. The circles represent groups of users. Each node ID in every field of every line is connected to the ego node.

\*.featnames - Gives more info about which feature corresponds to each feature column. For example, "8 education;classes;id;anonymized feature 8" means that column 9 of the features file (with 0-based indexing) corresponds to an anonymized feature related to education (specifically, classes taken). 

\*.egofeat and \*.feat - For each feature explained in featnames, give whether the feature is true or false in each user for the ego node and all other nodes respectively.

In [10]:
# load libraries
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path
from itertools import combinations

# additional jupyter setup
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline

## Data Loading/Parsing

In the following code cells, we will be loading the data into Python and preparing it to be visualized with NetworkX


In [11]:
fb_path = Path('../week3/facebook')
file_types = ('circles', 'edges', 'egofeat', 'feat', 'featnames')

# load all files into python dictionary by file type
files = {file_type: [(f.stem, f) for f in fb_path.glob(f'*{file_type}')] for file_type in file_types}

Now that all of the files are loaded, we need to create a function that will parse each file appropriately


### File Parsing Functions

In [12]:
def circles_parser(lines='', **kwargs):
    # circles files are tab delimited
    info = [(int(kwargs['node']), int(n))
            for l in lines 
            for n in l.split('\t')[1:]
            if l != '']
    return info

def edges_parser(lines='', **kwargs):
    # edges files are separated by spaces
    info = [x for l in lines
            for x in combinations([*l.split(' '), kwargs['node']], 2)
            if l != '']
    return info

def egofeat_parser(lines='', **kwargs):
    # egofeat is separated by spaces
    info = [x for l in lines
            for x in l.split(' ')
            if l != '']
    return info

def feat_parser(lines='', **kwargs):
    # feat files are separated by spaces
    info = [(l.split(' ')[0], l.split(' ')[1:]) 
            for l in lines
            if l != '']
    return info

def featnames_parser(lines='', **kwargs):
    # featnames files are delimited by ;
    # we will only use the first index and the last index of the featurename
    # as there are inconsistencies in length
    info = [(l.split(';')[0], l.split(';')[-1])
            for l in lines]
    return info


# dictionary of file parsing functions for easy referencing later
# using a dictionary will serve as a 'case' function
file_parsers = {
    'circles': circles_parser,
    'edges': edges_parser,
    'egofeat': egofeat_parser,
    'feat': feat_parser,
    'featnames': featnames_parser
}

In [13]:
def parse_file(file_, node, file_type=''):
    # basic error handling to make sure all required arguments are passed
    if (file_ is None) | (node == '') | (file_type == ''):
        raise ValueError('Unable to parse file without the required information')
    
    # get contents of the file
    content = file_.read_text()
    lines = content.split('\n')
    
    # parse based on file type
    info = file_parsers[file_type](lines, node=node)
    return info

In [14]:
# we can use our created functions to create a new dictionary of parsed files
parsed_files = {file_type: [parse_file(file_path, node, file_type) for node, file_path in list_of_files]
                for file_type, list_of_files in files.items()}

## Graph Construction

In [15]:
g = nx.Graph()

# add edges from all the edges files
g.add_edges_from([y for x in parsed_files['circles']
                  for y in x])

# add edges from all the circles files
g.add_edges_from([y for x in parsed_files['edges']
                  for y in x])

In [None]:
pos = nx.spring_layout(g)
deg_centrality = nx.degree_centrality(g)
node_color = [20000.0 * g.degree(v) for v in g]
node_size = [v * 10000.0 for v in deg_centrality.values()]

plt.figure(figsize=(20,20))
nx.draw_networkx(g, pos=pos, with_labels=True, node_color=node_color, node_size=node_size)
plt.axis('off')