# Six Degrees of Francis Bacon
[Six Degrees of Francis Bacon](http://sixdegreesoffrancisbacon.com) is a collaborativley produced historical network which traces the social relationships of the early modern English philosopher Francis Bacon. This notebook downloads the latest relationships data from the website and creates a GML file that can be used with PolyGraphs.

The output file generated by this notebook should be placed in: `~/polygraphs-cache/data/francisbacon/francisbacon.gml.gz`

In [105]:
from collections import defaultdict
import datetime
import numpy as np
import pandas as pd
import networkx as nx
import urllib.request

## Prepare download URLs

In [106]:
dt = datetime.datetime.now()
base = "http://sixdegreesoffrancisbacon.com/data/"
date = "_{y}_{m:02d}_{d:02d}.csv".format(y=dt.year, m=dt.month, d=dt.day)
people_url = base + "SDFB_people" + date
relations_url = base + "SDFB_relationships" + date
group_assignments_url = base + "SDFB_group_assignments" + date
group_names_url = base + "SDFB_groups" + date

## Create graph from edges

In [96]:
# Read Relationships
df = pd.read_csv(relations_url)
src = df['person1_index']
dst = df['person2_index']

# Initialise NetworkX Graph
G = nx.Graph()
G.add_edges_from(zip(src, dst))

## Largest single component only
There are some disconnected parts of the network we want to remove.

In [97]:
# Generate connected components and select the largest
largest_component = max(nx.connected_components(G), key=len)

# Create a subgraph of G consisting only of this component
G = G.subgraph(largest_component)

## Load information about nodes

In [98]:
# Load the people csv for names
names_df = pd.read_csv(people_url)

# Get display_name and id to turn it into a dictionary
names_dict = pd.Series(names_df.display_name.values,index=names_df.id).to_dict()

# Set the names dictionary as a node attribute
nx.set_node_attributes(G, names_dict, "name")

### Preserve original node ids

In [99]:
# Create a dict with original id
original_id = dict(zip(list(G.nodes()), list(G.nodes())))

# Set the names dictionary as node attributes
nx.set_node_attributes(G, original_id, "original_id")

## Normalise node ids on graph

In [100]:
# Create edge list
edges = [(edge[0], edge[1]) for edge in list(nx.to_edgelist(G))]

# Create normalised table
tbl = defaultdict(lambda: len(tbl))

# Normalise node identifiers (from 0 to N) using default dict
normalise_node_edges = [(tbl[edge[0]], tbl[edge[1]]) for edge in edges]

# Relabel nodes using lookup table
G = nx.relabel_nodes(G, tbl)

## Add group information

In [101]:
# Dict for converting between normalised node ids and original ids from CSV
tbl = nx.get_node_attributes(G, 'original_id')

# Read the group assignments and store them inside the node attributes
group_df = pd.read_csv(group_assignments_url)

# Groupby person_id and get a list of group assigments for each person
group_dict = group_df.groupby(['person_id'])['group_id'].apply(list).to_dict()

# Fetch group assignment for each person in graph, using the original_id to get
groups = { k: group_dict.get(v) for k, v in tbl.items() if v in group_dict.keys() }

# Set the groups dictionary as node attributes
nx.set_node_attributes(G, groups, "groups")

In [102]:
# Test Graph
G.nodes[0]

{'name': 'Mildred Cecil', 'original_id': 10002248, 'groups': [58, 128]}

## Group Names
Download the group names CSV from the Six Degrees of Francis Bacon website.

In [None]:
urllib.request.urlretrieve(group_names_url, "SDFB_groups-{0}.csv".format(datetime.date.today()))

## Export GML File

In [104]:
nx.write_gml(G, "francisbacon-{0}.gml.gz".format(datetime.date.today()))