## IS620 - Project 1
## Centrality Measures and Categorical Comparison
### Brian Chu | Sep 27, 2015

**If graphs are not visible, please view <a href="http://nbviewer.ipython.org/github/bchugit/IS620_WebAnalytics/blob/master/Week%203%20-%20Graph%20Theory%20and%20Definitions/bchu_wk3_graphviz_full.ipynb">here</a>. The <a href="http://blog.jupyter.org/2015/05/07/rendering-notebooks-on-github/">Github-Jupyter</a> rendering doesn't seem to connect GraphLab Canvas properly.**

### 1) Identify and load a network dataset that has some categorical information available for each node.

**This dataset represents a small network of the Mexican presidential elite throughout the 20th century. Each member (node) is also classified according to their professional background: (1 - military in class, 2 - civilians).**  
  
*Offficial Background:*    
  
*In Mexico, political power has been in the hands of a relatively small set of people who are connected by business relations, family ties, friendship, and membership of political institutions throughout most of the 20th century. A striking case in point is the succession of presidents, especially the nomination of the candidates for the presidential election. Since 1929, each new president was a secretary in the previous cabinet, which means that he worked closely together with the previous president. Moreover, the candidates always entertained close ties with former presidents and their closest collaborators. In this way, a political elite has maintained control over the country.*    
  
*The network contains the core of this political elite: the presidents and their closest collaborators. In this network, edges represent significant political, kinship, friendship, or business ties.*  
  
Source: https://sites.google.com/site/ucinetsoftware/datasets/mexicanpoliticalelite  

In [13]:
import networkx as nx
import matplotlib.pyplot as plt

#### The data was originally in Pajek format with a partition for the categorical information. I converted these to separate CSV files since I could not properly extract the partition using NetworkX. 

In [35]:
# Read nodes and edges
mex = nx.read_edgelist('mexican_edges.csv', delimiter=",", encoding='utf-8')

# Add professional attribute to each node
import csv
with open('mexican_military.csv', mode='r') as infile:
    reader = csv.reader(infile)
    military = {rows[0]:rows[1] for rows in reader}

nx.set_node_attributes(mex, 'professional', military)

# Check nodes are categorized correctly
{k: mex.node[k] for k in mex.node.keys()[:5]}

{u'Antonio Carrillo Flores': {'professional': '2'},
 u'Candido Aguilar': {'professional': '1'},
 u'Luis Echeverria Alvarez': {'professional': '2'},
 u'Miguel Aleman Velasco': {'professional': '2'},
 u'Venustiano Carranza': {'professional': '1'}}

> Looks like all the data is there

#### Get some basic metrics of the network

In [50]:
print "Nodes: %d" %len(mex)
print "Edges: %d" %nx.number_of_edges(mex)
print "Diameter: %d" %nx.diameter(mex)
print "Radius: %d" %nx.radius(mex)
print "min degrees: %s" %min(mex.degree())

Nodes: 35
Edges: 117
Diameter: 4
Radius: 2
min degrees: Adolfo Lopez Mateos


> Indeed, the network looks fairly closed

#### Draw the network

In [None]:
import graphlab as gl
gl.canvas.set_target('ipynb')
gmex = gl.SFrame.read_csv("mexican_edges.csv", header=False)

In [159]:
g = gl.SGraph()
g = g.add_edges(gmex, 'X1', 'X2')
g.show(vlabel='id', h_offset=0.03, node_size=250, vlabel_hover=True)

### 2) For each of the nodes in the dataset, calculate degree centrality and eigenvector centrality.

#### Show top 10 based on degree centrality

In [91]:
# Sorting function from SNAS pg. 47
def sorted_map(map):
    ms = sorted(map.iteritems(), key=lambda (k,v): (-v,k))
    return ms

deg = nx.degree(mex)
deg = {k:round(v,1) for k, v in deg.items()}
deg_sort = sorted_map(deg)
deg_sort[:10]

[(u'Miguel Aleman Valdes', 17.0),
 (u'Adolfo Ruiz Cortines', 13.0),
 (u'Hugo B. Margain', 12.0),
 (u'Lazaro Cardenas', 12.0),
 (u'Antonio Carrillo Flores', 11.0),
 (u'Adolfo Lopez Mateos', 10.0),
 (u'Raul Salinas Lozano', 10.0),
 (u'Antonio Ortiz Mena', 8.0),
 (u'Emilio Portes Gil', 8.0),
 (u'Venustiano Carranza', 8.0)]

#### For curiosity sake, show top 10 based on closeness centrality

In [76]:
close = nx.closeness_centrality(mex)
close = {k:round(v,3) for k, v in close.items()}
close_sort = sorted_map(close)
close_sort[:10]

[(u'Miguel Aleman Valdes', 0.667),
 (u'Lazaro Cardenas', 0.586),
 (u'Antonio Carrillo Flores', 0.576),
 (u'Adolfo Ruiz Cortines', 0.567),
 (u'Hugo B. Margain', 0.548),
 (u'Raul Salinas Lozano', 0.548),
 (u'Ramon Beteta', 0.54),
 (u'Adolfo Lopez Mateos', 0.531),
 (u'Antonio Ortiz Mena', 0.523),
 (u'Manuel Avila Camacho', 0.523)]

#### Also show top 10 based on betweenness centrality

In [78]:
btw = nx.betweenness_centrality(mex)
btw = {k:round(v,3) for k, v in btw.items()}
btw_sort = sorted_map(btw)
btw_sort[:10]

[(u'Miguel Aleman Valdes', 0.23),
 (u'Lazaro Cardenas', 0.157),
 (u'Adolfo Ruiz Cortines', 0.132),
 (u'Hugo B. Margain', 0.09),
 (u'Raul Salinas Lozano', 0.065),
 (u'Antonio Carrillo Flores', 0.06),
 (u'Ramon Beteta', 0.058),
 (u'Emilio Portes Gil', 0.039),
 (u'Adolfo Lopez Mateos', 0.035),
 (u'Venustiano Carranza', 0.031)]

#### Top 10 based on eigenvector centrality

In [161]:
eig = nx.eigenvector_centrality(mex)
eig = {k:round(v,3) for k, v in eig.items()}
eig_sort = sorted_map(eig)
eig_sort[:10]

[(u'Miguel Aleman Valdes', 0.371),
 (u'Antonio Carrillo Flores', 0.287),
 (u'Adolfo Ruiz Cortines', 0.273),
 (u'Hugo B. Margain', 0.273),
 (u'Adolfo Lopez Mateos', 0.25),
 (u'Antonio Ortiz Mena', 0.248),
 (u'Raul Salinas Lozano', 0.239),
 (u'Lazaro Cardenas', 0.214),
 (u'Gustavo Diaz Ordaz', 0.211),
 (u'Manuel Avila Camacho', 0.179)]

> We see a lot of common names throughout the 4 measures. Other observations:  
* Miguel Aleman Valdes is the most connected member by all measures
* Centrality scores are generally high, particularly for closeness
* Closeness and betweenness ranks seem the most aligned 

#### Summarize results into one table (sorted by eigenvector)

In [170]:
# Reference: SNAS pg. 54
# Make a list of the elite group by merging top ten groups for 3 centrality metrics 
names1=[x[0] for x in deg_sort[:10]]
names2=[x[0] for x in close_sort[:10]]
names3=[x[0] for x in btw_sort[:10]]
names4=[x[0] for x in eig_sort[:10]]

# use Python sets to compute a union of the sets 
names=list(set(names1) | set(names2) | set (names3) | set (names4))

## Build a table with centralities 
table=[[name,deg[name],close[name],btw[name],eig[name]] for name in names]

import pandas as pd
headers = ['Name', 'Degrees', 'Closeness', 'Betweenness', 'Eigenvector']
df = pd.DataFrame(table, columns=headers)
df = df.sort(['Eigenvector', 'Degrees'], ascending=[0, 0])
df

Unnamed: 0,Name,Degrees,Closeness,Betweenness,Eigenvector
1,Miguel Aleman Valdes,17,0.667,0.23,0.371
0,Antonio Carrillo Flores,11,0.576,0.06,0.287
6,Adolfo Ruiz Cortines,13,0.567,0.132,0.273
4,Hugo B. Margain,12,0.548,0.09,0.273
9,Adolfo Lopez Mateos,10,0.531,0.035,0.25
8,Antonio Ortiz Mena,8,0.523,0.022,0.248
7,Raul Salinas Lozano,10,0.548,0.065,0.239
2,Lazaro Cardenas,12,0.586,0.157,0.214
11,Gustavo Diaz Ordaz,7,0.5,0.014,0.211
10,Manuel Avila Camacho,7,0.523,0.02,0.179


### 3) Compare your centrality measures across your categorical groups

#### First let's view the graph based on node attribute  
  
*Purple = Military nodes*     
*Green = Civilian nodes*

In [158]:
military = []
for k, v in mex.node.items():
    if v['professional'] == '1':
         military.append(k)
          
g.show(vlabel='id', h_offset=0.03, node_size=250, vlabel_hover=True, highlight=military)

> In such a closed network, my initial thought is there is unlikely to be significant differences between military and civilian groups. 

#### Add all measures to node attributes

In [92]:
nx.set_node_attributes(mex, 'degrees', deg)
nx.set_node_attributes(mex, 'closeness', close)
nx.set_node_attributes(mex, 'betweenness', btw)
nx.set_node_attributes(mex, 'eigenvector', eig)

# Sample node format
{k: mex.node[k] for k in mex.node.keys()[:3]}

{u'Antonio Carrillo Flores': {'betweenness': 0.06,
  'closeness': 0.576,
  'degrees': 11.0,
  'eigenvector': 0.287,
  'professional': '2'},
 u'Candido Aguilar': {'betweenness': 0.021,
  'closeness': 0.493,
  'degrees': 6.0,
  'eigenvector': 0.12,
  'professional': '1'},
 u'Luis Echeverria Alvarez': {'betweenness': 0.011,
  'closeness': 0.43,
  'degrees': 5.0,
  'eigenvector': 0.117,
  'professional': '2'}}

#### Group data measures by professional background (1 - military in class, 2 - civilians)

In [156]:
def create_groups(g, measure):
    military = []
    civilian = []
    for k, v in g.node.items():
        if v['professional'] == '1':
            military.append(v[measure])
        else:
            civilian.append(v[measure])
            
    return(military, civilian)

military_deg = create_groups(mex, 'degrees')[0]
civilian_deg = create_groups(mex, 'degrees')[1]

# Get group size
print "Military sample size: %d" %len(military_deg)
print "Civilian sample size: %d" %len(civilian_deg)

Military sample size: 12
Civilian sample size: 23


#### Compare average values in each group

In [108]:
def group_results(g, measure):
    dgroups = create_groups(g, measure)
    print "Military avg. %s: %.3f" %(measure, float(sum(dgroups[0])/len(dgroups[0])))
    print "Civilian avg. %s: %.3f\n" %(measure, float(sum(dgroups[1])/len(dgroups[1])))
        
group_results(mex, 'degrees')
group_results(mex, 'closeness')
group_results(mex, 'betweenness')
group_results(mex, 'eigenvector')

Military avg. degrees: 6.917
Civilian avg. degrees: 6.565

Military avg. closeness: 0.491
Civilian avg. closeness: 0.478

Military avg. betweenness: 0.045
Civilian avg. betweenness: 0.028

Military avg. eigenvector: 0.139
Civilian avg. eigenvector: 0.157



#### Are the groups different? From first glance at the graph and numbers, it doesn't appear so. We can use a t-test to determine statistical significance

In [127]:
from scipy import stats

def calc_ttest(g, measure):
    military = create_groups(g, measure)[0]
    civilian = create_groups(g, measure)[1]
    t = stats.ttest_ind(military, civilian)
    print "%s: %.4f" %(measure, t[1])

print "T-test p-values:\n"
calc_ttest(mex, 'degrees')
calc_ttest(mex, 'closeness')
calc_ttest(mex, 'betweenness')
calc_ttest(mex, 'eigenvector')

T-test p-values:

degrees: 0.7708
closeness: 0.5635
betweenness: 0.3374
eigenvector: 0.5257


> Confirmed, none of the measures are statistically different between groups at multiple significance levels.

### Conclusion

This is a small closed network so we would expect to find a lot of centrality, particularly in closeness. By the same token, this did not lead to any significant difference between Mexican leaders who were had military versus civilian professional backgrounds. Perhaps this is not overly surprising given that the network is inherently from one political party and the dataset itself aims to highlight the connectivity of a single 'Mexican Elite' network. While there may not be centrality measure differences between military and civilian groups, another look at the attribute-divided graph indicates there may be some clustering or other variability.

* draw graph with degree-base nodes
* github, nbviewer link
