# Project 2: Two Mode Network  
IS620 Web Analytics  
Aaron Palumbo | Partho Banarjee  

In [154]:
import IPython.display as dis
import networkx as nx
import pandas as pd

# Little more to add for generating graphs
%pylab inline
pylab.rcParams['figure.figsize'] = (10, 8)

Populating the interactive namespace from numpy and matplotlib


## Data

#### Norwegian Interlocking Directorate (August 2009)

Excerpt from the data definition: 

> This website is the 2-year continuation of a research project on board memberships and gender in Norway. The main foundation of the paper is a gender representation law that required all public limited companies to compose their boards with at least 40% of each gender by January 2008. The paper attempted to stike a balance between the urgency of studying the gender represention law and the amount of data available (the analysis relied on data from August 2009).

> We collected a list of all the 384 public limited companies in Norway (Allmennaksjeselskap or ASA) that were available online through the Norwegian Business Register on August 5, 2009. We chose these companies as they are the ones bound by the gender representation law.

For this project, we chose to work with the data from Aug 2011. Further details about the data is as below:

We chose the following dataset to work with:

In [155]:
dis.IFrame(src="http://www.boardsandgender.com/data.php",
           width=800, height=400)

Specifically, we will work with the data from Aug. 2011.

## Data Cleaning

First we need to clean this up a bit. Our data file contains two columns, the first identifies a director, the second identifies a board to which that director is associated. Let's take a look at what we have:

In [156]:
netDF = pd.read_csv("net2m_2011-08-01.txt", sep=" ", 
                    header=None, names=["companyID", "personalID"])
print netDF.shape
netDF.head()

(1746, 2)


Unnamed: 0,companyID,personalID
0,1,2149
1,1,2910
2,1,3684
3,1,3754
4,2,766


In [157]:
print "The range of values for company are: {} to {}".format(netDF.companyID.min(), 
                                                             netDF.companyID.max())

The range of values for company are: 1 to 384


In [158]:
print "The range of values for person are: {} to {}".format(netDF.personalID.min(),
                                                            netDF.personalID.max())

The range of values for person are: 3 to 5766


In [159]:
# Quick sense for the number of  overlaping values
sum([i in netDF.personalID.values for i in netDF.companyID.values])

414

We can't add these to a networkx graph yet since it will treat the overlapping numbers as the same node.

In [160]:
companyID = pd.read_csv("data_companies.txt", delimiter='\t', 
                        header=None, names=["companyID", "orgNum", "name", "address"])
companyID.head()

Unnamed: 0,companyID,orgNum,name,address
0,1,879447992,24SEVENOFFICE ASA,0667 OSLO
1,2,990031479,A-COM NORGE ASA,0355 OSLO
2,3,890687792,ABERDEEN EIENDOMSFOND ASIA ASA,0230 OSLO
3,4,989761390,ABERDEEN EIENDOMSFOND NORDEN/BALTIKUM ASA,0255 OSLO
4,5,988671258,ABERDEEN EIENDOMSFOND NORGE II ASA,0255 OSLO


In [161]:
peopleID = pd.read_csv("data_people.txt", delimiter=' ')
peopleID.columns = ["personalID", "name", "gender"]
peopleID.head()

Unnamed: 0,personalID,name,gender
0,1,Aage Jakobsen,1
1,2,Aage Johan Rem�y,1
2,3,Aage Rasmus Bjelland Figenschou,1
3,4,Aagot Irene Skjeldal,2
4,5,Aase Gundersen,2


We should be able to add the information about the nodes once we create our graph. To make sure the node ids are unique, we will just append a 'p' for person and a 'c' for company to the front of the id columns

In [162]:
netDF.companyID     = ['c' + str(i) for i in netDF.companyID]
netDF.personalID    = ['p' + str(i) for i in netDF.personalID]
companyID.companyID = ['c' + str(i) for i in companyID.companyID]
companyID.index = companyID.companyID
peopleID.personalID = ['p' + str(i) for i in peopleID.personalID]
peopleID.index = peopleID.personalID

In [163]:
g = nx.from_pandas_dataframe(netDF, 'companyID', 'personalID')

In [164]:
# nx.draw(g)
# plt.show()

In [173]:
# Populate node attributes
for n in g.node.keys():
    if n[0] == 'c':
        g.node[n]['name'] = companyID.loc[n, 'name']
        g.node[n]['gender'] = 'NA'
    if n[0] == 'p':
        row = peopleID.loc[peopleID.personalID == n, :]
        g.node[n]['name'] = peopleID.loc[n, 'name']
        g.node[n]['gender'] = peopleID.loc[n, 'gender'] == 1 and 'male' or 'female'

In [175]:
nx.write_gexf(g, 'export.gexf')