<a href="https://colab.research.google.com/github/aaubs/ds-master/blob/main/notebooks/M2_power_elites.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install --upgrade scipy -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.5/34.5 MB[0m [31m26.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
# Basic packaging for network exploration
import pandas as pd
import networkx as nx
from community import community_louvain

import altair as alt

# Exploring the graph of Danish Power Elites

![](https://source.unsplash.com/GWe0dlVD9e0)

> Many people dream of being one of them, but only few make it all the way to the top. According to two CBS researchers, it takes more than just hard work to get to the top of the Danish hierarchy of power. [read more](https://www.cbs.dk/en/alumni/news/a-look-the-danish-power-elite)

In this project we are going to construct and explore a network of Danish power elites derived from boards of various organisations in th country.
We will construct an association network: Who is being in the same board? And first explore "basic" centrality indicators. Then identify communities and central persons within those. Finally we look at some "fancier" interactive network visualisation.

In this tutorial we will be using some more advanced Pandas techniques that may be new for you. Use the documentation if in doubt.

You can read some related research [here](https://research-api.cbs.dk/ws/portalfiles/portal/57663543/anton_grau_larsen_and_christoph_houmann_ellersgaard_who_listens_to_the_top_acceptedversion.pdf)



# Obtaining exploring data

## Loading the data

In [3]:
# Import data :-) and quick check
data = pd.read_csv('https://github.com/SDS-AAU/SDS-master/raw/master/00_data/networks/elite_den17.csv')
data.head()

Unnamed: 0,NAME,AFFILIATION,ROLE,TAGS,POSITION_ID,ID,SECTOR,TYPE,DESCRIPTION,CREATED,ARCHIVED,LAST_CHECKED,CVR_PERSON,CVR_AFFILIATION,PERSON_ID,AFFILIATION_ID
0,Aage Almtoft,Middelfart Sparekasse,Member,"Corporation, FINA, Banks, Finance",1,95023,Corporations,,Automatisk CVR import at 2016-03-12 18:01:28: ...,2016-03-12T18:01:28Z,,2017-11-09T15:38:01Z,4003984000.0,24744817.0,1,3687
1,Aage B. Andersen,Foreningen Østifterne - Repræsentantskab (Medl...,Member,"Charity, Foundation, Insurance, Socialomraadet",4,67511,NGO,Organisation,Direktør,2016-02-05T14:45:10Z,,2016-02-12T14:41:09Z,,,3,2528
2,Aage Christensen,ÅRHUS SØMANDSHJEM,Chairman,"Foundation, Marine, Tourism",6,100903,Foundations,,Automatisk CVR import at 2016-03-12 18:08:31: ...,2016-03-12T18:08:31Z,,2017-11-09T15:50:09Z,4000054000.0,29094411.0,4,237
3,Aage Dam,"Brancheforeningen automatik, tryk & transmissi...",Chairman,"Business association, Interest group, Technology",8,69156,NGO,Organisation,"Formand, Adm. direktør, Bürkert Contromatic A/S",2016-02-10T15:18:47Z,,2016-02-10T14:19:20Z,,,5,469
4,Aage Dam,Dansk Erhverv (bestyrelse),Member,Employers association,9,72204,NGO,Stat,Adm. dir. Aage Dam- Bürkert-Contromatic A/S,2016-02-16T10:49:01Z,,2016-02-16T11:55:34Z,,43232010.0,5,1041


As we can see each person has different attributes, among others IDs and affiliation IDs. There are also data for sector and role tha we could use for filtering or EDA.

In [4]:
data['AFFILIATION'].value_counts(ascending=False).nlargest(20)

H.M. Dronningens 75-års fødselsdag                                                           803
Axcelfuture - konferencedeltagere                                                            332
Gallatafler ved statsbesøg (2016-17) I                                                       250
Uddannelses- og Forskningsministeriet (Kvalifikationsnævnet - Medlemmer)                     232
Miljø- og Fødevareministeriet (Natur- og Miljøklagenævnet - Den sagkyndige sammensætning)    214
Folketingets Presseloge (Institutioner under Folketinget) (Medlemmer)                        199
Reception på Kongeskibet Dannebrog (2017)                                                    196
Gallatafler ved statsbesøg (2016-17) II                                                      195
Landvind i en ny virkelighed (Konference)                                                    148
Nytårskur og -taffel (2016 – 2018)                                                           146
Venstre (Hovedbestyrelse)     

In [5]:
data['SECTOR'].value_counts(ascending=False)

NGO             17720
State           13601
Corporations     7989
Foundations      6987
VL_networks      3803
Events           1948
Parliament       1087
Commissions       795
Municipal         320
Family            207
Politics           37
Organisation        6
Name: SECTOR, dtype: int64

In [6]:
# this would be the way for you to subset for corporate affiliation (which also have CVR numbers)
data = data.query('SECTOR == "Corporations"')
data = data.dropna(subset = ['CVR_AFFILIATION'])

In [7]:
data['AFFILIATION'].value_counts(ascending=False).nlargest(20)

Kromann Reumert                     55
Bech-Bruun                          54
Gorrissen Federspiel                40
Plesner                             40
EnergiMidt                          31
Lett Law Firm                       27
Syd Energi (SE)                     24
TDC (note)                          24
Bruun & Hjejle                      23
Dansk Retursystem                   22
Alm. Brand                          22
Danske Bank                         21
SEAS-NVE                            20
Rønne & Lundgren                    20
Nykredit Realkredit (Bestyrelse)    20
Carlsberg                           19
Naturgas Fyn                        19
PensionDanmark                      19
Vandcenter Syd                      19
Novo Nordisk                        18
Name: AFFILIATION, dtype: int64

In [8]:
data['NAME'].value_counts(ascending=False).nlargest(20)

Karen Frøsig                  7
Gert Rinaldo Jonassen         7
Jeppe Christiansen            6
Michael Christiansen 25501    6
Henning Kruse Petersen        6
Jørgen Huno Rasmussen         5
Anders Christen Obel          5
Niels Thomas Heering          5
Preben Sunke                  5
Jens Bjerg Sørensen           5
Lars Nørby Johansen           5
Jørn Ankær Thomsen            5
Niels Jørgen Kornerup         5
John Christiansen 16895       5
Kim Simonsen                  5
Lasse Nyby                    5
Niels Jacobsen 27459          5
David Hellemann               5
Hans Henrik Kjølby 10930      5
John Bull Fisker              5
Name: NAME, dtype: int64

## EDA

In [9]:
toplot = data['AFFILIATION'].value_counts(ascending=False).nlargest(20).reset_index()
alt.Chart(toplot).mark_bar().encode(
    x='index:N',
    y='AFFILIATION:Q'
)

## Edgelist construction

Given that each person and affiliation have unique IDs, we have perfect input for network construction




In [10]:
# select name and IDs
data_select = data[['NAME', 'PERSON_ID', 'AFFILIATION_ID']]

We can create an edge dataframe utilising a "trick" where we merge the dataframe with itself using `AFFILIATION_ID` as key. The only thing that we then need to remove are self-links since a person can not really be in a board with itself.

The initial dataframe has ~60 rows. The new after the merger ~160k. That looks promising.

In [11]:
# create edge DF by merge with itself.
edges = pd.merge(data_select, data_select, on='AFFILIATION_ID')
edges.head()

Unnamed: 0,NAME_x,PERSON_ID_x,AFFILIATION_ID,NAME_y,PERSON_ID_y
0,Aage Almtoft,1,3687,Aage Almtoft,1
1,Aage Almtoft,1,3687,Allan Buch,311
2,Aage Almtoft,1,3687,Bo Skovby Rosendahl,4491
3,Aage Almtoft,1,3687,Bo Smith 4493,4493
4,Aage Almtoft,1,3687,Martin Nørholm Baltser,24816


In [12]:
# Filter out self-edges
edges = edges[edges.PERSON_ID_x != edges.PERSON_ID_y]

We are now in a situation whre people that sit in multiple boards together will have one row per board. This can be aggregated in the following way by grouping.



In [13]:
# grouping to aggregate multiple co-occurences and to generate a weight: 
# How many times did PesonX and PersonY sit in boards together
# reset_index makes everytging from a multi-index-series into a dataframe
edges = edges.groupby(['PERSON_ID_x', 'PERSON_ID_y']).size().reset_index()

In [14]:
# column "0" is now our weight
edges.head()

Unnamed: 0,PERSON_ID_x,PERSON_ID_y,0
0,1,311,1
1,1,4491,1
2,1,4493,1
3,1,24816,1
4,1,31093,1


In [15]:
edges[0].value_counts()

1    58824
2     1560
4      222
3       34
5       12
Name: 0, dtype: int64

In [16]:
# finally we rename the "0" column to weight
edges.rename({0:'weight'}, axis = 1, inplace=True)

In [17]:
len(edges)

60652

Most of the people co-occure only once. There are only 4 cases where 2 people meet each other in 15 boards. This is also the strongest weight.

## Creating the Graph object with NetworkX

Now we can create a network object from this edgelist. From here we will calculate various centrality measures and perform community detection. Think about the latter as UML (which it actually is).
This will allow us to investigate e.g.:

- Are there power clusters within different domains (education, agriculture...)?
- Who are the top people in these communities



In [18]:
# Create network object from pandas edgelist
G = nx.from_pandas_edgelist(edges, source='PERSON_ID_x', target='PERSON_ID_y', edge_attr='weight', create_using=nx.Graph())

Using a pandas edgelist as source is a source for the graph object allows us to instantiate it with the weight attribute included

In [19]:
# We can create a node-attribute dictionary directly from the dataframe (using pandas to_dict)
node_attributes = data_select[['PERSON_ID','NAME']].set_index('PERSON_ID').drop_duplicates().to_dict('index')

In [20]:
# We now can include the degree as node-attribute
nx.set_node_attributes(G, {G.degree(): 'degree'})

In [21]:
# and use the node_attribute object to include all that in the graph object
nx.set_node_attributes(G, node_attributes)

In [22]:
len(G.nodes())

6479

In [23]:
len(G.edges())

30326

Subsetting og Graph objects in NetworkX is a bit of a challenge sometimes - Well, you need to remember that NetworkX wants us to pass a list of Node-IDs for subsetting. The easiest here (probably also most elegant) is to use a list comprehension with a condition statemen `if d > 1`

In [24]:
# Subset the graph keeping only nodes with degree > 1
G = nx.subgraph(G, [n for n,d in G.degree() if d > 1])

In [25]:
# Here we can calculate different centrality indicators as well as partition (community detection)
centrality_dgr = nx.degree_centrality(G)
centrality_eig = nx.eigenvector_centrality_numpy(G, weight = 'weight')

In [26]:
partition = community_louvain.best_partition(G) #that will take some time...

In [27]:
# All these indicators can now be set as attribute of the Graph
nx.set_node_attributes(G, centrality_dgr, 'dgr')
nx.set_node_attributes(G, centrality_eig, 'eig')
nx.set_node_attributes(G, partition, 'partition')

## Bringing it back to pandas

Once all graph indicators are in place, we can bring them back to Pandas for easier further analysis 🧐. You can compare that step to inspecting individual clusters identified with e.g. K-means.


In [28]:
# This is how you turn a Graph object (NetworkX) to a Dataframe
nodes_df = pd.DataFrame.from_dict(dict(G.nodes(data=True)), orient='index')

In [29]:
nodes_df.head()

Unnamed: 0,NAME,dgr,eig,partition
1,Aage Almtoft,0.001389,-3.817603e-19,0
311,Allan Buch,0.001389,-2.655743e-19,0
4491,Bo Skovby Rosendahl,0.001389,-8.701799e-19,0
4493,Bo Smith 4493,0.001389,-7.127816999999999e-19,0
24816,Martin Nørholm Baltser,0.001389,-3.1644569999999995e-19,0


Let's see who are the most central people in DK.
I guess, you'll not be surprised by the result

In [30]:
# For that we can e.g. sort the dataframe by eigenvector (only first 10 rows)
nodes_df.sort_values('eig', ascending=False)[:10]

Unnamed: 0,NAME,dgr,eig,partition
5040,Carsten Fode,0.012504,0.136354,29
24335,Marianne Philip,0.01096,0.135653,29
31548,Poul Viggo Bartels Petersen,0.009416,0.135155,29
17672,Jørgen Kjergaard Madsen,0.009416,0.135081,29
12685,Henrik Møgelmose,0.009416,0.135081,29
12891,Henrik Thal Jantzen,0.009262,0.135023,29
41537,Christina Bruun Geertsen,0.008336,0.134723,29
51939,Lau Normann,0.008336,0.134723,29
5879,Christian Lundgren,0.008336,0.134723,29
51962,Kolja Staunstrup,0.008336,0.134723,29


Let's see if we can find out about the largest communities and the most central people within these:

In [31]:
# How many communities are there (identified automatically)
nodes_df.partition.nunique()

623

In [32]:
# Let's look at how many people there are in the top 20
nodes_df.partition.value_counts()[:20]

11     260
52     218
61     196
15     168
72     161
12     136
13     126
44     116
7      115
45     104
8       99
24      90
55      88
14      82
25      81
73      80
17      78
16      71
100     71
29      67
Name: partition, dtype: int64

In [33]:
# Perhaps let's check out the first 10
top10_com = nodes_df.partition.value_counts()[:10].index

In [34]:
# Complicated approach using NetworkX
# Who are the people in these?
top10_com_nodes = nodes_df[nodes_df.partition.isin(top10_com)].index

# Let's make a subgraph with only these people (one could also do it in pandas)
g_sub = nx.subgraph(G, top10_com_nodes)

# From here to dataframe

In [35]:
# Simple approach using Pandas
# Now we will limit the resulting dataframe to the top10 communities
nodes_df_top10 = nodes_df[nodes_df.partition.isin(top10_com)]

In [36]:
nodes_df_top10

Unnamed: 0,NAME,dgr,eig,partition
136,Agnete Raaschou-Nielsen,0.004168,1.988567e-04,52
3631,Birgitte Nielsen 3631,0.003705,5.611508e-06,7
4863,Carl Erik Mathias Uhlén,0.001081,4.157223e-06,52
11466,Heine Jakob Heinsvig,0.001081,4.172544e-06,52
16150,Jesper Hessellund Arkil,0.001081,4.172544e-06,52
...,...,...,...,...
195511,Kimberly Lewis Clark,0.000463,4.777151e-10,72
195512,Rodney Ernest Carlson,0.000463,4.777151e-10,72
57664,John Henrik Madsen 57664,0.000463,9.118352e-09,45
57667,Alice Connie Madsen,0.000463,9.118352e-09,45


In [37]:
# Let's look at the "most important" people by grouping up and keeping the 5 observations
# with the highest eigenvector centrality
top_people = nodes_df_top10.groupby('partition')['eig'].nlargest(5).reset_index()

In [38]:
top_people

Unnamed: 0,partition,level_1,eig
0,7,21835,0.002968732
1,7,47682,0.002951876
2,7,46867,0.00294762
3,7,195697,0.00294762
4,7,5977,0.00294762
5,11,17035,0.002738571
6,11,6231,0.002725707
7,11,28861,0.002725707
8,11,36144,0.002725707
9,11,316,0.002725707


In [39]:
# After that we need to bring back ID's (rename) and Names (merge)
top_people.rename({'level_1':'PERSON_ID'}, axis=1, inplace=True)
top_people = pd.merge(top_people, data_select[['NAME','PERSON_ID']].drop_duplicates(), on='PERSON_ID', how='inner')

In [40]:
top_people

Unnamed: 0,partition,PERSON_ID,eig,NAME
0,7,21835,0.002968732,Lars Nørby Johansen
1,7,47682,0.002951876,Vivian Lund
2,7,46867,0.00294762,Maria Elisabeth Sandblom
3,7,195697,0.00294762,Scott Egan 195697
4,7,5977,0.00294762,Christian Sletten
5,11,17035,0.002738571,John Lesbo
6,11,6231,0.002725707,Claus Astrup-Larsen
7,11,28861,0.002725707,Otto Johannes Christensen
8,11,36144,0.002725707,Torben Christensen 36144
9,11,316,0.002725707,Allan Christensen 316


Now you can explore the names 😊 Happy stalking! 😏

## Fancier visualisatinos

Let's install some fancy visualisation infrastructure

In [41]:
!pip install -q holoviews==1.15.2
!pip install -q bokeh==2.4.0
!pip install -q datashader

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.3/4.3 MB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m40.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.5/18.5 MB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.4/18.4 MB[0m [31m49.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m76.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 KB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for datashape (setup.py) ... [?25l[?25hdone


In [42]:
# Import the libraries and link to the bokeh backend
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
from bokeh.plotting import show
kwargs = dict(width=800, height=800, xaxis=None, yaxis=None)
opts.defaults(opts.Nodes(**kwargs), opts.Graph(**kwargs))

In [43]:
# keeping only top nodes (extreme subsetting)
top_central_nodes = nodes_df[nodes_df.eig > nodes_df.eig.quantile(0.99)].index

In [44]:
# Create subset graph
g_sub = nx.subgraph(G, top_central_nodes)

In [45]:
# Create and save a layout.
g_layout = nx.layout.spring_layout(g_sub) 
g_plot = hv.Graph.from_networkx(g_sub, g_layout).opts(tools=['hover'], node_color='partition')
labels = hv.Labels(g_plot.nodes, ['x', 'y'], 'NAME')

In [46]:
# make the plot
from holoviews.operation.datashader import datashade, bundle_graph
bundled = bundle_graph(g_plot)

In [47]:
# show the plot
show(hv.render(bundled * labels.opts(text_font_size='6pt', text_color='white', bgcolor='gray')))