### This scripts outlines steps used in SNA: creating edgelist data and centrality analysis

###### Import Libraries

In [2]:
import networkx as nx
import pandas as pd
import ast
import itertools
import numpy as np
from matplotlib import pyplot as plt

##### Prepare data and construct edgelist dataset

The first goal is to construct the employee network from the employee groups. To do this, a variable (employee groups) generated during the merge process is used. The network is modeled in the form of undirected and unweighted graphs. 

First, we remove employee groups of size 0 and 1 since it is not a network.

In [4]:
def prepare_data(df):
    check_group = df['group_size'] == 1
    check_empty_group = df['group_size'] == 0
    df.drop(df.loc[check_group].index, inplace=True)
    df.drop(df.loc[check_empty_group].index, inplace=True)
    df = df[['hash_keys']].copy()
    df.hash_keys = df.hash_keys.apply(ast.literal_eval)
    
    return df

Next, we generate edgelist data on groups by checking every array and creating combinations. For example the group of employees with IDs 1, 2, 3 represent nodes in the network and have a link between them meaning they have worked together in a shift or during the reparation of the machine.

In [6]:
def generate_edgelist(df):
    for _, row in df.iterrows():
        items = row['hash_keys']
        combinations = list(itertools.combinations(items, 2))

        with open('edgelist.csv', 'a') as fp:
            for x in combinations:
                fp.write(str(x[0]) + " " + str(x[1]) + "\n")

    return combinations

In [7]:
generate_edgelist(hash_keys_df)

[(35, 32), (35, 83), (32, 83)]

##### Social network analysis

Having an edgelist data we can easily construct the employee network now with networkx library and analyze some of its features.

In [9]:
# Read the edgelist into a graph # adjust the path
graph = nx.read_edgelist('edgelist.csv')

##### Network Statistics

Some conventional stats regarding network, ex., nodes with highest number of connections, etc.

In [10]:
print(nx.info(graph))

# Density of the graph
density = nx.density(graph)
print("Network density:", density)

# %% network stats
print("Max node degree (in + out) is {}".format(max(nx.degree(graph), key=lambda x: x[1])[1]))
print("Min node degree (in + out) is {}".format(min(nx.degree(graph), key=lambda x: x[1])[1]))
print("Diameter of the network is {}".format(nx.diameter(graph)))
print("Radius of the network is {}".format(nx.radius(graph)))
print("Average distance in the network {}\n".format(nx.average_shortest_path_length(graph)))

print("Is bipartite? {}".format(nx.is_bipartite(graph)))
print("Is connected? {}".format(nx.is_connected(graph)))

# %% Node and edge removal
print("Number of node removals - {}, the node - {}".format(nx.node_connectivity(graph), nx.minimum_node_cut(graph)))
print("Number of edge removals - {}, the edge - {}".format(nx.edge_connectivity(graph), nx.minimum_edge_cut(graph)))

Name: 
Type: Graph
Number of nodes: 196
Number of edges: 6796
Average degree:  69.3469
Network density: 0.3556253270538985
Max node degree (in + out) is 153
Min node degree (in + out) is 15
Diameter of the network is 3
Radius of the network is 2
Average distance in the network 1.6680272108843537

Is bipartite? False
Is connected? True
Number of node removals - 15, the node - {'190', '182', '81', '34', '5', '17', '131', '43', '125', '62', '23', '104', '95', '39', '54'}
Number of edge removals - 15, the edge - {('79', '62'), ('79', '5'), ('79', '104'), ('79', '39'), ('79', '190'), ('79', '131'), ('79', '182'), ('79', '23'), ('79', '125'), ('79', '17'), ('79', '54'), ('79', '81'), ('79', '95'), ('79', '43'), ('79', '34')}


###### Centrality measures

Betweneess, closeness and degree centrality metrics are used to evaluate node importance in a network. 

In [13]:
# degree centrality
deg_centrality = nx.degree_centrality(graph)
size = [deg_centrality[n] * 5000 for n in graph.nodes()]

print("5 most important nodes as per Degree centrality\n {}\n\n".format(
    [n[0] for n in sorted(deg_centrality.items(), key=lambda x: x[1], reverse=True)[:5]]
))

# closeness centrality
closeness_centrality = nx.closeness_centrality(graph)
size = [-1000 * np.log((1 - closeness_centrality[n])) ** 3 for n in graph.nodes()]

print("5 most important nodes as per Closeness centrality\n {}\n\n".format(
    [n[0] for n in sorted(closeness_centrality.items(), key=lambda x: x[1], reverse=True)[:5]]
))

5 most important nodes as per Degree centrality
 ['108', '109', '166', '116', '63']


5 most important nodes as per Closeness centrality
 ['108', '109', '166', '116', '63']


