# Using network analysis to efficiently spread data awareness

I recently heard the wonderful story of how last year, Marks & Spencer launched a data academy to train staff across all parts of the business in data science - HR, Finance, IT, Sales and Operations, Logistics... After an intensive 18-month in-work data skills programme, these staff became data ambassadors for their business areas and local teams so that the effect could spread through the staff network. While it is unlikely that everyone at M&S will become a fully fledged data scientist overnight, the organisation as a whole is now far more data-oriented. For example, HR analyse data to identify staff who might be thinking of leaving; Sales, Operations and Logistics use data to work out how to reduce costs and streamline customer delivery. Importantly, staff ask questions in a data-oriented way and seek to support their decisions with evidence.

Inspired by this, I have as one of my ambitions to make my organisation data aware. I don't mean just have awesome data science teams developing cool new products and services (although I hope we will do this too), but rather have everyone in the organisation feel comfortable asking for data to make the best-informed decisions they can.

Being new to the organisation though, I was wondering how I might go about doing this. During my first month, my line manager introduced me to several key individuals who were invested in data in some way, either because they were trying to develop data-driven products and services, or because they were using data analyses to make decisions. Through these individuals, I got to know more and more people who had a love for data. I started to compile a list of these data fans and pondered how quickly we might be able to spread this awareness through the organisation.

I decided to get hold of the data behind the organisational chart with all staff's job titles, departments, line managers, functions, etc. so that I could get a feel for the organisation's structure and staff's interests. Then I ran some simple network analyses to help me spread data awareness across the organisation.





In [None]:
import plotly.plotly as py
import plotly.graph_objs as go
import seaborn as sns

import networkx as nx
import pandas as pd
import numpy as np


from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)


In [None]:
fname = 'AnonymisedStaff.csv'
#fname = 'anon_staffdata.csv'
allstaffdata = pd.read_csv(fname, encoding = "ISO-8859-1")
colTypes = {'ID': str, 'Department': str, 'Manager': str, 'Location': str, 'DataFan': str}
#colTypes = {'Preferred Name': str, 'Business Unit': str, 'Line Manager': str, 'Location': str, 'DataFan': str}

allstaffdata = pd.read_csv(fname, encoding = "ISO-8859-1", dtype = colTypes)



In [None]:
allstaffdata.columns
allstaffdata.head()
## For anonymised data:
department = 'Department'
identifier = 'ID' 
manager = 'Manager'
location = 'Location'

#department = 'Business Unit'
#identifier = 'Preferred Name'
#manager = 'Line Manager'
#location = 'Location'




## Where are the data staff?
* Which departments?
* Which company locations?
* Which teams (indicated by line managers)?

In [None]:
staffdata = allstaffdata

datafans = staffdata[staffdata['DataFan']=='yes']
#List of departments covered by data fans
DFDepartments = np.unique(datafans[[department]])
Departments = np.unique(staffdata[[department]])
#List of locations covered by data fans
DFLocations = np.unique(datafans[[location]])
Locations = np.unique(staffdata[[location]])
#List of data fan line managers
DFManagers = np.unique(datafans[[manager]])
Managers = np.unique(staffdata[[manager]])

print('#Data fans: {}'.format(len(datafans)))

#Coverage of data fans:
print('#Departments covered by data fans: {} out of {}'.format(len(DFDepartments), len(Departments)))
print('#staff covered: {} out of {}\n'.format(len(staffdata[staffdata[department].isin(DFDepartments)]), len(staffdata)))
print(DFDepartments)
print('#Locations covered by data fans: {} out of {}'.format(len(DFLocations), len(Locations)))
print('#staff covered: {} out of {}\n'.format(len(staffdata[staffdata[location].isin(DFLocations)]), len(staffdata)))
print(DFLocations)
print('#Managers covered by data fans: {} out of {}'.format(len(DFManagers), len(Managers)))
print('#staff covered: {} out of {}\n'.format(len(staffdata[staffdata[manager].isin(DFManagers)]), len(staffdata)))
print(DFManagers)

#Identify more susceptible staff members in terms of sharing the same departments, locations or teams as existing fans.
staffdata['PotentialDepartmentDataFan'] = np.where(staffdata[department].isin(DFDepartments), 'yes', 'no')
staffdata['PotentialLocationDataFan'] = np.where(staffdata[location].isin(DFLocations), 'yes', 'no')
staffdata['PotentialManagerDataFan'] = np.where(staffdata[manager].isin(DFManagers), 'yes', 'no')

staffdata[staffdata[identifier].isin(datafans[[identifier]])]['PotentialLocationDataFan'] = 'yes'
staffdata[staffdata[identifier].isin(datafans[[identifier]])]['PotentialDepartmentDataFan'] = 'yes'
staffdata[staffdata[identifier].isin(datafans[[identifier]])]['PotentialManagerDataFan'] = 'yes'

### Data fans in the staff network


In [None]:
import networkx as nx
#https://plot.ly/python/igraph-networkx-comparison/ (Comparison of two of the main Python network libraries.)

#The line management network
G = nx.convert_matrix.from_pandas_edgelist(staffdata, source=manager, target=identifier, 
                                           create_using=nx.DiGraph)
edges = pd.DataFrame({'target' : staffdata[identifier],
                      'source' : staffdata[manager]})

nodes = pd.DataFrame({'node' : staffdata[identifier],
                      'name' : staffdata[identifier],
                      'Department' : staffdata[department],
                      'Manager': staffdata[manager],
                      'Location' : staffdata[location],
                      'DataFan': staffdata['DataFan'],
                     'PotentialDepartmentDataFan': staffdata['PotentialDepartmentDataFan'],
                     'PotentialLocationDataFan': staffdata['PotentialLocationDataFan'],
                     'PotentialManagerDataFan': staffdata['PotentialManagerDataFan']})



In [None]:
#Get the team network - management as proxy for connections between individuals (team membership). 
#Two employees are connected if they share the same line manager.
#Get the list of managers (teams)
from itertools import permutations, chain

managers = staffdata[manager].unique()
manager_teams = {};
for mgr in managers:
    manager_teams[mgr] = staffdata[staffdata[manager]== mgr][identifier].tolist()

team_links = [list(permutations(team, 2)) for team in manager_teams.values()] 
team_links = list(chain(*team_links))
G.add_edges_from(team_links) 
  


In [None]:
# Some centrality measures
d = nx.degree(G)
c = nx.degree_centrality(G)
b = nx.betweenness_centrality(G)

In [None]:
## NB there are some managers who are in the graph but not in staffdata because they are externals, so
##there are a greater number of nodes in G than there are records in staffdata.

degreesdict = {name:degree for name, degree in d}

degrees = pd.DataFrame.from_dict(degreesdict, orient='index')
centralities = pd.DataFrame.from_dict(c, orient='index')
betweenness = pd.DataFrame.from_dict(b, orient='index')
degrees.columns = ['Degree']
centralities.columns = ['DegreeCentrality']
betweenness.columns = ['BetweennessCentrality']
print(degrees.sort_values(by='Degree', ascending=False).head(20))
print(centralities.sort_values(by='DegreeCentrality', ascending=False).head(20))
print(betweenness.sort_values(by='BetweennessCentrality', ascending=False).head(20))

datafannames = datafans[identifier].tolist()
fandegrees = degrees[degrees.index.isin(datafannames)]
fancentralities = centralities[centralities.index.isin(datafannames)]
fanbetweenness = betweenness[betweenness.index.isin(datafannames)]

print(fandegrees.sort_values(by='Degree', ascending=False).head())
print(fancentralities.sort_values(by='DegreeCentrality', ascending=False).head())
print(fanbetweenness.sort_values(by='BetweennessCentrality', ascending=False).head())


In [None]:
# Join degrees data to other data
staffdata = staffdata.merge(degrees, left_on=identifier, right_on=degrees.index)
staffdata = staffdata.merge(centralities, left_on=identifier, right_on=degrees.index)

In [None]:
def plotMetricHist(df, colname='Degree'):
    
    data = [go.Histogram(x=df[colname])]
    layout = go.Layout(
        xaxis=dict(
            #type='log',
            autorange=True
        ),
        yaxis=dict(
            #type='log',
            autorange=True
        )
    )
    fig = go.Figure(data=data, layout=layout)
    iplot(fig)
    

In [None]:
#plotMetricHist(staffdata, 'Degree')
plotMetricHist(staffdata, 'DegreeCentrality')

In [None]:
staffdata.head()

In [None]:
def drawnetwork(staffdata=staffdata, field='DataFan', d=d, dthreshold=2, sizefield='Degree', 
                sizemultiplier=3, G=G):
    
    #Formatting settings
    #Set colour mappings to field    
    flist = staffdata[field].unique()
    numcolours = len(flist)
    colours = sns.color_palette("hls", numcolours)
    colourmappings = dict(zip(flist, colours))
    
    #Filtering by degree
    #Remove nodes with degree (d) below threshold (dthreshold)
    selected_nodes = [n for n in nodes.name if d[n] > dthreshold]
    plotgraph = G.subgraph(selected_nodes)
    
    # Set node positions
    pos = nx.spring_layout(plotgraph, seed=0)
    for node in plotgraph.nodes():
        plotgraph.node[node]['pos']= pos[node]
        
    # Set other node attributes
    excluded = []
    xlist = []
    ylist = []
    textlist = []
    sizelist = []
    namelist = []
    colourlist = []
    
    


    for node in plotgraph.nodes():
       
    
        try:
            
        
            f = nodes[nodes['name']==node][field].values[0]
        
            x, y = plotgraph.node[node]['pos']
            xlist.append(x)
            ylist.append(y)
            
            ## Add node labels for hover over text
            text = node + ' <br>#connections: ' + str(d[node])
            textlist.append(text)
            
            ## Size the node depending on sizefield and sizemultiplier
            if sizefield=='':
                size = 1;
            else:
                size = staffdata[staffdata[identifier]==node][sizefield].values[0]
            sizelist.append(size * sizemultiplier)
            
            ## Map the colours to the nodes depending on the field values
            fcolour = 'rgba({}, {}, {}, {})'.format(colourmappings[f][0], colourmappings[f][1], colourmappings[f][2], .8)    
            colourlist.append(fcolour)
    
        except:
            excluded.append(node)
        
    print('Number of nodes excluded because {} not given: {}\n'.format(field, len(excluded)))

   
    

    ## Create the visualisation
    xlistedge =[]
    ylistedge = []
    
    for edge in plotgraph.edges():
        x0, y0 = plotgraph.node[edge[0]]['pos']
        x1, y1 = plotgraph.node[edge[1]]['pos']
        xlistedge += [x0, x1, None]
        ylistedge += [y0, y1, None]        
        
    # Create edge trace:
    edge_trace = go.Scatter(x = xlistedge, y = ylistedge, text = textlist,
                    line = go.scatter.Line(width = 0.5, color = '#888'),
                    mode = 'lines', hoverinfo = 'none')
    
    # Create node trace:
    node_trace = go.Scatter(x = xlist, y = ylist, text = textlist, mode = 'markers',
                    hoverinfo='text',
                    marker = go.scatter.Marker(
                    color = colourlist,
                    size = sizelist,
                    line = dict(color='rgb(50,50,50)', width=0.5)))


    data=[node_trace, edge_trace]
    layout = go.Layout(title=field, 
                   showlegend=False, 
                   xaxis=dict(
                   autorange=True,
                   showgrid=False,
                   zeroline=False,
                   showline=False,
                   ticks='',
                   showticklabels=False),
            yaxis=dict(
                autorange=True,
                showgrid=False,
                zeroline=False,
                showline=False,
                ticks='',
                showticklabels=False
            )
        )

    fig = go.Figure(data=data, layout=layout)
    iplot(fig, filename='Staff network')
    

In [None]:
drawnetwork(sizefield='')
drawnetwork()

## Making Marketing Decisions based on network analysis

We can use very simple network analyses to decide how best to use our data fans to spread the message. For example, we can compare the networks that would result if they spread the message to staff in their departments vs. focusing on staff in their location vs. spreading to their team (as indicated by shared managers).  


In [None]:
## Show what might happen over time:

def networkevolution(df=staffdata, numsteps=3, transfield=department, infectfield='DataFan'):
    
    network = df.copy(deep=True)
    print('Network evolution over {} time steps based on shared {}s: '.format(numsteps, transfield))
    
    
    for i in range(0, numsteps):
        drawnetwork(staffdata=network, field=infectfield)
        
        infected = network[network[infectfield]=='yes']
        
        #List of field values covered by infected
        infectedVals = np.unique(infected[[transfield]])
        vals = np.unique(network[[transfield]])
        
        print('#{}s covered by infected: {} out of {}'.format(transfield, len(infectedVals), len(vals)))
        print('#staff infected: {} out of {} at t {}'.format(len(infected), len(network), i))
        
        print(infectedVals)
        
        network['next'] = np.where(network[transfield].isin(infectedVals), 'yes', 'no')
        print('#staff next infected: {} out of {}'.format(len(network[network['next']=='yes']), len(network)))
        
        #List of field values covered by infected
        infectedVals = np.unique(infected[[transfield]])
        vals = np.unique(network[[transfield]])
        
    
        network[network[identifier].isin(infected[identifier].values)]['next'] = 'yes'
        #If the field is also a member of staff (e.g. manager), then they also need to be infected.
        network[network[identifier].isin(infectedVals)]['next'] = 'yes'
        
        network.loc[:, infectfield] = network['next'].values
        print('Check reset: {}\n'.format(network[infectfield].equals(network['next'])))
        
        

In [None]:
networkevolution(transfield=department)
networkevolution(transfield=manager)
networkevolution(transfield=location)

From these networks, we can see that spreading simply by the teams (as indicated by shared managers) results in a far lower rate of spread (only 180 out of 3358) than either location (coverage 2057 out of 3358) or department (coverage 2892 out of 3358). 

Of course, in practice this does not tell us much since these networks only show what happens in the limit, e.g. they show they that if spreading only by team, our message would only reach 180 out of 3358 even if we had inifinite time, while if spreading by department with infnite time, we could reach 2892 out of 3358. 

Perhaps more usefully, we should find a way to decide who should be targeted next for the message to spread most efficiently.


### Who should I target next?
There is no one correct answer to this, and it is a topic that is much discussed and contested in Marketing,  Communications and Sales. 

Some strategies are:
* Finding nodes in the network that already have neighours or nodes in close proximity that are 'converted'. In this case, I might look for individuals who are in the same team (line-managed by the same people), same country, and/or same business unit. 
* Finding nodes in the network that are most 'influential'. In this case, I might look for individuals who line manage a lot of people or who have line managers or subordinates who belong to a large number of business units or span many geographical locations.

I could also create a metric that combines both of the above and target the ones who score most highly. By doing this, I am choosing individuals who are both more likely to be receptive to data awareness and who are able to raise awareness in a greater number of other individuals.

In [None]:
#Get the top x most connected individuals
numtop = 20
sortdegrees = staffdata.sort_values(by='Degree', ascending=False)
print(sortdegrees.head(numtop))

### How many disconnected networks are there (in terms of the management structure)?
It is useful to know how many disconnected networks there are so that 'islands' can be targeted separately.

In [None]:
subgraphs = list(nx.weakly_connected_component_subgraphs(G))
print('#Distinct staff networks: {}\n'.format(len(subgraphs)))
subgraphsizes = [nx.number_of_nodes(subgraphs[i]) for i in range(0, len(subgraphs))]
print(subgraphsizes)

# Draw the two largest subgraphs
drawnetwork(field='DataFan', G=subgraphs[0])
drawnetwork(field='DataFan', G=subgraphs[1], dthreshold=0)




Although in terms of management structure there are 44 disconnected subnetworks in the staff network, this does not necessarily mean that these truly represent isolated groups of staff, only that formally speaking, they are not connected to each other via management. The vast majority of staff belong to the larest component (containing 3092, with the next larest only containing 22, see output).

## If only I had more data (and more hours in the day)...

Of course, what I've done above is only a very rough set of analyses but still enough for me to get going on targetting individuals who are more likely to be able to help me in my quest to raise data awareness in the organisation. But if there were more hours in the day and I had access to other data, I might be able to do even more targetted campaigns. Here are some examples.

* I could use Yammer data to craft better messages to individuals, e.g. if I knew a staff member was interested in Education, I might talk to them about adaptive learning and inferring learner cognitive models.
* I could use email traffic data or meeting participation data to get a more accurate idea of people's interactions with each other since line management and official team membership are not always the most representative measures of links between people.

## Why I'm telling you about this - two messages to take home

* A dataset that is originally used for one purpose (in this case drawing the organisational chart) can be used for many others (in this case identifying promising individuals to target).
* You can squeeze a lot of insight out of a small amount of simple data.

## Some things to read
* A paper providing a glossary of network terms: https://jech.bmj.com/content/58/12/971
* A very comprehensive list of resources for those wishing to implement their own analyses: https://github.com/briatte/awesome-network-analysis
