# Part 2: Email Behaviour Data Analysis

---

### Install Python packages (pip only)

In [1]:
#e.g., %pip install some-package
%pip install networkx

Note: you may need to restart the kernel to use updated packages.


### Import Python packages

In [2]:
#e.g., import some-package
import networkx as nx
import numpy as np
import pandas as pd

---

### Task 1 of 1 

Examine the file "emails_cmt224.edgelist" which represents email behaviour at an organisation. Each line contains two numbers, 𝑢 and 𝑣, separated by a blank space. Consider each number as an identifier for an individual in an organisation, with the space on each line representing that the individual, 𝑢, sent at least one email to the another individual, 𝑣, at some point. Model the data using an appropriate, directed network representation and answer the following questions:

##### Q1. Do the majority of individuals have a higher or lower ratio of mutual connections than average in the network?

In [3]:
#CODE:
#create a directed network
G = nx.read_edgelist("emails_cmt224.edgelist",create_using = nx.DiGraph)

#calculate local reciprocity for each node
local_reciprocity = list(nx.reciprocity(G,G.nodes()).values())

#find nodes that have higher local reciprocity than the average in the network
higher_than = [i for i in local_reciprocity  if i>np.mean(local_reciprocity)]

#calcualte the percentage
answer_q1 = len(higher_than) / len(local_reciprocity)

print(round(answer_q1,2))

0.65


ANSWER: Majority of individuals (approximately 65%) have a higher ratio of mutual connections than the average.
- Explanation:
    - The email behaviour is directed; thus, a directed graph was created.
    - Local reciprocity was calculated for each node.
    - Nodes that have reciprocity higher than average in the network were generated.
    - The fraction was calculated.
- Justification:
    - Ratio of mutual connection indicates that the local reciprocity metric should be examined as it considers the ratio of the edges in both directions(mutual) for a given node.
    - One may choose to compare local reciprocity with overall reciprocity. However, it's important to remember that the latter is not simply an average; it's a measure computed for the entire network.

##### Q2. Using the largest, strongly connected component (where at least one path exists between each individual and all others). Could the connectivity of the component be suggested to be reflective of a small world phenomenon in comparison to the typical connectivity of 10 comparative random networks?

In [4]:
#CODE:
#find the strongly connected components in the graph
scc_component = nx.strongly_connected_components(G)

#sort the generator result
scc_sort_by_no_of_nodes = sorted(scc_component, key=len, reverse=True)

#get the largest connected component
lscc_sort_by_no_of_nodes = scc_sort_by_no_of_nodes[0]

#create a subgraph from graph G
lscc = G.subgraph(lscc_sort_by_no_of_nodes).copy()

#calculate average shortest path length and clustering for measuring the connectivity
def generate_random_graph(no_of_nodes, no_of_edges):
    #generate a random directed graph with same number of edges and nodes as given arguments
    random_g = nx.gnm_random_graph(no_of_nodes, no_of_edges, directed=True)
    #find the strongly connected components in the random graph
    random_scc_component = nx.strongly_connected_components(random_g)
    #sort the generator result and get the largest
    lscc_random_g = sorted(random_scc_component, key=len, reverse=True)[0]
    #create a subgraph from graph random graph
    random_lscc = random_g.subgraph(lscc_random_g).copy()
    return nx.average_shortest_path_length(random_lscc),nx.average_clustering(random_lscc)

#place holder for average shortest path length and clustering in random graph
random_aspl = []
random_ac = []

#execute the generate function 10 times and append the average shortest path length and clustering to the list
for i in range(0,10):
    random_aspl.append(generate_random_graph(lscc.number_of_nodes(),lscc.number_of_edges())[0])
    random_ac.append(generate_random_graph(lscc.number_of_nodes(),lscc.number_of_edges())[1])

print("Average shortest path length are {0}(LSCC of email network) and {2}(random networks), and average clustering are {1}(LSCC of email network) and {3}(random networks)"
    .format(round(nx.average_shortest_path_length(lscc),2),
            round(nx.average_clustering(lscc),2),
            round(np.mean(random_aspl),2),
            round(np.mean(random_ac),2)))

Average shortest path length are 2.56(LSCC of email network) and 2.32(random networks), and average clustering are 0.39(LSCC of email network) and 0.04(random networks)


ANSWER: The small world phenomenon can be perceived in the largest, strongly connected component (LSCC) in social network. 
- Explanation:
    - The LSCC of social network was generated. 
    - A function for generating random network was created which takes the number of nodes and edges of the LSCC and returns its average shortest path length and average clustering.
    - Executed 10 times to generate two metrics for 10 comparative random networks. 
- Justification:
    - To measure the small world phenomenon, average shortest path length and average clustering are typically used. 
    - One may choose to use eccentricity metric, but it is sensitive to "outliers" (long chain of edges skew the value).
    - There are other approaches like nx.omega or nx.sigma but are for undirected graph.

##### Q3. Are occurrences of induced, connected subgraphs of 3 individuals (triads) with only mutual connections more abundant in the largest, strongly connected component than those with a mixture of asymmetric and mutual connections? What does this suggest about how mutual connections are distributed in the component?

In [5]:
#CODE
#list a count of how many of the 16 possible types of triads are present in a directed graph
lscc_tradic_census = nx.triadic_census(lscc)

#triads(induced, connected) - only mutual connections => type 16 - 300
#triads(induced, connected) - with a mixture of asymmetric and mutual connections => type 120D/120U/120C/210
answer_q3 = {key:value for (key,value) in lscc_tradic_census.items() if key[0]=='3' or (key[0]!='0' and key[1]!='0' and key[2]=='0')}
print(answer_q3)

{'120D': 3549, '120U': 4096, '120C': 3954, '210': 17333, '300': 12982}


ANSWER: In the largest, strongly connected component, the occurrences of type 300 triad is less abundant than the occurrences of type 120D, 120U, 120C, and 210 combined, which suggests that the distribution of mutual connections is less frequent in the component and less observable.
- Explanation:
    - Triadic_census function was used to count the quantity of 16 possible types of triads.
    - Relevant types were extracted from the results.
- Justification:
    - Induced, connected subgraphs indicates that there must be a path between every pair of nodes with then subgraphs. This excludes types which do not fulfil such condition.
    - Triad with only mutual connection is type 300.
    - Triads with a mixture of asymmetric and mutual connections are type 120D/120U/120C/210.

---
### Task 2 of 2

Examine the JSON file "emails_cmt224_departments.json" (departments file). Keys in the departments file represent individuals using the same ids as in the "emails_cmt224.edgelist" file in Part 2, Task 1 and the values represent a department id that the individual can be attributed to. Using the contents of the departments file in combination with the network in Part 2, Task 1, answer the following questions:

##### Q1. Using the connections that individuals have in the network, are they more likely to mix with others in their department or those with a similar number of outward connections?

In [6]:
#CODE:
#load the department data
df_depart = pd.read_json("emails_cmt224_departments.json",lines=True).T.reset_index()

#modify the type of index as string
df_depart['index'] = df_depart['index'].astype(str)

#rename the column 0 as depart
df_depart.rename(columns={0:'depart'},inplace=True)

#create a dataframe that assign the department as attribute for each node
depart_attr = df_depart.set_index('index',drop=True).to_dict(orient='index')

#set the attribute of node
nx.set_node_attributes(G,depart_attr)

#calculate the assortativity
attr_assort = nx.attribute_assortativity_coefficient(G, "depart")

# The default NetworkX implementation of the degree_assortativity_coefficient(..) method 
# compares the 'in' degree of source nodes with the 'out' degree of target nodes
degree_assort = nx.degree_assortativity_coefficient(G, x = 'out', y = 'out')

print("Assortativity coefficient with respect to attribute is {0} whereas degree_assortativity_coefficient is {1}.".format(round(attr_assort,2),round(degree_assort,2)))

Assortativity coefficient with respect to attribute is 0.29 whereas degree_assortativity_coefficient is -0.02.


ANSWER: Individuals are more likely to mix with others in same department (Attribute assortativity coefficient is 0.29) than those with similar number of outward connections (Degree assortativity coefficient is -0.02).
- Explanation: 
    - Department attribute was assigned for each node from a pandas dataframe.
    - Attribute assortativity coefficient was calculated.
    - Then, degree assortativity coefficient was calculated.
- Justification:
    - Assortativity measures the similarity of connections, which is suitable for answering the question about homophily:
        - Attribute assortativity coefficient was calculated with respect to the department attribute of each node.
        - Degree assortativity coefficient was calculated. Note the x and y were all set to "out", instead of seting to default since the question is asking similar outward connections.

##### Q2. Are all departments with 15 or more members more tightly connected amongst themselves in comparison to all individuals across the overall network irrespective of their department?  Where in this context, 'more tightly connected' is defined as having more mutual AND clustered connections. In addition to answering the overall question as yes or no, provide a list of departments this is true for (if any) and not true for (if any).

In [7]:
#CODE:
#Find out the departments with 15 or more members
depart_morethan_15_mem = df_depart.pivot_table(index=['depart'],
                                               values=['index'],
                                               aggfunc=len)\
.query("index >=15").reset_index()['depart'].to_list()

"""computing the reciprocity and average clustering for the entire network irrespective of their department"""
entire_network_reciprocity = round(nx.reciprocity(G),2)
entire_network_clustering = round(nx.average_clustering(G),2)

"""consider all departments >=15 members as a whole subgraph"""
G_depart_15_mem = G.subgraph([node for node,data in G.nodes(data=True) if data.get('depart') in depart_morethan_15_mem])
G_depart_15_mem_reciprocity = round(nx.reciprocity(G_depart_15_mem),2)
G_depart_15_mem_clustering = round(nx.average_clustering(G_depart_15_mem),2)
#print the answer
answer_task2_q2 = G_depart_15_mem_reciprocity>entire_network_reciprocity and G_depart_15_mem_clustering >  entire_network_clustering

"""consider all department >=15 members as 15 separately subgraphs"""
#create an empty list for appending nodes for different department
G_graph_list = []
#append the list use for loop
for depart in depart_morethan_15_mem:
    G_graph_list.append((depart,G.subgraph([node for node,data in G.nodes(data=True) if data.get('depart')== depart])))

#create three lists for appending the results of overal reciprocity and average_clustering for each depart id
depart_id = []
depart_reciprocity = []
depart_average_clustering = []

#calculate the reciprocity and average clustering for each depart in for-loop and append the results to the list
for depart, network in G_graph_list:
    depart_id.append(depart)
    depart_reciprocity.append(round(nx.reciprocity(network),2))
    depart_average_clustering.append(round(nx.average_clustering(network),2))
    
#create a dictonary for the result
dict_result = {'depart_id': depart_id, 'overall_reciprocity': depart_reciprocity, 'average_clustering': depart_average_clustering} 

#create a function that return True or False based on the condition
def is_more_tightly_connected(x):
    return x['overall_reciprocity']>entire_network_reciprocity and x['average_clustering']>entire_network_clustering

#Transform dictionray to pandas dataframe
df_result_by_depart = pd.DataFrame(dict_result)

#Apply function to the whole dataframe and create a column to store the result
df_result_by_depart['is_more_tightly_connected'] = df_result_by_depart.apply(is_more_tightly_connected,axis=1)

print("Situation 1 - Overall:",answer_task2_q2)
df_result_by_depart

Situation 1 - Overall: True


Unnamed: 0,depart_id,overall_reciprocity,average_clustering,is_more_tightly_connected
0,0,0.69,0.46,False
1,1,0.66,0.53,False
2,4,0.71,0.37,False
3,6,0.0,0.07,False
4,7,0.69,0.5,False
5,8,0.7,0.65,True
6,9,0.79,0.44,True
7,10,0.74,0.58,True
8,11,0.74,0.71,True
9,13,0.85,0.67,True


ANSWER: If perceive departments(>=15 members) as a whole, it is true that they are more tightly connected amongst themselves in comparison to the overall network. However, if examine each department(>=15 members) and compare with the overall network, the results vary as listed.
- Explanation: 
    - One subgraph containing all departments with 15 or more members as a whole was created.
    - Metrics(overall_reciprocity/average_clustering) were calculated for the whole network and the above subgraph for direct comparison.
    - 19 individual subgraphs were created for each department(with members >=15). The metrics were calculated for each and compared with the entire network.
- Justification: 
    - According to the definition in question, below metrics are more relevant:
        - Mutual connection can be measured using overall_reciprocity
        - Clustered connection can be measured using average_clustering