# Exersice 4.2
## QAP & MRQAP
---

In this exercise you will compute correlations between two networks, and measure significance using permutation test.\
You will work with `directed graphs` as dataframe of edgelists, as well as adjacency matrices as numpy arrays.\
Given the data, *'a social network of the managers of a high-tech company'*, you will learn how to create three types of hypotheses:\
dyadic, mixed monadic-dyadic, and node-level.\
\
At the end of this task, you will know how to interpret regression results as well as the significance from permutation tests.

1. Load the Krackhardt dataset (get to know the data!).
 - Compare data description file with actual data.


2. Practice QAP with dyadic hypotheses.
 - Create dyad-level hypotheses.
 - Run a correlation test between two variables, and check its significance using permutation test!


3. Practice MRQAP with dyadic hypotheses.
 - Create dyad-level hypotheses.
 - Build a linear regression model, and check the significance of each ind. variable using permutation test!


4. Practice QAP with mixed monadic-dyadic hypotheses.
 - Create monadic-level and dyad-level hypotheses.
 - Run a correlation test between two variables, and check its significance using permutation test!


5. Practice MRQAP with mixed monadic-dyadic hypotheses.
 - Create monadic-level and dyad-level hypotheses.
 - Build a linear regression model, and check the significance of each ind. variable using permutation test!


6. Practice MRQAP with node-level hypotheses.
 - Create node-level (node structure, node attribute) hypotheses.
 - Build a linear regression model, and check the significance of each ind. variable using permutation test!
 

---
## QUIZ
*Get to know the data from the '**description file**'!*\
Open the file `data/krackhardt/README.md` and read the description of this dataset.

 
- What type of network do we have? monolayer or multilayer?
- Is the network directed or undirected?
- How many layers? nodes? and total edges?

---
## Task #1
#### Load the data!
Load the multilayer network, and nodes' metadata.

---

#### Dependencies (libraries)
Import here all necessary python packages

In [None]:
#### Dependencies
import ... as pd

In [None]:
#### Local dependencies
# %load_ext autoreload
# %autoreload 2
    
import sys
sys.path.append('../libs/')

from ... import QAP
from ... import MRQAP
import hel...

#### Loading network
The datasets for this exercise are in `data/krackhardt/`.

Load nodes' metadata, and store in `nodes` the node ids.

In [None]:
# read file containing node attributes
nodes_metadata = pd.read_csv("../../data/krackhardt/<filename>", index_col=None, header=0, sep=' ')
nodes_metadata.head()

In [None]:
nodes = nodes_metadata.nodeID

Load the *Krackhardt* multilayer network as a DataFrame, and rename the columns accordingly.\
Make sure you know which column refers to `source`, `target`, `layer`, and `weight`.\
**HINT**: *Check the 'description file' to know what the format of the file is (what each column represent)*

In [None]:
# read file containing multilayer network
df_edges = pd.read_csv(...)
df_edges.rename(columns={0:..., 1:..., 2:..., 3:...}, inplace=True) # inplace = True, so the changes are effective 
                                                                    # in the same DataFrame
df_edges.head()

Load layers' metadata (labels).\
*HINT: The multilayer you just loaded identifies each layer as 1,2,3. But what are those numbers? Let's understand these layers.*

In [None]:
# read file containing layer metadata (labels)
layers = pd.read_csv(..., index_col=None, header=0, sep=' ')
layers

---
## QUIZ
*Get to know your data from a DataFrame.*\
Using the DataFrames you loaded before, write the necessary code to answer the following questions:
1. How many edges does this network have?
2. How many unique nodes?
3. How many layers?
4. What is the sum of edge-weights per layer?

Are these numbers matching the numbers reported in the *description file*?\
\
*HINT 1: You can ask for the `shape` of the DataFrame containing the edges (remember, it has rows referring to each edge) and columns (edge-attributes)*\
*HINT 2: You can ask for the `shape` of the DataFrame containing the nodes (remember, it has rows referring to each node) and columns (node-attributes)*\
*HINT 3: The DataFrame containing the edges posses a column referring to the layer id (you actually gave it a particular name). Is there a function in pandas to get the number of unique values from a column? (you can ask Google).*\
*HINT 4: What about using a `groupby` in the DataFrame containing the edges? What should you group-by then? After grouping by, which function should you call to get the summation over the edge weights?*

In [None]:
# number of edges
m = ...
print('Number of edges: {}'.format(m))

# number of nodes
n = ...
print('Number of nodes: {}'.format(n))

# number of layers
l = ...
print('Number of layers: {}'.format(l))

# sum of weights per layer, and overall
sewl = ...
print('Sum of all edge-weights: {}'.format(sewl.sum()))
print('\nSum of edge-weights per layer:\n {}'.format(sewl))

---
## Task #2
#### Practice QAP with dyadic hypotheses.

Find the correlation between two of the available variables (layers).\
How significant is this value?\
How would you interpret this result?

---

#### Adjacency matrices
Extract each layer separately from the multilayer DataFrame.\
\
*HINT 1: Are the networks directed or undirected? Chech it out in the 'description' file.*\
*HINT 2: You need the adjacency matrices of each layer. So, you can either use one of the methods we learnt on Monday on how to convert pandas edgelist to networkx graph, and then to adjacency matrix, or you can simply use the helper function `get_adjacency_from_pandas_weighted_edgelist(...)` from the `helper` library (see the demo notebook).*

In [None]:
# Advice (layer=1)
tmp = df_edges...[['source','target','weight']]
A = helper.get_adjacency_from_pandas_weighted_edgelist(df=...,
                                                       nodes_order=...,
                                                       directed=...)

A.shape, A.min(), A.max(), A.sum()

In [None]:
# Let's check how the adjacency matrix looks like.
A

In [None]:
# Friendship (layer=2)
tmp = ...
F = helper.get_adjacency_from_pandas_weighted_edgelist(df=...,
                                                       nodes_order=...,
                                                       directed=...)
F.shape, F.min(), F.max(), F.sum()

#### QAP: advice vs. friendship

In [None]:
# Run QAP
qap_obj = QAP(Y=..., X=..., npermutations=..., seed=...)
qap_obj...
qap_obj...

In [None]:
# Plot distribution of correlation scores (permutation test)
qap_obj...

---
# Task 3
#### Practice MRQAP with dyadic hypotheses.
Your dependent variable is 'Advice'.\
How significant are the regression coefficients (for friendship and report) with respect to the dependent variable?\
How would you interpret these results?\
\
*HINT 1: Get the adjacency matrix of the layer 'reports_to'*\
*HINT 2: What is Y? and What is X?*

---

In [None]:
# Report (layer=3)
tmp = ...
R = ...
R.shape, R.min(), R.max(), R.sum()

In [None]:
# MRQAP
Y = {'advice':...}
X = {'friendship':..., 'report':...}
mrqap_obj = MRQAP(Y=..., X=..., npermutations=..., standarized=..., seed=...)
mrqap_obj...

In [None]:
# Print MRQAP summary
mrqap_obj...

In [None]:
# Plot distribution of coefficients (permutation test)
mrqap_obj...

---
# Task 4
#### Practice QAP with mixed dyadic-monadic hypotheses.

*Hypothesis:*\
*People tend to **report** to people who are **older** than themselves.*

Find the correlation between 'reports' (dyad) and 'difference in age' (monadic) variables between people.\
How significant is this value?\
How would you interpret this result?\
\
*HINT 1: Use `helper.get_monadic_hypothesis` to obtain a node-by-node matrix using the `nodeAge` attribute.*\
*HINT 2: If 'source' node reports to 'target' node, that means that (according to the hypothesis) 'target' must be older than 'source'.*\
*HINT 3: Your 'comparison_function' must give higher scores to cases when 'age_target > age_source'.*\
*HINT 4: What about doing a substraction? `age_target - age_source`? or `age_source - age_target`?*

---

In [None]:
# Create monadic-hypothesis (target is older than source)
O = helper.get_monadic_hypothesis(df=..., 
                                  keyid=..., 
                                  attribute=..., 
                                  comparison_fnc=helper...., 
                                  symmetric=False,
                                  keyorder=...)

O.shape, O.min(), O.max(), O.sum()

In [None]:
# Let's check how the monadic matrix looks like.
pd.DataFrame(O).head()

In [None]:
# Run QAP
qap_obj = QAP(Y=..., X=..., npermutations=..., seed=...)
qap_obj.qap()
qap_obj.summary()

In [None]:
# Plot distribution of correlation scores (permutation test)
qap_obj...

---
# Task 5
#### Practice MRQAP with mixed dyadic-monadic hypotheses.

*Hypothesis:*\
*People (regardless of their age) are more likely to **report** to **older people** who belong to the **same department** and are in **different levels** of hierarchy.*\
\
*HINT 1: Be careful! The monadic hypothesis related to 'nodeAge' in this task is different from the one in task 4.*\
*HINT 2: You just need to use another 'comparison_function'. What about `compare_target_value`?*

---

In [None]:
# Create monadic-hypothesis (the older the target node/person, the more likely to get reports from source node-person)
P = helper.get_monadic_hypothesis(df=..., 
                                  keyid=..., 
                                  attribute=..., 
                                  comparison_fnc=helper...., 
                                  symmetric=False,
                                  keyorder=...)

P.shape, P.min(), P.max(), P.sum()

In [None]:
# Let's check how the monadic matrix looks like.
pd.DataFrame(P).head()

In [None]:
# Create monadic-hypothesis (same department)
D = helper.get_monadic_hypothesis(df=..., 
                                  keyid=..., 
                                  attribute=..., 
                                  comparison_fnc=helper..., 
                                  symmetric=True,
                                  keyorder=...)

D.shape, D.min(), D.max(), D.sum()

In [None]:
# Create monadic-hypothesis (different level)
L = helper.get_monadic_hypothesis(df=..., 
                                  keyid=..., 
                                  attribute=..., 
                                  comparison_fnc=..., 
                                  symmetric=True,
                                  keyorder=...)

L.shape, L.min(), L.max(), L.sum()

In [None]:
# Run MRQAP
Y = ...
X = ...
mrqap_obj = MRQAP(...)
mrqap_obj...

In [None]:
# Print summary of MRQAP
...

In [None]:
# Plot distribution of coefficients (permutation test)
...

---
# Task 6
#### Practice MRQAP with node-level hypotheses.

*Hypothesis:*\
*People's **PageRank** in the **reports_to** network can be explained by **tenure** and **level**.*\
\
*HINT 1: Recall that PageRank measures importance of nodes in a network. This is a node-structure property.*\
*HINT 2: Check in `code/libs/helper.py` if there is a function for `ego` that computes `pagerank`.*\
*HINT 3: We are testing a node-level hypothesis. What is the new parameter that `MRQAP` needs?*\
*HINT 4: Where did you store the node-attributes? Check task 1.*

What is your guess? How significant results will be?

---

In [None]:
# Create node-level (structural) hypothesis: PageRank of node
PR = helper.get_ego_hypothesis(adjacency=..., 
                               ego_fnc=helper...,
                               missing=0)

PR.shape, PR.min(), PR.max(), PR.sum(), PR

In [None]:
# Create node-level (attribute) hypothesis: tenure
T = ...
T.shape, T.min(), T.max(), T.sum(), T

In [None]:
# Create node-level (attribute) hypothesis: level
V = ...
V.shape, V.min(), V.max(), V.sum()

In [None]:
# Run MRQAP
Y = ...
X = ...
mrqap_obj = MRQAP(...)
mrqap_obj.mrqap()

In [None]:
# Print MRQAP summary
...

In [None]:
# Plot distribution of coefficients (permutation test)
...