# Social and Economic Networks | problem set 2

## Problem 1

We now have a local coordination game where when node $i$ chooses either A or B, it gets the respective utilities:
$$u_i^A = \alpha + d_i^Aa$$
$$u_i^B = \beta + d_i^Bb$$
where $\alpha$ and $\beta$ are constant and non-negative, and $d_i^C$ is the number of neighbours of $i$ that choose C.

The parameters $\alpha$ and $\beta$ give the minimum utility that node $i$ will obtain when choosing A and B respectively (that is, for example if node $i$ chooses A but no neighbour chooses A, it will have utility of at lease $\alpha$).

To find the best reply, we note that node $i$ will choose A if the utility from choosing A is higher than the utility from choosing B:
$$u_i^A \geq u_i^B$$
$$\alpha + d_i^Aa \geq \beta + d_i^Bb$$
We divide each side by the total number of neighbours of $i$, $d_i = d_i^A + d_i^B$, and define $f_A = \frac{d_i^A}{d_i}$, and $f_B = \frac{d_i^B}{d_i}$:
$$\frac{\alpha}{d_i} + f_Aa \geq \frac{\beta}{d_i} + f_Bb$$
Since $f_B = 1 - f_A$:
$$f_A - (1 - f_A)b \geq \frac{\beta - \alpha}{d_i}$$
$$f_A(a + b) \geq \frac{\beta - \alpha}{d_i} + b$$
$$f_A \geq \frac{\beta - \alpha}{d_i(a + b)} + \frac{b}{a + b}$$

So node $i$ would choose A  based on two factors:
* The relative payoff obtained if the neighbours choose the same as $i$, captured by whether the share $f_A$ is greater or equal to $\frac{b}{b + a}$.
* A fixed term $\frac{\beta - \alpha}{d_i(a + b)}$, which depends on the relative fixed payoffs from $\alpha$ and $\beta$, adjusted by the number of neighbours $d_i$ times the sum of the payoffs per neighbour $a + b$. This tells us that $i$ will depend negatively on how high its minimum utility from choosing B is relative to its minimum utility from choosing A. The importance of this factor, however, is decreasing in the number of neighbours and in the payoff from each neighbour which chooses the same as $i$.

Finally, we will not be able to use p-clustering since every element i will have a different threshold in terms of utility.

## Questions About Community Detection  

#### A

Newman and Girvan introduced 'modularity' as a scoring method by which the quality of a partition can be measured in community detection of networks. In "Resolution limit in community detection" Fortunato and Barthelemy (2008) point out that while Newman and Girvan's method of modularity led to important advances in community detection, the acutal properties of modularity have not been thoroughly investigated. They demonstrate how communities defined by modularity have limitations; specifically, they show both theoretically and experimentally that modularity has an inherent scale that depends on the degree of interconnctedness between pairs of communities within a network structure. When modules are smaller than this scale they cannot be “resolved.” In other words, regardless of the size of a network, through modularity optimization one cannot know if a module is a single module or a cluster of smaller modules. Important substructures can be lossed while finding the maximal modularity. 

Berry et al. (2011) in "Tolerating the community detection resolution limit with edge weighting" go on to reexamine   Fortunato and Barthelemy's analysis on the resolution limit of modularity and show that by adding weights to the edges of the network one can improve the accuracy for detecting communities in real networks. In exploring the resolution limit from the context of edge weighting, Berry et el. offer an alternative inequality which takes into account the sum of weights. While this defining inequality still has resolution limits, it is able to account for greater detection of substructures. Further, Berry et al. go on to delineate how some community detection algorithms can account for maximum weighted modularity. 

1: Fortunato, Santo, and Marc Barthelemy. "Resolution limit in community detection." Proceedings of the National Academy of Sciences 104.1 (2007): 36-41.

2: Berry, Jonathan W., et al. "Tolerating the community detection resolution limit with edge weighting." Physical Review E 83.5 (2011): 056119.

#### B

Given the complexity and diversity of both the application of network science generally and the methods of community detection more specifically, a formal and robust definition of community has been a topic of interest in the field. Ground-truth is an attempt to fix what is meant by a "community" in network science so that community detection can be evaluated across methodologies. Peel et al. in “The ground truth about metadata and community detection in networks”  note that particular weakness in using general definitions of growth-truth communities—a single partition of the network’s nodes into groups— is that they often cannot be extended beyond theoretical applications since the data-generating process as well as what constitutes a true "partition" in "real" social networks is unknown. 

To account for just this, in “Defining and Evaluating Network Communities based on Ground-truth”, Yang and Leskovec (2012) propose a robust methodology for more rigorously evaluating ground-truth communities. They use a set of large real-world social network data, whereby group membership of nodes is explicitly defined. Specifically, they compare how various and common structural definitions of network communities compare to ground-truth communities—four of which are strongly correlated to the real groupings. They go on to point out the differences between structural and functional definitions of communities and suggest that the common function of nodes to define ground-truth communities yields more robust interpretations of “real-networks.”

Still, ground-truth definitions of community present challenges. As Peel et al. show through two novel statistical techniques that quantify the relationship between metadata and community structure for a variety of models, there can be no algorithm that is optimal for every community detection task (i.e. a No Free Lunch theorem for community detection). 

1: Yang, Jaewon, and Jure Leskovec. "Defining and evaluating network communities based on ground-truth." Knowledge and Information Systems 42.1 (2015): 181-213.

2: Peel, Leto, Daniel B. Larremore, and Aaron Clauset. "The ground truth about metadata and community detection in networks." Science Advances 3.5 (2017): e1602548.

#### C

Fortunato and Barthelemy (2008) illustrate their theoretical discussion in "Resolution limit in community detection" with five examples of networks taken from [the following repository](http://www.weizmann.ac.il/mcb/UriAlon/download/collection-complex-networks). In addition to the five network data used to detect communities (the transcriptional regulation network of Saccharomyces
cerevisiae (yeast); Escherichia coli; a network of electronic circuits; a social network; and the neural network of Caenorhabditis Elegans), the repository also contains other data the structure of which is a list of the edges in the network: the first column being the source of the edge and the second column being the target of the edge.

Unlike Fortunato and Barthelemy, Berry et al.(2011) in "Tolerating the community detection resolution limit with edge weighting" use two different datasets, both of which were artificially generated networks: the ring of cliques and the 128-node benchmark data used by Lancichinetti et al. in "Benchmark graphs for testing community detection algorithms. A ring of cliques graph consists of cliques (i.e. modules) connected through single links. Each clique represents a complete graph. The 128-node benchmark data is a simple artificial graph with a built-in community structure that tries to reflect the real properties of nodes and communities found in real networks. An algorithm is then set to recover this structure. 

Yang et al. "Defining and evaluating network communities based on ground-truth" included data from [Defense Advanced Research Projects Agency's (DARPA)](https://opencatalog.darpa.mil/) open data catalogue. They also used social network data from  Mislove et al.'s "Measurement and analysis of online social
networks." [This data](http://socialnetworks.mpi-sws.org/datasets.html) was gathered from four popular online social networks: Flickr, YouTube, LiveJournal, and Orkut and contains over 11.3 million users and 328 million links. 


Other repositories of complex network data include: 
- http://www.chesapeakebay.net/data
- https://journals.aps.org/pre/abstract/10.1103/PhysRevE.90.012811
- http://snap.stanford.edu/data/index.html#communities
- http://www-personal.umich.edu/~mejn/netdata/
- http://www.linkgroup.hu/links.php#Networkdatasets
- http://www.princeton.edu/~ina/data/index.html
- https://github.com/caesar0301/awesome-public-datasets

## Exercise 1

#### A

#### B

## Exercise 2 

We did this exercise in R, so please see the separate pdf called "HW2_ex2.pdf", produced using R Markdown.

## Exercise 3 