# Network Assortativity Example

In this notebook we will use the following dataset.

[FauxMesaHigh.gml. ](http://gawron.sdsu.edu/python_for_ss/course_core/assignments/FauxMesaHigh.gml)

For a good description of the background on this datset see
[R FauxMesaHigh datapage](http://search.r-project.org/library/ergm/html/faux.mesa.high.html)

This graph is loaded in the next cell.
After executing that code `fm` is a `networkx` graph instance showing the Friendship links in the synthetically created data set for a fictional high school "Faux Mesa High". 

For more on the Faux Mesa High School graph, see
[the R page on the dataset](http://search.r-project.org/library/ergm/html/faux.mesa.high.html) as well as
[the statnet page. ](http://www.casos.cs.cmu.edu/tools/computational_tools/datasets/external/Goodreau/index11.php)

The original paper describing the methods and motivation for creating the dataset is:

>Resnick M.D., Bearman, P.S., Blum R.W. et al. (1997). Protecting adolescents from harm. Findings from the National Longitudinal Study on Adolescent Health, Journal of the American Medical Association, 278: 823-32.

The statnet paper is:

>Mark S. Handcock, David R. Hunter, Carter T. Butts, Steven M. Goodreau, and Martina Morris. 2003 statnet: Software tools for the Statistical Modeling of Network Data

In [1]:
import networkx as nx
import urllib.request
import os.path

def url_fetch_networkx_graph (url):
    with urllib.request.urlopen(url) as filehandle:
         G = nx.read_gml(filehandle)
    return G

github_networks_data = 'https://raw.githubusercontent.com/gawron/python-for-social-science/master/networks/'
fm_url = os.path.join(github_networks_data, 'FauxMesaHigh.gml')
#url = 'http://www.casos.cs.cmu.edu/tools/computational_tools/datasets/external/Goodreau/index11.php'
fm = url_fetch_networkx_graph (fm_url).to_undirected()

In [None]:
url22 = 'http://gawron.sdsu.edu/python_for_ss/course_core/assignments/FauxMesaHigh.gml'
fm = url_fetch_networkx_graph (url22)

## Race

The nodes are students.  Links are friendship links.

Race information is available for each fictional student.

In [2]:
race_seq = [fm.nodes[n]['Race'] for n in fm.nodes()]
race_seq[:10]

['Hisp',
 'Hisp',
 'Hisp',
 'NatAm',
 'NatAm',
 'Hisp',
 'NatAm',
 'NatAm',
 'White',
 'White']

These are the races.

In [3]:
races = set(race_seq)
races

{'Black', 'Hisp', 'NatAm', 'Other', 'White'}

### Assortativity Coefficient

The **assortativity coefficient** of a graph measures the extent to which  the 
vertices the graph are 
connected to other vertices that are like them in some way.
See [Newman - Physical review E, 2003 - APS](https://journals.aps.org/pre/pdf/10.1103/PhysRevE.67.026126?casa_token=6Y08m5BvhMEAAAAA%3Achgp6T6cHHt7mxcfKWrmSL34nZ3r5FK6PQtojIlpPj4vI75nurG8ySN8aSCv_rAA9hI34HNs3Fmy)

We begin with a quantity Newman calls, $e_{ij}$, the proportion of edges in a network that connect a vertex of type $i$ to one of type $j$.  It follows from this definition that for any type $i$, $e_{ii}$, the proportion of edges in a network that connect a vertex of type $i$ to one of the same type. 

The table belows shows data on the values of $e_{ij}$ for
mixing by race among sexual partners in a 1992 study cited by Newman (J.A. Catania, T.J. Coates, S. Kegels, and M.T. Fullilove, Am. J.
Public Health 82, 284. 1992). This
part of the study focused on heterosexuals, 
so there are two vertex types representing men and
women, with edges running only between vertices that are men
and vertices that are women.  The table
shows that mixing is highly **assortative** in this network, 
meaning that the 
individuals in the study strongly preferred partners from the same group.
You can see that, except for the category **Other**, the highest values are all $e_{ii}$ values running along thee diagonal of the table.

<table>
    <tr>
        <td> </td><td> </td> <td> </td> <th> Women </th> <td> </td> <td> </td>
    </tr>
    <tr>
       <th> Men  </th>     <th> Black </th> <th> Hispanic </th>  
  <th> White </th>
        <th> Other </th> <th> $a_{i}$ </th> </tr>
        <tr>
<th>Black </th><td>      0.258</td>  <td> 0.016 </td><td>   0.035 </td> 
<td>  0.013 </td> <td> 0.323 </td>
</tr>
<tr>
    <th>Hispanic </th><td>   0.012</td><td>  0.157 </td><td>   0.058 </td>
    <td>  0.019 </td> <td> 0.247</td>
    </tr>
    <tr>
        <th> White </th><td>      0.013 </td><td> 0.023  </td><td>  0.306 </td>
        <td> 0.035 </td><td> 0.377  </td>     
    </tr>
    <tr>
        <th> Other </th><td>      0.005 </td><td> 0.007 </td><td>   0.024 </td> 
        <td>0.016 </td><td>0.053</td>
    </tr>
    <tr>
        <th> $b_{i}$      <td>    0.289 </tdf><td> 0.204 </td>
    <td> 0.423 </td><td>  0.084</td>
    </tr>
</table>

The column label $a_{i}$ and the row labeled $b_{i}$ contains
the row and column sums.  This is sometimes called the **joint probability mixing matrix**.  That is, it's the probability that any randomly chosen edge will contain an agent of type $i$ and an agent of type $j$.

Each value in the $a_{i}$ column contains the total proportion of  male relationships for a particular group in the study, also known
as the marginal probability of $a_{i}$, of being a male of type $i$ in a relationship.
Each $b_{i}$ value contains the total proportion of  female relationships for a particular group in the study, or the marginal probability of being a female of type $i$ in a relationship.

Below we refer to each of $e_{ij}$ as a probability mass.

Because this is a study of heterosexuals, each edge in the graph links a male and a female, and the $a_{i}$ and $b_{i}$ values add to 1. 

In [None]:
# a_i
.323 + .247 + .377 + .053

1.0

In [None]:
# b_i
.289 + .204 + .423 + .084

0.9999999999999999

Newman proposes a measure which measures the assortativity of the network overall.

$$
\frac{\sum_{i} e_{ii} - \sum_{i} a_{i}b_{i}}{1 - \sum_{i} a_{i}b_{i}}
$$

Some of the important properties:

1. Perfect assortativity (every vertex links to vertices of the same group) gets a score of 1.
2. Perfect dis-assortativity (vertices link only to outgroup members) is negative
   (between 0 and -1).  By definition the above data is perfectly disassortative with respect to the sex attribute, since every link joins individuals of different sexes.
3. A score of 0 means assortativity is random.  Each individual links to members
of any group in proportion with the number of edges they participate in.  If the
proportion of group A links in the graph is 10%, then on average, agents
in the graph assign 10% of their links to group A.

The numerator is the total edge mass of ingroup relations minus a penalty related to
the total mass of all relationships (it's actually the sum of all the relationship masses
squared).  The penalty is applied in the numerator and denominator. When all the probability mass is consumed by ingroup relations, the sum of the $e_{ii}$ will be 1,
and the numerator and denominator are the same, resulting in a assortativity score of 1.
When the penalty exceeds the mass of the ingroup relationships, we get a negative value
in the numerator, resulting in a negative assortativity score.  The pensality is designed to exactly equal the ingroup probability mass when assortativity is random, yielding a score of 0.


## An example of perfect assortativity

In [4]:
G2 = nx.Graph()
G2.add_nodes_from([0, 1, 4], color="red")
G2.add_nodes_from([2, 3, 5], color="blue")
# All edges are ingroup relations
G2.add_edges_from([(0, 1), (0, 4), (1, 4), (2, 3), (3, 5), (2, 5) ])
print(nx.attribute_assortativity_coefficient(G2, "color"))
# Add one dis-assortative edge
G2.add_edges_from([(0, 2)])
print(nx.attribute_assortativity_coefficient(G2, "color"))

1.0
0.7142857142857143


## An example of perfect dis-assortativity

In [6]:
G3 = nx.Graph()
G3.add_nodes_from([0, 2, 3], color="red")
G3.add_nodes_from([1, 4, 5], color="blue")
G3.add_edges_from([ (0,1),(0,4),(2,5),(2,4),(3,1),(3,5),(0,5), (2,1),(3,4)])
print(nx.attribute_assortativity_coefficient(G3, "color"))
# Add one assortative edge
G3.add_edges_from([(0, 2)])
print(nx.attribute_assortativity_coefficient(G3, "color"))

-1.0
-0.8181818181818187


## Perfectly random assortativity

In [7]:
G3 = nx.DiGraph()
G3.add_nodes_from([0, 2, 3], color="red")
G3.add_nodes_from([1, 4, 5], color="blue")
# We draw a complete graph, all edges cnnected to all others.
for n1 in G3.nodes():
    for n2 in G3.nodes():
        G3.add_edge(n1,n2)
print(nx.attribute_assortativity_coefficient(G3, "color"))

0.0


Note that leaving out the self-identical edges makes the score lean towards disassortativity, since from a mathematical point of view, one opportunity for 
assortativity has been omitted.

In [8]:
G4 = nx.DiGraph()
G4.add_nodes_from([0, 2, 3], color="red")
G4.add_nodes_from([1, 4, 5], color="blue")
for n1 in G4.nodes():
    for n2 in G4.nodes():
        if n1 != n2:
           G4.add_edge(n1,n2)
print(nx.attribute_assortativity_coefficient(G4, "color"))

-0.19999999999999996


As the graph grows larger, the effect of leaving out self-identical edges diminishes.

In [9]:
G5 = nx.DiGraph()
G5.add_nodes_from(range(100), color="red")
G5.add_nodes_from(range(101,201), color="blue")
for n1 in G5.nodes():
    for n2 in G5.nodes():
        if n1 != n2:
           G5.add_edge(n1,n2)
print(nx.attribute_assortativity_coefficient(G5, "color"))

-0.005025125628140502


Newman says:


Assortative mixing can have a profound effect on the
structural properties of a network. For example, assortative
mixing of a network by a discrete characteristic will tend to
break the network up into separate communities. If people
prefer to be friends with others who speak their own language, for example, then one might expect countries with
more than one language to separate into communities by
language. Assortative mixing by age could cause stratification of societies along age lines.

p. 67

Let's try out this idea on the famous example of the karate club graoh, where we know
the `'club'` attributes defines two communities in the graph, and where we know that membership in one or the other clubn had real world consequences.

In [10]:
kn = nx.karate_club_graph()
nx.attribute_assortativity_coefficient(kn,'club')

0.717530864197531

High assortativity, meaning a large proportion of the edges are ingroup edges, as expected.

# Faux Mesa High Data

In [5]:
import os.path
import networkx as nx

In [6]:
#Goodreau's Faux Mesa High School 
github_networks_data = 'https://raw.githubusercontent.com/gawron/python-for-social-science/master/networks/'
fm_url = os.path.join(github_networks_data, 'FauxMesaHigh.gml')
fm = url_fetch_networkx_graph (fm_url).to_undirected()

In [7]:
type(fm)

networkx.classes.graph.Graph

In [8]:
len(fm.edges())

202

### Assortativity for race

In [9]:
race_seq = [fm.nodes[n]['Race'] for n in fm.nodes()]
race_seq[:10]

['Hisp',
 'Hisp',
 'Hisp',
 'NatAm',
 'NatAm',
 'Hisp',
 'NatAm',
 'NatAm',
 'White',
 'White']

In [10]:
races = {fm.nodes[n]['Race'] for n in fm.nodes()}
races

{'Black', 'Hisp', 'NatAm', 'Other', 'White'}

In [11]:
ac = nx.algorithms.attribute_assortativity_coefficient(fm, 'Race')
ac

0.23195376513754493

Pretty good!  Guess these (fictional) high schoolers are pretty integrated.

But wait, let's look at assortativity on a race by race basis, creating binary oppositions
between race $i$ and non-race $i$.

First we try White vs. non-White:

In [12]:
sorted_nodes = dict()
for r in races:
    sorted_nodes[r] = [n for n in fm.nodes() if  fm.nodes[n]['Race'] == r]

In [13]:
sorted_nodes['White']

['166',
 '25',
 '194',
 '89',
 '87',
 '101',
 '104',
 '34',
 '63',
 '19',
 '156',
 '43',
 '40',
 '5',
 '9',
 '141']

Even less sortativity: so whites and non whites are mixing ar very close to their proportions.

In [14]:
for n in fm.nodes():
    node_race = fm.nodes[n]['Race']
    if node_race == 'White': 
       fm.nodes[n]['Race2'] = 'White'
    else: 
       fm.nodes[n]['Race2'] = 'NonWhite'
       
nx.algorithms.attribute_assortativity_coefficient(fm, 'Race2')

0.07471371092541003

if we restructed to the sortative behavior on the white side, it is completely random.

In [15]:
nx.algorithms.attribute_assortativity_coefficient(fm, 'Race2', nodes = sorted_nodes['White'])

0.0

Similarly for Native Americans and the original race attribute.

In [31]:
nx.algorithms.attribute_assortativity_coefficient(fm, 'Race', 
                                                  nodes = sorted_nodes['NatAm'])

0.0

How many of thses facts have to with how many members each race has?  After all,
if your race is poorly represented at school, you are more likely to have out of group
relationships.

In [34]:
from collections import Counter
distribution = Counter(fm.nodes[n]['Race'] for n in fm.nodes())
distribution

Counter({'Black': 5, 'Hisp': 75, 'NatAm': 50, 'Other': 1, 'White': 16})

Now we see what the name Faux Mesa High is intended to suggest.  In this region,perhaps the southwest, 
the two best represented groups are Hispanic and Native American.

Let's binarize the Native American attribute and check sortativity.  Higher than we saw for the orginal race attribute.

In [32]:
for n in fm.nodes():
    node_race = fm.nodes[n]['Race']
    if node_race == 'NatAm': 
       fm.nodes[n]['Race3'] = node_race
    else: 
       fm.nodes[n]['Race3'] = 'NonNatAm'
       
nx.algorithms.attribute_assortativity_coefficient(fm, 'Race3')

0.3316790736145576

In [21]:
for n in fm.nodes():
    node_race = fm.nodes[n]['Race']
    if node_race == 'Hisp': 
       fm.nodes[n]['Race4'] = node_race
    else: 
       fm.nodes[n]['Race4'] = 'NonHisp'
       
nx.algorithms.attribute_assortativity_coefficient(fm, 'Race4')

0.27511961722488043

And with 'Black' versus 'NonBlack' we see very slight dissortaitivity.

In [22]:
for n in fm.nodes():
    node_race = fm.nodes[n]['Race']
    if node_race == 'Black': 
       fm.nodes[n]['Race5'] = node_race
    else: 
       fm.nodes[n]['Race5'] = 'NonBlack'
       
nx.algorithms.attribute_assortativity_coefficient(fm, 'Race5')

-0.06878306878306872

In [33]:
for n in fm.nodes():
    node_race = fm.nodes[n]['Race']
    if node_race == 'Other': 
       fm.nodes[n]['Race6'] = node_race
    else: 
       fm.nodes[n]['Race6'] = 'NonOther'
       
nx.algorithms.attribute_assortativity_coefficient(fm, 'Race6')

-0.0024813895781289606

In [24]:
other_student = [n for n in fm.nodes if fm.nodes[n]['Race'] == 'Other'][0]
[fm.nodes[n]['Race'] for n in fm.neighbors(other_student)]

['Hisp']

In [26]:
N = sum(distribution.values())
dd = distribution.copy()

for (key,val) in distribution.items():
    dd[key] = val/N

dd

Counter({'Black': 0.034013605442176874,
         'Hisp': 0.5102040816326531,
         'NatAm': 0.3401360544217687,
         'Other': 0.006802721088435374,
         'White': 0.10884353741496598})

### Other attributes

In [None]:
fm.nodes()['133']

{'Grade': '7', 'Race': 'Hisp', 'Sex': 'M'}

In [35]:
nx.algorithms.attribute_assortativity_coefficient(fm, 'Sex')

0.27775399723026567

In [36]:
nx.algorithms.attribute_assortativity_coefficient(fm, 'Grade')

0.7430270648475019