In [1]:
from IPython.core.display import HTML
from datascience import *

import matplotlib
from matplotlib import animation as animation
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
plt.style.use('fivethirtyeight')

import networkx as nx
from networkx.algorithms import bipartite

import pandas as pd

import pickle

from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual, widgets

def css_styling():
    styles = open('../notebook_styles.css', 'r').read()
    return HTML(styles)
css_styling()

### Randomize partners

**Question** Write your name here

<div class='response'>
[answer here]
</div>

**Question** Write your partner's name here

<div class='response'>
[answer here]
</div>

**Question** What was your partner's favorite class last year?

<div class='response'>
[answer here]
</div>

## Lab 10 - Exploring the final project datasets

Today, we're going to explore the two datasets that you can decide to use for your final projects.

The goal of this lab is to give you a chance to spend some time exploring and familiarizing yourself with these two datasets. Then, in the coming week you will

* find a partner
* work with your partner to pick a topic to explore in the project
* schedule a meeting with me to discuss your plan

## Workplace contact networks

This dataset comes from a [research project](http://www.sociopatterns.org/datasets/contacts-in-a-workplace/) that wanted to measure physical contact in the workplace. Epidemiologists are interested in understanding patterns of physical contact because these patterns are important for building realistic models of infectious disease transmission. You can find more information in the [paper](https://www.cambridge.org/core/journals/network-science/article/data-on-facetoface-contacts-in-an-office-building-suggest-a-lowcost-vaccination-strategy-based-on-community-linkers/18AB49AB4F2AEA33CE7501F06ADBC8E8).

In order to read in the data, we'll have to open up two text files:

* `contacts.csv` has edge lists
* `department.csv` has information on which department each node belongs to

This first chunk of code will read the edge-list into a `pandas` dataframe called `contacts_df`.

In [None]:
contacts_df = pd.read_csv(open(os.path.join("..", "data", "workplace-contact", "contacts.csv")),
                         names=['time', 'id1', 'id2'])
print(contacts_df.shape)
contacts_df.head()

This second chunk of code will read the edge-list into a `pandas` dataframe called `departments_df`.

In [None]:
departments_df = pd.read_csv(open(os.path.join("..", "data", "workplace-contact", "department.csv")),
                             names=['id', 'department'])
print(departments_df.shape)
departments_df.head()

Now we'll use the two dataframes we just read in to

* create a networkx `Graph` object
* add the appropriate attributes (ie, department) to each node

In [None]:
contact_network = nx.from_pandas_dataframe(contacts_df, 
                                           source='id1', 
                                           target='id2',
                                           edge_attr='time',
                                           create_using=nx.Graph())

nx.set_node_attributes(contact_network,
                       'department',
                       departments_df.set_index('id').to_dict()['department'])

**Question** Investigate this network; use approaches we learned in previous labs to quantify different aspects of the network structure; like any exploratory analysis, you might find it helpful to make plots to visualize the data.  
Some example questions you might explore:
* how many nodes / edges?
* what is the degree distribution? average degree?
* how many components are in the network?
* etc...

In [None]:
...
...
...

**Question** Now that you have explored the data a little bit, talk to your partner and try to come up with two questions you might be interested in answering using this dataset. (These don't have to end up being project topics; this is just to get you thinking.)

<div class='response'>
1. [answer here]  
2. [answer here]
</div>

## Indian villages

The second dataset comes from a large study about microfinance that several economists conducted a few years ago. The researchers collected information about many different kinds of network relationships among people who lived in 77 different villages in Southern India. Their ultimate goal was to understand what kinds of social influence factors might be important in whether or not people decide to make use of a microfinance program. You can see a description of their results in [their paper](http://science.sciencemag.org/content/341/6144/1236498).

There is also some more information on their dataset on [Prof. Matt Jackson's website](http://web.stanford.edu/~jacksonm/Data.html).

In particular, the [README](https://web.stanford.edu/~jacksonm/IndianVillagesREADME.pdf) file for the Indian Villages data is helpful. Here is an excerpt that describes some important parts of the dataset:

    3. Data
    The “Data” folder contains two subfolders: “Network Data” and “Demographics and Outcomes.” In the "Network Data" folder, there are adjacency matrices for each of the 75 villages surveyed. The 75 villages are numbered 1-77 (villages 13 and 22 are missing.) About half of households received detailed surveys in which individuals were asked to list the names of people with whom they shared a certain relationship. Households were randomly sampled and stratified by religion and geographic sub-region.
    For each variable, an individual matrix and a household matrix were constructed. A relationship between households exists if any household members indicated a relationship with members from the other household. These questions were asked in the individual survey.
    Individuals were asked who they: -- borrow money from
    -- give advice to
    -- help with a decision
    -- borrow kerosene or rice from -- lend kerosene or rice to
    -- lend money to
    -- obtain medical advice from -- engage socially with
    -- are related to
    -- go to temple with
    -- invite to one's home
    -- visit in another's home.
    We also include the ALL network which is a union and an AND network which is the intersection. This is done both at the individual and household levels.


We've done much of the work needed to actually read the data in already. This function will be helpful:

In [None]:
def load_iv_relation(relation):
    
    fn = os.path.join("..", "data", "indian-villages", "iv_hh_" + relation + ".pickle")
    return(pickle.load(open(fn, 'rb')))

### Using the functions to load Indian Villages data

Here's an example that reads in the Indian Villages network for households using the 'lendmoney' relation for village id 6.

First, we load the `lendmoney` networks using the function defined above:

In [None]:
net_lendmoney = load_iv_relation('lendmoney')

The result, `net_lendmoney` is a dictionary. This means that we can use the index of a specific village to get the lendmoney network for that village:

In [None]:
net_lendmoney_village6 = net_lendmoney[6]

nx.draw(net_lendmoney_village6)

This list has all of the different types of network relation that are available.

In [None]:
all_relations = """borrowmoney
giveadvice
helpdecision
keroricecome
keroricego
lendmoney
medic
nonrel
rel
templecompany
visitcome
visitgo
allVillageRelationships
andRelationships""".split('\n')

print(all_relations)

Similarly, this list has the ids of all of the different villages in the dataset:

In [None]:
all_village_ids = list(range(1, 78, 1))
all_village_ids.remove(13)
all_village_ids.remove(22)

print(all_village_ids)

So we can compute the number of nodes in each village like this:

In [None]:
village_ids = make_array()
num_nodes = make_array()

for cur_village in all_village_ids:
    num_nodes = np.append(num_nodes, nx.number_of_nodes(net_lendmoney[cur_village]))
    
lendmoney_num_nodes = Table().with_columns('village_id', all_village_ids,
                                           'num_nodes', num_nodes)

lendmoney_num_nodes


**Question** Now investigate these networks; use approaches we learned in previous labs to quantify different aspects of the network structure; like any exploratory analysis, you might find it helpful to make plots to visualize the data.   
You might choose to focus on one village, or on one type of relationship. There are tons of possibilities.

In [None]:
...
...
...

**Question** Now that you have explored the data a little bit, talk to your partner and try to come up with two questions you might be interested in answering using this dataset. (These don't have to end up being project topics; this is just to get you thinking.)

<div class='response'>
1. [answer here]  
2. [answer here]
</div>

## Submit the lab

You're almost done! Now please create a pdf version of your completed lab by **either**:

* printing your notebook to a pdf file
* going to the Jupyter 'File' menu, choosing 'Download as' and then 'PDF via LaTeX (.pdf)'. 

Please save the resulting .pdf on your computer and then **submit the .pdf on bcourses**.

**The lab must be submitted by the end of the day on Monday, Nov. 13. Late labs will not be accepted.**