## Homework 2 

### Personal network surveys

In this homework, we will be analyzing some data from the [General Social Survey](http://gss.norc.org/) (GSS).
The GSS is the survey that was the basis of the debate over whether or not Americans are becoming more socially isolated, which we discussed in class.

In [None]:
from IPython.core.display import HTML
from datascience import *

import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('fivethirtyeight')

In [None]:
import os
os.getcwd()

In [None]:
#Loading testing data
from client.api.notebook import Notebook 
hw02 = Notebook('hw02.ok')
_ = hw02.auth(inline=True)

The file `GSS.csv` has an extract from the GSS which we will analyze today.

Here is the [codebook](http://gss.norc.org/documents/codebook/GSS_Codebook.pdf) for the entire GSS. Of course, the GSS is huge, so you will have to search through for the variable names included in this extract. (Don't do this by hand -- use your pdf viewer's search function.)

Read the GSS extract into a Table called `gss_data`.

In [None]:
url = 'GSS.csv'
gss_data = Table.read_table(url)

**Question.** How many rows and how many columns does `gss_data` have? Print out the first several rows to take a look at its contents.

In [None]:
gss_number_of_rows = ...
gss_number_of_cols = ...

print("num rows: ", gss_number_of_rows)
print("num cols: ", gss_number_of_cols)

In [None]:
_ = hw02.grade('q1')

In [None]:
gss_data

**Question** What range of years is covered by this dataset? Answer this by finding the largest and smallest year.

In [None]:
largest_year = ...
smallest_year = ...

print("earliest year: ", smallest_year)
print("latest year: ", largest_year)

In [None]:
_ = hw02.grade('q2')

We are interested in the years when the 'important matters' question was asked of survey respondents. It turns out that this question was only asked for 1985 and 2004. In order to continue with our analysis, we will pick out only the rows of the dataset that correspond to the years we are interested in.

**Question** Make two new datasets: `gss_1985` and `gss_2004` which have only the responses from 1985 and from 2004.

In [None]:
gss_1985 = gss_data.where('year', are.equal_to(1985))
gss_2004 = gss_data.where('year', are.equal_to(2004))

In [None]:
_ = hw02.grade('q3')

In [None]:
_ = hw02.grade('q4')

**Question** How many responses are there from 1985, and how many from 2004? 

In [None]:
responses_from_1985 = ...
responses_from_2004 = ...

In [None]:
_ = hw02.grade('q5')

**Question** Make a table of the responses to the `numgiven` question for each year.

In [None]:
gss_1985_numgiven = ...

In [None]:
gss_2004_numgiven = ...

Your table for 2004 should show that quite a few respondents have -1 as the value of `numgiven`. These respondents actually did not answer the important matters name generator.

**Question** Narrow the 2004 dataset down so that it does not have the respondents who have -1 values for `numgiven`

In [None]:
gss_2004_interviewed = gss_2004.where('numgiven', are.above(-1))

In [None]:
_ = hw02.grade('q7')

**Question** Narrow both datasets down so that they only have respondents who were asked the `numgiven` question and who provided answers to that question.

In [None]:
gss_1985_responded = gss_1985.where('numgiven', are.not_equal_to(9))
gss_2004_responded = gss_2004_interviewed.where('numgiven', are.not_equal_to(9))

In [None]:
_ = hw02.grade('q7b')

Now we have the set of respondents we will study in more detail: those who responded to the 'important matters' name generator.

**For the problems below, please use `gss_1985_responded` and `gss_2004_responded`.**

Many analysts have focused on how many survey respondents report that they don't discuss important matters with anyone.  They interpret the fraction of respondents who don't report discussing important matters with anyone as an indicator for the amount of social isolation. (These respondents who report not discussing important matters with anyone have `numgiven` equal to 0.)

**Question.** Do you think this is a good way to try to quantify social isolation? Name one way this could be a good measure of social isolation, and one way this could be a bad measure of social isolation. Please be specific.

<div class='response'>
[answer here]
</div>

**Question** For both the 1985 and 2004 datasets, create a new variable, `isolated` which has the value False if the respondent reports discussing important matters with anyone, and True otherwise.

In [None]:
isolated_1985 = ...
isolated_2004 = ...

In [None]:
_ = hw02.grade('q8')

**Question** Using the variable you just created, what proportion of respondents was socially isolated in 1985? In 2004?

In [None]:
proportion_isolated_1985 = ...
proportion_isolated_2004 = ...

print("Proportion isolated in 1985: ", proportion_isolated_1985)
print("Proportion isolated in 2004: ", proportion_isolated_2004)

In [None]:
_ = hw02.grade('q9')

In [None]:
_ = hw02.grade('q10')

In [None]:
gss_1985

Here is a function that you may find useful in answering the next question. Given a row in a GSS dataset, the function returns `True` if one of the alters is a spouse, and `False` otherwise:

In [None]:
def reports_spouse(row):
    return(row.item('spouse1') == 1 or row.item('spouse2') == 1 or row.item('spouse3') == 1 or row.item('spouse4') == 1 or row.item('spouse5') == 1)

**Question** What proportion of married respondents named a spouse?

In [None]:
married_1985 = gss_1985_responded.where("marital", are.equal_to(...))
married_spouses_1985 = married_1985.apply(...)
married_spouses_proportion_1985 = ...

married_2004 = gss_2004_responded.where("marital", are.equal_to(...))
married_spouses_2004 = married_2004.apply(...)
married_spouses_proportion_2004 = ...

print("proportion of married respondents naming spouse in 1985: ", married_spouses_proportion_1985)
print("proportion of married respondents naming spouse in 2004: ", married_spouses_proportion_2004)

In [None]:
_ = hw02.grade('q14')

### Homophily

Below, you will find the functions that we used to convert data from wide to long as part of Lab 1. The `wide_to_long` function has been slightly modified to account for the different format of the variable names in the GSS dataset, but it works in the same way we saw in the lab.

In [None]:
def repeat_single_col(data, var_name, times=5):
    """Repeats a single column multiple times.
    
    Parameters
    ----------
    var_name : str
        Text that contains the name of the column to repeat.
    
    Returns
    -------
    np.array
        A single array with the contents of the column repeated five times.
    
    Examples
    --------
    >>> repeat_single_col(Table().with_columns(['respondent_age', [10]]),
                          'respondent_age')
    
    array([10, 10, 10, 10, 10])
    """
    new_col = np.tile(data.column(var_name), times)
    return new_col

def wide_to_long(data, var_name, times=5):
    """Given columns of alter characteristics, stack them into one long column.
    
    Parameters
    ----------
    data : Table
        The data table containing the alter characteristics
    var_name : str
        Text that contains the variable name; columns of the dataset should
        match the pattern: [var_name][alter_number]
        For example, if var_name is 'age' then this function expects to find
        columns in the survey dataset named 
        'age1', 'age2', 'age3', 'age4', and 'age5'
    times : int
        The number of columns for each characteristic
    
    Returns
    -------
    np.array
        A single array with the contents of all of the columns stacked on top of one another.
    
    Examples
    --------
    >>> wide_to_long(Table().with_columns(['age1', [10, 15],
                                           'age2', [30, 35],
                                           'age3', [20, 15],
                                           'age4', [60, 70],
                                           'age5', [20, 25]]),
                     'age')
    
    array([10, 15, 30, 35, 20, 15, 60, 70, 20, 25])
    """
    new_col = np.concatenate([data.column(var_name + str(idx)) for idx in range(1,times+1)])
    return new_col

**Question** Now we will use these functions to convert the wide-format data from 1985 and 2004 into long format. This will enable us to examine whether or not there is evidence of homophily in the GSS confidant reports from those two years.

Follow the pattern that we used in Lab 1 to do this. Be sure to include the following:

* ego's age
* alter's gender (called 'sex' in the dataset')
* alter's age

In [None]:
gss_1985_long_raw = ...

In [None]:
gss_2004_long_raw = ...

In [None]:
_ = hw02.grade('q15')

In [None]:
gss_1985_responded.show()

**Question** Similar to Lab 1, not all respondents reported about 5 alters. In cases where alter information is missing, `alter_age` is coded as -1. Furthermore, in cases where respondents did report about an alter, but they did not know or refused to give the alter's age, [the codebook](http://gss.norc.org/documents/codebook/GSS_Codebook.pdf) tells use that `alter_age` will have the value 98 or 99.

Create the Tables gss_1985_long and gss_2004_long, which start from `gss_1985_long_raw` and `gss_2004_long_raw` and filter out rows where `alter_age` equals -1, 98, or 99 so that we are left with only actual reported alters whose age was given.

In [None]:
gss_1985_long = gss_1985_long_raw.where(...)
gss_1985_long = gss_1985_long.where(...)
gss_1985_long = gss_1985_long.where(...)

In [None]:
gss_2004_long = gss_2004_long_raw.where(...)
gss_2004_long = gss_2004_long.where(...)
gss_2004_long = gss_2004_long.where(...)

In [None]:
_ = hw02.grade('q16')

**Question** Create a scatterplot of the respondent's age and the alter's age (make a separate plot for 1985 and for 2004).

In [None]:
...

In [None]:
...

**Question** What do the scatter plots you made suggest about homophily in Americans' confidant networks? How similar or different are these patterns to what we saw from our survey of Berkeley students? (Note: there is no single right answer here -- I just want you to interpret the plots and provide an argument for why your interpretation might be right.)

[answer here]

# Adjacency Matrices and Adjacency Lists

**Question** <br>
Consider the undirected graph as shown in the figure below.

<img src="Graph1.png" height=40% width=40%>

<ol>
<li>Write down the adjacency matrix for this graph.</li>
<li>Write down the adjacency list for this graph (there can be different ways to represent an adjacency list).</li>
<li>Which representation is better for this graph and why?</li>
</ol>

In [None]:
# Answer here

#### **Question** <br>
Now consider the following graph

<img src="5_18.png" width="240" height="180" align="center"/>
<br>
Which representation (adjacency list or adjacency matrix) is better for this graph? Write down the representation that you think is better for this graph.

In [None]:
# Answer here

# Breadth first search

In class, we did an example using the ARPANET graph that systematically compute the distance between one node (MIT) and every other node (lecture 2). The searching algorithm we used is called breadth first search. Use this algorithm and calculate the longest distance between two nodes in the following graph.

<img src="bfs.png" width="240" height="180" align="center"/>
<br>

**Question** Write down the steps for searching in the following box and enter the longest distance between each pair of nodes as q17.

In [None]:
# Begin with A
d1 = make_array(...) # Nodes that have distance 1 with A
...

q17 = ...

In [None]:
_ = hw02.grade('q17')

# SUBMIT YOUR ASSIGNMENT

You can rerun all the tests before submitting the homework if you'd like to.

In [None]:
import os
print("Running all tests...")
_ = [hw02.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]
print("Finished running all tests.")

In order to submit your assignment, run the next cell.

You can submit as many times as you want (up to the deadline: Feb 11th, Monday 9 pm).

In [None]:
_ = hw02.submit()