In [None]:
from IPython.core.display import HTML
from datascience import *

import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('fivethirtyeight')

import warnings
warnings.simplefilter('ignore', FutureWarning)

In [None]:
def css_styling():
    styles = open('../notebook_styles.css', 'r').read()
    return HTML(styles)
css_styling()

### L&S 88 - Social Networks

# Lab 02: Homophily and personal networks

Let's start by loading the results of our survey, which we started to explore in Lab 01.

In [None]:
from datascience import *

url = "../data/survey/berkeley_survey_clean.csv"
survey = Table.read_table(url)

At the end of the last lab, we made a scatter plot which showed the age of the survey respondents on the x axis and the age of the first reported confidant (`alter1_age`) on the y axis:

In [None]:
survey.scatter('respondent_age', 'alter1_age')

[Discuss as a group]

One distinctive feature of this plot is that many of the confidants have ages that are very similar to the ages of the survey respondents.  This is an example of a phenomenon that is found in many different kinds of social networks: people tend to be connected to others who are simliar to them. This phenomenon is called **homophily**. Homophily in social networks is not a universal law or a mathematical fact, but it is a striking empirical regularity. We see examples of homophily again and again in real-world social networks.

**Discussion question** Based on your personal experience, would you predict that Berkley students' confidant networks will be homophilous? Which traits do you think will be most homophilous and which will be least homophilous? (A *trait* here is a measurable characteristic of Berkeley students like age, class year, or home. Of course, there are many traits that we did not measure on our survey.)

[Discuss as a group]

**Discussion question** Now look at the scatter plot again. There is a second cloud of points above the first one.  These points represent alters who are not about the same age as survey respondents; instead, they look like they are about 25 years older than survey respondents. What do you think might explain this?

[Discuss as a group]

## Randomize into pairs

**Question** Write your partner's name

<div class='response'>
[Answer here]
</div>

**Question** What is your partner's favorite food?

<div class='response'>
[Answer here]
</div>

## Converting from wide to long

So far, we have only looked at the relationship between the survey respondents and the first alter that they mentioned. However, we designed our survey so that respondents could tell us about up to five alters. In order to look at *all* of the alters respondents reported about, we're going to have to manipulate the dataset a bit more extensively. This manipulation is a little bit tricky, but we're going to go through how it can be done step by step. Later on in the semester, Data 8 will teach you a more direct way to perform some of these manipulations.

First, let's take a look at the first few rows of the dataset again to remind ourselves of how it is structured:

In [None]:
survey.show()

The dataset is in *wide* format: all of the information reported by a respondent is stored in a single row:

[respondent 1 info] ... [info about respondent 1's first alter] ... [info about respondent 1's second alter] ...   
[respondent 2 info] ... [info about respondent 2's first alter] ... [info about respondent 2's second alter] ...   
...

Our goal is to reshape the dataset so that the information is in *long* format instead:

[respondent 1 info] [info about respondent 1's first alter]  
[respondent 1 info] [info about respondent 1's second alter]  
[respondent 1 info] [info about respondent 1's third alter]  
[respondent 1 info] [info about respondent 1's fourth alter]  
[respondent 1 info] [info about respondent 1's fifth alter]  
[respondent 2 info] [info about respondent 2's first alter]  
[respondent 2 info] [info about respondent 2's second alter]  
...

[discuss wide to long transformations together on the board]

We're going to start by creating an id variable for our survey responses. This is not strictly necessary, but it will help us double-check our results once we have finished.

In [None]:
## create a respondent id variable
survey['respondent_id'] = range(1, survey.num_rows + 1)

In [None]:
survey.show()

In order to convert the alter information from wide to long format, we're going to use two functions that have been written for you below. Since you have been learning about functions in Data 8, this will be a good chance to practice some of the skills you recently learned.

Let's look at the function below and try to understand how it works.

In [None]:
def repeat_single_col(data, var_name, times=5):
    """Repeats a single column multiple times.
    
    Parameters
    ----------
    var_name : str
        Text that contains the name of the column to repeat.
    
    Returns
    -------
    np.array
        A single array with the contents of the column repeated five times.
    
    Examples
    --------
    >>> repeat_single_col(Table().with_columns(['respondent_age', [10]]),
                          'respondent_age')
    
    array([10, 10, 10, 10, 10])
    """
    new_col = np.tile(data.column(var_name), times)
    return new_col

The key to understanding `repeat_single_col` is the `np.tile` function. Look that function up and read its help page.

**Question** Use `np.tile` to create an array that contains [1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7].

In [None]:
...

**Question** What is the difference between `np.tile` and `np.repeat`? Do they do the same thing?<BR>
[*Hint: you can use the help page to figure out what `np.repeat` does.*]

In [None]:
...

<div class="response">
[Answer here]
</div>

**[Discuss first function as a group]**

Now look at the next function, and try to understand how it works.

In [None]:
def wide_to_long(data, var_name, times=5):
    """Given columns of alter characteristics, stack them into one long column.
    
    Parameters
    ----------
    data : Table
        The data table containing the alter characteristics
    var_name : str
        Text that contains the variable name; columns of the dataset should
        match the pattern: alter[NUM]_[var_name]
        For example, if var_name is 'age' then this function expects to find
        columns in the survey dataset named 
        'alter1_age', 'alter2_age', 'alter3_age', 'alter4_age', and 'alter5_age'
    times : int
        The number of columns for each characteristic
    
    Returns
    -------
    np.array
        A single array with the contents of all of the columns stacked on top of one another.
    
    Examples
    --------
    >>> wide_to_long(Table().with_columns(['alter1_age', [10, 15],
                                           'alter2_age', [30, 35],
                                           'alter3_age', [20, 15],
                                           'alter4_age', [60, 70],
                                           'alter5_age', [20, 25]]),
                     'age')
    
    array([10, 15, 30, 35, 20, 15, 60, 70, 20, 25])
    """
    new_col = np.concatenate([data.column('alter' + str(idx) + '_' + var_name) for idx in range(1,times+1)])
    return new_col

The key to understanding this second function is `np.concatenate`. Look up the help page for `np.concatenate`.

**Question** Use `np.concatenate` to make a single array that has the contents of `column_one` and `column_two` concatenated together:

In [None]:
concat_table = Table().with_columns(['column_one', [1,3,5,7,9],
                                    'column_two', [2,4,6,8,10]])
## now use np.concatenate to concatenate column_one and column_two together
...

**[Discuss the second function as a group]**

Here's an example of the `wide_to_long` function to give you a sense of how it works.

In [None]:
wide_to_long(Table().with_columns(['alter1_age', [10, 15],
                                   'alter2_age', [30, 35],
                                   'alter3_age', [20, 15],
                                   'alter4_age', [60, 70],
                                   'alter5_age', [20, 25]]),
                     'age')

We'll take a first try at converting the alter data to long format together, and then you'll do a more complete job as an exercise.

For our first step, we'll only keep the respondent's id number and the age of the alters each respondent reported about. Since there is space for 5 alters per respondent, we expect (number of respondents x 5) = (79 x 5) = 395 rows in our resulting dataset.

In [None]:
alter_first_try = Table().with_columns([
    'respondent_id', repeat_single_col(survey, 'respondent_id'),
    'alter_age', wide_to_long(survey, 'age')])


You can see that we want to use `repeat_single_col` for the column that is a respondent characteristic and that we want to use `wide_to_long` for the column that is an alter characteristic.

Let's check that the resulting Table, called `alter_first_try` makes sense:

In [None]:
alter_first_try.num_rows

In [None]:
alter_first_try

**Question** Fill in the missing parts of the statement below in order to double-check that each respondent appears five times in this dataset.

In [None]:
np.all(alter_first_try.group(...).column(...) == ...)

**Question** Now it's your turn! Using the example above as a pattern, create a long dataset that has

* respondent id
* respondent age
* respondent class
* respondent home
* alter age
* alter gender
* alter class
* alter home

Don't forget to perform a couple of checks to be sure the resulting datset makes sense (like we did above).

In [None]:
alter_data = Table().with_columns([
    'respondent_id', ...,
    'respondent_age', ...,
    'respondent_class', ...,
    'respondent_home', ...,
    'alter_age', ...,
    'alter_gender', ...,
    'alter_class', ...,
    'alter_home', ...,])

In [None]:
alter_data

In [None]:
## double check that each respondent appears 5 times
...

### Ages of Berkeley students' confidants

OK, now that we have created a long-form dataset, let's make use of it to learn about the people Berkeley students discuss important matters with.

**Question** Start by trying to make a histogram of the confidants' ages. (You will get an error -- read the error message and then see the next question.)

In [None]:
...

**Question** Use the `show()` method to browse `alter_data`

In [None]:
...

You get an error message because some of the rows of the dataset have `nan` or 'not a number' for the value of age. Why? Because not every respondent reported 5 alters. So we need to filter our dataset down to only have data from actual reports about alters.

We will do this by assuming that if `alter_age` is `nan`, then no alter was reported about.

**Question** Use a `where` statement to keep only the rows of `alter_data` that do **not** have an `nan` value for `alter_age`.<BR>
*[HINT 1: this is a little tricky. Check out the `np.isnan` function.]*<BR>
*[HINT 2: you can flip an array of Boolean values by using `~`. So `~ [False, True, False]` equals `[True, False, True].`*

In [None]:
# demonstration of the ~ operator (see HINT 2)
~np.array([False, True, False])

In [None]:
...

Check that your newly filtered Table has 347 rows:

In [None]:
...

**Question** Now make a histogram of the ages of the alters. Be sure to choose sensible bins.

In [None]:
...

**Question** Now make a histogram of the survey respondents' ages.  (Please be sure that the bins you use are sensible.)<BR>
*[HINT: think carefully about which dataset you should use here.]*

In [None]:
...

**Question** Compare the two histograms. What does this tell you about homophily among confidants?

<div class='response'>
[Answer here]
</div>

**[Discuss as a group]**

**Question** Now make a scatter plot comparing the ages of survey respondents and the ages of the alters.

In [None]:
...

**Question** What does this scatter plot tell you about homophily among confidants?

<div class='response'>
[Answer here]
</div>

### Class year of Berkeley students' confidants

In this section, we will start to explore the relationship between respondents' class years and their alters' class years. Our approach will be to walk through one example -- the alters reported by freshmen -- in detail. Then, we will write a function to easily allow us to repeat our analysis for sophomores, juniors, and seniors.

**Question** First, let us look at the distribution of class year among all of the confidants reported. First use `group` to make a simple table with the counts of alters by class year.

In [None]:
...

**Question** Now make a bar plot that shows those counts graphically.

In [None]:
...

**Question** Make another bar plot that shows the class years of survey respondents. (Be careful to pick the right dataset to use here.)

In [None]:
...

Now that we have a sense of what all respondents and all of the alters look like, we can dig into the alters of freshmen.

**Question** Create a new Table that only has alters reported by respondents who are sophomores

In [None]:
alters_of_sophomores = ...

**Question** Make a plot that shows the class years reported by alters of sophomores.

In [None]:
...

**Question** Make a function called `plot_alter_class` that makes a plot of the class years of alters reported by respondents in a particular class. Your function should take as its arguments

* `data` - the alter dataset
* `class_year` - the class year of respondents to focus on
    
For example, running

    plot_alter_class(alter_data, 'Freshman')

should produce the plot you just made above.

In [None]:
def plot_alter_class(data, class_year):
    ...
    ...

**Question** Use your function to produce plots of the class years of the alters of freshmen, sophomores, juniors, and seniors.

In [None]:
# freshmen
...

In [None]:
# sophomores
...

In [None]:
# juniors
...

In [None]:
# seniors
...

**Question** Do you see evidence of homophily with respect to class year?

<div class='response'>
[Answer here]
</class>

### Home of Berkeley students' confidants

**Question** Following the patterns we used above, look for evidence of homophily with respect to home. Please be sure to:

* start by looking at the distribution of all of the data, before looking at subgroups (like we did above)
* write a function to save having to repeat code over and over
* make plots to visualize your results

In [None]:
# home of all alters
...

In [None]:
# home of all respondents
...

In [None]:
# function
...
...
...

In [None]:
# plot for respondents from Bay Area

# plot for respondents from LA Area

# plot for respondents from Rest of California

# plot for respondents from Rest of United States

# plot for respondents from Rest of World

**Question** Did you find evidence suggestive of homophily with respect to where people are from?

<div class='response'>
[Answer here]
</div>

**Question** We looked at age, class year, and where people are from. Which of these do you think had the most homophily, and which had the least?

<div class='response'>
[Answer here]
</div>

**Challenge Question (optional)** Can you think of a way to try to quantify the extent of homophily in a network? What properties would you want such a metric to have?

<div class='response'>
[Answer here]
</div>

**Question** In order to test the OK infrastructure I've set up, please see if the following test passes. (Don't worry about the output -- please just run this cell so that I can see whether or not it works in your notebook.)

In [None]:
from client.api.notebook import Notebook
ok = Notebook('lab02.ok')                     # change this line to correct .ok file
_ = ok.auth(inline=True)

x = 9182017
_ = ok.grade("q1")

### Submit the lab

You're almost done! Now please create a pdf version of your completed lab by **either**:

* printing your notebook to a pdf file
* going to the Jupyter 'File' menu, choosing 'Download as' and then 'PDF via LaTeX (.pdf)'. 

Please save the resulting .pdf on your computer and then **submit the .pdf on bcourses**.

**The lab must be submitted by the end of the day on Monday, Sep. 19**