### Lab 1 -- part 2
### Homophily and personal networks

In [None]:
# Import all the modules we need
from IPython.core.display import HTML
from datascience import *

import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('fivethirtyeight')

import warnings
warnings.simplefilter('ignore', FutureWarning)

In [None]:
# Load in the ok book
from client.api.notebook import Notebook
lab1 = Notebook('lab1.ok')
_ = lab1.auth(inline=True, force=True)

## 1.1 Converting from wide to long

In the class demo, we have only looked at the relationship between the survey respondents and the first alter that they mentioned. However, we designed our survey so that respondents could tell us about up to five alters. In order to look at all of the alters respondents reported about, we're going to have to manipulate the dataset a bit more extensively. This manipulation is a little bit tricky, but we're going to go through how it can be done step by step.

First, let's take a look at the first few rows of the dataset again to remind ourselves of how it is structured:

In [None]:
import os
os.getcwd() # this is a line that tells you the current working directory. It's a helpful move before you want to load something.
            # we are in the folder of lab1, so we can import the data directly.

In [None]:
survey = Table.read_table('ucb_personal_networks_clean.csv')

survey.show(6)

The dataset is in *wide* format: all of the information reported by a respondent is stored in a single row:

[respondent 1 info] ... [info about respondent 1's first alter] ... [info about respondent 1's second alter] ...   
[respondent 2 info] ... [info about respondent 2's first alter] ... [info about respondent 2's second alter] ...   
...

Our goal is to reshape the dataset so that the information is in *long* format instead:

[respondent 1 info] [info about respondent 1's first alter]  
[respondent 1 info] [info about respondent 1's second alter]  
[respondent 1 info] [info about respondent 1's third alter]  
[respondent 1 info] [info about respondent 1's fourth alter]  
[respondent 1 info] [info about respondent 1's fifth alter]  
[respondent 2 info] [info about respondent 2's first alter]  
[respondent 2 info] [info about respondent 2's second alter]  
...

We're going to start by creating an id variable for our survey responses. 

Always remember to double-check the results of your operations.

In [None]:
## create a respondent id variable
survey['respondent_id'] = range(1, survey.num_rows + 1) # using [] is a short-hand approach for .with_column function

In [None]:
survey.show(6)

In order to convert the alter information from wide to long format, we're going to use two functions that have been written for you below.

Let's look at the function below and try to understand how it works.

In [None]:
def repeat_single_col(data, var_name, times=5):
    """Repeats a single column multiple times.
    
    Parameters
    ----------
    data : Table
        The data table containing the column to be repeated.
    var_name : str
        Text that contains the name of the column to repeat.
    times : int
        The number of times column is to be repeated.
    
    Returns
    -------
    np.array
        A single array with the contents of the column repeated five times.
    
    Examples
    --------
    >>> repeat_single_col(Table().with_columns(['respondent_age', [10]]),
                          'respondent_age')
    
    array([10, 10, 10, 10, 10])
    """
    new_col = np.tile(data.column(var_name), times)
    return new_col

The key to understanding repeat_single_col is the np.tile function. Look that function up and read its help page.

**Practice** Use np.tile to create an array that contains [1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7].

In [None]:
#np.tile? # You can remove the # sign and see the 'help' documentation for np.tile
np.tile([1,3,5,7], 4)

**Practice** What is the difference between np.tile and np.repeat? Do they do the same thing?
[Hint: you can use the help page to figure out what np.repeat does.]

In [None]:
#np.repeat?
np.repeat([1,3,5,7], 4)

Now look at the next function, and try to understand how it works.

In [None]:
def wide_to_long(data, var_name, times=5):
    """Given columns of alter characteristics, stack them into one long column.
    
    Parameters
    ----------
    data : Table
        The data table containing the alter characteristics
    var_name : str
        Text that contains the variable name; columns of the dataset should
        match the pattern: alter[NUM]_[var_name]
        For example, if var_name is 'age' then this function expects to find
        columns in the survey dataset named 
        'alter1_age', 'alter2_age', 'alter3_age', 'alter4_age', and 'alter5_age'
    times : int
        The number of columns for each characteristic
    
    Returns
    -------
    np.array
        A single array with the contents of all of the columns stacked on top of one another.
    
    Examples
    --------
    >>> wide_to_long(Table().with_columns(['alter1_age', [10, 15],
                                           'alter2_age', [30, 35],
                                           'alter3_age', [20, 15],
                                           'alter4_age', [60, 70],
                                           'alter5_age', [20, 25]]),
                     'age')
    
    array([10, 15, 30, 35, 20, 15, 60, 70, 20, 25])
    """
    new_col = np.concatenate([data.column('alter' + str(idx) + '_' + var_name) for idx in range(1,times+1)])
    return new_col

The key to understanding this second function is np.concatenate. Look up the help page for np.concatenate.

In [None]:
np.concatenate?

**Practice** Use `np.concatenate` to make a single array that has the contents of `column_one` and `column_two` concatenated together:

In [None]:
concat_table = Table().with_columns(['column_one', [1,3,5,7,9],
                                    'column_two', [2,4,6,8,10]])
concat_table

In [None]:
np.concatenate(concat_table)

Now that we understand what np.concatenate does, let's run the wide_to_long() function upon a simple table to see what it does.

In [None]:
temp_table = Table().with_columns(['alter1_age', [10, 15],
                                   'alter2_age', [30, 35],
                                   'alter3_age', [20, 15],
                                   'alter4_age', [60, 70],
                                   'alter5_age', [20, 25]])
wide_to_long(temp_table,'age')

You can see that it links the head and end of columns.

We'll take a first try at converting the alter data to long format together, and then you'll do a more complete job as an exercise.

For our first step, we'll only keep the respondent's id number and the age of the alters each respondent reported about. Since there is space for 5 alters per respondent, we expect (number of respondents x 5) = (... x 5) = ... rows in our resulting dataset.

Let's construct the table.

In [None]:
alter_first_try = Table().with_columns([
    'respondent_id', repeat_single_col(survey, 'respondent_id'),
    'alter_age', wide_to_long(survey, 'age')])

You can see that we want to use `repeat_single_col` for the column that is a respondent characteristic and that we want to use `wide_to_long` for the column that is an alter characteristic.

Let's check that the resulting Table, called `alter_first_try` makes sense.

**Question 1** How many rows do you think alter_first_try should have? Check the number of rows.

In [None]:
q1 = ...
q1

In [None]:
_ = lab1.grade('q1')

**Practice** The following statement allows you to double-check that each respondent appears five times in this dataset.

In [None]:
np.all(alter_first_try.group('respondent_id').column('count') == 5)

**Practice** Now it's your turn! Using the example above as a pattern, create a long dataset that has

* respondent id
* respondent age
* respondent class
* respondent home
* alter age
* alter gender
* alter class
* alter home

Don't forget to perform a couple of checks to be sure the resulting dataset makes sense (like we did above).

**Hint: you need to repeat the information for the respondent and convert from wide to long for the alters' information.**

In [None]:
alter_data = Table().with_columns([
    'respondent_id', repeat_single_col(survey, 'respondent_id'),
    'respondent_age', repeat_single_col(survey, 'respondent_age'),
    'respondent_class', repeat_single_col(survey, 'respondent_class'),
    'respondent_home', repeat_single_col(survey, 'respondent_home'),
    'alter_age', wide_to_long(survey, 'age'),
    'alter_gender', wide_to_long(survey, 'gender'),
    'alter_class', wide_to_long(survey, 'class'),
    'alter_home', wide_to_long(survey, 'home'),])

alter_data

Check the number of rows of the new table as well as the repetitions of the repondents in the new table

In [None]:
nrow = alter_data.num_rows
nrow

In [None]:
# double check that each respondent appears 5 times
np.all(alter_data.group('respondent_id').column('count') == 5)

### Ages of Berkeley students' confidants

OK, now that we have created a long-form dataset, let's make use of it to learn about the people Berkeley students discuss important matters with.

Start by trying to make a histogram of the confidants' ages. (You will get an error -- read the error message and then see the next question.)

In [None]:
alter_data.hist('alter_age')

You get an error message because some of the rows of the dataset have nan or 'not a number' for the value of age. Why? Because not every respondent reported 5 alters. So we need to filter our dataset down to only have data from actual reports about alters.

We will do this by assuming that if alter_age is nan, then no alter was reported about.

**Question 2** Use a where statement to keep only the rows of alter_data that do not have an nan value for alter_age.

[HINT 1: this is a little tricky. Check out the np.isnan function.]
[HINT 2: you can flip an array of Boolean values by using ~. So ~ [False, True, False] equals [True, False, True].

In [None]:
# Prep
np.isnan?

In [None]:
# First step, get the booleans (trues and falses) for constructing the table
# If it's TRUE for the 1st row, it will be kept in the new table, otherwise it will be filtered out.
boolean = ...(alter_data.column('alter_age')) # fill in the code before the ()

In [None]:
# Second step, construct the new data table
alter_data = ...
alter_data.show() # the filtered alte_data

In [None]:
# Finally, check the number of rows
q2 = ...
q2

In [None]:
_ = lab1.grade('q2')

Now you can make a histogram of the ages of the alters.

In [None]:
alter_data.hist('alter_age')

**Question 3** Now make a histogram of the survey respondents' ages.

Please take a look at the `bins` argument and see how to make reasonable bins, and you can try and see what happens if you remove this and plot with the line we used for alter's age. <BR>

In [None]:
# Clean the nan values like you did for alter_data (try to combine the 2 steps in one line)
survey = ...
survey.select('respondent_age').hist(bins=np.arange(15, 30, 1))

In [None]:
# Check the number of rows of the cleaned "survey"
q3 = ...

In [None]:
_ = lab1.grade('q3')

**Question** Compare the two histograms. What does this tell you about homophily among confidants?

<div class='response'>
[Answer here]
</class>

Now you can make a scatter plot comparing the ages of survey respondents and the ages of the alters.

In [None]:
alter_data.scatter('respondent_age', 'alter_age')

**Question** What does this scatter plot tell you about homophily among confidants?

<div class='response'>
[Answer here]
</class>

We are able to get a lot of descriptive information from these two datasets. Here are two practice examples.

**Practice** What is the range of the age of the age?

In [None]:
oldest = np.max(alter_data.column('alter_age'))
youngest = np.min(alter_data.column('alter_age'))

print('oldest age:', oldest)
print('youngest age:', youngest)

**Practice** What's the proportion of alters from Bay Area of all the alters?

In [None]:
# First, you create a variable, alter_bay which has value False if the alter is not from the Bay Area, and True otherwise.
alter_bay = alter_data.column('alter_home') == 'Bay Area' # two equal marks == constructs a comparison, the result is true (equal to) or false (not equalt to)

In [None]:
# Secondly, you calculate the proportion of the rows which has the alter_bay variable True.
alter_bay_proportion = np.mean(alter_bay)
alter_bay_proportion

### Class year of Berkeley students' confidants

In this section, we will start to explore the relationship between respondents' class years and their alters' class years. Our approach will be to walk through one example -- the alters reported by freshmen -- in detail. Then, we will write a function to easily allow us to repeat our analysis for sophomores, juniors, and seniors.

First, let us look at the distribution of class year among all of the confidants reported. First use `group` to make a simple table with the counts of alters by class year.

In [None]:
alter_data.group('alter_class')

**Practice** Now make a bar plot that shows those counts graphically.

In [None]:
# First we sort this table by counts of each group
alter_data.group('alter_class').sort('count', descending=True)

In [None]:
# By adding the function of plotting: .barh(the variable you want to plot), we can create a bar plot.
alter_data.group('alter_class').sort('count', descending=True).barh('alter_class')

**Practice** Make another bar plot that shows the class years of survey respondents.

In [None]:
survey.group('respondent_class').sort('count', descending=True).barh('respondent_class')

Now that we have a sense of what all respondents and all of the alters look like, we can dig into the alters of a particulr class group.

**Question 4** Create a new table that only has alters reported by respondents who are sophomores using `where` and `are.equal_to`.

In [None]:
alters_of_sophomores = ...

q4 = alters_of_sophomores.num_rows
q4

In [None]:
_ = lab1.grade('q4')

Make a plot that shows the class years reported by alters of sophomores.

In [None]:
alters_of_sophomores.group('alter_class').barh('alter_class')

**Practice** Make a function called `plot_alter_class` that makes a plot of the class years of alters reported by respondents in a particular class. Your function should take as its arguments

* `data` - the alter dataset
* `class_year` - the class year of respondents to focus on
    
For example, running

    plot_alter_class(alter_data, 'Freshman')

should produce the plot you just made above.

In [None]:
def plot_alter_class(data, class_year):
    to_plot = data.where('respondent_class', are.equal_to(class_year)) #create the dataset for plotting
    to_plot.group('alter_class').barh('alter_class')

**Practice** Use your function to produce plots of the class years of the alters of freshmen, sophomores, juniors, and seniors.

In [None]:
# freshmen
plot_alter_class(alter_data, 'Freshman')

In [None]:
# sophomores
plot_alter_class(alter_data, 'Sophomore')

In [None]:
# juniors
plot_alter_class(alter_data, 'Junior')

In [None]:
# seniors
plot_alter_class(alter_data, 'Senior')

**Question** Do you see evidence of homophily with respect to class year?

<div class='response'>
[Answer here]
</class>

### Rerun the tests and submit your lab

In [None]:
import os
print("Running all tests...")
_ = [lab1.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]
print("Finished running all tests.")

In order to submit your assignment, run the next cell.

You can submit as many times as you want (up to the deadline: Feb 8th, Friday 9pm).

In [None]:
_ = lab1.submit()