In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab2_personal_networks.ipynb")

In [None]:
# Import all the modules we need
from IPython.core.display import HTML
from datascience import *

import os
import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
plt.style.use('fivethirtyeight')

import warnings
warnings.simplefilter('ignore', FutureWarning)

## Lab 2: Review of survey results
### Load the survey responses

In [None]:
os.getcwd() # this is a line that tells you the current working directory. It's a helpful move before you want to load something.
            # we are in the folder of lab2, so we can import the data directly.

In [None]:
survey = Table.read_table('ucb_fa2023_personal_networks_rev.csv')

How many responses are there?

In [None]:
num_rows = survey.num_rows
num_rows

### Who responded to the survey?

Look at the age distribution of respondents

In [None]:
survey.select('respondent_age').hist()

Look at the gender distribution

In [None]:
survey.group('respondent_gender').barh('respondent_gender')

Look at the class distribution

In [None]:
survey.group('respondent_class').barh('respondent_class')

### Relationship between respondent and first alter named: gender

In [None]:
#pd.crosstab(survey['respondent_gender'], survey['alter1_gender']) # this would show raw counts
pd.crosstab(survey['respondent_gender'], survey['alter1_gender'], normalize='index')

In [None]:
pd.crosstab(survey['respondent_gender'], survey['alter1_gender']) # this would show raw counts

In [None]:
obs_frac_nonhom = (23 + 43) / (93 + 23 + 43 + 48)
obs_frac_nonhom

In [None]:
# Berkeley undergrad gender breakdown source: 
# https://opa.berkeley.edu/campus-data/uc-berkeley-quick-facts
# (based on Fall 2022 undergraduate enrollment)
prop_male = 14183 / (14183 + 17808)
prop_female = 1 - prop_male

rand_expected_mf = 2 * prop_male * prop_female
rand_expected_mf

In [None]:
permuted_alter_gender = survey.select('alter1_gender').sample(num_rows) # NB: num_rows is the number of rows in our dataset
permuted_dyads = Table().with_columns(
    'respondent_gender', survey.column('respondent_gender'),
    'alter1_gender', permuted_alter_gender.column(0))
permuted_dyads

Let's write a function to help calculate the fraction of dyads that goes from male to female or from female to male.

In [None]:
def frac_mf_dyads(permuted_df):
    """
    Calculate the fraction of dyads that is male to female OR female to male
    """
    counts_mf = permuted_dyads.group(['respondent_gender', 'alter1_gender']).where('respondent_gender', 'Male').where('alter1_gender', 'Female')
    mf = counts_mf.column('count').item(0)
    counts_fm = permuted_dyads.group(['respondent_gender', 'alter1_gender']).where('respondent_gender', 'Female').where('alter1_gender', 'Male')
    fm = counts_fm.column('count').item(0)
    
    return((mf + fm)/permuted_df.num_rows)

#permuted_frac_mf = permuted_dyads.where()
frac_mf_dyads(permuted_dyads)

Now take many resamples and calculate the fraction of cross-gender edges for each one

In [None]:
nonhom_fracs = make_array()

for _ in np.arange(10000):
    permuted_alter_gender = survey.select('alter1_gender').sample(num_rows) # NB: num_rows is the number of rows in our dataset
    permuted_dyads = Table().with_columns(
        'respondent_gender', survey.column('respondent_gender'),
        'alter1_gender', permuted_alter_gender.column(0))
    nonhom_fracs = np.append(nonhom_fracs, frac_mf_dyads(permuted_dyads))
null_fracs = Table().with_column('frac_dyads_nonhom', nonhom_fracs)

Let's add a plot showing where our observed value is, so that we can easily compare the observed value to the null distribution.

In [None]:
null_fracs.hist('frac_dyads_nonhom')
#plt.scatter(obs_frac_nonhom,0,c='red',s=80);
plt.axvline(x=obs_frac_nonhom,c='red',linewidth=2);

## Lab 2 (the part you will turn in): Homophily and personal networks

###  Converting from wide to long

So far, we have only looked at the relationship between the survey respondents and the first alter that they mentioned. However, we designed our survey so that respondents could tell us about up to five alters. In order to look at all of the alters respondents reported about, we're going to have to manipulate the dataset a bit more extensively. This manipulation is a little bit tricky, but we're going to go through how it can be done step by step.

First, let's take a look at the first few rows of the dataset again to remind ourselves of how it is structured:

In [None]:
survey.show(6)

The dataset is in *wide* format: all of the information reported by a respondent is stored in a single row:

[respondent 1 info] ... [info about respondent 1's first alter] ... [info about respondent 1's second alter] ...   
[respondent 2 info] ... [info about respondent 2's first alter] ... [info about respondent 2's second alter] ...   
...

Our goal is to reshape the dataset so that the information is in *long* format instead:

[respondent 1 info] [info about respondent 1's first alter]  
[respondent 1 info] [info about respondent 1's second alter]  
[respondent 1 info] [info about respondent 1's third alter]  
[respondent 1 info] [info about respondent 1's fourth alter]  
[respondent 1 info] [info about respondent 1's fifth alter]  
[respondent 2 info] [info about respondent 2's first alter]  
[respondent 2 info] [info about respondent 2's second alter]  
...

We're going to start by creating an id variable for our survey responses. 

Always remember to double-check the results of your operations.

In [None]:
## create a respondent id variable
survey['respondent_id'] = range(1, survey.num_rows + 1) # using [] is a short-hand approach for .with_column function

In [None]:
survey.show(6)

In order to convert the alter information from wide to long format, we're going to use two functions that have been written for you below.

Let's look at the function below and try to understand how it works.

PS The portion highlighted in red is called a docstring. It describes what the function does and gives examples. The code following ">>>" is fully executable and the following line is what that function call should return.

In [None]:
def repeat_single_col(data, var_name, times=5):
    """Repeats a single column multiple times.
    
    Parameters
    ----------
    data : Table
        The data table containing the column to be repeated.
    var_name : str
        Text that contains the name of the column to repeat.
    times : int
        The number of times column is to be repeated.
    
    Returns
    -------
    np.array
        A single array with the contents of the column repeated five times.
    
    Examples
    --------
    >>> repeat_single_col(Table().with_columns(['respondent_age', [10]]), 'respondent_age')
    
    array([10, 10, 10, 10, 10])
    """
    new_col = np.tile(data.column(var_name), times)
    return new_col

The key to understanding repeat_single_col is the `np.tile` function. Look that function up and read its help page.

**Practice** Use `np.tile` to create an array that contains [1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7].

In [None]:
#np.tile? # You can remove the # sign and see the 'help' documentation for np.tile
practice = [1, 3, 5, 7]
practice_array = np.tile(practice, 4)
practice_array

**Practice** What is the difference between `np.tile` and `np.repeat`? Do they do the same thing? What happens if you pass the same arguments to `np.repeat` as to `np.tile`?

[Hint: you can use the help page to figure out what np.repeat does.]

In [None]:
#np.repeat? # You can remove the # sign and see the 'help' documentation for np.repeat
practice2 = [1, 3, 5, 7]
practice2_array = np.repeat(practice2, 4)
practice2_array

Now look at the next function, and try to understand how it works.

In [None]:
def wide_to_long(data, var_name, times=5):
    """Given columns of alter characteristics, stack them into one long column.
    
    Parameters
    ----------
    data : Table
        The data table containing the alter characteristics
    var_name : str
        Text that contains the variable name; columns of the dataset should
        match the pattern: alter[NUM]_[var_name]
        For example, if var_name is 'age' then this function expects to find
        columns in the survey dataset named 
        'alter1_age', 'alter2_age', 'alter3_age', 'alter4_age', and 'alter5_age'
    times : int
        The number of columns for each characteristic
    
    Returns
    -------
    np.array
        A single array with the contents of all of the columns stacked on top of one another.
    
    Examples
    --------
    >>> wide_to_long(Table().with_columns(['alter1_age', [10, 15],
                                           'alter2_age', [30, 35],
                                           'alter3_age', [20, 15],
                                           'alter4_age', [60, 70],
                                           'alter5_age', [20, 25]]),
                     'age')
    
    array([10, 15, 30, 35, 20, 15, 60, 70, 20, 25])
    """
    new_col = np.concatenate([data.column('alter' + str(idx) + '_' + var_name) for idx in range(1,times+1)])
    return new_col

The key to understanding this second function is np.concatenate. Look up the help page for np.concatenate.

In [None]:
np.concatenate?

**Practice** Use `np.concatenate` to make a single array that has the contents of `column_one` and `column_two` concatenated together:

In [None]:
concat_table = Table().with_columns(['column_one', [1,3,5,7,9],
                                    'column_two', [2,4,6,8,10]])
concat_table

In [None]:
#to make a single array with the contents of column_one and column_two
column_one = concat_table.column('column_one') 
column_two = concat_table.column('column_two')
concat_array = np.concatenate((column_one, column_two))
concat_array

Now that we understand what `np.concatenate` does, let's run the wide_to_long() function on a simple table to see what it does.

In [None]:
temp_table = Table().with_columns(['alter1_age', [10, 15],
                                   'alter2_age', [30, 35],
                                   'alter3_age', [20, 15],
                                   'alter4_age', [60, 70],
                                   'alter5_age', [20, 25]])
wide_to_long(temp_table,'age')

You can see that it links the head and end of columns.

We'll take a first try at converting the alter data to long format together, and then you'll do a more complete job as an exercise.

For our first step, we'll only keep the respondent's id number and the age of the alters each respondent reported about. Since there is space for 5 alters per respondent, we expect (number of respondents x 5)

Let's construct the table.

In [None]:
alter_first_try = Table().with_columns([
    'respondent_id', repeat_single_col(survey, 'respondent_id'),
    'alter_age', wide_to_long(survey, 'age')])
alter_first_try

You can see that we want to use `repeat_single_col` for the column that is a respondent characteristic and that we want to use `wide_to_long` for the column that is an alter characteristic.

Let's check that the resulting Table, called `alter_first_try` makes sense.



## Question 1: 
How many rows do you think alter_first_try should have? Check the number of rows.

In [None]:
q1 = ...
q1

In [None]:
grader.check("q1")

The following statement allows you to double-check that each respondent appears five times in this dataset. If the cell does not output True, you have a problem.

In [None]:
np.all(alter_first_try.group('respondent_id').column('count') == 5)

## Question 1.5: Full Alters Table

Now it's your turn! Using the example above as a pattern, create a long dataset that has

* respondent id
* respondent age
* respondent class
* respondent home
* alter age
* alter gender
* alter class
* alter home

Don't forget to perform a couple of checks to be sure the resulting dataset makes sense (like we did above).

Hint: you need to **repeat** the information for the respondent and **convert from wide to long** for the alters' information.

In [None]:

alter_data = Table().with_columns([
    'respondent_id', repeat_single_col(survey, 'respondent_id'),
    'respondent_age', ...,
    'respondent_class', ...,
    'respondent_home', ...,
    'alter_age', wide_to_long(survey, 'age'),
    'alter_gender', ...,
    'alter_class', ...,
    'alter_home', ...,])

alter_data

In [None]:
grader.check("q1.5")

<!-- BEGIN QUESTION -->

## Question 2: Ages of Berkeley Students' Confidants

OK, now that we have created a long-form dataset, let's make use of it to learn about the people Berkeley students discuss important matters with.

Start by making a histogram of the confidants' ages. Please use the following bins: `np.arange(15, 70, 5)`

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

## Question 3: Respondent's Ages
Now make a histogram of the survey respondents' ages.

Please take a look at the `bins` argument of `.hist()` if you haven't already and use it to make bins like `np.arange(15, 35, 1)`. You can try and see what happens if you remove this and plot with the line of code we used for alter's age. <BR>

HINT: Make sure you use the `survey` table, not the `alter_data` table. If you're unsure of why, ask a student nearby or the GSI.

In [None]:
... # replace this line with code to make a histogram of respondent's ages

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

## Question 4: Comapre histograms
Compare the two histograms. What does this tell you about homophily among confidants?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Practice:** Now you can make a scatter plot comparing the ages of survey respondents and the ages of the alters.

In [None]:
alter_data.scatter(..., ...)

## Question 5
What does this scatter plot tell you about homophily among confidants?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

We are able to get a lot of descriptive information from these two datasets. Here is a practice example.

**Practice** What's the proportion of alters from Bay Area of all the alters?

In [None]:
# First, you create a variable, alter_bay which has value False if the alter is not from the Bay Area, 
# and True otherwise.
alter_bay = alter_data.column(...) == 'Bay Area' # two equal marks == constructs a comparison, the result is True (equal to) or False (not equalt to)

In [None]:
# Secondly, you calculate the proportion of the rows which has the alter_bay variable True.
alter_bay_proportion = ...
alter_bay_proportion

### Class year of Berkeley students' confidants

In this section, we will start to explore the relationship between respondents' class years and their alters' class years. Our approach will be to walk through one example -- the alters reported by sophomores -- in detail. Then, we will write a function to easily allow us to repeat our analysis for sophomores, juniors, and seniors.

First, let us look at the distribution of class year among all of the confidants reported. First use `group` to make a simple table with the counts of alters by class year.

In [None]:
alter_data.group('alter_class')

**Practice** Now make a bar plot that shows those counts graphically.

In [None]:
# First we sort this table by counts of each group
alter_data.group('alter_class').sort('count', descending=True)

In [None]:
# By adding the function of plotting: .barh(the variable you want to plot), we can create a bar plot.
alter_data.group('alter_class').sort('count', descending=True).barh('alter_class')

**Practice** Make another bar plot that shows the class years of survey respondents.

In [None]:
...

Now that we have a sense of what all respondents and all of the alters look like, we can dig into the alters of a particulr class group.

## Question 6: 
Create a new table that only has alters reported by respondents who are sophomores using `where` and `are.equal_to`.

In [None]:
alters_of_sophomores = ...
q6 = alters_of_sophomores.num_rows
q6

In [None]:
grader.check("q6")

Make a plot that shows the class years reported by alters of sophomores using .barh().

In [None]:
...

**Practice** Make a function called `plot_alter_class` that makes a plot of the class years of alters reported by respondents in a particular class. Your function should take as its arguments

* `data` - the alter dataset
* `class_year` - the class year of respondents to focus on
    
For example, running

    plot_alter_class(alter_data, 'Sophomore')

should produce the plot you just made above.

In [None]:
def plot_alter_class(data, class_year):
    to_plot = data.where(...) #create the dataset for plotting
    to_plot.group(...).barh(...)

**Practice** Use your function to produce plots of the class years of the alters of freshmen, sophomores, juniors, and seniors.

In [None]:
# freshmen
plot_alter_class(alter_data, 'Freshman')

In [None]:
# sophomores
plot_alter_class(alter_data, 'Sophomore')

In [None]:
# juniors
plot_alter_class(alter_data, 'Junior')

In [None]:
# seniors
plot_alter_class(alter_data, 'Senior')

<!-- BEGIN QUESTION -->

## Question 7
Do you see evidence of homophily with respect to class year?

_Type your answer here, replacing this text._

<!-- END QUESTION -->



---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Please upload the .zip file to Gradescope.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)