In [None]:
from datascience import *

import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
plt.style.use('fivethirtyeight')

# Lecture: Personal networks of Berkeley students

### Load the survey responses

In [None]:
url = "../../data/survey/ucb_personal_networks_clean.csv"

survey = Table.read_table(url)
survey

How many responses are there?

In [None]:
num_responses = ...
num_responses

### Who responded to the survey?

Look at the age distribution of respondents

In [None]:
...

Look at the gender distribution

In [None]:
...

Look at the class year

In [None]:
...

Here's a different way to plot the same data, specifying the order of the bars.

In [None]:
class_order = ['Freshman', 'Sophomore', 'Junior', 'Senior', 'Other']
survey.group('respondent_class').to_df().set_index('respondent_class').loc[class_order].plot.barh()

# Core discussion networks of Berkeley students

About how many confidents, on average, do Americans have?

[?]

Do you think that Berkeley students will have confidant networks that are the same size? Smaller? Larger?

[DISCUSS]

OK, let's see!

In [None]:
... # distribution of number of reported confidants

It can be easier to make sense of this kind of information with a plot:

In [None]:
...

It looks pretty clear that the average is higher than 3.

In [None]:
def recode_number_alters(na):
    if na in ['0', '1', '2', '3', '4', '5']:
        return int(na)
    elif na in ['6', '6+']:
        return 6

In [None]:
survey['number_alters_recoded'] = ... # apply the recode function

In [None]:
... # make a table of the recoded values

In [None]:
... # calculate the average

Note that this estimate is, if anything, low. Why?

[?]

We won't do this now, but an alternate approach would be to fit some kind of parametric distribution to the data we observe, and to use the inferred mean of that distribution as an estimate.  This would be an interesting extension to pursue.

### Is this a meaningful difference?

We see a difference in the point estimate between the average American discussion network and the network among Berkeley students. But, of course, this difference comes from a sample of Berkeley students. If we talked to a different set of Berkeley students, we could get a different answer. How do we know if what we observe in our sample is different enough from 3 to conclude that the networks of Berkeley students are in fact bigger?

We can actually estimate the sampling variation from the data we collected under the assumption that we have a random sample of Berkeley students. 

(This is a stretch, of course - we didn't actually take a random sample. What difference might we expect between the people in our dataset and a randomly selected set of Berkeley students?)

In order to estimate the sampling variation, we'll use an approach called resampling or the bootstrap.

We can take one resample of our survey like this:

In [None]:
resampled_survey = ...

To estimate the sampling variation, we need many resamples; we'll stick the resampling in a loop:

In [None]:
resampled_number_alters = make_array()

for _ in np.arange(10000):
    # NB: num_responses rows in our dataset
    resampled_survey = ... # resample the survey
    resampled_number_alters = np.append(resampled_number_alters, 
                                        ...) # calculate the mean and add it to our list of means
resampled_net_size = Table().with_column('number_alters_recoded', resampled_number_alters)

Let's look at the distribution of resampled network sizes

In [None]:
...

We see that we do indeed get different estimates from sample to sample, but they are all bigger than 3. So this analysis suggests that it's safe to conclude that Berkeley students' discussion networks are bigger than the average American's discussion networks, as long as we assume (1) that we have a random sample of Berkeley students; and (2) that the average American's discussion network has 3 people in it.

# Respondents' confidants

Now let's dig a little deeper into who was names as a confidant by our survey respondents.

### Relationship between respondent and first alter named: gender

In Lab, you'll learn how to work with all of the alters that were named. Here in lecture, we'll keep things simple by looking at the relationship between the respondent and the first alter that the respondent named.

We'll start by looking at the alter's gender.

What is the distribution of genders among the first alters named?

In [None]:
...

Does the gender of the first named alter seem to be related to the gender of the ego?

In [None]:
pd.crosstab(...)

We might want to look at proportions by row:

In [None]:
...

How many egos name alters of a different gender? (We'll only deal with males and females to keep this simple.)

In [None]:
obs_frac_nonhom = ...
obs_frac_nonhom

### Is this a meaningful difference?

Again, we are faced with an important question: is the observed fraction of ego-alter pairs that is nonhomogenous by gender high? Low? What would we expect it to be?

To answer this question, we have to think about what we would expect to see if picking an alter were random

In [None]:
# narrow down to only male and female alters, to keep things simple
survey_altermf = survey.to_df()
survey_altermf = survey_altermf.loc[survey_altermf['alter1_gender'].isin(['Female', 'Male'])]
survey_altermf = Table().from_df(survey_altermf)
num_responses_mf = survey_altermf.num_rows
num_responses_mf

In [None]:
permuted_alter_gender = survey_altermf.select('alter1_gender').sample(num_responses_mf) # NB: num_responses rows in our dataset
permuted_dyads = Table().with_columns(
    'respondent_gender', survey_altermf.column('respondent_gender'),
    'alter1_gender', permuted_alter_gender.column(0))
permuted_dyads

Let's write a function to help calculate the fraction of dyads that goes from male to female or from female to male.

In [None]:
def frac_mf_dyads(permuted_df):
    """
    Calculate the fraction of dyads that is male to female OR female to male
    """
    counts_mf = permuted_dyads.group(['respondent_gender', 'alter1_gender']).where('respondent_gender', 'Male').where('alter1_gender', 'Female')
    mf = counts_mf.column('count').item(0)
    counts_fm = permuted_dyads.group(['respondent_gender', 'alter1_gender']).where('respondent_gender', 'Female').where('alter1_gender', 'Male')
    fm = counts_fm.column('count').item(0)
    
    return((mf + fm)/permuted_df.num_rows)
    
#permuted_frac_mf = permuted_dyads.where()
frac_mf_dyads(permuted_dyads)

Now take many resamples and calculate the fraction of cross-gender edges for each one

In [None]:
nonhom_fracs = make_array()

for _ in np.arange(10000):
    permuted_alter_gender = survey_altermf.select('alter1_gender').sample(num_responses_mf) # NB: num_responses rows in our dataset
    permuted_dyads = Table().with_columns(
        'respondent_gender', survey_altermf.column('respondent_gender'),
        'alter1_gender', permuted_alter_gender.column(0))
    nonhom_fracs = np.append(nonhom_fracs, frac_mf_dyads(permuted_dyads))
null_fracs = Table().with_column('frac_dyads_nonhom', nonhom_fracs)

In [None]:
null_fracs.hist('frac_dyads_nonhom')

Let's add a plot showing where our observed value is, so that we can easily compare the observed value to the null distribution.

In [None]:
null_ev = np.mean(null_fracs['frac_dyads_nonhom'])
null_fracs.hist('frac_dyads_nonhom')
#plt.scatter(null_ev,0,c='yellow',s=80);
plt.axvline(x=null_ev,c='yellow',linewidth=2);
plt.axvline(x=obs_frac_nonhom,c='red',linewidth=2);

Let's compare the expected value (average) of the null model to this distribution

In [None]:
null_ev = 2 * prop_male * prop_female
null_fracs.hist('frac_dyads_nonhom')
#plt.scatter(null_ev,0,c='yellow',s=80);
plt.axvline(x=null_ev,c='yellow',linewidth=2);

In [None]:
# end of demo

In [None]:
survey.scatter('respondent_age', 'alter1_age')