# Lab 14

Welcome to Lab 14! In this lab we will get more practice with random sampling. More information about randomness can be found in textbook chapter 8, starting [here](https://data-8r.gitbooks.io/textbook/chapters/08/randomness.html).

In [None]:
# Run this cell, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# Don't change this cell; just run it. 
import otter
grader = otter.Notebook()
Table.interactive_plots()

The dataset for this lab includes salary data for City of San Francisco employee in 2020. This data was collected from [DataSF](https://data.sfgov.org/City-Management-and-Ethics/Employee-Compensation/88g8-5mnd). 

Run the cell below to load the employee data.

In [None]:
full_data = Table().read_table("sf_salary_data_2020.csv")
full_data.show(3)

Given that this is publicly available data from a government entity (the City of San Francisco), we can assume that this is *population-level* data. That means that we have all of the information; if we wanted to figure out the average salary out of all employees, for example, all we would need to do is simply call `np.average` on one of the columns.

However, we are not always so lucky to have all of the data. In fact, most of the time, we do not have all of the information. Instead, we are often forced to learn things about a large underlying population using a smaller sample.  When we do this, we are doing *statistical inference*.

A statistical inference is a statement about some fact about the underlying population, such as "the average salary of NBA players in 2014 was 3 million dollars", given a sample.  You may have heard the word "inference" used in other contexts.  It's important to keep in mind that statistical inferences, unlike, say, logical inferences, can be wrong.

A general strategy for inference using samples is to estimate facts about the population by computing the same facts about a sample.  This strategy sometimes works well and sometimes doesn't.  The degree to which it gives us useful answers depends on several factors, and we'll touch lightly on a few of those today.

One very important factor in the utility of samples is how they were gathered.  We have prepared some example sample datasets to simulate inference from different kinds of samples from the SF salaries dataset.  Later we'll ask you to create your own samples to see how they behave.

To save typing and increase the clarity of your code, we will package the loading and analysis code into two functions. This will be useful in the rest of the lab as we will repeatedly need to create histograms and collect summary statistics from that data.

In [None]:
# Run this cell to define a function that creates Salary
# and Benefits histograms.

def histograms(t):
    salary = t.column('Total Salary')
    benefits = t.column('Total Benefits')
    benefits_bins = np.arange(min(benefits), max(benefits) + 10000, 5000)
    salary_bins = np.arange(min(salary), max(salary) + 50000, 25000)
    t.hist('Total Salary', bins=salary_bins, unit='$')
    t.hist('Total Benefits', bins=benefits_bins, unit='$')
    
histograms(full_data)

### Question 1
Create a function called `compute_statistics` that takes a single argument, a Table containing salaries (in a column called "Total Salary") and benefits (in a column called "Total Benefits").  It should:
- Draw a histogram of salaries in that table.
- Draw a histogram of benefits in that table.
- Return a two-element array containing the average salary and average benefits from that table, in that order.

*Hint:* You can call your `histograms` function to draw the histograms! You do not need to return the output of `histograms` to view the graph, as the function will automatically print the image. You will, however, need to return the array.

In [None]:
def compute_statistics(benefits_and_salary_data):
    ...
    salary = ...
    benefits = ...
    return ...

full_stats = compute_statistics(full_data)
full_stats

In [None]:
grader.check('q1') # Warning: Charts will be displayed while running this test

### Question 2
When you ran the cell containing your answer to question 1, you should have seen an array with 2 numbers, and then 2 histograms.  Describe the meaning of each number and each histogram.  What dataset do they come from?

*Write your answer here, replacing this text.*

### Convenience sampling
One sampling methodology, which is **generally a bad idea**, is to choose people who are somehow convenient to sample.  For example, you might choose employees that you personally know, since it's easier to survey them.  This is called, somewhat pejoratively, *convenience sampling*.

Suppose you survey only members of the Department of Public Health. Maybe you work there, so it's easier to just ask your coworkers. 

### Question 3
Assign `convenience_sample_data` to the subset of `full_data` that contains only the rows that have "Public Health" listed under the "Job Family" column.

In [None]:
convenience_sample = ...
convenience_sample

In [None]:
grader.check('q3')

### Question 4
Use the `compute_statistics` function you wrote in Question 1 to print the histograms of the age and salary for Public Health workers and generate an array of the average age and average salary of your convenience sample. Assign the array of averages to the name `convenience_stats`. Since they're computed on a sample, these are called *sample averages*.

In [None]:
convenience_stats = ...
convenience_stats

In [None]:
grader.check('q4')

Next, we'll compare the convenience sample salaries with the full data salaries in a single histogram. To do that, we'll need to use the `group` option of the `hist` method, which allows us to make one histogram of one column for each group in another column. The following cell should not require any changes; just run it.

In [None]:
## Just run this cell.

def compare_salaries(first, second, first_title, second_title):
    """Compare the salaries in two tables."""
    first_labeled = first.with_columns("Sample", np.repeat(first_title, first.num_rows))
    second_labeled = second.with_columns("Sample", np.repeat(second_title, second.num_rows))
    combined = first_labeled.copy().append(second_labeled)
    # Make a histogram for each sample.
    combined.hist("Total Salary", group="Sample", bins=20, unit="$")

compare_salaries(full_data, convenience_sample, 'All Employees', 'Convenience Sample')

### Question 5
Does the convenience sample give us an accurate picture of the **benefits or salary** of the full population of SF City employees in 2020?  Would you expect it to, in general?  You can refer to the statistics calculated above or perform your own analysis.

In [None]:
### Show your work here.

*Write your answer here, replacing this text.*

### Simple random sampling
A more principled approach is to sample uniformly at random from the employees.  If we ensure that each employee is selected at most once, this is a *simple random sample without replacement*.  Imagine writing down each employee's name on a card, putting the cards in an urn, and shuffling the urn.  Then, pull out cards one by one and set them aside, stopping when the specified *sample size* is reached. We can also sample *without* replacement - in this case, we would be putting the card back in the urn after each draw.

### Question 7
For this sample of SF city employees, should we sample with or without replacement? Why or why not?

*Write your answer here, replacing this text.*

### Producing simple random samples
Let us now see what would happen if we had access to only a sample of the salary data.  Would our conclusions have been very inaccurate?

### Question 8
Produce a simple random sample **without replacement of size 44** from `full_data`, assigning the sample table to the name `my_small_srswor_data`.  Run your analysis on it again using the `compute_statistics` function. Feel free to run the cell multiple times to see how sample varies.

Are your results similar to those we saw with the convenience sample?  What happens to the histograms?  Run your code several times to get new samples.  How much do things change across samples? Discuss your findings in the Markdown cell below.

*Hint:* We can use `tbl.sample(n, with_replacement = True/False)` where n is the sample size to create a table that is sampled from `tbl`.

In [None]:
my_small_srswor_data = ...
my_small_stats = ...
my_small_stats

*Write your answer here, replacing this text.*

### Question 9
As in the previous question, analyze a **simple random sample without replacement of size 4400** from `full_data`. Again, assign your sample to the variable `my_large_srswor_data` and use the `compute_statistics` function and assign the array of statistics to `my_large_stats`. Try running the cell multiple times to see how the sample changes each time. 

Do the average and histogram statistics seem to change more or less across samples of this size than across samples of size 44?  And are the sample averages and histograms closer to their true values for age or for salary?  What did you expect to see? Again, discuss your findings in the Markdown cell below.

In [None]:
my_large_srswor_data = ...
my_large_stats = ...
my_large_stats

*Write your answer here, replacing this text.*

## Submission

You're done with this lab!

To submit this notebook, please download your notebook as a .ipynb file and submit to Gradescope. You can do so by navigating to the toolbar at the top of this page, clicking File > Download as... > Notebook (.ipynb). Then, go to our class's Gradescope page [here](https://www.gradescope.com/courses/136698) and upload your file under "Lab 14." 

To check your work for all autograded questions, run the cell below. 

It's fine to submit multiple times, but we will only grade the final notebook you submit for each assignment. Make sure you pass all tests to receive credit.

In [None]:
# For your convenience, you can run this cell to run all the tests at once.
grader.check_all()