# Lab 14

Welcome to Lab 14! In this lab we will get more practice with random sampling. More information about randomness can be found in textbook chapter 8, starting [here](https://data-8r.gitbooks.io/textbook/chapters/08/randomness.html).

In [None]:
# Run this cell, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# Don't change this cell; just run it. 
from client.api.notebook import Notebook
ok = Notebook('lab14.ok')
_ = ok.auth(inline=True)

The dataset for this lab includes salary data and statistics for every NBA basketball player from the 2014-2015 NBA season. This data was collected from [basketball-reference](http://www.basketball-reference.com) and [spotrac](http://www.spotrac.com).

Run the cell below to load the player and salary data.

In [None]:
full_data = Table().read_table("salary_data.csv").join("PlayerName", Table().read_table("player_data.csv"), "Name")
full_data.show(3)

Rather than getting data on every player, imagine that we had gotten data on only a smaller subset of the players.  For 492 players, it's not so unreasonable to expect to see all the data, but usually we aren't so lucky.  Instead, we are often forced to learn things about a large underlying population using a smaller sample.  When we do this, we are doing *statistical inference*.

A statistical inference is a statement about some fact about the underlying population, such as "the average salary of NBA players in 2014 was $3".  You may have heard the word "inference" used in other contexts.  It's important to keep in mind that statistical inferences, unlike, say, logical inferences, can be wrong.

A general strategy for inference using samples is to estimate facts about the population by computing the same facts about a sample.  This strategy sometimes works well and sometimes doesn't.  The degree to which it gives us useful answers depends on several factors, and we'll touch lightly on a few of those today.

One very important factor in the utility of samples is how they were gathered.  We have prepared some example sample datasets to simulate inference from different kinds of samples for the NBA player dataset.  Later we'll ask you to create your own samples to see how they behave.

To save typing and increase the clarity of your code, we will package the loading and analysis code into two functions. This will be useful in the rest of the lab as we will repeatedly need to create histograms and collect summary statistics from that data.

In [None]:
# Run this cell to define a function that creates age
# and salary histograms.

def histograms(t):
    ages = t.column('Age')
    salaries = t.column('Salary')
    age_bins = np.arange(min(ages), max(ages) + 2, 1)
    salary_bins = np.arange(min(salaries), max(salaries) + 2000000, 1000000)
    t.hist('Age', bins=age_bins, unit='year')
    t.hist('Salary', bins=salary_bins, unit='$')
    return age_bins # Keep this statement so that your work can be checked
    
histograms(full_data)

### Question 1
Create a function called `compute_statistics` that takes a single argument, a Table containing ages (in a column called "Age") and salaries (in a column called "Salary").  It should:
- Draw a histogram of ages in that table.
- Draw a histogram of salaries in that table.
- Return a two-element array containing the average age and average salary in that table, in that order.

*Hint:* You can call your `histograms` function to draw the histograms!

In [None]:
def compute_statistics(age_and_salary_data):
    histograms(age_and_salary_data) #SOLUTION
    age = age_and_salary_data.column("Age") #SOLUTION
    salary = age_and_salary_data.column("Salary") #SOLUTION
    return make_array(np.mean(age), np.mean(salary)) #SOLUTION

full_stats = compute_statistics(full_data)
full_stats

In [None]:
_ = ok.grade('q1') # Warning: Charts will be displayed while running this test

### Question 2
When you ran the cell containing your answer to question 1, you should have seen an array with 2 numbers, and then 2 histograms.  Describe the meaning of each number and each histogram.  What dataset do they come from?

**SOLUTION:** The results are all for *all NBA players in 2014-2015*.  The first number is their average age (26), and the second is their average salary.  The first histogram shows the distribution of their ages, and the second shows the distribution of their salaries.

### Convenience sampling
One sampling methodology, which is **generally a bad idea**, is to choose players who are somehow convenient to sample.  For example, you might choose players from one team that's near your house, since it's easier to survey them.  This is called, somewhat pejoratively, *convenience sampling*.

Suppose you survey only *relatively new* players with ages less than 22.  (The more experienced players didn't bother to answer your surveys about their salaries.)

### Question 3
Assign `convenience_sample_data` to the subset of `full_data` that contains only the rows for players under the age of 22.

In [None]:
convenience_sample = full_data.where("Age", are.below(22)) #SOLUTION
convenience_sample

In [None]:
_ = ok.grade('q3')

### Question 4
Assign `convenience_stats` to an array of the average age and average salary of your convenience sample, using the `compute_statistics` function.  Since they're computed on a sample, these are called *sample averages*.

In [None]:
convenience_stats = compute_statistics(convenience_sample) #SOLUTION
convenience_stats

In [None]:
_ = ok.grade('q4')

Next, we'll compare the convenience sample salaries with the full data salaries in a single histogram. To do that, we'll need to use the `group` option of the `hist` method, which allows us to make one histogram of one column for each group in another column. The following cell should not require any changes; just run it.

In [None]:
def compare_salaries(first, second, first_title, second_title):
    """Compare the salaries in two tables."""
    first_labeled = first.with_columns("Sample", np.repeat(first_title, first.num_rows))
    second_labeled = second.with_columns("Sample", np.repeat(second_title, second.num_rows))
    combined = first_labeled.copy().append(second_labeled)
    # Make a histogram for each sample.
    combined.hist("Salary", group="Sample", bins=20, unit="$")

compare_salaries(full_data, convenience_sample, 'All Players', 'Convenience Sample')

### Question 5
Does the convenience sample give us an accurate picture of the **age or salary** of the full population of NBA players in 2014-2015?  Would you expect it to, in general?  You can refer to the statistics calculated above or perform your own analysis.

**SOLUTION:** No, the convenience sample does not give us an accurate picture of the age or salary of the full population of NBA players in 2014-2015. We would not expect it to, because players younger than 22 obviously have a different distribution of ages, and almost as obviously have a different distribution of salaries.

### Question 6
The NBA has salary caps for players that depend on complicatd rules.  Given that no player starts before the age of 18, can you guess the salary cap that applies to most players below 22 years old?

**SOLUTION:** From the histogram, there appears to be a cap around 6 million dollars for these young players.  (The rules are quite complicated, but that's about accurate.)

### Simple random sampling
A more principled approach is to sample uniformly at random from the players.  If we ensure that each player is selected at most once, this is a *simple random sample without replacement*.  Imagine writing down each player's name on a card, putting the cards in an urn, and shuffling the urn.  Then, pull out cards one by one and set them aside, stopping when the specified *sample size* is reached. We can also sample *without* replacement - in this case, we would be putting the card back in the urn after each draw.

### Question 7
For this sample, should we sample with or without replacement? Why or why not?

**SOLUTION:** Without replacement - we don't want to draw the same player twice.

### Producing simple random samples
Let us now see what would happen if we had access to only a sample of the salary data.  Would our conclusions have been very inaccurate?

### Question 8
Produce a simple random sample without replacement of size 44 from `full_data`.  Run your analysis on it again.  Are your results similar to those we saw with the convenience sample?  What happens to the histograms?  Run your code several times to get new samples.  How much do things change across samples?

In [None]:
my_small_srswor_data = full_data.sample(44, with_replacement = False) #SOLUTION
my_small_stats = compute_statistics(my_small_srswor_data) #SOLUTION
my_small_stats

**SOLUTION:** The averages are similar, but the histograms are quite noisy. The average age tends to stay around the same value as there is a limited range of ages for NBA players, but the salary changes by a sizeable factor due to larger variability in salary.

### Question 9
As in the previous question, analyze several simple random samples of size 100 from `full_data`.  Do the average and histogram statistics seem to change more or less across samples of this size than across samples of size 44?  And are the sample averages and histograms closer to their true values for age or for salary?  What did you expect to see?

In [None]:
my_large_srswor_data = full_data.sample(100, with_replacement = False) #SOLUTION
my_large_stats = compute_statistics(my_large_srswor_data) #SOLUTION
my_large_stats

**SOLUTION:** The average and histogram statistics seem to change less across samples of this size. They are closer to their true values, because we are sampling a larger subset of the population.

In [None]:
# For your convenience, you can run this cell to run all the tests at once.
_ = ok.grade_all()

In [None]:
_ = ok.submit()