# SW 282: Lab 8 - Sampling and T-tests

---

### Professor Erin Kerrison

In [None]:
from datascience import *
import numpy as np
import pyreadstat
from scipy import stats
from statsmodels.stats import weightstats
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = [9,6]

## Part 1

In this section of the notebook, we'll be working with a dataset from United Airlines that provides the delay of arrivals to airports (in minutes) for the month of June 2015. Please load the dataset and review how the data are oganized, by running the cell below.

In [None]:
united = Table.read_table("united.csv")
united

We can determine the total population size (all of the delated United Airlines flights observed for that month) by calculating the number of rows in the full table.  

Conveniently, the `Table` object has a `num_rows` attribute that stores the number of rows. Once we transform our table to `data`, we can then determine the number of rows in that table (or how many delayed flights are included in this dataset) by simply running Python's `data.num_rows` command. 

**Question 1.1:** I have renamed United Airlines table for you.  In the cell below, please run the command that will provide you with the population size of our dataset.

In [None]:
data=united
data.num_rows

<div class="alert alert-info">

**QUESTION:** What does the output above represent? Please interpret thje vlue in the cell below.**

</div>

_**Type your answer here, replacing this text.**_

As you will recall, NumPy provides mean and standard deviation functions. The functions for the mean and SD are `np.mean` and `np.std`, respectively. These functions work on arrays, so let's extract the `Delay` column of our table into an array, which we will call `delays`.

In [None]:
delays = united.column("Delay")
delays

Now let's use `np.mean` and `np.std` to calculate the mean and SD of the `delays` array.

In [None]:
# mean
np.mean(delays)

In [None]:
# SD
np.std(delays)

Now let's work on some sampling. Recall that we can create a simple random sample (SRS) of our table using `Table.sample` with the `with_replacement` argument set to `False`. In the cell below, we create a SRS of our table of $n=500$ observations and extract the delays into the `sample_delays` array.

In [None]:
n = 500
sample = united.sample(n, with_replacement=False)
sample_delays = sample.column("Delay")
sample_delays

That's a lot of values!! Let's try to make sense of this distribution by identifying some measures of central tendency.  For our random sample, let's calculate the mean and standard deviation of the `sample_delays` array.

In [None]:
# mean
np.mean(sample_delays)

In [None]:
# SD
np.std(sample_delays)

<div class="alert alert-info">

**QUESTION:** From what you understand of the sample's mean and standard deviation, how would you interpet the last two values that you calucluted above? **

</div>

_**Type your answer here, replacing this text.**_

Recall that we can calculate the _standard error of the mean_ as the standard deviation of the sampling distribution of the mean. If we have a set of values $x_i$ with $i = 1, ..., n$ and SD $\sigma$ then the SE of the mean is

$$\Large
\sigma_m = \frac{\sigma}{\sqrt{n}}
$$

As with the mean and SD, there is a function that will calculate this value for us; it is in the `stats` submodule of SciPy. In the cell below, we use `stats.sem` to compute the standard error of the mean of the `delays` array.

In [None]:
stats.sem(delays)

<div class="alert alert-info">
The standard deviation (SD) measures the amount of variability, or dispersion of data from the mean, while the standard error of the mean (SEM) measures how far the sample mean of the data is likely to be from the true population mean. 
    
    
**QUESTION:** Given this, what can you infer from the SEM output you just calculated above?

</div>

_**Type your answer here, replacing this text.**_

## Part 2

In this section, we'll be using another SPSS-formatted dataset. In the cell below, we use the usual pipeline to read this into a `Table`.

In [None]:
df, _ = pyreadstat.read_sav("ch-11-dataset-2.sav")
data = Table.from_df(df)
data

Let's look at the means of `Hands_Up` grouped by `Gender`. Recall that we can group values in a table using `Table.group` and giving it the label of a column. This function can also take in an optional _aggregator function_ to which it will pass an array of values and the value in the column will be the value returned by this function.

For example, if I wanted the standard deviation grouped by gender, my call to `Table.group` would be

```python
data.group("Gender", np.std)
```

We calculate the mean of `Hands_Up` grouped by `Gender` using `Table.group` in the cell below.

In [None]:
data.group("Gender", np.mean)

<div class="alert alert-info">
    
    
**QUESTION:** Please review Salkind's description of this dataset and interpret the mean values you caluculated just above.

</div>

_**Type your answer here, replacing this text.**_

Now let's perform a T-test on each group. Again, `scipy.stats` provides a useful function for this: `stats.ttest_ind`. To use it, pass it two arrays that you would like it to compare.

In the cell below, we create two arrays, `hands_1` and `hands_2`, of the `Hands_Up` values for each `Gender` value, and then perform a T-test on these arrays using `stats.ttest_ind`. Recall that you can filter rows based on some value of a column using `Table.where(column, are.equal_to(value))`.

In [None]:
hands_1 = data.where("Gender", are.equal_to(1)).column("Hands_Up")
hands_2 = data.where("Gender", are.equal_to(2)).column("Hands_Up")

stats.ttest_ind(hands_1, hands_2)

Notice in our result that we're given both the test statistic value and the p-value.

<div class="alert alert-info">
 
**QUESTION:** Are these mean values (grouped by gender) statistically different?  How do you know?

</div>

_**Type your answer here, replacing this text.**_