# SW 282: Lab 8 - Sampling, Z-tests, and T-tests

---

### Proessor Erin Kerrison

In [None]:
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = [9,6]

## Part 1

In this section of the notebook, we'll be working with a dataset from United Airlines that gives the delay of arrivals to airports in June 2015 in minutes. We load the dataset below.

In [None]:
united = Table.read_table("united.csv")
united

We can find the population size by finding the number of rows in the table; conveniently, the `Table` object has a `num_rows` attribute that stores the number of rows. If we had a table `data`, then the number of rows in that table is `data.num_rows`.

**Question 1.1:** Find the population size of our dataset.

In [None]:
...

As you will recall, NumPy provides mean and standard deviation functions. The functions for the mean and SD are `np.mean` and `np.std`, respectively. These functions work on arrays, so let's extract the `Delay` column of our table into an array, which we will call `delays`.

**Question 1.2:** Use `np.mean` and `np.std` to calculate the mean and SD of the `delays` array.

In [None]:
# mean
np.mean(...)

In [None]:
# SD
np.std(...)

Now let's work on some sampling. Recall that we can create a simple random sample of our table using `Table.sample` with the `with_replacement` argument set to `False`.

**Question 1.3:** Create a SRS of our table of $n=500$ observations.

In [None]:
n = 500
sample = ...

# create an array of delays from our sample
sample_delays = sample.column("Delay")
sample

**Question 1.4:** Calculate the mean and standard deviation of your `sample_delays` array.

In [None]:
# mean
np.mean(...)

In [None]:
# SD
np.std(...)

Recall that we can calculate the _standard error of the mean_ as the standard deviation of the sampling distribution of the mean. If we have a set of values $x_i$ with $i = 1, ..., n$ and SD $\sigma$ then the SE of the mean is

$$\Large
\sigma_m = \frac{\sigma}{\sqrt{n}}
$$

As with the mean and SD, there is a function that will calculate this value for us; it is in the `stats` submodule of SciPy.

**Question 1.5:** Use `stats.sem` to compute the standard error of the mean of the `delays` array.

In [None]:
from scipy import stats

...

To perform a z-test on our sample, we can use the `statsmodels` library, which contains many machine learning and statistics functions. The function we will use is `statsmodels.stats.weightstats.ztest`. To use it, just pass it the array of your sample's `Delay` values.

**Question 1.6:** Perform a z-test on your sample.

_Hint:_ Because of he we've imported the function, your call will look like `weightstats.ztest(...)`.

In [None]:
from statsmodels.stats import weightstats

...

The `ztest` function returns a tuple; the first value is the test statistic, and the second is the p-value of the test.

## Part 2

In this section, we'll be using another SPSS-formatted dataset. In the cell below, we use the usual pipeline to read this into a `Table`.

In [None]:
import pyreadstat

df, _ = pyreadstat.read_sav("ch-11-dataset-2.sav")
data = Table.from_df(df)
data

Let's look at the means of `Hands_Up` grouped by `Gender`. Recall that we can group values in a table using `Table.group` and giving it the label of a column. This function can also take in an optional _aggregator function_ to which it will pass an array of values and the value in the column will be the value returned by this function.

For example, if I wanted the standard deviation grouped by gender, my call to `Table.group` would be

```python
data.group("Gender", np.std)
```

We calculate the mean of `Hands_Up` grouped by `Gender` using `Table.group` in the cell below.

In [None]:
data.group("Gender", np.mean)

Now let's perform a T-test on each group. Again, `scipy.stats` provides a useful function for this: `stats.ttest_ind`. To use it, pass it two arrays that you would like it to compare.

In the cell below, we create two arrays, `hands_1` and `hands_2`, of the `Hands_Up` values for each `Gender` value, and then perform a T-test on these arrays using `stats.ttest_ind`. Recall that you can filter rows based on some value of a column using `Table.where(column, are.equal_to(value))`.

In [None]:
hands_1 = data.where("Gender", are.equal_to(1)).column("Hands_Up")
hands_2 = data.where("Gender", are.equal_to(2)).column("Hands_Up")

stats.ttest_ind(hands_1, hands_2)

Notice in our result that we're given both the test statistic value and the p-value.

---

### References

The data for this notebook is from https://www.eia.gov/state/seds/seds-data-complete.php?sid=US#CompleteDataFile.