# SW 282: Lab 4 - Standard Deviation & Bootstrapping

---

### Proessor Erin Kerrison

In this lab, we consider the standard deviation and the data science technique of bootstrapping.

---

### Table of Contents

1. [Reading in SPSS FIles](#section-1)
2. [Standard Deviation & the Bootstrap](#section-2)
3. [Population Standard Deviation](#section-3)

---

#### Dependencies

To begin with the lab, run the cell below, which loads in all of the Python packages we'll be making use of.

In [None]:
from datascience import *
import numpy as np
import pandas as pd
import pyreadstat
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("fivethirtyeight")

## 1. Reading in SPSS Files <a id="section-1"></a>

While CSV is one of the more common formats for saving data, there are others as well (e.g. JSON, XML). The `datascience` library does not currently support reading in data formatted in this way, so instead we use a library called `pandas`, which is the industry standard for working with rectangular data.

In this notebook, we will focus on how to import SPSS-formatted data (files with a `.sav` extension). We will use a library called `pyreadstat` which reads these files into a pandas `DataFrame`, analogous to the `Table`s you've been working with. Once we have the DataFrame, we can then use `Table.from_df()` from the `datascience` library to convert the DataFrame into a table for use.

In the cell below, we load in the data for this lab using `pyreadstat.read_sav()` and the `pyreadstat`-`pandas`-`datascience` pipeline discussed above.

In [14]:
df, _ = pyreadstat.read_sav("data/ch-3-ds-1.sav")
reaction_times = Table.from_df(df)
reaction_times

ReactionTime
0.3
0.5
0.2
1.9
0.7
0.4
0.2
0.2
0.2
0.5


## 2. Standard Deviation & the Bootstrap <a id="section-2"></a>

In this section of the notebook, we will be looking at the _standard deviation_ of samples from our `reaction_time` data. The standard deviation is a measured of the _spread_ of a sample, and is defined as the root-mean-squared difference from the mean of the data.

Let's consider the following question:

> What is the standard deviation of the **population** from which our sample is drawn?

This question can't be answered exactly as we can't know the population distribution of our sample, but there are statistical techniques that we can use to attempt to answer such a question. One such technique is the **bootstrap**, in which we use data already collected to generate new samples in an attempt to estimate the value of a **population parameter**.

The bootstrap is a simple process:

1. Resample from your existing sample $n$ times to create a new _bootstrap sample_
2. Calculate the statistic being estimated on your bootstrap sample
3. Average these values for an estimate of the parameter

In this section, we will walk you through a simple bootstrap of our `reaction_time` data in an attempt to estimate the population standard deviation. If you want to learn more about the boostrap, you can read about it and its statistical foundations in the [Data 8 textbook](https://www.inferentialthinking.com/chapters/13/2/Bootstrap.html).

#### Resample the Table

To begin bootstrap, you will need to repeatedly resample from your table. Recall from the last lab that you can sample from a table using `Table.sample()`. In this example we will use $n=5$ resamples, although normal bootstrapping procedures resample on the order of thousands or tens of thousands of times.

To sample repeatedly, we will use a `for` loop, which will repeatedly execute an action for each element in an array. Consider the `for` loop below:

In [None]:
for word in make_array("This", "is", "a", "smple", "for", "loop."):
    print(word)

The loop above prints out each word in the array we define in the first line. If we wanted to collect values in an array, we would use a loop of the form

```python
arr = make_array()
for i in arr2:
    # do something with i
    arr = np.append(arr, i)
```

The function `np.append()` adds its second argument to the end of its first argument. In this way, we can construct a bootstrap procedure as follows:

```python
values = make_array()
for i in np.arange(n):
    # resample table
    value = # statistic value
    values = np.append(values, value)
```

The function `np.arange(n)` creates an array of integers from 0 to $n-1$, which means that `for` loop above runs $n$ times.

<div class="alert alert-info">

**QUESTION:** Fill in the skeleton code below to bootstrap an estimate of the standard deviation of our sample with 5 resamples. Use a sample size of 75.

</div>

In [None]:
n = ...
sdevs = ...
for i in np.arange(...):
    sample = reaction_times.sample(...)
    sdev = np.std(sample.column("ReactionTime"))
    sdevs = np.append(..., ...)

Now that we have an array of standard deviations for each resample, we can average them using `np.mean` to get an estimate of the **population** standard deviation.

In [None]:
print("Estimate of Population SD: ", np.mean(sdevs))

<div class="alert alert-info">

**QUESTION:** Take a look at the values for the standard deviation we got with each resample (by running the cell below). How does what is calculated change based on the particular sample that is chosen?

</div>

In [None]:
sdevs

_Type your answer here, replacing this text._

## 3. Population Standard Deviation <a id="section-3"></a>

In the cell below, we load in the underlying population from which our sample `reaction_times` was drawn.

In [13]:
df, _ = pyreadstat.read_sav("data/ch-3-dataset-1.sav")
population = Table.from_df(df)
population

ReactionTime
0.4
0.7
0.4
0.9
0.8
0.7
0.3
1.9
1.2
2.8


<div class="alert alert-info">

**QUESTION:** Compute the standard deviation of the `ReactionTime` column of `population`.

</div>

In [None]:
np.std(...)

<div class="alert alert-info">

**QUESTION:** What are the implications for the observed differences in the five standard deviations from our bootstrap, our bootstrapped standard deviation, and the population standard deviation?

</div>

_Type your answer here, replacing this text._

---

## Submission

Congrats on finishing another lab notebook! To turn in this lab assignment follow the steps below:

>1. Press `Control + P` (or `Command + P` on Mac) to open the Print preview
2. Change the destination so that it saves locally on your own computer.
3. Save as PDF
4. If you are stuck, follow further instructions [here](https://www.wikihow.com/Save-a-Web-Page-as-a-PDF-in-Google-Chrome).
5. Upload this PDF to bCourses.

---
Notebook developed by: Chris Pyles

Data Science Modules: http://data.berkeley.edu/education/modules