# SW 282: Lab 3 - Data Files & Descriptive Statistics

---

### Proessor Erin Kerrison

In this lab, we discuss how to read in data from external sources and how to calculate descriptive statistics in Python.

---

### Table of Contents

1. [Reading Tables from Files](#section-1)
2. [Random Sampling and the Mean, Median, and Mode](#section-2) <br>
&nbsp;&nbsp;&nbsp; a. [Sampling from a Table](#section-2a) <br>
&nbsp;&nbsp;&nbsp; b. [Descriptive Statistics](#section-2b)

---

#### Dependencies

To begin with the lab, run the cell below, which loads in all of the Python packages we'll be making use of.

In [None]:
from datascience import *
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
import warnings
warnings.simplefilter('ignore', FutureWarning)
%matplotlib inline

## 1. Reading Tables from Files <a id="section-1"></a>

In the last notebook, we discussed how to create tables from scratch using arrays. In this notebook, we will start by discussing how to import data that has already been saved in a specific format. One of the most common formats for distributing rectangular data is in CSV format, which stands for comma-separated value. In a CSV file, you will see column values separated by commas, such that each line is a row:

```
Col_1,Col_2,Col_3
1,2,3
4,5,6
7,8,9
...
```

The `datascience` library provides a function to read these files: `Table.read_table`. To read in this file, you just pass the relative path to the file as a string. For example, the dataset for this notebook is stored in the `data` directory and its filename is `prejudice.csv`. Therefore, to access this file, we would make the call below:

In [None]:
prejudice = Table.read_table("data/prejudice.csv")
prejudice

## 2. Random Sampling and the Mean, Median, and Mode <a id="section-2"></a>

In this section, we will cover sampling tables and descriptive statistics.

### 2a. Sampling from a Table <a id="section-2a"></a>

To sample rows from a table, we use the `Table.sample()` function. This function defaults to sampling **with replacement**, meaning that rows can appear in the sample more than once. To create a sample with replacement of the table, we just call the sample function and pass to it the number of rows we want in our sample:

In [None]:
n = 20
prejudice.sample(n)

<div class="alert alert-info">

**QUESTION:** Create a new sample of 25 rows from `prejudice`. Store this sample as `my_sample`.

</div>

In [None]:
n = ...
my_sample = ...

### 2b. Descriptive Statistics <a id="section-2b"></a>

The `numpy` package which we have previously discussed provides tools to compute some descriptives statistics on arrays of numerical data. These functions take in arrays, not tables, so in order to use them, we will need to extract data from our table.

#### Extracting Columns

To get an array of the values in a column of a table, we use the function `Table.column()`, which takes a column label as its parameter and returns an array of the values in that column.

In [None]:
my_values = my_sample.column("Prejudice")
my_values

With the array `my_values`, we can now compute some descriptive statistics on your sampled data.

#### Means

To compute the mean of a sample, pass the array to `np.mean`.

In [None]:
my_mean = np.mean(my_values)
my_mean

#### Medians 

To find the median of a sample, pass the array to `np.median`.

<div class="alert alert-info">

**QUESTION:** Save the median of your sample values `my_values` to `my_median` using `np.median`.

</div>

In [None]:
my_median = ...
my_median

#### Modes

The `numpy` library does not have a function to compute the mode of a dataset. However, another library called `scipy` has a submodule `stats`, which we have already imported for you, which _does_ have a mode function. The call to this function is `stats.mode()` and it also accepts arrays of values.

<div class="alert alert-info">

**QUESTION:** Fill in the function call below to set `my_mode` to the mode of your sample.

</div>

In [None]:
my_mode = stats.mode(...)
my_mode

<div class="alert alert-info">

**QUESTION:** Which of the values computed above is most representative of the sample you drew from the dataset?

</div>

_Type your answer here, replacing this text._

---

## Submission

Congrats on finishing another lab notebook! To turn in this lab assignment, go to File > Download as > PDF via Chrome and upload the PDF to bCourses.

---
Notebook developed by: Chris Pyles

Data Science Modules: http://data.berkeley.edu/education/modules