# Chapter 02: Data, Dataframes, and Pandas

## Experiments, Outcomes, Datapoints, and Dataframes

We define a **set** as a collection of items, sometimes called elements. A set is typically given a capital letter (for example $A$) and the elements are included inside curly braces.

We use $\mathcal{G}$ to define the set of all possible outcomes from an experiment and call this set the **sample space**.
The term *"experiment"* has a broad meaning.
An experiment can mean everything from a randomized controlled trial to an observational study.
Experiments are what produce observations that we collect and try to characterize.
Here an **experiment** is the process that generates outcomes.

An **outcome** is defined as an element of the sample space, some result of the experiment that is recorded.
An outcome is a single observation from an experiment, and we define an **event** as a set, or group, of outcomes.
Most often we use $o_{i}$ to denote an outcome and $E_{i}$ to denote an event.

The sample space, event, and outcome are all potential results from an experiment.
When we conduct an experiment we will generate an outcome from our sample space and call this realized outcome a **data point**.

---

*Example: Flipping a coin*
Consider the experiment of flipping a coin and recording whether the coin lands heads or tails side up.  
We can define a sample space $\mathcal{G} = \{H,T\}$ where $H$ is an outcome that represents the coin landing heads up and $T$ represents tails up. 
The sample space includes all the possible events we wish to record, either the coin lands heads or it lands tails.  

Up until this point we have structured our experiment, but we have not generated data.
We flip the coin and the coin lands tails side up. Now we have performed an experiment and generated the data point $T$.
We flip again and record a heads. $H$ is the second data point, and so on. 

---

Now suppose that we conduct an experiment with the same sample space $(\mathcal{G})$ a number $N$ times and with each experiment we record a data point.
A tuple (i.e. ordered list) of data points $d$ is called a **data set** $\mathcal{D} = (d_{1}, d_{2}, d_{3}, \cdots, d_{N})$ where $d_{i}$ is the data point generated from the $i^\text{th}$ experiment.
We say that we have _drawn_ or that we have _sampled_ a data set $\mathcal{D}$.
Further, data points $(d)$ are often called _realized_ outcomes because they are no longer in a set of potential possibilities but are now determined items.

A data set $\mathcal{D}$ can be unwieldy depending on the number of data points, the complexity of the sample space, or both.
A **data frame** is one way to organize a data set.
A data frame $\mathcal{F}$ is a table where each data point $d$ in a dataset $\mathcal{D}$ is represented as a row in the table.
Each data point may contains multiple pieces of information.
That is, each data point may itself be a tuple. 
In this case, then a separate column is created for each position in the tuple.

---

*Example Human predictions of infectious diseases:*
Suppose we design an experiment to collect from humans predictions, two weeks ahead from the time of our experiment, of the number of incident cases and incident deaths at the US national level of COVID-19 ([Source](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7523166/)).
We decide to collect from each human whether they are an expert in the modeling of infectious disease, a prediction of incident cases, and a prediction of incident deaths.
We draw a data set $\mathcal{D}$ of 50 human judgment predictions.
We can organize this data set into a data frame:

| Expert | Prediction of cases | Prediction of deaths |
|--------|---------------------|----------------------|
| Yes    | 145                 | 52                  |
| No     | 215                 | 34                  |
| Yes    | 524                 | 48                  |
| Yes    | 265                 | 95                  |
| No     | 354                 | 35                  |

**Table:** Example data frame $\mathcal{F}$ built from a data set $\mathcal{D}$ that contains 5 data points where each data point is a tuple of length three.

---

Above, the first data point is $(\text{Yes},145,52)$, the second data point is $(\text{No}, 215, 34)$, and so on until the last data point $(\text{No}, 354, 35)$.
A data frame can also include informative information for others such as labels for each column.


## Pandas 

The **pandas** module in Python is, by far, the most used set of tools for interacting with data, data points, and data frames. The documentation for Pandas is available here = [link](https://pandas.pydata.org/docs/). 

Pandas allows a structured way to import, organize, access, and compute with data frames. 
But first, we need to discuss the fundemental object in pandas---the **Series**. 

### Series
A **Series** is (1) a list of items plus (2) an index, a list of string values that are associated with each item in (1).

The typical may to define a Series is by called ```pd.Series``` and inputting a list and index. 
For example, the first two data points of out above coin flip experiment were tails and then heads.
Lets assign "tails" the value 0 and "heads" the value 1 to make this more numerically friendly. 

In [10]:
import pandas as pd                                     #<--Import pandas (only needed once)
coin_flips = pd.Series([0,1], index=["flip1", "flip2"]) #<--Create a Series
print(coin_flips)                                       # Print this out so we can see what this object looks like

flip1    0
flip2    1
dtype: int64


We see that a series object was created where all the values are integers.
This is a rule for Series, they cannot be "mixed" type such as character and integer or integer and floats (decimals values).   

The index for our series is displayed on the left side.
We can access items in a series using the index like this. 

In [11]:
coin_flips.get("flip1")

0

or like this 

In [12]:
coin_flips["flip1"]

0

Like most objects in Python, a series is a type of dictionary.
The "keys" of the dictionary are the index values and the "values" of the dictionary are the items in the list. 
In fact, we can build a series from a dictionary.

In [13]:
coin_flips = pd.Series({"flip1":0,"flip2":1})
coin_flips

flip1    0
flip2    1
dtype: int64

Finally, it should be noted that pandas Series objects can, for the most part, be treated the same as numpy arrays. 
Series support vectorized operations 