### From Series to DataFrames, Boolean indexing and analyzing distributions 

#### Via the Pfizer COVID clinical trial

In the next series of notebooks, we will consider if the Pfizer COVID vaccine actually reduces your risk of getting COVID, and by how much. Does the Pfizer vaccine actually work? How certain are we that it works? How do scientists make these kinds of decisions? As an information scientist, you don't have to just trust what your uncle says on Facebook (TikTok)? By the end of 2301 you will be ba able to actually reason through these things yourself. 

To answer, we will need to learn the following: 
- the pandas DataFrame
- Boolean indexing
- Grouping data
- analyzing distributions (at least to start)

In [None]:
### Warm up 

How do you think scientists know the COVID vaccine works?

[Your thoughts here]

Patient-level data from the Pfizer [clinical trial](https://www.pfizer.com/research/clinical_trials/trial_data_and_results/data_requests/) is available to researchers. But it is presented to the public in summary form. So for our purposes, I simulated data from the trial based on [aggregate data](https://www.fda.gov/media/144246/download) from Pfizer.

- I am going to show you how I simulated the data after we are done.
- In general, you don't get to see the real data generating process. That is a mystery from nature that you can't actually observe. But you *can* make inferences about the data generating process, based on the observed data.
- Let's go ahead and get started by reading in the simulated data with a pandas data frame.

In [31]:
import pandas as pd 

df = pd.read_csv("clinical_trial.csv")

df

Unnamed: 0,group,covid
0,treatment,False
1,control,False
2,treatment,False
3,control,False
4,treatment,False
...,...,...
29995,control,False
29996,control,False
29997,treatment,False
29998,treatment,False


### Check in

Why does the DataFrame show a "..." in the middle?

[Your answer here]

### Check in

Based on this data frame, do you think the COVID vaccine works? 

What steps would you take to figure that out using this data? (Even if you don't know the commands yet in pandas.)

[Your thoughts here]

### A strategy for answering

To answer these questions about the COVID vaccine, we will need to use the Pandas API and quantitative reasoning! 

**The point of this class is not software. Software is a means to quantitative and computational analysis**

### What is a data frame? 

- A Series is a 1D collection of observations
- A DataFrame is a 2D collection of observations. You can think of it as a generalization or extension of a series

#### Indexes 

- Like a series, a DataFrame has an index. 
- Where a Series has indexes into individual obserations (i.e. names each obserbation), in a DataFrame the index labels each row number.
- If you hear "index" for a DataFrame, think rows.

#### Column names

- Additionally, because a DataFrame is 2D, a DataFrame has columns.
    - The column names are "indexes" along the second dimension of a DataFrame. But in the Pandas way of thinking, they are separate (use different syntax).

In [None]:
### Check in

- What are the column names of `df` above?
- What is the index? 

Print these things out programmatically using code!

[Your code here]

### Creating a DataFrame

There are lots of ways to make a DataFrame. Often in this class we will read them in from files using the `read_csv` method. But there are lots of other ways. Review the [documentaion](https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.read_csv.html) to investigate a few more.

[What are some ways to create a DataFrame?]

### Getting to know your data 

Pandas has a number of methods for getting to know your data. Experiment with each of the following methods using the `df` data frame and try to figure out what they do

In [None]:
### Getting to know your data 

- `df.head()`
- `df.sample(5)`
- `len(df)`
- `len(df.colums)`
- `df.dtypes`