### From Series to DataFrames, Boolean indexing and analyzing distributions 

#### Via the Pfizer COVID clinical trial

In [2]:
import pandas as pd 

df = pd.read_csv("clinical_trial.csv")

### Debugging practice

In this class, we are going to try to cultivate a data scientist mindset by adopting a hacker ethic, a willingness to embrace confusion to build our skills, scientific values and computational thinking. 

(We talked about this on the first day of class!)

Part of cultivating computational thinking means becoming comfortable with practices likle iterating and debugging. As you know, many times your code won't work the first time. You should cultivate the skill of debugging your code. Here is a great [blog post](https://jvns.ca/blog/2019/06/23/a-few-debugging-resources/) on debugging skills. So let's practice this skill now.

Remember, you can think of a DataFrame as a collection of Series objects, which are like lists. Thus we can make a DataFrame like this.

In [43]:
pd.DataFrame({'numbers': [1,2,3],
              'letters': ["A","B", "C"]})

Unnamed: 0,numbers,letters
0,1,A
1,2,B
2,3,C


But you will get an error on the cell below if you run it. Why do you think that is the case? Can you debug the error?

Before you get started:
- Discussion of reading stack traces and error messages
- Reading the source, the nuclear option of bug fixing. Often not needed, but good to know there is no deep/hidden mystery here. It's just Python! See also ["Read the source, Luke"](https://blog.codinghorror.com/learn-to-read-the-source-luke/).

In [44]:
names = ["DeGette", "Neguse", "Boebert"]
party = ["D", "D", "R", "R"]
hometown = ["Denver", "Lafayette", "Rifle", ]

df3 = pd.DataFrame({'names': names,
                    'party': party,
                    'hometown': hometown
                    })

df3

ValueError: All arrays must be of the same length

## Subsetting DataFrames

It is important to be able to split up DataFrames. The simplest way to do this is using `iloc` (for index location). This will allow you to select rows from a data frame using a (numbered) index. 

### Check in

What do you think this next cell is doing?

In [56]:
df.iloc[0:2]

Unnamed: 0,group,covid
0,treatment,False
1,control,False


What about this one? What changed?

In [57]:
df.iloc[0:4, :]

Unnamed: 0,group,covid
0,treatment,False
1,control,False
2,treatment,False
3,control,False


What is the meaning of the comma in the above statement?

How do you only select the first 5 values of the "covid" column using `iloc`?

### Boolean indexing

- Another way to subset data frames is via Boolean indexing
- Remember a Boolean variable takes the values True or False
- The operation below is "vectorized" meaning we are comparing each item in the series df["group"] to the string "control"

In [6]:
booleanIndex = (df["group"] == "control")

booleanIndex

0        False
1         True
2        False
3         True
4        False
         ...  
29995     True
29996     True
29997    False
29998    False
29999     True
Name: group, Length: 30000, dtype: bool

In [4]:
type(booleanIndex)

pandas.core.series.Series

- We can then use the booleanIndex to select rows from our dataset like this

In [7]:
df2 = df[booleanIndex]
df2

Unnamed: 0,group,covid
1,control,False
3,control,False
11,control,False
12,control,True
13,control,True
...,...,...
29991,control,False
29992,control,False
29995,control,False
29996,control,False


### Check in 

How many rows are there in df2? Are there more rows or fewer than df? Why is that the case?