In [1]:
import pandas as pd 

df = pd.read_csv("clinical_trial.csv")

df

Unnamed: 0,group,covid
0,treatment,False
1,control,False
2,treatment,False
3,control,False
4,treatment,False
...,...,...
29995,control,False
29996,control,False
29997,treatment,False
29998,treatment,False


### Counting items in a group

- Pandas offers many ways to count items in a Series or DataFrame.
- One straightforward method is to call [value counts](https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.Series.value_counts.html) on a Series object. 

1. Use the `value_counts` method to count how many patients got covid in our dataset

In [4]:
# Your code here

2. What happens if you call `value_counts` on the whole data frame instead of the series object?

In [None]:
# Your code here

### Sorting 

- It is also possible to sort a series to show high values at the start and low values at the bottom.
- You can do this by calling the `sort_values` method on a Series
- Using the cell below, use the `.sort_values` method to show the series `simple_demo` in sorted order

In [98]:
simple_demo = pd.Series([.9, .3, .5, .7])
# your code here

1    0.3
2    0.5
3    0.7
0    0.9
dtype: float64

- The previous example shows how to sort a series. But often you will be working with data frames instead of series. Here again you can use the `sort_values` method.

In [100]:
import numpy as np
np.random.seed(42)
x = np.random.randint(1000, size=5)
y = np.random.randint(1000, size=5)
df = pd.DataFrame({"x": list(x), "y": list(y)})
df = df.astype({"x": int, "y": int})
df

Unnamed: 0,x,y
0,102,71
1,435,700
2,860,20
3,270,614
4,106,121


- Sort the above DataFrame by the `x` and `y` series
- You should consult the [docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) for guidance on how to do this. Again, using documentation is a **crucial** skill for working with data computationally.

In [None]:
# Your code here

#### Check in

In your code above, you may have used a "by" parameter to sort a DataFrame.  Why do you think you need to pass a "by" parameter when sorting a DataFrame but not a Series? 

[your answer here]

#### Check in

Sometimes you need to sort to see the lowest values. Other times you need to sort to see the highest values. Using the pandas documentation for guidance, sort the DataFrame from the previous cells so that higher numbers for the column `y` come first.

In [None]:
# Your code here

#### Check in 

Remember that both Series and DataFrame objects in pandas have indexes. With that said, what do you think is happening in the next cell? If you call index after calling `sort_values` what do you think it shows?

In [107]:
df.sort_values(by="y").index

Int64Index([2, 0, 4, 3, 1], dtype='int64')

### Pivot tables 

- A pivot table offers a way to summarize data in a dataset
- The basic idea of a pivot table is to present summary statistics about subsets of your data.
- To compute those summary statistics you use an "aggregation function", or "agg function"
- Let's start with a very simple pivot table and build up intuition for this concept.

In [129]:
df = pd.DataFrame({"income": [100, 53, 40, 254, 53], 
                   "city": ["Denver", "Boulder", "Denver", "Boulder", "Boulder"],
                   "gender": ["F", "M", "M", "F", "F"]})

- All pivot tables must have an index, which organizes the data into groups.
- The index is the key you use to group the items in a pivot table
- What is happening in the cell below?

In [130]:
df.pivot_table(index='city')

Unnamed: 0_level_0,income
city,Unnamed: 1_level_1
Boulder,120
Denver,70


It is possible to group by multiple indexes. What is happening in the cell below?

In [131]:
df.pivot_table(index=['city', 'gender'])

Unnamed: 0_level_0,Unnamed: 1_level_0,income
city,gender,Unnamed: 2_level_1
Boulder,F,153.5
Boulder,M,53.0
Denver,F,100.0
Denver,M,40.0


- How is the data in the above tables aggregated? 
- It is possible to change the way scores in groups are aggregated
- What are other kinds of summary statistics? Think back to earlier INFO classes.

- What is happening in the code below?

[your answer here]

In [132]:
df.pivot_table(index=['city', 'gender'], aggfunc=np.max)

Unnamed: 0_level_0,Unnamed: 1_level_0,income
city,gender,Unnamed: 2_level_1
Boulder,F,254
Boulder,M,53
Denver,F,100
Denver,M,40
