# Overview
In this final notebook, we'll explore `pandas` and use it to load data, transform it, perform analysis, and create visualizations. This notebook will be much more interactive than the previous ones, so get ready to code!

# pandas

One great library for working with data in Python is `pandas` (pronounced "pan-dis" - not like the bear). Pandas offers high-performance data structures for processing data in a tabular format. It is one of the most useful tools for a data scientist and we'll use it frequently throughout this course. 

# Loading and viewing data
In the previous notebook, we were working with small samples of dummy data. We'll often want to load data from another source before analyzing it.


10 data points is not enough to do any real analysis or gain any interesting insights. Let's instead use the entire dataset. This dataset has been saved to this directory as **"500_Person_Gender_Height_Weight_Index.csv"**. It originally came from [Kaggle](https://www.kaggle.com/yersever/500-person-gender-height-weight-bodymassindex/data).

We'll start by reading in the csv file containing the data. Pandas can read files of a number data types. First, import pandas with the alias `pd`.

In [None]:
import pandas as pd

# Let's re-import our visualization modules
import seaborn as sns
sns.set()

import matplotlib.pyplot as plt
%matplotlib inline

Then, call the function `read_csv` with the filepath as an argument. Save the result as `df` (short for **"dataframe"**).

In [None]:
df = pd.read_csv("500_Person_Gender_Height_Weight_Index.csv")

Let's take a quick look at what the data looks like. We can look at the first 5 rows of a dataframe by calling the dataframe's `head` method:

In [None]:
df.head()

From looking at this, we can see that there are 4 columns. Most are explanatory, but the **Index** variable may need some additional explanation. This is provided on Kaggle's website:

Index :

0 - Extremely Weak

1 - Weak

2 - Normal

3 - Overweight

4 - Obesity

5 - Extreme Obesity

**Note**:
- Height is in cm
- Weight is in Kg

In addition to looking at the first rows, let's get a summary of the dataset as a whole. We can get summary statistics of any numerical columns by calling `df.describe`. This gives the count, mean, standard deviation, min/max, and quantile information for each variable. Note that **"Gender"** is not included here - we'll have to do some additional analysis later for that column.

In [None]:
df.describe()

We can access a single column by indexing with the column's name, similar to getting the value in a dictionary:

In [None]:
# Column
df["Gender"]

We can then call methods on that specific column. For example, we can get descriptive statistics like the min, max, and mean:

In [None]:
df["Height"].mean()

In [None]:
df["Height"].min()

In [None]:
df["Height"].max()

#### TODO
Get the mean, min, and max values of the **"Weight"** columns.

In [None]:
# Mean
df[____].mean() 

In [None]:
# Min
df["Weight"].____()

In [None]:
# Max
df[___].____()

## Indexing and Accessing Certain Values

We can get specific rows by using a numerical index on the `df.iloc` attribute. This functions the same as lists or other ordered arrays in Python:

In [None]:
# Just the first row
df.iloc[0]

In [None]:
# Rows 5-10
df.iloc[5:10]

In [None]:
# The height and weight for the first 10 rows:
df.iloc[:10][["Height", "Weight"]]

## Aggregating by Variables
We can call methods on the columns of a DataFrame to do additional analysis on specific variables. Let's look at two categorical variables: **Gender** and **Index**. With categorical variables, we might want to get the count of rows where the variable takes on a certain value. For example, how many rows are **Male** vs. **Female**?

One way to do this is by using the `groupby` method. We group the dataframe by a column name and then call `size()` to get the count. The cell below shows how to do this with the **"Gender"** column. 

**Note**: This is the equivalent in SQL of using  `GROUP BY` clause:

```sql
SELECT
    Gender, COUNT(*)
FROM bmi
GROUP BY Gender
```

In [None]:
df.groupby("Gender").size()

#### TODO
Get the count of each value of **Index**.

In [None]:
df.____("Index").____()

## Calling operations on columns and assigning new columns
We can add, subtract, or do other operations on pandas columns, just like we can with other variables. For example, multiplying a column by a scalar value will multiply each element of the column by that value:

In [None]:
df["Height"].head()

In [None]:
df["Height"].head() * 2

The same works for addition. Note that this will work the same way as the datatype of the elements. Remember how we added two strings together?

In [None]:
"This person is " +  df["Gender"].head()

We can also perform operations using multiple columns. 

Let's use the height and weight columns to create a new column with **BMI** measurements. BMI is calculated from height and weight using this equation:

$$BMI = \frac{Weight (kg)}{Height (m) ^ 2}$$

#### TODO
To get the BMI measurement, we'll need to go through a few steps:
1. Convert **Height** from centimers to meters. We'll save this as a new variabled called `height_m`
2. Square the **Height in meters** column. Save this as a variable `height_m_sqrd`
3. Divide the **Weight** column by the **Height in meters** column. Save this as a variabled called `bmi`
4. Assign the result to a new column in `df`

**1. Convert Height from centimers to meters**

In [None]:
height_m = ___

In [None]:
height_m.head()

**2. Square the *Height in meters* column**

In [None]:
# Square height_m
height_m_sqrd = ____

**3. Divide the *Weight* column by the *Height in meters* column**

In [None]:
# Divide the weight column by the height in meters squared
bmi = ____

In [None]:
bmi.head()

**4. Assign the result to a new column in `df`**

In [None]:
# Now add as a column in the DataFrame
df[___] = bmi

Now we should have a column called **BMI** in our dataset.

In [None]:
df.head()

## Pandas Plotting
Pandas also contains useful methods for plotting the data in dataframes. When combined with seaborn, this allows us to create powerful visualizations using datasets. Let's generate a histogram to look at the distribution of the BMI which we just calculated:

#### Discussion
Does this histogram show a **normal** distribution?

In [None]:
df["BMI"].hist()

#### TODO
Generate histograms for the height and weight columns.

In [None]:
# Height
df[____].hist()

In [None]:
# Weight
____

Instead of a histogram for the **Index** and **Gender** columns, which are categorical, let's generate bar graphs. We can first get the count of values of male vs. female patients by calling the `groupby().size()` method, then calling `.plot.bar()`:

In [None]:
df.groupby("Gender").size().plot.bar()

Seaborn's version of the barplot is called `countplot` and will assign a different color to each bar. We provide the dataframe in the `data` keywoard argument and specify the column to use in the `x` keyword argument:

In [None]:
sns.countplot(x="Gender", data=df)

#### TODO
Create bar graphs for the **Index** column using both `.plot.bar()` and `sns.countplot()`:

In [None]:
df.____("Index").____().plot.____()

In [None]:
sns.countplot(x=____, ____=df)

## Boolean indexing
Earlier, we saw how we can access specific subsets of the data using column names or row indices. Next, we'll see how we can filter the dataset based on conditions. 

Let's say we want to de-aggregate the data by sex so that we can compare statistics between the female and male populations. It would be useful to separate these data points into two separate dataframes so we can compare them. **Boolean indexing** allows us to evaluate a condition and then filter to rows where that condition is True.

Here is the general syntax for filtering based on whether a column is equal to some value:

```python
df[df["column_name"] == value]
```

... or greater than:
```python
df[df["column_name"] > value]
```

... and so on.

Here is an example using our data:

In [None]:
female = df[df["Gender"] == "Female"]
male = df[df["Gender"] == "Male"]

In [None]:
female.head()

In [None]:
male.head()

#### TODO
Create two new dataframes: `norm` and `sev_obese`. `norm` contains all rows which have an **Index** of 2 (normal BMI) and `sev_obese` contains rows with an **Index** of 5 (severely obese). We can then compare and contrast the other measurements for these two populations.

In [None]:
norm = df[df[____] == 2]
____ = ____

In [None]:
sev_obese.head()

Let's compare the height and weights of these two groups. In the cell below, the code to plot the height and weight for `norm` has already been completed. Uncomment the second line of code and edit it so you can plot the `sev_obese` group as well. Use a different color for each group and fill in the label keyword argument so we can visually differentiate between the two groups.

Note that we are using the same plot figure to plot both scatterplots.

In [None]:
ax = norm.plot.scatter(x="Weight", y="Height", color="C0", label="Normal Weight")
# ax = ____.plot.scatter(x="Weight", ____, ____=____, ax=ax, label=____)

# Increase the plot size
fig = plt.gcf()
fig.set_size_inches(10, 6)

#### TODO
Let's now do the same using a histogram to compare the distribution of weight of the two populations. We'll use `sns.displot` so that we can see the kernel density estimate of the data.

In [None]:
ax = sns.distplot(norm["Weight"], label="Normal BMI")
sns.distplot(____["Weight"], ax=ax, ____="Severely Obese")
ax.legend()
ax.grid(False)

# Next Steps
If you feel you need additional review of Python and SQL, there are two additional notebooks in this directory for quick review:

- [./03-python_review](./03-python_review.ipynb)
- [./04-sql_review](./04-sql_review.ipynb)

See Canvas for the homework assignment for next week.

Next week, we'll start to use **MIMIC-II**, a deidentified clinical database containing real-world data from an EHR. We'll combine Python and SQL to query this database and apply these tools we learned today to analyze medical data.