<img src="../../shared/img/banner.svg" width=2560></img>

# Pandas and Seaborn Cheat Sheet

This notebook is intended as a reference for the pandas and seaborn material in hw01 and lab01.

If you want a reference for the pure Python material,
e.g. on lists, dictionaries, functions, and control flow (`for`, `if`, `else`),
check out [these cheat sheets](https://ehmatthes.github.io/pcc/cheatsheets/README.html)

If you've taken data8 or are otherwise familiar with the `datascience` package
and its `Table`s,
check out the notebook at
[this interact link](http://datahub.berkeley.edu/user-redirect/interact?account=ds-modules&repo=core-resources&branch=master&path=tabular-data/datascience%20to%20pandas.ipynb)
for a "Rosetta Stone" matching `datascience.Table`s to `pandas.DataFrame`s.
You only need up to, but not including, **Grouping and Aggregating** for the first lab and homework.

In [None]:
%matplotlib inline

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
import matplotlib.pyplot as plt
import numpy as np

## Pandas

Pandas is typically abbreviated to `pd`.

In [None]:
import pandas as pd

### Creating a `DataFrame`

One view of a `DataFrame` is that it is like a _dictionary of lists_.
Each list represents a column of the dataframe,
and the key gives the column name.
All of the lists are the same size.

In [None]:
data_in_columns = {"columnA": ["0A", "1A"], "columnB": ["0B", "1B"]}

pd.DataFrame(data_in_columns)

Another view of a `DataFrame` is that it is like a _list of lists_.
In this case, each element in the outermost list represents a _row_
of the dataframe.
Again, all of the lists must be the same size.

We must add the column names manually, in this case.

In [None]:
data_in_rows = [["0A", "0B"], ["1A", "1B"]]

pd.DataFrame(data_in_rows, columns=["columnA", "columnB"])

The cell below generates a slightly larger dataframe to work with for the rest of the pandas section.

In [None]:
N = 6
toy_data = {"species": np.random.choice(["puppy", "kitty"], size=N),
            "rating": np.random.randint(8, 11, size=N)}

toy_df = pd.DataFrame(toy_data)
toy_df

### Series

But a DataFrame is not represented using `list`s.

Instead, both rows and columns of dataframes are represented as a datatype called a `Series`.

In [None]:
# this cell selects a column
toy_df["species"]

In [None]:
type(toy_df["species"])

In [None]:
# this cell selects a row; see indexing section below
toy_df.iloc[0]

In [None]:
type(toy_df.iloc[0])

Lots of operations on `Series` result in a new `Series`. 

In [None]:
toy_df["species"] == "puppy"

In [None]:
type(toy_df["species"] == "puppy")

In [None]:
toy_df["rating"] + 1

In [None]:
type(toy_df["rating"] + 1)

### Applying functions to Series

Use the `apply` function to feed each element of a `Series`
to a function and get out a new `Series`.

In [None]:
def is_ten(rating):
    return rating == 10

In [None]:
toy_df["rating"].apply(is_ten)

We can then add a column to our dataframe by assignment.

In [None]:
toy_df["top_rated"] = toy_df["rating"].apply(is_ten)

In [None]:
toy_df

### Selection with Boolean Series

A series whose values are all `True` or `False` is called a _Boolean Series_.

If we use a Boolean series, instead of a string,
inside brackets following the name of a dataframe,
we will pull out, or "select" the rows corresponding to `True` in the Series.

In [None]:
toy_df[toy_df["species"] == "kitty"]

Matching is done, when possible, using the index, not the order,
as shown by the cells below.

`sample` here draws, without replacement, rows from the dataframe.
`frac=1` means to draw all of the rows.
Since the sampling is without replacement,
`sample(frac=1)` results in a shuffling of all of the rows.

`tolist()` turns the `Series` into a list, which gets rid of the index.
Run both cells below several times to see the different results.

In [None]:
toy_df.sample(frac=1)[toy_df["species"] == "kitty"]

In [None]:
toy_df.sample(frac=1)[(toy_df["species"] == "kitty").tolist()]

We can combine Boolean series with `&`, pronounced "and", and `|`, pronounced "or".

Two gotchas:
- The keywords `and` and `or` will not work here.
- Notice the parentheses!

In [None]:
toy_df[(toy_df["species"] == "kitty") & (toy_df["rating"] > 8)]

### Indexing

When we want to select just one row,  or a few in a row,
we use the two indexing functions of pandas: `loc` and `iloc`.

They are akin to slicing into lists, with `list[a:b]` and such.

#### .iloc

`.iloc` selects by position in the dataframe.
It takes an `i`nteger as argument and uses straight brackets `[`,
just like when slicing a list.

In [None]:
toy_df.iloc[0]

If the order of the rows changes,
e.g. in the example below,
the value returned by `iloc` will change:

In [None]:
toy_df.sample(frac=1).iloc[0]

#### .loc

`.loc`, on the other hand, selects by the `index`,
the "extra" column of values on the left-most side.

In [None]:
toy_df

In [None]:
toy_df.index.tolist()

In [None]:
toy_df.loc[0]

In our case, the index is just a number, so `loc` and `iloc` take the same arguments.
In other cases, the index might be a name, an ID string, or a timestamp,
and so `iloc` would still use an `int`eger,
but `loc` would take a different type of argument.

But in our case, how are these functions different?
Because the index is part of the row, the value returned by `loc` is unaffected by changes in the order of rows.

In [None]:
toy_df.sample(frac=1)

In [None]:
toy_df.sample(frac=1).loc[0]

### Sorting

Sort a dataframe by one of its columns with the function `sort_values`.

In [None]:
toy_df.sort_values(by="rating")

Reverse the order by passing `ascending=False` as a keyword argument.

In [None]:
toy_df.sort_values(by="rating", ascending=False)

## Seaborn

Seaborn is typically abbreviated to `sns`.

In [None]:
import seaborn as sns

The plotting code with seaborn is mostly provided for you in the lab, in the following format:

```python
sns.stripplot(x=?, y=?, data=?);
```

where `stripplot` is just one example of the types of plots seaborn can make,
and which you'll make in the lab.

The `x=` keyword argument picks which column is plotted on the `x` axis,
while the `y=` keyword argument picks a column for `y`.
The `data` keyword argument tells seaborn which dataframe to pull the columns from.

So if the above code snippet were given, along with the instructions
> Create a `stripplot` of the data in `toy_df` with `species` on the `x` axis and `rating` on the `y` axis.

the correct answer would be:

In [None]:
sns.stripplot(x="species", y="rating", data=toy_df);

Seaborn also lets you choose a column to separate out datapoints by color,
using the `hue` argument.
This will be done for you in this first lab,
as below:

```python
sns.stripplot(x=?, y=?, hue="top_rated", data=?);
```

accompanied by instructions
> Create a `stripplot` of the data in `toy_df` with `species` on the `x` axis and `rating` on the `y` axis.

and with correct answer

In [None]:
sns.stripplot(x="species", y="rating", hue="top_rated", data=toy_df);