# Accessing and Filtering Data

## Metadata

- Teaching: 60
- Exercises: 0

## Questions

- How can we look at individual rows and columns in a dataframe?
- How can we look at subsets of the dataset?

## Objectives

- Access individual rows and columns
- Access multiple columns at once using a list
- Filter the dataframe based on the data it contains
- Sort the dataframe

In the previous lesson, we saw how to load a dataframe from a file. Now we'll look at how to access the data within that dataframe. We'll begin by importing pandas and reading our CSV, as we did in the previous lesson:

In [None]:
import pandas as pd

surveys = pd.read_csv("data/surveys.csv")
surveys

## Getting a column

We can get a single column from the dataframe using square brackets. Square brackets are used in Python to access objects inside a container like a `list`, `dict`, or `DataFrame`. To get a column, we pass the name of the column inside a set brackets appended to the dataframe. For example, tp retrieve the year, use:

In [None]:
surveys["year"]

A single column is returned as a `Series`, which is why the output for this cell is formatted differently than the other cells in this lesson. The main difference between a `Series` and a `DataFrame` is that a `DataFrame` can have multiple columns. The two classes share many but not all attributes and methods.

Note also that this is a copy of data from the original dataframe. Changing values in the series will have no effect on the original dataframe. **Most operations on a DataFrame or Series return a copy of the original object.**

## Getting unique values

We can get the list of unique values within a column using the `unique()` method:

In [None]:
surveys["species_id"].unique()

## Getting multiple columns

It is also possible to retrieve more than one column at a time. To do some, we'll use the built-in `list` data type. 

Python uses `list` to store sequences, that is, ordered lists of objects. Any type of Python object can be stored in a list: strings, integers, floats, even other lists or collections. Once created, a list can be modified by appending, inserting, or deleting items. We will not be going into detail about these operations here, but as always you can learn more about how to use a list using `help()` or [the Python docs](https://docs.python.org/3/library/stdtypes.html#list).

You can create a list using square brackets. Let's create a list of the three columns in our dataframe that together give the date of an observation:

In [None]:
cols = ["year", "month", "day"]

When we pass this list to the survey dataframe using square brackets, as we did above, we get a copy of the dataframe containing just those columns. Note that, because we asked for more than one column, pandas returns a dataframe:

In [None]:
surveys[cols]

Suppose we want to get the unique values for multiple columns. The `unique()` method only works on a `Series`, that is, a single column. Instead, we can use the `drop_duplicates()` method on a copy of the dataframe with the columns we're interested in. Like any well-named method, `drop_duplicates()` does exactly what the name implies: It returns a copy of the dataframe with all duplicate rows removed.

In [None]:
surveys[["plot_id", "species_id"]].drop_duplicates()

## Getting one or more rows

Pandas provides a variety of ways to view rows within a dataframe. We can get the rows at the beginning of the dataframe using `head()`:

In [None]:
surveys.head()

By default, this methods returns the first five rows. We can provide a number inside the parentheses if we want to view a different number of rows:

In [None]:
surveys.head(10)

The `tail()` method is similar, except it returns rows from the end of the table:

In [None]:
surveys.tail()

Or we can use `sample()` to return a random row from anywhere in the dataframe:

In [None]:
surveys.sample()

If you're following along, you may notice that the output of this method on your screen differs from what's shown here. That's exactly what we'd expect to see. Remember, `sample()` is returnning a random row--it would be far less likely for the outputs to be the same!

## Slicing the dataframe

The `head()`, `tail()`, and `sample()` methods are useful for getting a feel for how our data is structured, but we may also want to look at specific rows. One way to do so is to extract rows based on where they appear in the dataframe. We can use square brackets to extract these *slices*. A slice is a subset of the dataframe starting at one row and ending at another. To get a slice, we pass the starting and ending indexes to the square brackets as `start:end`:

In [None]:
surveys[2:5]

There are three things to be aware of when slicing a dataframe:

- Row indexes are *zero-based*. The first row has an index of 0, not 1.
- When slicing, the slice includes start but not the end index. In this case, that means the slice includes rows 2, 3, and 4 but not 5.
- The row label can be different from the row index. They happen to be the same here, but don't count on that being true.

Core Python types like `list` and `tuple` use the same conventions, as do most Python pacakges that work with sequences.

## Filtering data

It is often more useful to subset a dataframe based on the data itself. Pandas provides a variety of ways to filter a dataframe in this way. For example, suppose we want to look at a specific species in the surveys dataframe. We can view the rows matching a given species using the same square brackets we used above to select specific columns and rows. Here, however, instead of using a value or list of values, we will use a *conditional expression*.

A conditional expression is a statement that evaluates as either True or False. They often make use of inequality operators, for example:

- `==` for equals
- `!=` for does not equal
- `>` for greater than
- `>=` for greater than or equal to
- `<` for less than
- `<=` for less than or equal to

Examples of conditional statements include:

+ `"a" == "b"` evaluates False
+ `"a" != b"` evaluates True
+ `3 > 4` evaluates False

Note that, when comparing strings, evaluations are case sensitive:

+ `"a" == "A"` evaluates False

### = for assignment, == for equality

Remember that, in Python, a single equal sign is used to assign values to variables. We've already used the assignment operator in this lesson, for example, when we created a new column in the dataframe.

To limit the dataframe to rows matching the species "DM", we will again use square brackets. This time, instead of passing a string or a number, we will include the conditional `surveys["species_id"] == "DM"` inside the square brackets:

In [None]:
surveys[surveys["species_id"] == "DM"]

Other comparisons can be used in the same way. To limit our results to observations made in or after 2000, use:

In [None]:
surveys[surveys["year"] >= 2000]

As when we selected columns above, each filtering operation returns a copy of the dataframe.

## Using complex filters

When analyzing data, we will often need to filter on multiple columns at one time. In pandas, we can combine conditionals using *bitwise operators*. These work like the terms AND and OR in many search interfaces:

+ `&`: True if conditions on both sides of the operator are True (and)
+ `|`: True if a condition on either side is True (or)

To return all observations of DM in or after 2000, we can combine the two conditionals we used previously into a single operation. Note that, when joining conditionals using `&` or `|`, we must wrap each individual condition in parentheses. If we omit the parentheses, pandas will not perform the comparisons in the expected order.

In [None]:
surveys[(surveys["species_id"] == "DM") & (surveys["year"] >= 2000)]

We can also use methods to filter the dataframe. For example, `isin()` can be used to match a list of values. Methods can be combined with other conditionals as above. The example below returns rows from 2000 or later with either "DM", "DO", or "DS" in the species_id column:

In [None]:
surveys[
    surveys["species_id"].isin(["DM", "DO", "DS"]) & (surveys["year"] >= 2000)
]

## Sorting data

We can sort a dataframe using the `sort_values()` method. To sort by weight, we'll pass the name of that column to the `sort_values()`:

In [None]:
surveys.sort_values("weight")

By default, rows are sorted in ascending order (smallest to largest). We can reorder them from largest to smallest using the *ascending* keyword argument:

In [None]:
surveys.sort_values("weight", ascending=False)

We can sort on multiple fields at once by passing a list of column names. We can control how each column sorts by passing a list with the same number of values (that is, one value per column) to the ascending keyword. The cell below sorts the results first by species_id (largest to smallest), then by weight (smallest to largest):

In [None]:
surveys.sort_values(["species_id", "weight"], ascending=[False, True])

As with the dataframe methods above, `sort_values()` returns a copy of the original dataframe and leaves the original untouched.

### Challenge

Write a query that returns year, species_id, and weight from the surveys table, sorted with the largest weights at the top.

In [None]:
# Create a subset containing only year, species_id, and weight
subset = surveys[["year", "species_id", "weight"]]

# Sort the subset by weight
subset.sort_values("weight", ascending=False)

## Showing data on a plot

We will discuss data visualization using plotly in depth in [lesson 6](06-visualizing-data.html) but will introduce some fundamental concepts as we go along. Like pandas, plotly is an external package installed separately from Python itself. It can be used to create interactive, highly customizable plots based on pandas dataframes using just a few lines of code. For example, to create a scatter plot of the weight and hindfoot length, we need only to import plotly:

In [None]:
import plotly.express as px

Once plotly is loaded, we can create a scatter plot using the `px.scatter()` method. We include the dataframe as the first argument, then x and y keyword arguments to select the columns we want to show on our scatter plot:

In [None]:
px.scatter(surveys, x="weight", y="hindfoot_length")

This simple plot is limited in what it can tell us about the observations in the dataset. We will return to this scatter plot in later lessons to see how we can improve it to better understand the survey data.

## Keypoints

- Use square brackets to access rows, columns, and specific cells
- Sort data and get unique values in a dataframe using methods provided by pandas
- By default, most dataframe operations return a copy of the original data
- Scatter plots can be used to visualize how two parameters in a dataset covary