# Accessing Data in a Dataframe

## Metadata

- Teaching: 60
- Exercises: 0

## Questions

- How can we look at individual rows and columns in a dataframe?
- How can we perform calculations?
- How can we modify the table and data?

## Objectives

- Access individual rows and columns
- Access multiple columns at once using a list
- Perform calculations like addition and subtraction
- Rename columns using a dictionary
- Access rows containing specific data
- Sort the data returned by a query
- Modify data using loc

We'll begin by importing pandas and reading our CSV, as we did in the previous lesson:

In [None]:
import pandas as pd

surveys = pd.read_csv("data/surveys.csv")
surveys

We will now look at how to access rows and columns in the dataframe.

## Getting columns

We can get the values from a single column by passing a string inside square brackets to the dataframe object. For example, to look at the year column, use:

In [None]:
surveys["year"]

A single column is returned as a `Series`, which is why the output for this cell is formatted differently than the other cells in this lesson. The main difference between a `Series` and a `DataFrame` is that a `DataFrame` can have multiple columns. The two classes share many but not all attributes and methods. Note also that this is a copy of data from the original dataframe. Changing values in the series will have no effect on the original dataframe.

### Using lists to get more than one column at a time

Python uses the built-in `list` data type to store sequences, that is, an ordered list of objects. Any type of Python object can be stored in a list: strings, integers, floats, even other lists or collections. Once created, a list can be modified by appending, inserting, or deleting items. We will not be going into detail about these operations here, but as always you can learn more about how to use a list using `help()` or [the Python docs](https://docs.python.org/3/library/stdtypes.html#list).

Create a list using square brackets. Let's create a list of the three columns in our dataframe that together give the date of an observation:

In [None]:
cols = ["year", "month", "day"]

When we pass this list to the survey dataframe using square brackets, as we did above, we retrieve a copy of the dataframe containing just those columns. Note that, when we get more than one column, pandas returns a dataframe:

In [None]:
surveys[cols]

## Getting rows

We can get the rows at the beginning of the table using the head method:

In [None]:
surveys.head()

By default, this methods returns the first five rows. We can provide a number inside the parentheses if we need a specific number of rows:

In [None]:
surveys.head(20)

The `tail()` method is similar, except it returns rows from the end of the table:

In [None]:
surveys.tail()

The `head()` and `tail()` methods are useful for getting a feel for how our data is structured, but we'll also want to  be able to look at specific rows. As when we selected columns above, we can use square brackets to extract *slices* from the dataframe. A slice is a subset of the dataframe starting at one row and ending at another. To get a slice, we pass the starting and ending indexes to the square brackets as `start:end`:

In [None]:
surveys[2:5]

There are three things to be aware of when slicing a dataframe:

- Row indexes are zero-based. That is, the first row has an index of 0, not 1.
- When slicing, the slice includes start but not the end index. In this case, that means the slice includes rows 2, 3, and 4 but not 5.
- The row label can be different from the row index. They happen to be the same here, but don't count on that being true.

Core Python types like `list` and `tuple` use the same conventions, as do most libraries that deal with sequences.

## Getting unique values

Recall that we can use square brackets to return a single column from a dataframe. We can get the list of unique values within that column using the `unique()` method:

In [None]:
surveys["species_id"].unique()

To do the same across multiple columns, we can use the `drop_duplicates()` method on a copy of the dataframe containing only the columns we're interested in. Like any well-named method, `drop_duplicates()` does exactly what the name implies: It returns a copy of the dataframe with all duplicate rows removed.

In [None]:
surveys[["plot_id", "species_id"]].drop_duplicates()

## Calculating values

The survey dataset includes two columns, hindfoot_length and weight, that are stored as numbers and represent measurements. We may want to perform calculations using numbers like these in our own data. We can do so using Python's built-in mathematical operators, including:

- `x + y` for addition
- `x - y` for subtraction
- `x * y` for multiplication
- `x / y` for division
- `x % y` for calculating remainders
- `x ** y` for exponents 

To make the examples in this section a little more useful, we're going to remove all rows that contain null values using the `dropna()` method. This will filter out any rows that don't have a valid hindfoot_length or weight, as well as those that have a null value in any other cell. (This is an inelegant solution to the problem of missing data. We'll talk about more nuanced solutions later in the lesson.)

In [None]:
surveys_nona = surveys.dropna().copy()
surveys_nona

Suppose we want to convert the weight column from grams to milligrams. To do so, we can multiply that column by 1000:

In [None]:
surveys_nona["weight"] * 1000

To convert it to kilograms, we can divide by 1000:

In [None]:
surveys_nona["weight"] / 1000

Note that calculations do not modify the original dataset. If we want to retain the result, we have to assign it to a new column:

In [None]:
surveys_nona["weight_mg"] = surveys_nona["weight"] * 1000
surveys_nona

We can also add, subtract, multiply, and divide columns, as in the following (admittedly nonsensical) calculation, which adds together the hindfoot_length and weight columns: 

In [None]:
surveys_nona["hindfoot_length"] + surveys_nona["weight"]

## Renaming columns

The hindfoot_length and weight columns don't specify a unit, which may get confusing if we want to perform unit conversions like the one above. Fortunately, dataframes allow us to rename existing columns using the `rename()` method.

The `rename()` method uses a dictionary (or `dict`) to map between the old and new column names. As with `list` above, the `dict` data type is built into Python--we don't need to import anything to use it. A `dict` maps keys to values. We can create one using curly braces:

In [None]:
dct = {"key1": "val1", "key2": "val2"}
dct

Dictionaries are a useful and highly flexible data type. As with `list` above, we'll be giving them short shrift here, but you can learn more about them at [the Python docs](https://docs.python.org/3/library/stdtypes.html#dict).

Here we'll use a `dict` to specify how we want to rename our columns. The *keys* will be the current column names and the *values* the new column names. Note that we explicitly assign the result of `rename()` to the original variable--by default, `rename()` returns a copy of the original dataframe instead of modifying the original dataframe.

In [None]:
# Create a dict that maps from the old to the new column name
cols = {
    "hindfoot_length": "hindfoot_length_mm",
    "weight": "weight_g",
}

# Assign the result of the rename method back to surveys_nona
surveys_nona = surveys_nona.rename(columns=cols)

# View the dataframe with the new column names
surveys_nona

### Challenge

Create a dataframe that returns the year, month, day, species_id and weight in mg.

In [None]:
# Assign the weight in milligrams to the weight_mg column
surveys_nona["weight_mg"] = surveys_nona["weight_g"] * 1000

# Display a copy of survey with only the desired columns
surveys_nona[["year", "month", "day", "species_id", "weight_mg"]]

## Filtering data

pandas provides a variety of ways to filter a dataframe. For example, suppose we want to look at a specific species in the surveys_nona dataframe. We can view the rows matching a given species using the same square brackets we used above to select specific columns and rows. Here, however, instead of passing a value or list of values, we will pass a *conditional expression*.

A conditional expression is a statement that evaluates as either True or False. They often make use of inequality operators, for example:

- `==` for equals
- `!=` for does not equal
- `>` for greater than
- `>=` for greater than or equal to
- `<` for less than
- `<=` for less than or equal to

Examples of conditional statements include:

+ `"a" == "b"` evaluates False
+ `"a" != b"` evaluates True
+ `3 > 4` evaluates False

Note that, when comparing strings, evaluations are case sensitive:

+ `"a" == "A"` evaluates False

### = for assignment, == for equality

Remember that, in Python, a single equal sign is used to assign values to variables. We've already used the assignment operator in this lesson, for example, when we created a new column in the dataframe.

To limit the dataframe to rows matching the species "DM", include the conditional `surveys_nona["species_id"] == "DM"` inside the square brackets:

In [None]:
surveys_nona[surveys_nona["species_id"] == "DM"]

To limit our results to observations made in or after 2000, use:

In [None]:
surveys_nona[surveys_nona["year"] >= 2000]

As with `rename()` above, each filtering operation returns a copy of the dataframe. We will look at how to make changes to the original dataframe at the end of this lesson.

## Building more complex queries

We can combine conditionals using what are called *bitwise operators*:

+ `&`: True if conditions on both sides of the operator are True (and)
+ `|`: True if a condition on either side is True (or)

To return all observations of DM in or after 2000, we can combine the two conditionals we used previously. Note that, when joining conditionals using `&` or `|`, we must wrap each individual condition in parentheses. If we omit the parentheses, pandas will not perform the comparisons in the expected order.

In [None]:
surveys_nona[(surveys_nona["species_id"] == "DM") & (surveys_nona["year"] >= 2000)]

Some column methods can also be used for filtering. One example is `isin()`, which is used to match a list of values. This method can be combined with other conditionals as above. The example below returns rows from 2000 or later with either "DM", "DO", or "DS" in the species_id column:

In [None]:
surveys_nona[
    surveys_nona["species_id"].isin(["DM", "DO", "DS"]) & (surveys_nona["year"] >= 2000)
]

## Sorting data

We can sort a dataframe using the `sort_values()` method. For this example, we'll work from the subset defined above. First we need to assign that subset to a variable:

In [None]:
results = surveys_nona[
    surveys_nona["species_id"].isin(["DM", "DO", "DS"]) & (surveys_nona["year"] >= 2000)
]

Now we'll sort the results by weight_g. To do so, pass that column name as an argument (that is, inside the trailing parentheses) to the `sort_values()` method:

In [None]:
results.sort_values("weight_g")

By default, rows are sorted in ascending order (smallest to largest). We can modify this behavior using the *ascending* keyword argument:

In [None]:
results.sort_values("weight_g", ascending=False)

We can sort on multiple fields at once by passing a list of column names. We can control how each column sorts by passing a list with the same number of values (that is, one value per column) to the ascending keyword. The cell below sorts the results first by species_id (largest to smallest), then by weight (smallest to largest):

In [None]:
results.sort_values(["species_id", "weight_g"], ascending=[False, True])

As with the dataframe methods above, `sort_values()` returns a copy of the original dataframe and leaves the original untouched.

### Challenge

Write a query that returns year, species_id, and weight in kg from the surveys_nona table, sorted with the largest weights at the top.

In [None]:
# Create a new column with weight in kg
surveys_nona["weight_kg"] = surveys_nona["weight_g"] / 1000

# Create a subset containing only year, species_id, and weight_kg
subset = surveys_nona[["year", "species_id", "weight_kg"]]

# Sort the subset by weight_kg
subset.sort_values("weight_kg", ascending=False)

## Modifying data

We've already shown how to modify an existing a dataframe by adding a new column. What if we want to modify existing cells instead? As we've seen, this can be a little tricky in pandas because most of its methods return a copy of the original dataframe. For example, we can get subsets of a dataframe using square brackets. The cell below returns the species_id column for rows 2 through 5:

In [None]:
surveys[2:6]["species_id"]

But trying to set new values using this syntax may not work as expected. When working with the full dataframe, we can use 


Say we want to set the species_id column to a new value, "FD". Try running the code in the cell below:

In [None]:
surveys[2:6]["species_id"] = "FD"

You should have received a `SettingWithCopyWarning` warning after running that cell. This warning tells us that the data in the original dataframe has not been modified. This is because the square brackets returned a copy, meaning that any changes will be reflected in the copy, not the original. We can verify that the original dataframe has not been changed by displaying the rows that would have been modified:

In [None]:
surveys[2:6]

### Using loc to modify existing cells

One way to modify existing data in pandas is to use the `loc` attribute. This attribute allows you to extract and modify cells in a DataFrame using the following syntax: `df.loc[row_indexer, col_indexer]`.

The *row_indexer* argument is used to select one or more rows. It can be:
- A row label (i.e., the bold column on the far left)
    - `0` returns the row with label 0


- A slice including multiple rows:
    - `:` returns all rows
    - `:2` returns all rows from the beginning of the dataframe to the row labeled 2, *inclusive*
    - `2:` returns all rows from the row labeled 2 to the end of the dataframe, *inclusive*
    - `2:5` returns all rows between those labeled 2 and 5, *inclusive*


- A conditional, as in the examples above.

The *col_indexer* argument is used to select one or more columns. It will typically be a list of column names and can be omitted, in which case all columns will be returned.

### Row labels and indexes

The row labels in this dataframe happen to be numeric and aligned exactly to the row's index (so for the first row both the index and the label are 0, for the second row both the index and the label are 1, etc.) This is often **but not always** true in pandas. For example, if we used record_id as the label, the row labels would be one-based and the row indexes would be zero-based.

### loc slicing behavior

Slices using `loc` are inclusive--rows matching both the start and end values are included in the returned slice. This differs from list slices, where the start but not end value is included in the slice. `loc` works this way because it is looking at the row *label*, not the row *index*.

We'll be making some changes to our data, so let's work from a copy instead of modifying the original. Create a copy using the `copy()` method:

In [None]:
surveys_copy = surveys.copy()

To select a subset of rows and columns using `loc`, use:

In [None]:
surveys_copy.loc[2:5, "species_id"]

Unlike the methods earlier in the lesson, this is a view, not a copy, of the data in the surveys_copy dataframe. That means that the object returned by loc is live and can be used to change the original dataframe. We can now assign a new value to the species_id column in the matching rows of the original dataframe:

In [None]:
surveys_copy.loc[2:5, "species_id"] = "FD"

We can see that these changes are reflected in the original surveys_copy object:

In [None]:
surveys_copy.loc[1:6, "species_id"]

### Slicing with iloc

pandas provides another indexer, `iloc`, that allows us to select and modify data using row and column indexes instead of labels. Learn more in the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html).

## Keypoints

- Use square brackets to access rows, columns, and specific cells
- Use operators like `+`, `-`, and `/` to perform arithmetic on rows and columns
- Store the results of calculations in a dataframe by adding a new column or overwriting an existing column
- Sort data, rename columns, and get unique values in a dataframe using methods provided by pandas
- By default, most dataframe operations return a copy of the original data