# EEB125 Lecture 7: Introduction to Pandas

## Feb 26, 2025

## Karen Reid

## Introducing `pandas`

So far this semester, you've worked in *base Python*, using only types of data, functions, and methods that are built into Python.

For the next few weeks, we'll learn how to use one of the most common **libraries** for doing data science in Python: `pandas`.

## What is `pandas`?

[`pandas`](https://pandas.pydata.org/) "is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language." 

<img width="400" src="https://cdn.britannica.com/80/150980-050-84B9202C/Giant-panda-cub-branch.jpg" alt="Image of a panda"/>

Today, we'll learn how to use `pandas` to:

- Read in a dataset from a CSV file
- Identify, use, and differentiate two new Pandas data types, `DataFrame` and `Series`
- Describe the properties of a dataset representation in Pandas
- Inspect parts of a large dataset
- Perform simple *data cleaning* and *data transformation* operations on a dataset
- Compute some summary statistics on a dataset

## Importing `pandas`

Because `pandas` doesn't come built-in with Python, we need to **import it** to be able to use it in our code.

This is done with a Python statement called an **import statement**.

In [None]:
import pandas

Common alternate: import with a renaming ("nickname"):

In [None]:
import pandas as pd

## Reading in data from a CSV file

Using pandas, we can read in data from a csv file using the `read_csv` function.

In [None]:
species_data = pd.read_csv('PanTHERIA_WR05_Aug2008.csv')

Let's explore: what is `species_data`?

In [None]:
species_data

In [None]:
type(species_data)

Formally, `species_data` is a `DataFrame`, which is a custom data type defined by `pandas` to represent **tabular data**.

## Exploring `DataFrame`s

We can use the `DataFrame.head()` method to quickly see the first few rows.

In [None]:
species_data.head()

We use the `.shape` **attribute** to obtain the number of rows and columns of a `DataFrame`.

- An **attribute** is like a method, but it just stores a piece of data, and is not a function.
- You do *not* write parentheses after an attribute name.

In [None]:
species_data.shape

We can access just the number of rows or columns by using indexing on `.shape` (with square brackets), just like with lists.

In [None]:
num_rows = species_data.shape[0]
num_cols = species_data.shape[1]

print(f"There are {num_rows} rows and {num_cols} columns in the dataset.")

## `DataFrame` columns properties

One of the most important properties of a `DataFrame` are its **columns**.
Each column has two important pieces of "metadata":

- the column's name
- the column's type (i.e., the type of data stored in that column)

We can see the **column names** by accessing the `.columns` attribute of a `DataFrame`.

In [None]:
species_data.columns

The `.columns` attribute has a special type called `Index`, which is like a `list`.

You don't need to worry about what `Index` is exactly, but if you want you can convert it into a `list`:

In [None]:
list(species_data.columns)

We can access *column types* by using the `.dtypes` attribute:

In [None]:
species_data.dtypes

Pandas uses its own custom data types to represent large datasets efficiently.
They typically correspond to Python's built-in data types.

For example:

- `float64` corresponds to a Python `float`
- `object` is a special `dtype` that means "any value"

**Note**: by default, Pandas reads in text column data as `object`, not `string`.

We'll see how to improve this later this lecture.

Finally, we can use the `DataFrame.info()` **method** to display all of the previous information and more:

In [None]:
species_data.info()

### Summary

Given a `DataFrame`, we can access the following attributes/methods to obtain information about it.

| Attribute/Method | Description                                       |
|------------------|---------------------------------------------------|
| `.shape`         | (number of rows, number of columns)               |
| `.columns`       | column names                                      |
| `.dtypes`        | column names and types                            |
| `.info()`        | all of the above, and more (e.g. non-null counts) |
| `.head()`        | display the first few rows of the `DataFrame`     |


## Data Wrangling: Columns

In data science, **data wrangling** is the process of turning raw data into a format more suitable for subsequent computation, analysis, and visualization.  This might be more properly called Data Cleaning.

There are many different types of data wrangling, but for now we'll look at three techniques centred on *columns*:

- renaming columns
- converting column types
- identifying and replacing "invalid" values
- extracting a subset of columns to work with

### Renaming columns

We rename columns by using the `DataFrame.rename(columns=...)` method, where we pass in a **dictionary** mapping "original column name" to "new column name".

In [None]:
old_to_new = {
    'MSW05_Genus': 'Genus',
    'MSW05_Species': 'Species',
    '1-1_ActivityCycle': 'Activity Cycle',
    '5-1_AdultBodyMass_g': 'Adult Body Mass (g)',
    '2-1_AgeatEyeOpening_d': 'Age at Eye Opening (days)',
    '17-1_MaxLongevity_m': 'Max Longevity (months)'
}

species_data_renamed = species_data.rename(columns=old_to_new)
species_data_renamed.head()

### Converting column types

We can also ask Pandas to *automatically choose* the best column types for an existing `DataFrame`.
This is done with the `DataFrame.convert_dtypes()` method.

In [None]:
species_data_converted = species_data_renamed.convert_dtypes()

species_data_converted.dtypes

### Identifying and replacing "missing" values

The PanTHERIA dataset uses a special value, `-999`, to represent missing or unknown data.

Instead of leaving these values in our `DataFrame`, we'll **replace** them with a special `pandas` value called `NA`.

In [None]:
species_data_with_na = species_data_converted.replace(-999, pd.NA)
species_data_with_na.head()

### Extracting a subset of columns

Sometimes our full dataset contains *too much information*, and we only care about a subset of the data.

One common occurrence is when we only want a *subset of the columns* in a dataset.

For example, suppose we only care about the *genus*, *species*, *body mass*, and *longevity* of each species in our dataset.

### Extracting a subset of columns

We select a subset of columns in two steps:

1. Define a *list* containing the *column names* that we want to select.
2. Use *square bracket "lookup" syntax* on a `DataFrame`, with the list inside the square brackets.

In [None]:
columns_to_keep = [
    'Genus',
    'Species',
    'Adult Body Mass (g)',
    'Max Longevity (months)'
]

species_data_final = species_data_with_na[columns_to_keep]
species_data_final.head()

## Data Transformation: computing on columns

A typical step in analysis of a dataset is to perform computations on invididual columns, or operations that combine columns in some way.

For example:

- Add 1 to each value in a column
- Multiply the values in two columns together
- "Find and Replace" values in a column

### Retrieving a column by name

We can extract a *single* column from a `DataFrame` using square brackets with a *single string* instead of a list of strings.

In [None]:
masses = species_data_final['Adult Body Mass (g)']

masses

But what exactly is `totals`?

In [None]:
type(masses)

`masses` is a `Series`, which is a `pandas` data type that represents a single column of data.

A `Series` is similar to a `DataFrame`, but it can only hold one "series" of data, rather than storing a whole table.

But most of the descriptive attributes/methods we learned for `DataFrame`s can be applied to `Series` as well:

In [None]:
masses.shape

In [None]:
masses.dtypes

In [None]:
masses.info()

In [None]:
# We can even obtain the original column name from the Series
masses.name

But if `Series` are a simplified version of `DataFrame`s, why bother with them?

Because we can perform computations on `Series` "one element at a time", without needing to use for loops!

### Example: transform a single Series

**Goal**: Given the species masses, convert to kg by dividing each one by 1000 and rounding to one decimal place.

Example: for a single mass like `492714.47`, we'd compute

```python
round(492714.47 / 1000, 1)  # 492.7
```

But we want to do this for every mass!

In [None]:
masses_kg = masses / 1000
masses_kg_rounded = masses_kg.round(1)
masses_kg_rounded

### Example: combine two `Series`

Now let's consider another problem: we'll calculate the ratio between the longevity and mass of each species.

Example: for *Camelus dromedarius*, we'll compute

```python
480.0 / 492714.47
```

But again, we want to do ths for each species!

In [None]:
masses = species_data_final["Adult Body Mass (g)"]
longevities = species_data_final["Max Longevity (months)"]


longevities / masses

### Adding a new column to a `DataFrame`

In addition to creating new variables to store computed `Series`, it is common to modify existing `DataFrame` by adding a computed `Series` as a new column.

We can do this using square bracket notation again, this time on the left-hand side of an assignment statement.

In [None]:
# You don't need to worry about the following line.
# It just hides a warning message that's beyond the scope of this course.
pd.set_option('mode.chained_assignment', None)

species_data_final["Longevity-to-Mass Ratio"] = longevities / masses

species_data_final

### WARNING!

**Warning**: the previous code cell *changes* the existing data frame `species_data_final`, rather than creating a new `DataFrame`.

## Boolean `Series` and filtering rows

Another common type of data transformation is to **filter** for specific rows in a dataset based on one or more conditions.

**Goal**: filter the rows of the dataset to keep the species with a mass *greater than or equal to 100 kg*.

As a first step, we create a *boolean `Series`* that stores `True` for the rows we want to keep, and `False` for the other rows.

In [None]:
is_large = species_data_final["Adult Body Mass (g)"] >= 100000
is_large

Then, we use this `Series` to index `species_data_final` by using square bracket notation.

In [None]:
species_data_final[is_large]

### Note: lots of square brackets!

One of the tricky things about `DataFrame`s is that there are different ways of obtaining subsets of the dataset that all have very similar code syntax:

```python
species_data_final[...]
```

The key principle is that **the type of the value inside the square brackets determines what kind of "subsetting" operation is being performed**.

| Type inside `[...]` | Example                    | Return type   | Which columns?  | Which rows? |
|---------------------|----------------------------|---------------|-----------|-------|
| `str`               | `species_data_final["Adult Body Mass (g)"]` | `Series` | The one specified | All rows |
| `list` of `str`     | `species_data_final[["Genus", "Species"]]` | `DataFrame` | The ones specified | All rows |
| `Series` of `bool`  | `species_data_final[is_large]` | `DataFrame` | All columns | The ones specified |

### Logical operators: `&` and `|`

Sometimes we want to filter on two conditions.
To start, suppose we have these two boolean `Series`:

In [None]:
is_large = species_data_final["Adult Body Mass (g)"] >= 100000
is_long_lived = species_data_final["Max Longevity (months)"] >= 240

There are two common ways to filter based on a combination of these two conditions.

**Filter 1**: find rows where the species is large **and** is long-lived.

To do this, we use the `&` operator to combine the two `Series`.

In [None]:
filter1 = is_large & is_long_lived

species_data_final[filter1]

**Filter 2**: find rows where the species is large **or** is long-lived.

To do this, we use the `|` operator to combine the two `Series`.

In [None]:
filter2 = is_large | is_long_lived

species_data_final[filter2]

## Exploratory analysis: sorting and basic descriptive statistics

### Sorting

Suppose we want to take our `DataFrame` and sort it by the `"Adult Body Mass (g)"` column to see which species have the largest mass.

We do this by using the `DataFrame.sort_values(by=...)` method, where we pass in a `str` that names the column to sort by.

In [None]:
species_data_final.sort_values(by="Adult Body Mass (g)")

By default, the column values are sorted in *ascending* (low-to-high) order.

If we want to sort in *descending* (high-to-low) order, we can pass in an *optional* argument `ascending=False` to `DataFrame.sort_values`:

In [None]:
species_data_final.sort_values(by="Adult Body Mass (g)", ascending=False)

### Descriptive statistics

Here are five simple *descriptive statistics* that we can use to describe a collection of numbers:

- sum
- count (i.e., size; number of elements)
- mean (average)
- min
- max

Unsurprisingly, we can compute all of these on any Pandas `Series` containing numeric data by calling a corresponding `Series` method.

| Statistic | `Series` method |
|-----------|-----------------|
| sum       | `Series.sum()`  |
| count     | `Series.count()` |
| mean      | `Series.mean()` |
| min       | `Series.min()`  |
| max       | `Series.max()`  |

**Note**: all five of these methods *ignore* `NA` values.

Let's start by extracting the body mass column (again).

In [None]:
totals = species_data_final["Adult Body Mass (g)"]
totals.head()

In [None]:
totals.sum()

In [None]:
totals.count()

In [None]:
totals.mean()

In [None]:
totals.min()

In [None]:
totals.max()

## Further reading

`pandas` is the most complex part of Python we've studied so far in this course, and so we expect you'll need to review and practice more as we dive deeper into this library.

The official Pandas website has some great introductory materials, including:

- [10 minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html)
- [*Getting Started* tutorials](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html)
