# EEB125 Lecture 7: Introduction to Pandas

## Feb 26, 2025

## Karen Reid

## Introducing `pandas`

So far this semester, you've worked in *base Python*, using only types of data, functions, and methods that are built into Python.

For the next few weeks, we'll learn how to use one of the most common **libraries** for doing data science in Python: `pandas`.

## What is `pandas`?

[`pandas`](https://pandas.pydata.org/) "is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language." 

<img width="400" src="https://cdn.britannica.com/80/150980-050-84B9202C/Giant-panda-cub-branch.jpg" alt="Image of a panda"/>

Today, we'll learn how to use `pandas` to:

- Read in a dataset from a CSV file
- Identify, use, and differentiate two new Pandas data types, `DataFrame` and `Series`
- Describe the properties of a dataset representation in Pandas
- Inspect parts of a large dataset
- Perform simple *data cleaning* and *data transformation* operations on a dataset
- Compute some summary statistics on a dataset

## Importing `pandas`

Because `pandas` doesn't come built-in with Python, we need to **import it** to be able to use it in our code.

This is done with a Python statement called an **import statement**.

In [39]:
import pandas

Common alternate: import with a renaming ("nickname"):

In [None]:
import pandas as pd

## Reading in data from a CSV file

Using pandas, we can read in data from a csv file using the `read_csv` function.

In [3]:
species_data = pd.read_csv('PanTHERIA_WR05_Aug2008.csv')

Let's explore: what is `species_data`?

In [41]:
species_data

Unnamed: 0,MSW05_Order,MSW05_Family,MSW05_Genus,MSW05_Species,MSW05_Binomial,1-1_ActivityCycle,5-1_AdultBodyMass_g,8-1_AdultForearmLen_mm,13-1_AdultHeadBodyLen_mm,2-1_AgeatEyeOpening_d,...,26-6_GR_MinLong_dd,26-7_GR_MidRangeLong_dd,27-1_HuPopDen_Min_n/km2,27-2_HuPopDen_Mean_n/km2,27-3_HuPopDen_5p_n/km2,27-4_HuPopDen_Change,28-1_Precip_Mean_mm,28-2_Temp_Mean_01degC,30-1_AET_Mean_mm,30-2_PET_Mean_mm
0,Artiodactyla,Camelidae,Camelus,dromedarius,Camelus dromedarius,3,492714.47,-999.0,-999.00,-999.00,...,-999.00,-999.00,-999,-999.00,-999.0,-999.00,-999.00,-999.00,-999.00,-999.00
1,Carnivora,Canidae,Canis,adustus,Canis adustus,1,10392.49,-999.0,745.32,-999.00,...,-17.53,13.00,0,35.20,1.0,0.14,90.75,236.51,922.90,1534.40
2,Carnivora,Canidae,Canis,aureus,Canis aureus,2,9658.70,-999.0,827.53,7.50,...,-17.05,45.74,0,79.29,0.0,0.10,44.61,217.23,438.02,1358.98
3,Carnivora,Canidae,Canis,latrans,Canis latrans,2,11989.10,-999.0,872.39,11.94,...,-168.12,-117.60,0,27.27,0.0,0.06,53.03,58.18,503.02,728.37
4,Carnivora,Canidae,Canis,lupus,Canis lupus,2,31756.51,-999.0,1055.00,14.01,...,-171.84,3.90,0,37.87,0.0,0.04,34.79,4.82,313.33,561.11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5411,Rodentia,Muridae,Zyzomys,argurus,Zyzomys argurus,-999,40.42,-999.0,107.83,-999.00,...,114.33,131.09,0,1.10,0.0,0.02,62.33,256.75,692.93,1704.98
5412,Rodentia,Muridae,Zyzomys,maini,Zyzomys maini,-999,93.99,-999.0,-999.00,-999.00,...,131.45,132.66,0,0.17,0.0,0.00,90.76,265.30,877.90,1755.73
5413,Rodentia,Muridae,Zyzomys,palatilis,Zyzomys palatilis,-999,123.00,-999.0,-999.00,-999.00,...,136.72,137.08,0,0.00,0.0,-999.00,49.00,247.16,637.90,1638.67
5414,Rodentia,Muridae,Zyzomys,pedunculatus,Zyzomys pedunculatus,-999,100.00,-999.0,126.79,-999.00,...,130.16,132.97,0,0.09,0.0,0.25,21.64,215.72,291.82,1405.85


In [5]:
type(species_data)

pandas.core.frame.DataFrame

Formally, `species_data` is a `DataFrame`, which is a custom data type defined by `pandas` to represent **tabular data**.

## Exploring `DataFrame`s

We can use the `DataFrame.head()` method to quickly see the first few rows.

In [42]:
species_data.head()

Unnamed: 0,MSW05_Order,MSW05_Family,MSW05_Genus,MSW05_Species,MSW05_Binomial,1-1_ActivityCycle,5-1_AdultBodyMass_g,8-1_AdultForearmLen_mm,13-1_AdultHeadBodyLen_mm,2-1_AgeatEyeOpening_d,...,26-6_GR_MinLong_dd,26-7_GR_MidRangeLong_dd,27-1_HuPopDen_Min_n/km2,27-2_HuPopDen_Mean_n/km2,27-3_HuPopDen_5p_n/km2,27-4_HuPopDen_Change,28-1_Precip_Mean_mm,28-2_Temp_Mean_01degC,30-1_AET_Mean_mm,30-2_PET_Mean_mm
0,Artiodactyla,Camelidae,Camelus,dromedarius,Camelus dromedarius,3,492714.47,-999.0,-999.0,-999.0,...,-999.0,-999.0,-999,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
1,Carnivora,Canidae,Canis,adustus,Canis adustus,1,10392.49,-999.0,745.32,-999.0,...,-17.53,13.0,0,35.2,1.0,0.14,90.75,236.51,922.9,1534.4
2,Carnivora,Canidae,Canis,aureus,Canis aureus,2,9658.7,-999.0,827.53,7.5,...,-17.05,45.74,0,79.29,0.0,0.1,44.61,217.23,438.02,1358.98
3,Carnivora,Canidae,Canis,latrans,Canis latrans,2,11989.1,-999.0,872.39,11.94,...,-168.12,-117.6,0,27.27,0.0,0.06,53.03,58.18,503.02,728.37
4,Carnivora,Canidae,Canis,lupus,Canis lupus,2,31756.51,-999.0,1055.0,14.01,...,-171.84,3.9,0,37.87,0.0,0.04,34.79,4.82,313.33,561.11


We use the `.shape` **attribute** to obtain the number of rows and columns of a `DataFrame`.

- An **attribute** is like a method, but it just stores a piece of data, and is not a function.
- You do *not* write parentheses after an attribute name.

In [7]:
species_data.shape

(5416, 55)

We can access just the number of rows or columns by using indexing on `.shape` (with square brackets), just like with lists.

In [8]:
num_rows = species_data.shape[0]
num_cols = species_data.shape[1]

print(f"There are {num_rows} rows and {num_cols} columns in the dataset.")

There are 5416 rows and 55 columns in the dataset.


## `DataFrame` columns properties

One of the most important properties of a `DataFrame` are its **columns**.
Each column has two important pieces of "metadata":

- the column's name
- the column's type (i.e., the type of data stored in that column)

We can see the **column names** by accessing the `.columns` attribute of a `DataFrame`.

In [9]:
species_data.columns

Index(['MSW05_Order', 'MSW05_Family', 'MSW05_Genus', 'MSW05_Species',
       'MSW05_Binomial', '1-1_ActivityCycle', '5-1_AdultBodyMass_g',
       '8-1_AdultForearmLen_mm', '13-1_AdultHeadBodyLen_mm',
       '2-1_AgeatEyeOpening_d', '3-1_AgeatFirstBirth_d',
       '18-1_BasalMetRate_mLO2hr', '5-2_BasalMetRateMass_g', '6-1_DietBreadth',
       '7-1_DispersalAge_d', '9-1_GestationLen_d', '12-1_HabitatBreadth',
       '22-1_HomeRange_km2', '22-2_HomeRange_Indiv_km2',
       '14-1_InterbirthInterval_d', '15-1_LitterSize', '16-1_LittersPerYear',
       '17-1_MaxLongevity_m', '5-3_NeonateBodyMass_g',
       '13-2_NeonateHeadBodyLen_mm', '21-1_PopulationDensity_n/km2',
       '10-1_PopulationGrpSize', '23-1_SexualMaturityAge_d',
       '10-2_SocialGrpSize', '24-1_TeatNumber', '12-2_Terrestriality',
       '6-2_TrophicLevel', '25-1_WeaningAge_d', '5-4_WeaningBodyMass_g',
       '13-3_WeaningHeadBodyLen_mm', 'References', '5-5_AdultBodyMass_g_EXT',
       '16-2_LittersPerYear_EXT', '5-6_NeonateB

The `.columns` attribute has a special type called `Index`, which is like a `list`.

You don't need to worry about what `Index` is exactly, but if you want you can convert it into a `list`:

In [10]:
list(species_data.columns)

['MSW05_Order',
 'MSW05_Family',
 'MSW05_Genus',
 'MSW05_Species',
 'MSW05_Binomial',
 '1-1_ActivityCycle',
 '5-1_AdultBodyMass_g',
 '8-1_AdultForearmLen_mm',
 '13-1_AdultHeadBodyLen_mm',
 '2-1_AgeatEyeOpening_d',
 '3-1_AgeatFirstBirth_d',
 '18-1_BasalMetRate_mLO2hr',
 '5-2_BasalMetRateMass_g',
 '6-1_DietBreadth',
 '7-1_DispersalAge_d',
 '9-1_GestationLen_d',
 '12-1_HabitatBreadth',
 '22-1_HomeRange_km2',
 '22-2_HomeRange_Indiv_km2',
 '14-1_InterbirthInterval_d',
 '15-1_LitterSize',
 '16-1_LittersPerYear',
 '17-1_MaxLongevity_m',
 '5-3_NeonateBodyMass_g',
 '13-2_NeonateHeadBodyLen_mm',
 '21-1_PopulationDensity_n/km2',
 '10-1_PopulationGrpSize',
 '23-1_SexualMaturityAge_d',
 '10-2_SocialGrpSize',
 '24-1_TeatNumber',
 '12-2_Terrestriality',
 '6-2_TrophicLevel',
 '25-1_WeaningAge_d',
 '5-4_WeaningBodyMass_g',
 '13-3_WeaningHeadBodyLen_mm',
 'References',
 '5-5_AdultBodyMass_g_EXT',
 '16-2_LittersPerYear_EXT',
 '5-6_NeonateBodyMass_g_EXT',
 '5-7_WeaningBodyMass_g_EXT',
 '26-1_GR_Area_km2',

We can access *column types* by using the `.dtypes` attribute:

In [11]:
species_data.dtypes

MSW05_Order                      object
MSW05_Family                     object
MSW05_Genus                      object
MSW05_Species                    object
MSW05_Binomial                   object
1-1_ActivityCycle                 int64
5-1_AdultBodyMass_g             float64
8-1_AdultForearmLen_mm          float64
13-1_AdultHeadBodyLen_mm        float64
2-1_AgeatEyeOpening_d           float64
3-1_AgeatFirstBirth_d           float64
18-1_BasalMetRate_mLO2hr        float64
5-2_BasalMetRateMass_g          float64
6-1_DietBreadth                   int64
7-1_DispersalAge_d              float64
9-1_GestationLen_d              float64
12-1_HabitatBreadth               int64
22-1_HomeRange_km2              float64
22-2_HomeRange_Indiv_km2        float64
14-1_InterbirthInterval_d       float64
15-1_LitterSize                 float64
16-1_LittersPerYear             float64
17-1_MaxLongevity_m             float64
5-3_NeonateBodyMass_g           float64
13-2_NeonateHeadBodyLen_mm      float64


Pandas uses its own custom data types to represent large datasets efficiently.
They typically correspond to Python's built-in data types.

For example:

- `float64` corresponds to a Python `float`
- `object` is a special `dtype` that means "any value"

**Note**: by default, Pandas reads in text column data as `object`, not `string`.

We'll see how to improve this later this lecture.

Finally, we can use the `DataFrame.info()` **method** to display all of the previous information and more:

In [12]:
species_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5416 entries, 0 to 5415
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   MSW05_Order                   5416 non-null   object 
 1   MSW05_Family                  5416 non-null   object 
 2   MSW05_Genus                   5416 non-null   object 
 3   MSW05_Species                 5416 non-null   object 
 4   MSW05_Binomial                5416 non-null   object 
 5   1-1_ActivityCycle             5416 non-null   int64  
 6   5-1_AdultBodyMass_g           5416 non-null   float64
 7   8-1_AdultForearmLen_mm        5416 non-null   float64
 8   13-1_AdultHeadBodyLen_mm      5416 non-null   float64
 9   2-1_AgeatEyeOpening_d         5416 non-null   float64
 10  3-1_AgeatFirstBirth_d         5416 non-null   float64
 11  18-1_BasalMetRate_mLO2hr      5416 non-null   float64
 12  5-2_BasalMetRateMass_g        5416 non-null   float64
 13  6-1

### Summary

Given a `DataFrame`, we can access the following attributes/methods to obtain information about it.

| Attribute/Method | Description                                       |
|------------------|---------------------------------------------------|
| `.shape`         | (number of rows, number of columns)               |
| `.columns`       | column names                                      |
| `.dtypes`        | column names and types                            |
| `.info()`        | all of the above, and more (e.g. non-null counts) |
| `.head()`        | display the first few rows of the `DataFrame`     |


## Data Wrangling: Columns

In data science, **data wrangling** is the process of turning raw data into a format more suitable for subsequent computation, analysis, and visualization.  This might be more properly called Data Cleaning.

There are many different types of data wrangling, but for now we'll look at three techniques centred on *columns*:

- renaming columns
- converting column types
- identifying and replacing "invalid" values
- extracting a subset of columns to work with

### Renaming columns

We rename columns by using the `DataFrame.rename(columns=...)` method, where we pass in a **dictionary** mapping "original column name" to "new column name".

In [13]:
old_to_new = {
    'MSW05_Genus': 'Genus',
    'MSW05_Species': 'Species',
    '1-1_ActivityCycle': 'Activity Cycle',
    '5-1_AdultBodyMass_g': 'Adult Body Mass (g)',
    '2-1_AgeatEyeOpening_d': 'Age at Eye Opening (days)',
    '17-1_MaxLongevity_m': 'Max Longevity (months)'
}

species_data_renamed = species_data.rename(columns=old_to_new)
species_data_renamed.head()

Unnamed: 0,MSW05_Order,MSW05_Family,Genus,Species,MSW05_Binomial,Activity Cycle,Adult Body Mass (g),8-1_AdultForearmLen_mm,13-1_AdultHeadBodyLen_mm,Age at Eye Opening (days),...,26-6_GR_MinLong_dd,26-7_GR_MidRangeLong_dd,27-1_HuPopDen_Min_n/km2,27-2_HuPopDen_Mean_n/km2,27-3_HuPopDen_5p_n/km2,27-4_HuPopDen_Change,28-1_Precip_Mean_mm,28-2_Temp_Mean_01degC,30-1_AET_Mean_mm,30-2_PET_Mean_mm
0,Artiodactyla,Camelidae,Camelus,dromedarius,Camelus dromedarius,3,492714.47,-999.0,-999.0,-999.0,...,-999.0,-999.0,-999,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
1,Carnivora,Canidae,Canis,adustus,Canis adustus,1,10392.49,-999.0,745.32,-999.0,...,-17.53,13.0,0,35.2,1.0,0.14,90.75,236.51,922.9,1534.4
2,Carnivora,Canidae,Canis,aureus,Canis aureus,2,9658.7,-999.0,827.53,7.5,...,-17.05,45.74,0,79.29,0.0,0.1,44.61,217.23,438.02,1358.98
3,Carnivora,Canidae,Canis,latrans,Canis latrans,2,11989.1,-999.0,872.39,11.94,...,-168.12,-117.6,0,27.27,0.0,0.06,53.03,58.18,503.02,728.37
4,Carnivora,Canidae,Canis,lupus,Canis lupus,2,31756.51,-999.0,1055.0,14.01,...,-171.84,3.9,0,37.87,0.0,0.04,34.79,4.82,313.33,561.11


### Converting column types

We can also ask Pandas to *automatically choose* the best column types for an existing `DataFrame`.
This is done with the `DataFrame.convert_dtypes()` method.

In [14]:
species_data_converted = species_data_renamed.convert_dtypes()

species_data_converted.dtypes

MSW05_Order                     string[python]
MSW05_Family                    string[python]
Genus                           string[python]
Species                         string[python]
MSW05_Binomial                  string[python]
Activity Cycle                           Int64
Adult Body Mass (g)                    Float64
8-1_AdultForearmLen_mm                 Float64
13-1_AdultHeadBodyLen_mm               Float64
Age at Eye Opening (days)              Float64
3-1_AgeatFirstBirth_d                  Float64
18-1_BasalMetRate_mLO2hr               Float64
5-2_BasalMetRateMass_g                 Float64
6-1_DietBreadth                          Int64
7-1_DispersalAge_d                     Float64
9-1_GestationLen_d                     Float64
12-1_HabitatBreadth                      Int64
22-1_HomeRange_km2                     Float64
22-2_HomeRange_Indiv_km2               Float64
14-1_InterbirthInterval_d              Float64
15-1_LitterSize                        Float64
16-1_LittersP

### Identifying and replacing "missing" values

The PanTHERIA dataset uses a special value, `-999`, to represent missing or unknown data.

Instead of leaving these values in our `DataFrame`, we'll **replace** them with a special `pandas` value called `NA`.

In [15]:
species_data_with_na = species_data_converted.replace(-999, pd.NA)
species_data_with_na.head()

Unnamed: 0,MSW05_Order,MSW05_Family,Genus,Species,MSW05_Binomial,Activity Cycle,Adult Body Mass (g),8-1_AdultForearmLen_mm,13-1_AdultHeadBodyLen_mm,Age at Eye Opening (days),...,26-6_GR_MinLong_dd,26-7_GR_MidRangeLong_dd,27-1_HuPopDen_Min_n/km2,27-2_HuPopDen_Mean_n/km2,27-3_HuPopDen_5p_n/km2,27-4_HuPopDen_Change,28-1_Precip_Mean_mm,28-2_Temp_Mean_01degC,30-1_AET_Mean_mm,30-2_PET_Mean_mm
0,Artiodactyla,Camelidae,Camelus,dromedarius,Camelus dromedarius,3,492714.47,,,,...,,,,,,,,,,
1,Carnivora,Canidae,Canis,adustus,Canis adustus,1,10392.49,,745.32,,...,-17.53,13.0,0.0,35.2,1.0,0.14,90.75,236.51,922.9,1534.4
2,Carnivora,Canidae,Canis,aureus,Canis aureus,2,9658.7,,827.53,7.5,...,-17.05,45.74,0.0,79.29,0.0,0.1,44.61,217.23,438.02,1358.98
3,Carnivora,Canidae,Canis,latrans,Canis latrans,2,11989.1,,872.39,11.94,...,-168.12,-117.6,0.0,27.27,0.0,0.06,53.03,58.18,503.02,728.37
4,Carnivora,Canidae,Canis,lupus,Canis lupus,2,31756.51,,1055.0,14.01,...,-171.84,3.9,0.0,37.87,0.0,0.04,34.79,4.82,313.33,561.11


### Extracting a subset of columns

Sometimes our full dataset contains *too much information*, and we only care about a subset of the data.

One common occurrence is when we only want a *subset of the columns* in a dataset.

For example, suppose we only care about the *genus*, *species*, *body mass*, and *longevity* of each species in our dataset.

### Extracting a subset of columns

We select a subset of columns in two steps:

1. Define a *list* containing the *column names* that we want to select.
2. Use *square bracket "lookup" syntax* on a `DataFrame`, with the list inside the square brackets.

In [16]:
columns_to_keep = [
    'Genus',
    'Species',
    'Adult Body Mass (g)',
    'Max Longevity (months)'
]

species_data_final = species_data_with_na[columns_to_keep]
species_data_final.head()

Unnamed: 0,Genus,Species,Adult Body Mass (g),Max Longevity (months)
0,Camelus,dromedarius,492714.47,480.0
1,Canis,adustus,10392.49,137.0
2,Canis,aureus,9658.7,192.0
3,Canis,latrans,11989.1,262.0
4,Canis,lupus,31756.51,354.0


## Data Transformation: computing on columns

A typical step in analysis of a dataset is to perform computations on invididual columns, or operations that combine columns in some way.

For example:

- Add 1 to each value in a column
- Multiply the values in two columns together
- "Find and Replace" values in a column

### Retrieving a column by name

We can extract a *single* column from a `DataFrame` using square brackets with a *single string* instead of a list of strings.

In [17]:
masses = species_data_final['Adult Body Mass (g)']

masses

0       492714.47
1        10392.49
2          9658.7
3         11989.1
4        31756.51
          ...    
5411        40.42
5412        93.99
5413        123.0
5414        100.0
5415        95.02
Name: Adult Body Mass (g), Length: 5416, dtype: Float64

But what exactly is `totals`?

In [18]:
type(masses)

pandas.core.series.Series

`masses` is a `Series`, which is a `pandas` data type that represents a single column of data.

A `Series` is similar to a `DataFrame`, but it can only hold one "series" of data, rather than storing a whole table.

But most of the descriptive attributes/methods we learned for `DataFrame`s can be applied to `Series` as well:

In [19]:
masses.shape

(5416,)

In [20]:
masses.dtypes

Float64Dtype()

In [21]:
masses.info()

<class 'pandas.core.series.Series'>
RangeIndex: 5416 entries, 0 to 5415
Series name: Adult Body Mass (g)
Non-Null Count  Dtype  
--------------  -----  
3542 non-null   Float64
dtypes: Float64(1)
memory usage: 47.7 KB


In [22]:
# We can even obtain the original column name from the Series
masses.name

'Adult Body Mass (g)'

But if `Series` are a simplified version of `DataFrame`s, why bother with them?

Because we can perform computations on `Series` "one element at a time", without needing to use for loops!

### Example: transform a single Series

**Goal**: Given the species masses, convert to kg by dividing each one by 1000 and rounding to one decimal place.

Example: for a single mass like `492714.47`, we'd compute

```python
round(492714.47 / 1000, 1)  # 492.7
```

But we want to do this for every mass!

In [23]:
masses_kg = masses / 1000
masses_kg_rounded = masses_kg.round(1)
masses_kg_rounded

0       492.7
1        10.4
2         9.7
3        12.0
4        31.8
        ...  
5411      0.0
5412      0.1
5413      0.1
5414      0.1
5415      0.1
Name: Adult Body Mass (g), Length: 5416, dtype: Float64

### Example: combine two `Series`

Now let's consider another problem: we'll calculate the ratio between the longevity and mass of each species.

Example: for *Camelus dromedarius*, we'll compute

```python
480.0 / 492714.47
```

But again, we want to do ths for each species!

In [24]:
masses = species_data_final["Adult Body Mass (g)"]
longevities = species_data_final["Max Longevity (months)"]


longevities / masses

0       0.000974
1       0.013183
2       0.019878
3       0.021853
4       0.011147
          ...   
5411        <NA>
5412        <NA>
5413        <NA>
5414        <NA>
5415        <NA>
Length: 5416, dtype: Float64

### Adding a new column to a `DataFrame`

In addition to creating new variables to store computed `Series`, it is common to modify existing `DataFrame` by adding a computed `Series` as a new column.

We can do this using square bracket notation again, this time on the left-hand side of an assignment statement.

In [25]:
# You don't need to worry about the following line.
# It just hides a warning message that's beyond the scope of this course.
pd.set_option('mode.chained_assignment', None)

species_data_final["Longevity-to-Mass Ratio"] = longevities / masses

species_data_final

Unnamed: 0,Genus,Species,Adult Body Mass (g),Max Longevity (months),Longevity-to-Mass Ratio
0,Camelus,dromedarius,492714.47,480.0,0.000974
1,Canis,adustus,10392.49,137.0,0.013183
2,Canis,aureus,9658.7,192.0,0.019878
3,Canis,latrans,11989.1,262.0,0.021853
4,Canis,lupus,31756.51,354.0,0.011147
...,...,...,...,...,...
5411,Zyzomys,argurus,40.42,,
5412,Zyzomys,maini,93.99,,
5413,Zyzomys,palatilis,123.0,,
5414,Zyzomys,pedunculatus,100.0,,


### WARNING!

**Warning**: the previous code cell *changes* the existing data frame `species_data_final`, rather than creating a new `DataFrame`.

## Boolean `Series` and filtering rows

Another common type of data transformation is to **filter** for specific rows in a dataset based on one or more conditions.

**Goal**: filter the rows of the dataset to keep the species with a mass *greater than or equal to 100 kg*.

As a first step, we create a *boolean `Series`* that stores `True` for the rows we want to keep, and `False` for the other rows.

In [26]:
is_large = species_data_final["Adult Body Mass (g)"] >= 100000
is_large

0        True
1       False
2       False
3       False
4       False
        ...  
5411    False
5412    False
5413    False
5414    False
5415    False
Name: Adult Body Mass (g), Length: 5416, dtype: boolean

Then, we use this `Series` to index `species_data_final` by using square bracket notation.

In [27]:
species_data_final[is_large]

Unnamed: 0,Genus,Species,Adult Body Mass (g),Max Longevity (months),Longevity-to-Mass Ratio
0,Camelus,dromedarius,492714.47,480.0,0.000974
5,Bos,frontalis,800143.05,314.4,0.000393
6,Bos,grunniens,500000.0,267.0,0.000534
7,Bos,javanicus,635974.34,318.96,0.000502
23,Camelus,bactrianus,554515.91,480.0,0.000866
...,...,...,...,...,...
5331,Ursus,americanus,110500.0,384.0,0.003475
5332,Ursus,arctos,196287.5,600.0,0.003057
5333,Ursus,maritimus,371703.81,458.4,0.001233
5398,Zalophus,californianus,137194.86,360.0,0.002624


### Note: lots of square brackets!

One of the tricky things about `DataFrame`s is that there are different ways of obtaining subsets of the dataset that all have very similar code syntax:

```python
species_data_final[...]
```

The key principle is that **the type of the value inside the square brackets determines what kind of "subsetting" operation is being performed**.

| Type inside `[...]` | Example                    | Return type   | Which columns?  | Which rows? |
|---------------------|----------------------------|---------------|-----------|-------|
| `str`               | `species_data_final["Adult Body Mass (g)"]` | `Series` | The one specified | All rows |
| `list` of `str`     | `species_data_final[["Genus", "Species"]]` | `DataFrame` | The ones specified | All rows |
| `Series` of `bool`  | `species_data_final[is_large]` | `DataFrame` | All columns | The ones specified |

### Logical operators: `&` and `|`

Sometimes we want to filter on two conditions.
To start, suppose we have these two boolean `Series`:

In [28]:
is_large = species_data_final["Adult Body Mass (g)"] >= 100000
is_long_lived = species_data_final["Max Longevity (months)"] >= 240

There are two common ways to filter based on a combination of these two conditions.

**Filter 1**: find rows where the species is large **and** is long-lived.

To do this, we use the `&` operator to combine the two `Series`.

In [29]:
filter1 = is_large & is_long_lived

species_data_final[filter1]

Unnamed: 0,Genus,Species,Adult Body Mass (g),Max Longevity (months),Longevity-to-Mass Ratio
0,Camelus,dromedarius,492714.47,480.0,0.000974
5,Bos,frontalis,800143.05,314.4,0.000393
6,Bos,grunniens,500000.0,267.0,0.000534
7,Bos,javanicus,635974.34,318.96,0.000502
23,Camelus,bactrianus,554515.91,480.0,0.000866
...,...,...,...,...,...
5331,Ursus,americanus,110500.0,384.0,0.003475
5332,Ursus,arctos,196287.5,600.0,0.003057
5333,Ursus,maritimus,371703.81,458.4,0.001233
5398,Zalophus,californianus,137194.86,360.0,0.002624


**Filter 2**: find rows where the species is large **or** is long-lived.

To do this, we use the `|` operator to combine the two `Series`.

In [30]:
filter2 = is_large | is_long_lived

species_data_final[filter2]

Unnamed: 0,Genus,Species,Adult Body Mass (g),Max Longevity (months),Longevity-to-Mass Ratio
0,Camelus,dromedarius,492714.47,480.0,0.000974
3,Canis,latrans,11989.1,262.0,0.021853
4,Canis,lupus,31756.51,354.0,0.011147
5,Bos,frontalis,800143.05,314.4,0.000393
6,Bos,grunniens,500000.0,267.0,0.000534
...,...,...,...,...,...
5377,Vulpes,macrotis,4499.97,240.0,0.053334
5380,Vulpes,velox,2088.0,240.0,0.114943
5397,Zaglossus,bruijni,8951.71,372.0,0.041556
5398,Zalophus,californianus,137194.86,360.0,0.002624


## Exploratory analysis: sorting and basic descriptive statistics

### Sorting

Suppose we want to take our `DataFrame` and sort it by the `"Adult Body Mass (g)"` column to see which species have the largest mass.

We do this by using the `DataFrame.sort_values(by=...)` method, where we pass in a `str` that names the column to sort by.

In [31]:
species_data_final.sort_values(by="Adult Body Mass (g)")

Unnamed: 0,Genus,Species,Adult Body Mass (g),Max Longevity (months),Longevity-to-Mass Ratio
965,Craseonycteris,thonglongyai,1.96,,
1866,Kerivoula,minuta,2.03,,
4630,Suncus,etruscus,2.26,32.4,14.336283
4505,Sorex,minutissimus,2.46,,
4635,Suncus,madagascariensis,2.47,,
...,...,...,...,...,...
5368,Volemys,musseri,,,
5395,Zaglossus,attenboroughi,,,
5396,Zaglossus,bartoni,,,
5399,Zalophus,japonicus,,,


By default, the column values are sorted in *ascending* (low-to-high) order.

If we want to sort in *descending* (high-to-low) order, we can pass in an *optional* argument `ascending=False` to `DataFrame.sort_values`:

In [32]:
species_data_final.sort_values(by="Adult Body Mass (g)", ascending=False)

Unnamed: 0,Genus,Species,Adult Body Mass (g),Max Longevity (months),Longevity-to-Mass Ratio
136,Balaenoptera,musculus,154321304.5,1320.0,0.000009
391,Balaena,mysticetus,79691178.99,480.0,0.000006
137,Balaenoptera,physalus,47506008.23,1392.0,0.000029
32,Caperea,marginata,31999999.98,,
1951,Megaptera,novaeangliae,30000000.01,1140.0,0.000038
...,...,...,...,...,...
5368,Volemys,musseri,,,
5395,Zaglossus,attenboroughi,,,
5396,Zaglossus,bartoni,,,
5399,Zalophus,japonicus,,,


### Descriptive statistics

Here are five simple *descriptive statistics* that we can use to describe a collection of numbers:

- sum
- count (i.e., size; number of elements)
- mean (average)
- min
- max

Unsurprisingly, we can compute all of these on any Pandas `Series` containing numeric data by calling a corresponding `Series` method.

| Statistic | `Series` method |
|-----------|-----------------|
| sum       | `Series.sum()`  |
| count     | `Series.count()` |
| mean      | `Series.mean()` |
| min       | `Series.min()`  |
| max       | `Series.max()`  |

**Note**: all five of these methods *ignore* `NA` values.

Let's start by extracting the body mass column (again).

In [33]:
totals = species_data_final["Adult Body Mass (g)"]
totals.head()

0    492714.47
1     10392.49
2       9658.7
3      11989.1
4     31756.51
Name: Adult Body Mass (g), dtype: Float64

In [34]:
totals.sum()

629803664.9200003

In [35]:
totals.count()

3542

In [36]:
totals.mean()

177810.18207792216

In [37]:
totals.min()

1.96

In [38]:
totals.max()

154321304.5

## Further reading

`pandas` is the most complex part of Python we've studied so far in this course, and so we expect you'll need to review and practice more as we dive deeper into this library.

The official Pandas website has some great introductory materials, including:

- [10 minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html)
- [*Getting Started* tutorials](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html)
