# Combining Dataframes

## Metadata

- Teaching: 60
- Exercises: 0

## Questions

- How do we combine data from multiple sources using pandas?
- How do we add data to an existing dataframe?
- How do we split and combine data columns?

## Objectives

- Use `pd.merge()` to add species info to the survey dataset
- Use `pd.concat()` to add additional rows the dataset
- Use string methods to combine, split, and modify text columns using the `str` accessor

Dataframes can be used to organize and group data by common characteristics. Often, we need to combine elements from separate dataframes into one for analysis and visualization. A merge (or join) allows use to combine two dataframes using values common to each. Likewise, we may need to append data collected under different circumstances. In this chapter, we will show how to merge, concatenate, and split data using pandas.

## Merging dataframes

The survey dataframe we've been using throughout this lesson has a column called species_id. We used this column in the previous lesson to calculate summary statistics about observations of each species. But the species_id is just a two-letter code—what does each code stand for? To find out, we'll now load both the survey dataset and a second dataset containing more detailed information about the various species observed. Read the second dataframe from a file called species.csv:

In [None]:
import pandas as pd

surveys = pd.read_csv("data/surveys.csv")
species = pd.read_csv("data/species.csv")

species

We can see that the species dataframe includes a genus, species, and taxon for each species_id. This is much more useful than the species_id included in the original dataframe--how can we add that data to our surveys dataframe? Adding it by hand would be tedious and error prone. Fortunately, pandas provides the `pd.merge()` function to join two dataframes.

## Managing repetitive data

Why store species data in a separate table in the first place? Species information is repetitive: Every observation of the same species has the same genus, species, and taxa. Storing it in the original survey table would require including that data in every record, increasing the complexity of the table and increasing the likelihood of errors. Storing that data in a separate table means we only have to enter and validate it once. A tool like pandas then allows us to access that data when we need it.

To merge the surveys and species dataframes, we will use the `merge()` method:

In [None]:
merged = surveys.merge(species)
merged

Following the merge, the genus, species, and taxa columns have all been added to the survey dataframe. We can now use those columns to filter and summarize our data.

### Joins

The `pd.merge()` method is equivalent to the JOIN operation in SQL

## Challenge

Filter the merged dataframe to show the genus, the species name, and the weight for every individual captured at the site

In [None]:
merged[["genus", "species", "weight"]]

In the example above, we didn't provide any information about how we wanted to merge the dataframes together, so pandas made an educated guess by looking at the columns in each of the dataframes and merging them on the only column that appeared in both datasets, species_id. For more complex tables, we may want to specify the columns are used for merging. We can do so by passing one or more column names using the *on* keyword argument:

In [None]:
pd.merge(surveys, species, on="species_id")

## Challenge

Compare the number of rows in the original and merged survey dataframes. How do they differ? Why do you think that might be?

**Hint:** Use `pd.unique()` method to look at the species_id column in each dataframe.

In [None]:
pd.unique(surveys["species_id"].sort_values())

In [None]:
pd.unique(species["species_id"].sort_values())

The number of rows in the merged dataframe is lower than in the original surveys dataframe. By default, `pd.merge()` performs an **inner join**. This means that a row will only appear in the merged dataframe if the value(s) in the join column(s) appear in both dataframes. Here, observations that did not include a species_id or that included a species_id that was not definedin the species dataframe were dropped.

This is not always desirable behavior. Fortunately, pandas supports other kinds of merges:

- **Inner:** Include all rows with common values in the join columns. This is the default behavior.
- **Left:** Include all rows from the left dataframe. Columns from the right dataframe are populated if a common value exists and set to NaN if not.
- **Right:** Include all rows from the right dataframe. Columns from the left dataframe are populated if a common value exists and set to NaN if not.
- **Outer:** Include all rows from both dataframes

We want to keep all of our observations, so let's do a left join instead. To specify the type of merge, we use the *how* keyword argument:

In [None]:
pd.merge(surveys, species, how="left")

Now all 35,549 rows appear in the merged dataframe.

## Appending rows to a dataframe

Merges address the case where information related to a set of observations is spread across multiple files. What about when the observations themselves are in more than one file? For a survey like the one we've been looking at in this lesson, we might get a new file once a year with the same columns but a completely new set of observations. How can we add those new observations to our dataframe?

We'll simulate this operation by splitting data from two different years, 2001 and 2002, into separate dataframes. We can filter the dataset this using conditionals, as we saw in lesson 3:

In [None]:
surveys_2001 = surveys[surveys["year"] == 2001].copy()
surveys_2001

In [None]:
surveys_2002 = surveys[surveys["year"] == 2002].copy()
surveys_2002

We now have two different dataframes with the same columns but different data, one with 1,610 rows, the other with 2,229 rows, for a total of 3,839 records. We can combine them into a new dataframe using `pd.concat()`, which stacks the dataframes vertically (that is, it appends records from the 2002 dataset to the 2001 dataset). This method accepts a list of dataframes and works from left to right (so the leftmost dataframe will be at the top of the new dataframe and the rightmost will be at the bottom). We're only combining two dataframes here but could include more if necessary.

In [None]:
surveys_2001_2002 = pd.concat([surveys_2001, surveys_2002])
surveys_2001_2002

The combined dataframe includes all rows from both dataframes.

In some cases, the exact columns may change from year to year even within the same project. For example, researchers may decide to add an additional column to track a new piece of data or to provide a quality check. If a column is present in only one dataset, you can still concatenate the datasets. Any column that does not appear in a given dataset will have the value NaN for those rows in the combined dataframe.

To illustrate this, we'll add a column called test to the 2002 survey, then re-run `pd.concat()`:

In [None]:
surveys_2002["test"] = True

surveys_2001_2002 = pd.concat([surveys_2001, surveys_2002])
surveys_2001_2002

As expected, the test column has a value of NaN for the 2001 data in the combined dataframe.

## Combining data in multiple columns

Sometimes we want to combine values from two or more columns into a single column. For example, we might want to refer to the species in each record by both its genus and species names. In Python, we use the `+` operator to concatenate (or join) strings, and pandas works the same way: 

In [None]:
species["genus_species"] = species["genus"] + " " + species["species"]
species["genus_species"]

Note that the `+` operator can also be used to add numeric columns. In Python, the same operator can be used to perform different operations for different data types (but keep reading for an important caveat.)

## Combining dates

Another common need is to join or split dates. In the ecology dataset, the date of each observation is split across year, month, and day columns. However, pandas has a special data type, `datetime64`, for representing dates that is useful for analyzing time series data. To make use of that functionality, we can concatenate the date columns and convert them to a datetime object using the `pd.to_datetime()` method. The pandas library is doing a lot of work behind the scenes here.

In [None]:
surveys["date"] = pd.to_datetime(surveys[["year", "month", "day"]])
surveys["date"]

## Challenge

pandas allows us to ask specific questions about our data. A key skill is knowing how to translate our scientific questions into a sensible approach (and subsequently visualize and interpret our results).

Try using pandas to answer the following questions.

1. How many specimens of each sex are there for each year, including those whose sex is unknown?
2. What is the average weight of each taxa?
3. What are the minimum, maximum and average weight for each species of Rodent?
4. What is the average hindfoot length for male and female rodent of each species? Is there a male/female difference?
5. What is the average weight of each rodent species over the course of the years? Is there any noticeable trend for any of the species?

### Show me the solution to challenge 1

How many specimens of each sex are there for each year, including those whose sex is unknown?

In [None]:
# Fill in NaN values in sex with U
surveys["sex"] = surveys["sex"].fillna("U")

# Count records by sex
result = surveys.groupby(["year", "sex"])["record_id"].count()

# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
    print(result)

### Show me the solution to challenge 2

What is the average weight of each taxa?

In [None]:
# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")

# Group by taxa
grouped = merged.groupby("taxa")

# Calculate the min, max, and mean weight
grouped["weight"].mean()

### Show me the solution to challenge 3

What are the minimum, maximum and average weight for each species of Rodent?

In [None]:
# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")

# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]

# Group rodents by species
grouped = rodents.groupby("species_id")

# Calculate the min, max, and mean weight
grouped.agg({"weight": ["min", "max", "mean"]})

### Show me the solution to challenge 4

What is the average hindfoot length for male and female rodent of each species? Is there a male/female difference?

In [None]:
# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")

# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]

# Group rodents by species and sex
grouped = rodents.groupby(["species_id", "sex"])

# Calculate the mean hindfoot length, plus count and standard deviation
# to better assess the question
with pd.option_context("display.max_rows", None):
    print(grouped["hindfoot_length"].agg(["count", "mean", "std"]))

### Show me the solution to challenge 5

What is the average weight of each rodent species over the course of the years? Is there any noticeable trend for any of the species?

In [None]:
# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")

# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]

# Group rodents by species and year
grouped = rodents.groupby(["species_id", "year"])

# Calculate the mean weight by year
result = grouped["weight"].mean()

# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
    print(result)

## Keypoints

- Combine two dataframes on one or more common values using `pd.merge()`
- Append rows from one dataframe to another using `pd.concat()`
- Combine multiple text columns into one using the `+` operator
- Use the `str` accessor to use string methods like `split()` and `zfill()` on text columns
- Convert date strings to datetime objects using `pd.to_datetime()`