# Video: Joining Data Frames with Pandas

This video shows how to join data sets in different data frames into one data frame with Pandas.


## Data Frames are Joined By Index

* The previous trivial joins worked because both data frames shared the same index.
* The data frame `join` method can join from any column of the calling data frame to the index of the other data frame.
* Key idea: searching an index is faster.


Script:
* We previously saw trivial joins work through shared index columns.
* Shared index columns come up regularly when building data to match the original data frame.
* Another common case is matching a column in the original data frame to the index of another data frame.
* For example, a data frame with address data and a zipcode column might be joined with another data frame indexed on zipcode.
* Since pandas data frames support looking up rows by index values, you can easily join to another data frames index.
* Let's walk through a concrete example.

## Joining Costs to Project Materials

![Raw materials for a 4' x 8' garden bed](https://github.com/user-attachments/assets/aedaf09f-1488-479a-a3e0-fe6395271b8a)

How much does all that wood cost?

Script:
* I will walk through an example of calculating the material costs of building a garden bed that I built myself recently.
* I actually planned this out using Google Sheets before building it.
* The process there was pretty similar to what I will show you in pandas.


## Code Example: Garden Bed Data

In [None]:
import pandas as pd

In [None]:
bed_size_materials = pd.read_csv("https://raw.githubusercontent.com/bu-cds-omds/dx602-examples/main/data/garden-bed_size_materials.tsv", sep="\t")
bed_size_materials

Unnamed: 0,bed_size,material,quantity_per_bed
0,4' x 4',"2"" x 6"" x 4'",20
1,4' x 4',"8"" x 8"" x 16"" Cinder Block",12
2,4' x 8',"2"" x 6"" x 4'",6
3,4' x 8',"2"" x 6"" x 8'",14
4,4' x 8',"8"" x 8"" x 16"" Cinder Block",24


Script:
* The first input into this calculation is a table with different garden bed sizes and kinds of materials, and how many units of each material that I need for one bed of that size.
* I did not bother setting the index of this data frame.
* I could have made it using the bed_size and material columns, but I don't see an advantage to spending time on that now.
* If you have ever built something out of wood, you probably noticed that I left out important things like screws.
* I am following the analysis that I did at home, focusing on the bulky and expensive materials, and I already had some buckets of screws.
* The next data file is for the material costs.

In [None]:
material_costs = pd.read_csv("https://raw.githubusercontent.com/bu-cds-omds/dx602-examples/main/data/garden-material_costs.tsv", sep="\t", index_col="material")
material_costs

Unnamed: 0_level_0,unit_cost
material,Unnamed: 1_level_1
"2"" x 6"" x 4'",4.92
"2"" x 6"" x 8'",6.62
"8"" x 8"" x 16"" Cinder Block",2.53


Script:
* For this table, I specifically set the index column to be the material column, because I want to join from the other data on this material column, and joining on the other index is required.
* Let's do that join now.

In [None]:
bed_size_costs = bed_size_materials.join(material_costs, on="material")
bed_size_costs

Unnamed: 0,bed_size,material,quantity_per_bed,unit_cost
0,4' x 4',"2"" x 6"" x 4'",20,4.92
1,4' x 4',"8"" x 8"" x 16"" Cinder Block",12,2.53
2,4' x 8',"2"" x 6"" x 4'",6,4.92
3,4' x 8',"2"" x 6"" x 8'",14,6.62
4,4' x 8',"8"" x 8"" x 16"" Cinder Block",24,2.53


Script:
* Now we have quantities and unit costs in rows together, so we can calculate the costs.

In [None]:
bed_size_costs["cost"] = bed_size_costs["quantity_per_bed"] * bed_size_costs["unit_cost"]
bed_size_costs

Unnamed: 0,bed_size,material,quantity_per_bed,unit_cost,cost
0,4' x 4',"2"" x 6"" x 4'",20,4.92,98.4
1,4' x 4',"8"" x 8"" x 16"" Cinder Block",12,2.53,30.36
2,4' x 8',"2"" x 6"" x 4'",6,4.92,29.52
3,4' x 8',"2"" x 6"" x 8'",14,6.62,92.68
4,4' x 8',"8"" x 8"" x 16"" Cinder Block",24,2.53,60.72


Script:
* One reason why the material costs file used the column name unit cost was to avoid name collisions with future cost columns.
* Pandas will let you overwrite the cost column, so this is not required.
* But keeping the names separate will make it easier you to distinguish them in your head.
* Let's add up the costs by bed size now.

In [None]:
bed_size_costs.groupby("bed_size")["cost"].sum()

bed_size
4' x 4'    128.76
4' x 8'    182.92
Name: cost, dtype: float64

Script:
* This returned a series because I selected one column by name.
* If I select with a list of column names, I will get a data frame back, so I will select with a list of just that one column name.

In [None]:
bed_size_costs = bed_size_costs.groupby("bed_size")[["cost"]].sum()
bed_size_costs

Unnamed: 0_level_0,cost
bed_size,Unnamed: 1_level_1
4' x 4',128.76
4' x 8',182.92


Script:
* Since this code example is wrapping up, the difference between a series or data frame with the same data is not a big deal.
* I am personally biased towards presenting the data in data frames because it is presented more clearly.
* The type information is not shown, but I find the table presentation in a Jupyter notebook much more legible.




## Garden Bed Wrap Up

![big cheap](https://github.com/user-attachments/assets/fc493d69-32bb-423a-9979-e5d2dbeb349b)

Final cost for 4' x 8' DIY garden bed ~ $250 after miscellaneous hardware and wax.

Cost of a slightly shinier 4' x 1' garden bed on Amazon.com ~ $400.


Script:
* Wrapping up this example, when I was looking into just ordering a garden bed online, most of the beds that I found were about four feet by one foot, and cost 4 to 500 dollars, so I was pretty pleased with this result.
* Yes, there are missing costs including the missing hardware, and my personal time, but I got a much bigger garden bed too.
* Now, I doubt you all came here to become quantitative wood workers.
* So let's talk about generalizations next.