# Lecture 16 – Joining and Row Methods

## Spark 10, Spring 2024

In [None]:
import pandas as pd
import numpy as np

## `pd.merge`

Oftentimes, we have useful data from multiple different sources. While each of these datasets provides information on their own, they are usually more powerful when combined. So when we have multiple tables with related data, we can **pd.merge** those tables together into a single larger table.

For example, we have two tables: `phones`—which lists the prices of each phone model—and `inventory`—which shows us how many of each phone we have. 

Using `pd.merge` we can answer the question: _If I sold all of the phones in my inventory, what would my revenue be?_

In [None]:
phones_data = [['iPhone 12',799,6.1],
               ['iPhone 12 Pro Max',1099,6.7],
               ['Samsung Galaxy S21',799,6.2],
               ['OnePlus 8',699,6.6]]

phones = pd.DataFrame(data = phones_data, columns = ['Model','Price','Screen Size'])

inventory_data = [['Samsung Galaxy S21',50, 'Berkeley'],
                  ['iPhone 12', 40, 'Berkeley'],
                  ['iPhone 12', 10, 'San Francisco'],
                  ['OnePlus 8', 100 , 'Oakland'],
                  ['Pixel 5',25, 'Oakland']]

inventory = pd.DataFrame(data = inventory_data, columns = ['Handset','Units','Store'])

In [None]:
phones

In [None]:
inventory

First, let's use `pd.merge` to combine the two tables.

In [None]:
... # Join the `phones` and `inventory` tables in the way that makes most sense

In [None]:
... # Try switching the order of the arguments in `.join` to see if you get the same result

Notice that when we switch around the arguments to `pd.merge`, we get the same information, just in a different order. **This will not always be the case**.

In [None]:
store = ... # Join `phones` and `inventory` into one table
store

Using our joined table, we can calculate our revenue if we sold all of our phones.

In [None]:
... # Create an array of the revenue for each phone model if all phones were sold

In [None]:
... # Calculate the total revenue we would generate if we sold all of our phones

### Quick Check 1

In [None]:
contacts_data={'Name' :['Roxanne', 'Sandy', 'Stan', 'Tomas', 'Wilma'],
               'Email':['roxanne@berkeley.edu', 'sandy@nyu.edu', 'stan.vg@gmail.com', 'tomastrain@umich.edu', 'wilma@columbia.edu'],
               'Area Code':[510, 212, 734, 734, 212]}

contacts = pd.DataFrame.from_dict(contacts_data)

codes_data = {'Code' : [212, 310, 519, 734],
              'Region' : ['New York City', 'Los Angeles', 'Ontario, Canada', 'Metro Detroit']}

codes = pd.DataFrame.from_dict(codes_data)

In [None]:
contacts

In [None]:
codes

Consider the tables `contacts` and `codes`.

1. Fill in the blanks of the code below to join the two tables in the way that feels most natural.
2. Before running your code, think about how many rows and columns will be in the resulting table.

In [None]:
pd.merge(..., ..., left_on = ..., right_on = ...) # Replace the ... with your answers

### Followup

Suppose we were not careful and mistyped the Los Angeles area code 213 as 212 in the `extra_codes` table below.

In [None]:
extra_codes = pd.DataFrame.from_dict({'Code':[212, 212, 519, 734],
                                      'Region':['New York City', 'Los Angeles', 'Ontario, Canada', 'Metro Detroit']})

In [None]:
contacts

In [None]:
extra_codes

Now, when we join the `contacts` table with the `extra_codes` table, we will get multiple entries for the same person. This is unnatural, but is how `pd.merge` works!

In [None]:
pd.merge(contacts,extra_codes,left_on = 'Area Code', right_on = 'Code')

### Disclaimer

When a join produces no matches between the two tables, the resulting table will be blank. 

In [None]:
# No output – because there are no matches between
# the 'Name' column in contacts and the 'Region' column in codes
pd.merge(contacts,codes,left_on = 'Name', right_on = 'Region')

## Other Tools

## Rows

Since each row in a table contains values of different data types, we cannot store this information as an array (since arrays have to contain values of the same data type). Instead, Pandas uses a `Series` data type to store the information in rows.

In [None]:
phones

Use can extract a particular `Row` from a table using `tbl.iloc[index]`. Note that this is **not** the same as `tbl.iloc[[index]]`, which returns a `DataFrame`.

In [None]:
... # Get the second row in `phones` as a Series object

In [None]:
type(phones.iloc[1])

In [None]:
phones.iloc[[1]]

In [None]:
type(phones.iloc[[1]])

You _can_ convert a `Series` to an array, but it will do so by converting all values in the row to one data type. Not ideal!

In [None]:
... # Convert the last row of `phones` to an array

## `pd.concat`

If you want to add a single row to an existing table, there are several ways to do this. You can do so with `pd.concat()`, an abbreviation of **concatenate**. This method must take two DataFrames or Series. We won't use this method often, but it's still good to know!

In [None]:
... # Add a row to `phones` with the following attributes: Name:'iPhone 12 Mini', Price: 699, Screen Size: 5.8

In [None]:
... # Add two rows to `phones` with the following attributes:
    # Row 1 - Name: 'iPhone 12 Mini', Price: 699, Screen Size: 5.8
    # Row 2 - Name: 'Moto RAZR', Price: 459, Screen Size: 3.5

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. **Please create a PDF using File->Save and Export Notebook as->PDF**