# Lecture 10 – Joining and Row Methods

## Data 6, Fall 2024

In [1]:
from datascience import *
import numpy as np

## `.join`

Oftentimes, we have useful data from multiple different sources. While each of these datasets provides information on their own, they are usually more powerful when combined. So when we have multiple tables with related data, we can **join** those tables together into a single larger table.

For example, we have two tables: `phones`—which lists the prices of each phone model—and `inventory`—which shows us how many of each phone we have. 

Using `.join` we can answer the question: _If I sold all of the phones in my inventory, what would my revenue be?_

In [2]:
phones = Table().with_columns(
    'Model', np.array(['iPhone 12', 'iPhone 12 Pro Max', 'Samsung Galaxy S21', 'OnePlus 8']),
    'Price', np.array([799, 1099, 799, 699]),
    'Screen Size', np.array([6.1, 6.7, 6.2, 6.6])
)

inventory = Table().with_columns(
    'Handset', np.array(['Samsung Galaxy S21', 'iPhone 12', 'iPhone 12', 'OnePlus 8', 'Pixel 5']),
    'Units', np.array([50, 40, 10, 100, 25]),
    'Store', np.array(['Berkeley', 'Berkeley', 'San Francisco', 'Oakland', 'Oakland'])
)

In [20]:
tbl = Table().with_columns("Value", make_array(-2, -2, 4, 8, 8, 2))
tbl_2 = Table().with_columns("Value", make_array(-2, -2, 4),
                             "String Representation", make_array("Neg two", "Neg two", "Four"))

tbl.join("Value", tbl_2)

Value,String Representation
-2,Neg two
-2,Neg two
-2,Neg two
-2,Neg two
4,Four


In [3]:
phones

Model,Price,Screen Size
iPhone 12,799,6.1
iPhone 12 Pro Max,1099,6.7
Samsung Galaxy S21,799,6.2
OnePlus 8,699,6.6


In [4]:
inventory

Handset,Units,Store
Samsung Galaxy S21,50,Berkeley
iPhone 12,40,Berkeley
iPhone 12,10,San Francisco
OnePlus 8,100,Oakland
Pixel 5,25,Oakland


First, let's use `tbl.join` to combine the two tables.

In [19]:
phones.join("Model", inventory, "Handset") # Join the `phones` and `inventory` tables in the way that makes most sense


Model,Price,Screen Size,Units,Store
OnePlus 8,699,6.6,100,Oakland
Samsung Galaxy S21,799,6.2,50,Berkeley
iPhone 12,799,6.1,40,Berkeley
iPhone 12,799,6.1,10,San Francisco


In [6]:
inventory.join("Handset", phones, "Model") # Try switching the order of the arguments in `.join` to see if you get the same result

Handset,Units,Store,Price,Screen Size
OnePlus 8,100,Oakland,699,6.6
Samsung Galaxy S21,50,Berkeley,799,6.2
iPhone 12,40,Berkeley,799,6.1
iPhone 12,10,San Francisco,799,6.1


Notice that when we switch around the arguments to `.join`, we get the same information, just in a different order. **This will not always be the case**.

In [45]:
inventory.row(0).item("Handset")

'Samsung Galaxy S21'

Using our joined table, we can calculate our revenue if we sold all of our phones.

In [25]:
revenue = store.column("Price") * store.column("Units") # Create an array of the revenue for each phone model if all phones were sold
revenue

array([69900, 39950, 31960,  7990])

In [28]:
np.sum(revenue) # Calculate the total revenue we would generate if we sold all of our phones

149800

### Quick Check 1

In [20]:
contacts = Table().with_columns(
    'Name', np.array(['Roxanne', 'Sandy', 'Stan', 'Tomas', 'Wilma']),
    'Email', np.array(['roxanne@berkeley.edu', 'sandy@nyu.edu', 'stan.vg@gmail.com', 'tomastrain@umich.edu', 'wilma@columbia.edu']),
    'Area Code', np.array([510, 212, 734, 734, 212]),
)

codes = Table().with_columns(
    'Code', np.array([212, 310, 519, 734]),
    'Region', np.array(['New York City', 'Los Angeles', 'Ontario, Canada', 'Metro Detroit'])
)

In [21]:
contacts

Name,Email,Area Code
Roxanne,roxanne@berkeley.edu,510
Sandy,sandy@nyu.edu,212
Stan,stan.vg@gmail.com,734
Tomas,tomastrain@umich.edu,734
Wilma,wilma@columbia.edu,212


In [22]:
codes

Code,Region
212,New York City
310,Los Angeles
519,"Ontario, Canada"
734,Metro Detroit


Consider the tables `contacts` and `codes`.

1. Fill in the blanks of the code below to join the two tables in the way that feels most natural.
2. Before running your code, think about how many rows and columns will be in the resulting table.

In [29]:
contacts.join("Area Code", codes, "Code") # Replace the blanks with your answers

Area Code,Name,Email,Region
212,Sandy,sandy@nyu.edu,New York City
212,Wilma,wilma@columbia.edu,New York City
734,Stan,stan.vg@gmail.com,Metro Detroit
734,Tomas,tomastrain@umich.edu,Metro Detroit


### Followup

Suppose we were not careful and mistyped the Los Angeles area code 213 as 212 in the `extra_codes` table below.

In [30]:
extra_codes = Table().with_columns(
    'Code', np.array([212, 212, 519, 734]),
    'Region', np.array(['New York City', 'Los Angeles', 'Ontario, Canada', 'Metro Detroit'])
)

In [31]:
contacts

Name,Email,Area Code
Roxanne,roxanne@berkeley.edu,510
Sandy,sandy@nyu.edu,212
Stan,stan.vg@gmail.com,734
Tomas,tomastrain@umich.edu,734
Wilma,wilma@columbia.edu,212


In [32]:
extra_codes

Code,Region
212,New York City
212,Los Angeles
519,"Ontario, Canada"
734,Metro Detroit


Now, when we join the `contacts` table with the `extra_codes` table, we will get multiple entries for the same person. This is unnatural, but is how `.join` works!

In [33]:
contacts.join('Area Code', extra_codes, 'Code')

Area Code,Name,Email,Region
212,Sandy,sandy@nyu.edu,New York City
212,Sandy,sandy@nyu.edu,Los Angeles
212,Wilma,wilma@columbia.edu,New York City
212,Wilma,wilma@columbia.edu,Los Angeles
734,Stan,stan.vg@gmail.com,Metro Detroit
734,Tomas,tomastrain@umich.edu,Metro Detroit


### Disclaimer

When a join produces no matches between the two tables, the resulting table will be blank. 

In [14]:
# No output – because there are no matches between
# the 'Name' column in contacts and the 'Code' column in codes
contacts.join('Name', codes, 'Code')

## Other Tools

## `.row`

Since each row in a table contains values of different data types, we cannot store this information as an array (since arrays have to contain values of the same data type). Instead, Python uses a `Row` data type to store the information in rows.

In [34]:
phones

Model,Price,Screen Size
iPhone 12,799,6.1
iPhone 12 Pro Max,1099,6.7
Samsung Galaxy S21,799,6.2
OnePlus 8,699,6.6


Use can extract a particular `Row` from a table using `tbl.row(index)`. Note that this is **not** the same as `tbl.take(index)`, which returns a `Table`.

In [36]:
phones.row(1) # Get the second row in `phones` as a Row object

Row(Model='iPhone 12 Pro Max', Price=1099, Screen Size=6.7000000000000002)

In [43]:
type(phones.row(1))

datascience.tables.Row

In [38]:
type(phones.take(1))

datascience.tables.Table

In [39]:
phones.row(1).item(1)

1099

You _can_ convert a `Row` to an array, but it will do so by converting all values in the row to one data type. Not ideal!

In [None]:
... # Convert the last row of `phones` to an array

## `.with_rows`

If you want to add a single row to an existing table, you can do so with `tbl.with_row(row_list)`. This table method must take a `list`, which is similar to an array but can hold values of multipledata types. We won't use this method often, but it's still good to know!

In [None]:
... # Add a row to `phones` with the following attributes: Name:'iPhone 12 Mini', Price: 699, Screen Size: 5.8

In [None]:
... # Add two rows to `phones` with the following attributes:
    # Row 1 - Name: 'iPhone 12 Mini', Price: 699, Screen Size: 5.8
    # Row 2 - Name: 'Moto RAZR', Price: 459, Screen Size: 3.5