In [1]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# Lecture 11 #

## Lists

In [None]:
# Recall that a list is a sequence of values, and the values can be different types
mylist = ...
mylist

In [None]:
# We can access individual items in a list with the "bracket operator" (indexing operator)
...

In [None]:
...

In [None]:
# If we use an invalid index, we get an IndexError
...

In [None]:
# The largest valid index is one less than the length of the list
max_valid = len(mylist) - 1
max_valid

## Rows from lists

In [26]:
# We already know how to make a new table by providing arrays for the columns
# Use the `with_columns` method:
Table().with_columns('Numbers', make_array(1, 2, 3))

Numbers
1
2
3


In [27]:
# If we want, we can use a list instead of an array
Table().with_columns('Numbers', [1, 2, 3])

Numbers
1
2
3


In [28]:
# We can also make a new table by providing lists for the ROWS
# We start by making a rowless table with three column headings
drinks = Table(['Drink', 'Cafe', 'Price'])
drinks

Drink,Cafe,Price


In [29]:
# Then use the `with_rows` method, passing in a "list of lists" for the row values
drinks = drinks.with_rows([
    ['Milk Tea', 'Asha', 5.5],
    ['Espresso', 'Strada',  1.75],
    ['Latte',    'Strada',  3.25],
    ['Espresso', "FSM",   2]
])
drinks

Drink,Cafe,Price
Milk Tea,Asha,5.5
Espresso,Strada,1.75
Latte,Strada,3.25
Espresso,FSM,2.0


In [30]:
# Here's the list of lists which describes the sequence of rows
list_of_rows = [
    ['Milk Tea', 'Asha', 5.5],
    ['Espresso', 'Strada',  1.75],
    ['Latte',    'Strada',  3.25],
    ['Espresso', "FSM",   2]
]
list_of_rows

[['Milk Tea', 'Asha', 5.5],
 ['Espresso', 'Strada', 1.75],
 ['Latte', 'Strada', 3.25],
 ['Espresso', 'FSM', 2]]

**Question**: Why can't an `array` hold the row information for a row of the drinks table? Why use a `list` instead?

**Back to Slides...**

## Grouping by Two Categorical Variables (aka Cross-Classification)

In [None]:
# These are the Data 8 "Welcome Survey" data from Spring 2022
survey = Table.read_table('welcome_survey_sp22.csv')
survey.show(3)

We're going to group the survey data by 'Handedness' and 'Sleep position' to investigate a possible association between the two. It's a good idea to first group by each column individually, to familiarize ourselves with the distributions of those variables.

In [None]:
survey.group('Handedness')

In [None]:
survey.group('Sleep position')

There are 3 unique values for 'Handedness' and 4 for 'Sleep position'. 

  - How many rows might we get when we group on both variables simultaneously?
  - Is there an association between handedness (right/left) and preferred side for sleeping (right/left)?

In [None]:
# Notice the syntax: The two column labels are put into a list, and that list
# is the first argument to `group`
survey.group(['Handedness', 'Sleep position']).show()

In [None]:
# to answer the second question, focus on just the 4 relevant rows
survey.group(['Handedness', 'Sleep position']).take(5, 6, 9, 10)

Leftie side-sleepers are almost evenly split between preferring to sleep on one side or the other. Rightie side-sleepers seem to be somewhat more inclined to sleep on their right.

We can also include a second argument (an aggregating function). Check this out:

In [None]:
survey.group(['Handedness', 'Sleep position'], np.average).sort(8, descending=True).show()

Hmm. Looks like Data 8 students who are ambidextrous and sleep on their backs tend to have a lot of piercings! At least, it was true in Spring 2022, according to the survey responses.

**Back to Slides...**

## Compare group and pivot

A pivot table is similar to grouping with two categorical variables.

In [None]:
# Using group
survey.group(['Handedness', 'Sleep position']).show()

In [None]:
# Using pivot
survey.pivot('Handedness', 'Sleep position').show()

These tables show the exact same information but with different formats.

To show aggregated values of some third column instead of counts, we have to provide the label for the third variable and the name of the aggregating function.

In [None]:
print("Average Sleep Hours, by Handedness and Sleep Position")
survey.pivot('Handedness', 'Sleep position', 'Hours of sleep', np.average).show()

**Back to Slides...**

## Pivot Table Discussion Questions ##

In [14]:
# From the CORGIS Dataset Project
# By Austin Cory Bart acbart@vt.edu
# Version 2.0.0, created 3/22/2016
# https://corgis-edu.github.io/corgis/csv/skyscrapers/
sky = Table.read_table('skyscrapers.csv')
sky.show(5)

name,material,city,height,completed
One World Trade Center,mixed/composite,New York City,541.3,2014
Willis Tower,steel,Chicago,442.14,1974
432 Park Avenue,concrete,New York City,425.5,2015
Trump International Hotel & Tower,concrete,Chicago,423.22,2009
Empire State Building,steel,New York City,381.0,1931


In [15]:
# 1. For each city, what’s the tallest building for each material?

# The first two columns we care about are city and material. We also need
# the building heights and names.

sky1 = sky.select('city', 'material', 'height', 'name')
sky1.show(5)

city,material,height,name
New York City,mixed/composite,541.3,One World Trade Center
Chicago,steel,442.14,Willis Tower
New York City,concrete,425.5,432 Park Avenue
Chicago,concrete,423.22,Trump International Hotel & Tower
New York City,steel,381.0,Empire State Building


In [16]:
# 2. As a warm-up exercise, solve a simpler version of the question, like: What's the tallest
# concrete skyscraper in Atlanta?
sky1.where('city', 'Atlanta').where('material', 'concrete').sort('height', descending=True).show(5)

city,material,height,name
Atlanta,concrete,264.25,SunTrust Plaza
Atlanta,concrete,220.37,Westin Peachtree Plaza
Atlanta,concrete,206.35,AT&T Building
Atlanta,concrete,202.69,Sovereign
Atlanta,concrete,200.16,1180 Peachtree


**SunTrust Plaza is the winner, with a height of 264.25 meters.**

In [17]:
# To solve the overall question, use group() to cross-classify the data using city and material. Provide
# max as the aggregating funciton, to learn the max heights.

max_table = (
    sky1.group(['city','material'], max)
)
max_table.show(5)

city,material,height max,name max
Atlanta,concrete,264.25,Westin Peachtree Plaza
Atlanta,mixed/composite,311.8,Two Alliance Center
Atlanta,steel,169.47,State of Georgia Building
Austin,concrete,208.15,Windsor on the Lake
Austin,steel,93.6,University of Texas Tower


**Carefully inspect the result.**

  - Is 264.25 the correct height for the tallest concrete skyscraper in Atlanta?
  - Is Westin Peachtree Plaza the correct name?
  - How does `max` operate on an array of strings?

In [18]:
# "name max" is not what we want
max_table = max_table.drop('name max').relabeled(2, 'maximum height')
max_table.show(5)

city,material,maximum height
Atlanta,concrete,264.25
Atlanta,mixed/composite,311.8
Atlanta,steel,169.47
Austin,concrete,208.15
Austin,steel,93.6


In [19]:
# Question: Define a function which takes a row index k (for max_table)
# and finds the name of the corresponding building in the sky1 table
def find_name(k):
    # c is the city
    c = max_table.column('city').item(k)
    # m is the material
    m = max_table.column('material').item(k)
    # h is the height
    h = max_table.column('maximum height').item(k)
    # matches should have just one row
    matches = sky1.where('city', c).where('material', m).where('height', h)
    
    return matches.column('name').item(0)

# which building is max_table's fourth row? (Austin, concrete, 208.15)
find_name(3)

'The Austonian'

Now let's make a pivot table from the sky data.

In [20]:
sky_p = sky.pivot('material', 'city', 'height', max)
sky_p.show(5)

city,concrete,mixed/composite,steel
Atlanta,264.25,311.8,169.47
Austin,208.15,0.0,93.6
Baltimore,161.24,0.0,155.15
Boston,121.92,139.0,240.79
Charlotte,265.48,239.7,179.23


Compare the pivot table with the grouping table (max_table).

In [21]:
# 2. For each city, what’s the height difference between the tallest 
#    steel building and the tallest concrete building?

# Hint: Use the pivot table from the previous question to compute
# the absolute differences and add that new column to the pivot table

diff = ...
sky_p_with_difference = ...
sky_p_with_difference

Ellipsis

In [22]:
# Write code to count the rows in the previous table where the difference is negative
...

**Back to slides...**

### Another Challenge

Back in the lecture slides, we see an image of the desired table...

In [23]:
# 3. For each material and each city (cross-classification), find the name of the oldest 
# skyscraper. Show the results in a table.

# Let's start by viewing the sky table:
sky.show(5)

name,material,city,height,completed
One World Trade Center,mixed/composite,New York City,541.3,2014
Willis Tower,steel,Chicago,442.14,1974
432 Park Avenue,concrete,New York City,425.5,2015
Trump International Hotel & Tower,concrete,Chicago,423.22,2009
Empire State Building,steel,New York City,381.0,1931


In [24]:
# Hint: You can use sort to find the name of the oldest building in Chicago
old = sky.where('city', 'Chicago').sort('completed').column('name').item(0)
print("Oldest in Chicago:", old)

Oldest in Chicago: The Rookery


In [25]:
# Define a function, first, which accepts an array of values and returns
# the item at index 0. We'll use it with "pivot", below...

def first(my_array):
    '''Takes a non-empty array of values and returns the first item.'''
    return my_array.item(0)

# To find the oldest building, taking the first isn't helpful unless we first sort by 'completed' 
# in ascending order to make the oldest building name pop up to the top.
sky.sort('completed').pivot('material', 'city', 'name', first)


city,concrete,mixed/composite,steel
Atlanta,Westin Peachtree Plaza,One Atlantic Center,FlatironCity
Austin,One American Center,,University of Texas Tower
Baltimore,Charles Towers North Apartments,,Emerson Tower
Boston,Harbor Towers I,Ellison Building,Marriott's Custom House
Charlotte,Bank of America Corporate Center,Hearst Tower,Midtown Plaza
Chicago,The Powhatan,American Furniture Mart,The Rookery
Cincinnati,Kroger Building,Great American Tower at Queen City Square,PNC Tower
Cleveland,National City Center,55 Public Square,Huntington Bank Building
Columbus,Key Bank Building,,Leveque Tower
Dallas,Reunion Tower,Bank of America Plaza,Three AT&T Plaza


That's an impressively short block of code, for what seemed like a huge challenge. Read up on the `pivot` method in the `datascience` documentation and be sure you understand how that last line of code did the trick.

If you're feeling a little shaky concerning all these table, manipulations, that's OK! You have a project 1 to work on, and it will give you the practice you need to solidify these skills. 

Just be sure to spend some time **each day** working on your codes for this class. You cannot build a skill without regular practice.

And please seek out the help you need when you are stuck! You have classmates, tutors, and a professor to assist you.

**Back to Slides...**

## Joins ##

In [31]:
# From the beginning of lecture...
drinks

Drink,Cafe,Price
Milk Tea,Asha,5.5
Espresso,Strada,1.75
Latte,Strada,3.25
Espresso,FSM,2.0


In [32]:
# Here's the information about discounts
discounts = Table().with_columns(
    'Coupon % off', make_array(10, 25, 5),
    'Location', make_array('Asha', 'Strada', 'Asha')
)
discounts

Coupon % off,Location
10,Asha
25,Strada
5,Asha


In [33]:
# Combine the tables using `join`
combined = drinks.join('Cafe', discounts, 'Location')
combined

Cafe,Drink,Price,Coupon % off
Asha,Milk Tea,5.5,10
Asha,Milk Tea,5.5,5
Strada,Espresso,1.75,25
Strada,Latte,3.25,25


In [35]:
# Add a column which shows the discounted price
discount_proportion = combined.column('Coupon % off') / 100
discount_dollars = combined.column('Price') * discount_proportion
combined.with_column(
    'Discounted Price', 
    np.round(combined.column('Price') - discount_dollars, 2)
)

Cafe,Drink,Price,Coupon % off,Discounted Price
Asha,Milk Tea,5.5,10,4.95
Asha,Milk Tea,5.5,5,5.22
Strada,Espresso,1.75,25,1.31
Strada,Latte,3.25,25,2.44


In [None]:
# What happens when we join the drinks table with itself on the 'Cafe' column?
drinks.join('Cafe', drinks, 'Cafe')

For each cafe, we see all the options for ordering a first drink and a second drink.