# Lecture 10: Groups

In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## Prediction

In [None]:
# read family heights data from CSV file
families = Table.read_table('family_heights.csv')
families

This is an old, classic data set. Each family consists of two parents of opposite genders and their biological children. The data for each family are spread over multiple rows, depending on the number of children (one row per child in the family). Columns at indexes 1, 2, and 3 are heights (in inches).

**Question**: Do we understand all the columns? 

We're wondering if there is an association between parent heights and child heights, but there are two parent heights for each child height. To simplify the situation, let's add a column for 'Parent Average', the average height of the parents. Then we can look for an association between 'parent_avg' and 'child' heights.

In [None]:
# Calculate an array of values for the new 'Parent Average' column
parent_avgs = (families.column('father') + families.column('mother')) / 2
parent_avgs[0:10]  # a slice, the first 10 values in the array

Use the `families` table and the `parent_avgs` array to build a new table, `heights`, with columns labeled as follows:

  - `Parent Average`
  - `Child`
  - `Sex`

In [None]:
heights = Table().with_columns(
    'Parent Average', parent_avgs,
    'Child', families.column('child'),
    'Sex', families.column('sex')
)
heights

Draw a scatterplot to check for an association between 'Parent Average' and 'Child' heights.

In [None]:
heights.scatter('Parent Average', 'Child')
plots.title("Heights in Inches for Adult Child vs. Parents' Average");

Do we see any association between the variables in this scatterplot? 



Suppose some parents from long ago had an average height between them of 68 inches, and they wanted to use this data to predict the adult height of their child. What would you tell them?

In [None]:
# It might help to focus our attention on the dots for long-ago parents with
# average height within 0.5 inches of 68
heights.scatter('Parent Average', 'Child')
plots.title("Heights in Inches for Adult Child vs. Parents' Average");

# x-y data for two points lying on a vertical line x = 67.5
xs = [67.5, 67.5]
ys = [55, 80]

plots.plot(xs, ys, color='red', lw=2);
xs = [68.5, 68.5]
plots.plot(xs, ys, color='red', lw=2);

The parents pressure us to provide a single height number, predicting the adult height of their child, based only on these data and the fact that the parents' average height is around 68 inches. So, we decide to find the average height for all the 'Child' values from the scatterplot dots which fall between the two red lines shown above.

Let's see how to find that average.

In [None]:
# Focus on the rows where Parent Average is between (68 - 0.5) and (68 + 0.5)
nearby = heights.where('Parent Average', are.between(67.5, 68.5))
nearby

In [None]:
# Find the mean of the 'Child' column in the `nearby` table
nearby_mean = np.mean(nearby.column('Child'))
nearby_mean

In [None]:
# Redraw the previous scatterplot with a big red dot indicating the
# point (68, nearby_mean)
heights.scatter('Parent Average', 'Child')
plots.title("Heights in Inches for Adult Child vs. Parents' Average");

# vertical lines
xs = [67.5, 67.5]
ys = [55, 80]
plots.plot(xs, ys, color='red', lw=2);
xs = [68.5, 68.5]
plots.plot(xs, ys, color='red', lw=2);

# big red dot
plots.scatter(68, nearby_mean, color='red', s=50);

Should we confidently tell the parents, "Your child will be 67.62 inches tall when they grow up"?

Now imagine that many pairs of parents want our advice on predicting the adult heights of their children. Instead of repeating all the above codes, with slight variations, for each pair of parents, we decide to define a function that will generalize the above computation: it will take the average height of the parents and return the `nearby_mean` child height as its prediction.

In [None]:
# Define the new function; name it `predict` since its job is to make a prediction
def predict(parent_avg):
    nearby = heights.where('Parent Average', are.between(parent_avg - 0.5, parent_avg + 0.5))
    nearby_mean = np.mean(nearby.column('Child'))
    return nearby_mean

# Call the function with the argument 68; it should return the familiar 67.62 answer
print(predict(68))

# If we pass in the argument 70, we should get a slightly larger prediction
print(predict(70))

What happens if we try to predict for parents with average height equal to 72 inches?

In [None]:
predict(72)

What happens if we use `apply`, as follows? What type of result is returned by `t.apply`?

In [None]:
predicted_heights = heights.apply(predict, 'Parent Average')
predicted_heights[:10]

In [None]:
# Add a 'Prediction' column to our heights table
heights = heights.with_column('Prediction', predicted_heights)

# Select the columns 'Parent Average', 'Child', and 'Prediction' and use
# that result to make a new scatterplot. We refer to the yellow dots in the chart
# as a "Graph of Averages"
heights.select('Parent Average', 'Child', 'Prediction').scatter(0)

We say that the original scatterplot ('Child' versus 'Parent Average') shows a **linear association**, because the graph of averages is roughly linear. 

Later in the course, we'll learn how to find the "regression line" (the line suggested by the graph of averages for a linear association) and how to use it to make predictions.

## Prediction Errors

To get a feeling for how accurate our predictions are, we can calculate the differences between the actual heights for the adult children and the predicted heights. **Define a function** which takes two numbers and returns their difference.

In [None]:
def diff(a, b):
    ''' Takes two numbers and returns their difference.'''
    return a - b

In [None]:
# Find the prediction errors (`pred_errs`) using apply and difference
# Note: prediction error = prediction - actual (not actual - prediction)
pred_errs = heights.apply(diff, 'Prediction', 'Child')

# Add a new column, 'Errors', to the heights table
heights = heights.with_columns('Errors', pred_errs)
heights.show(6)

In [None]:
# Draw a histogram to visualize the distribution of 'Errors'
heights.hist('Errors')

Because we used average Parent heights to make our predictions, we are not surprised to see that the predictions are too large (positive error) about as often as they are too small (negative error).

In [None]:
# Is there a relationship between 'Errors' and 'Sex'? 
# Draw an overlaid histogram to investigate this question.
heights.hist('Errors', group='Sex')

Discuss with your neighbor: What does this histogram tell us?

**Back to Slides**

## Discussion Question

We can make a *smarter* prediction by taking the child's sex into account.

In [None]:
# Our new function takes two arguments, the average height of the parents and 
# the sex of the child ('male' or 'female'), and returns the predicted Child height
def predict_smarter(parent_avg, sex):
    nearby = (
        heights.where('Parent Average', are.between(parent_avg - 0.5, parent_avg + 0.5))
        .where('Sex', sex)
    )
    nearby_mean = np.mean(nearby.column('Child'))
    return nearby_mean

In [None]:
predict_smarter(68, 'female')

In [None]:
predict_smarter(68, 'male')

In [None]:
smarter_predictions = heights.apply(predict_smarter, 'Parent Average', 'Sex')
heights = heights.with_column('Smarter Preds', smarter_predictions)
heights

In [None]:
# If this is really a "smarter" prediction, the errors should be smaller
smart_errs = heights.apply(diff, 'Smarter Preds', 'Child')
heights = heights.with_column('Smarter Errors', smart_errs)

# Make a histogram of the 'Smarter Errors'
heights.hist('Smarter Errors')

In [None]:
# The new errors are "smarter" because the values are more concentrated close to zero
# Remember the amount of spread in the original errors:
heights.hist('Errors')

Also, remember that with the original 'Errors', there was a pattern: the predictions for males were mostly negative, and the predictions for females were mostly positive.

In [None]:
heights.hist('Errors', group='Sex')

In [None]:
# Is that still true for 'Smarter Errors'?
heights.hist('Smarter Errors', group='Sex')

**Back to Slides**

## Grouping by One Column

In [None]:
# Load the `cones` table, which you saw in the reading; drop the 'Color'
cones = Table.read_table('cones.csv').drop('Color')
cones.show()

In [None]:
# Describe what happens when we simply group by one column
cones.group('Flavor')

The values in the 'Flavor' column are now unique (no repeated values), and the 'Price' column has disappeared. Instead, we have a 'count' column, reminding us how many rows in the original table had a particular flavor value.

In [None]:
# What if we wanted to list the prices for each flavor?
cones.group('Flavor', list)

In [None]:
# What if we wanted to show the average price for each flavor?
cones.group('Flavor', np.average)

In [None]:
# What if we wanted to show the smallest price for each flavor?
cones.group('Flavor', np.min).relabeled(1, 'Min Price')

Notice that the function named in the second argument is applied to **all other columns** except the column named in the first argument. If we add a 'Rating' column to our original cones table, we can show the average price and average rating for each flavor.

In [None]:
cones_rated = cones.with_column('Rating',make_array(4, 4, 2, 5, 5, 1))
cones_rated

In [None]:
cones_rated.group('Flavor', np.average)

In [None]:
cones_rated.group('Flavor', list)

**Back to Slides**

## Grouping -- Survey Data

In [None]:
# These are the Data 8 "Welcome Survey" data from Spring 2022
survey = Table.read_table('welcome_survey_sp22.csv')
survey.show(3)

In [None]:
# If we group by Extroversion, we get a count for each unique value in that column
survey.group('Extroversion')

In [None]:
# We can visualize this distribution with a histogram
survey.hist('Extroversion')

In [None]:
# We can calculate averages "by Extroversion" for all the other numerical columns
survey.group('Extroversion', np.average)

In [None]:
# It looks like students who rated themselves as more extroverted
# tend to have more "textees"
by_extra = survey.group('Extroversion', np.average)
by_extra.plot(0, 2)

In [None]:
# We can group by "Year" and find the average hours of sleep for each group
survey.select("Year", "Hours of sleep").group("Year", np.average)

In [None]:
undergrad_years = make_array('First', 'Second', 'Third', 'Fourth', 'Fifth or above')
undergrad_years

In [None]:
# We can group by "Year" and find the average hours of sleep for each group,
# focusing only on the undergrads
undergrads = survey.where('Year', are.contained_in(undergrad_years))
undergrads.select("Year", "Hours of sleep").group("Year", np.average)

In [None]:
# We can assign number codes for year and then sort by that code to organize the rows
def get_yr_code(year_string):
    if year_string == 'First':
        return 1
    if year_string == 'Second':
        return 2
    if year_string == 'Third':
        return 3
    if year_string == 'Fourth':
        return 4
    if year_string == 'Fifth or above':
        return 5
    else:
        return None  # Should not happen
year_code = undergrads.apply(get_yr_code, 'Year')
undergrads = undergrads.with_column('Year_code', year_code)
undergrads_sorted = undergrads.select('Year_code', 'Hours of sleep').sort('Year_code')
undergrads_sorted.group('Year_code', np.mean)

## Grouping by Two Columns (aka Cross-Classification)

We're going to group the survey data by two columns, 'Handedness' and 'Sleep position'. It's a good idea to first group by each column individually, to familiarize ourselves with the distributions of those variables.

In [None]:
survey.group('Handedness')

In [None]:
survey.group('Sleep position')

There are 3 unique values for 'Handedness' and 4 for 'Sleep position'. 

  - How many rows might we get when we group on both variables simultaneously?
  - Is there an association between handedness (right/left) and preferred side for sleeping (right/left)?

In [None]:
# Notice the syntax: The two column labels are put into a list, and that list
# is the first argument to `group`
survey.group(['Handedness', 'Sleep position']).show()

In [None]:
# to answer the second question, focus on just the 4 relevant rows
survey.group(['Handedness', 'Sleep position']).take(5, 6, 9, 10)

Leftie side-sleepers are almost evenly split between preferring to sleep on one side or the other. Rightie side-sleepers seem to be somewhat more inclined to sleep on their right.

We can also include a second argument (an aggregating function). Check this out:

In [None]:
survey.group(['Handedness', 'Sleep position'], np.average).show()

Hmm. Looks like Data 8 students who are ambidextrous and sleep on their backs tend to have a lot of piercings! At least, it was true in Spring 2022, according to the survey responses.