# Groups

In [None]:
from datascience import *
from cs104 import *
import numpy as np
%matplotlib inline

## 1. Functions

### Think-pair-share

What will be printed after the following code is executed? 

In [None]:
def triple_add_one(x): 
    """Triples the input and adds one"""
    return 3*x +1 

In [None]:
x = 5
y = 3
z = 2
z = z + triple_add_one(y)
#print(z)

### Apply

In [None]:
heights_original = Table().read_table('data/galton.csv')
heights = heights_original.select('father', 'mother', 'childHeight')
heights = heights.relabeled('childHeight', 'child')
heights.show(5)

In [None]:
heights.hist('child')

There are times we want to perform mathematical operations columns of the table but can't use array broadcasting...

In [None]:
min(heights.column('child'), 72)  # will cause an error

This is problematic because we cannot use array broadcasting with `min` in this way:

In [None]:
min(make_array(70, 73, 69), 72) #should be an error

Instead, define our operation on a *single* value first:

In [None]:
def cut_off_at_72(x):
    """The smaller of x and 72"""
    return min(x, 72)

In [None]:
cut_off_at_72(62)

In [None]:
cut_off_at_72(72)

In [None]:
cut_off_at_72(78)

The table `apply` method can then apply such a function to every entry in a column.

In [None]:
cut_off = heights.apply(cut_off_at_72, 'child')
height2 = heights.with_columns('child', cut_off)

In [None]:
height2.hist('child')

### Apply with multiple columns

In [None]:
heights.show(6)

In [None]:
def average(x, y):
    """Compute the average of two values"""
    return (x + y) / 2

In [None]:
parent_avg = heights.apply(average, 'mother', 'father')
parent_avg.take(np.arange(0, 6))

In [None]:
heights = heights.with_columns(
    'parent average', parent_avg
)
heights

In [None]:
heights.scatter('parent average', 'child')

## 2. Predicting heights using functions and apply

We're following the example in [Ch. 8.1.3](https://inferentialthinking.com/chapters/08/1/Applying_a_Function_to_a_Column.html)

**Think-pair-share:** Suppose researchers encountered a new couple, similar to those in this dataset, and wondered how tall their child would be once their child grew up. What would be a good way to  predict the child’s height, given that the parent average height was, say, 68 inches? 

In [None]:
plot = heights.scatter('parent average', 'child')
plot.line(68, color='orange', linestyle='--', lw=2);

**A:** One initial approach would be to base the prediction on all observations (child, parent pairs) that are "close to" 68 inches for the parent. 
- Let's take "close to" to mean within a half-inch
- Let's draw these with red lines

In [None]:
parent_avg_height = 68
close = 0.5

plot = heights.scatter('parent average', 'child')
plot.line(x=parent_avg_height - close, color='red', lw=1)
plot.line(x=parent_avg_height + close, color='red', lw=1)
plot.line(parent_avg_height, color='orange', linestyle='--', lw=2)
plot.dot(x=parent_avg_height, y=67.62, color='orange')

Let's now identify all points within that red strip. 

In [None]:
close_to_68 = heights.where('parent average', 
                            are.between(parent_avg_height - close, 
                                        parent_avg_height + close))
close_to_68

And take the average to make a prediction about the child. 

In [None]:
np.average(close_to_68.column('child'))

Ooo!  Let's write a function to compute that child mean height for *any* parent average height

In [None]:
def predict_child(parent_avg_height):
    close = 0.5
    close_points = heights.where('parent average', 
                                 are.between(parent_avg_height - close, 
                                             parent_avg_height + close))
    return np.mean(close_points.column('child'))

In [None]:
predict_child(68)

In [None]:
predict_child(65)

**Apply** predict_child to all the parent averages.

In [None]:
predicted = heights.apply(predict_child, 'parent average')
predicted.take(np.arange(0,10))

Now, let's extend this table with these new predictions. 

In [None]:
height_pred = heights.with_columns('prediction', predicted)

In [None]:
height_pred.select('child', 'parent average', 'prediction').scatter('parent average')

**Preview:** Throughout this course we'll keep moving towards making our predictions *better!*

### Extra: How close is close enough for prediction?

The choice of say two heights are "close to" eachother if they are within a half-inch was a somewhat arbitrary choice.  We chould have chosen other values instead.  What would happen if we changed that constant to be 0.25, 1, 2, or 5?

This visualization demostrates the impact that choice has on our predictions. The `visualize_predictions` function plots the prediction for each child height using a window of parent average height +/- `delta`.

In [None]:

from functools import lru_cache as cache

@cache  # saves tables for each delta we compute to avoid recomputing.
def vary_range(delta):
    """Use a window of +/- delta when predicting child heights."""
    def predict_child(parent_avg_height):
        close_points = heights.where('parent average', 
                                     are.between(parent_avg_height - delta, 
                                                 parent_avg_height + delta))
        return np.mean(close_points.column('child'))

    predicted = heights.apply(predict_child, 'parent average')
    height_pred = heights.with_columns('prediction', predicted)
    return height_pred.select('child', 'parent average', 'prediction')

def visualize_predictions(delta = 0.5):
    predictions = vary_range(delta)
    predictions.scatter('parent average', s=50, width=6, height=4) # make dots a little bigger than usual
    
interact(visualize_predictions, delta = Slider(0, 10, 0.125))

## 3. Groups with Scrabble 

Let's load a table of 98 tiles from [Scrabble](https://en.wikipedia.org/wiki/Scrabble). (We'll exclude the two blank tiles from the full set of 100.)

In [None]:
scrabble_tiles = Table().read_table('data/scrabble_tiles.csv')
scrabble_tiles.sample(10)











We must often divide rows into groups according to some feature, and then compute a basic characteristic for each resulting group.


In [None]:
scrabble_tiles.group('Letter')

In [None]:
scrabble_tiles.group('Vowel')

In [None]:
scrabble_tiles.group('Vowel', sum)

Notes: 
- When we pass in a function to `group` that is not the default (e.g. `sum`), the name of that function is appended to the column name. 
- Some of the columns are empty because `sum` can only be applied to numerical (not categorial) variables. Our package is smart about this and leaves the columns empty (e.g. `Letter sum`). 

In [None]:
scrabble_tiles.group('Vowel', max)

- Applying aggregation functions (e.g. `max`) to some columns (e.g. `Letter`) are not meaningful. That's ok. But we'll have to use our understanding about the dataset to ignore these aggregations.

### Group multiple columns

In [None]:
small_scrabble = scrabble_tiles.sample(10)
small_scrabble = small_scrabble.with_columns('Used', 
                                             make_array('Yes', 'Yes', 'Yes', 'No', 'No', 
                                                        'No', 'No', 'No', 'No', 'No'))
small_scrabble

**Q**: How many vowels do I have left that I have not used? 

In [None]:
small_scrabble.group(make_array('Vowel', 'Used'))

**Q:** What's the total score of the non-vowels I have used and not used? 

In [None]:
small_scrabble.group(make_array('Vowel', 'Used'), sum)

## 4. Groups with heights

In [None]:
heights_original.show(3)

**Q:** How many children does each family have? 

In [None]:
by_family = heights_original.group('family')
by_family.show(5)

Let's relabel based on what we know about this particular dataset (each row is a child).

In [None]:
by_family = by_family.relabeled("count", "number of children")

In [None]:
by_family.hist("number of children", bins=15)

**Q:** Per family, what is the average height of the children? 

In [None]:
by_family = heights_original.select('family', 'childHeight').group('family', np.mean)
by_family.show(5)
by_family.hist('childHeight mean')