<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

# Lecture 10: Groups

Associated Textbook Sections: [8.2, 8.3](https://inferentialthinking.com/chapters/08/2/Classifying_by_One_Variable.html)

## Overview

* [Francis Galton](#Francis-Galton)
* [Prediction](#Prediction)
* [Prediction Accuracy](#Prediction-Accuracy)
* [Grouping](#Grouping)
* [Lists](#Lists)
* [Pivot Tables](#Pivot-Tables)

## Set Up the Notebook

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

---

## Francis Galton

* 1822 - 1911 (knighted in 1909)
* Charles Darwin's half-cousin
* Developed systems for making predictions in several fields
* An advocate for eugenics and scientific racism

<img src="./img/galton.jpg" width = 20%>

If you are not familiar with the concept of eugenics or scientific racism, you might consider watching the following TED-Ed video _The movement that inspired the Holocaust - Alexandra Minna Stern and Natalie Lira_. 

**Keep in mind that the content in this video references forced reproductive sterilization and the Holocaust.**

In [None]:
from IPython.display import IFrame

IFrame(src="https://www.youtube.com/embed/6zCpRVP1DgQ?rel=0&amp;controls=0&amp;showinfo=0", width="560", height="315")

---

## Prediction

### Demo: Prediction

Load the Galton's data and visualize the relationship between `midparentHeight` and `childHeight`.

In [None]:
galton = Table.read_table('./data/galton.csv')
galton

In [None]:
...

Identify points that are within 0.5 of the `midparentHeight` value of 68.

In [None]:
...
plots.plot([67.5, 67.5], [50, 85], color='red', lw=2)
plots.plot([68.5, 68.5], [50, 85], color='red', lw=2);

In [None]:
nearby = ...
nearby_mean = ...
nearby_mean

In [None]:
...
plots.plot([67.5, 67.5], [50, 85], color='red', lw=2)
plots.plot([68.5, 68.5], [50, 85], color='red', lw=2)
plots.scatter(68, nearby_mean, color='red', s=50);

Create a function to predict the height of a child based on the average `childHeight` associated with the points within 0.5 of the given child's `midparentHeight` value.

In [None]:
def predict(h):
    ...

In [None]:
predict(68)

In [None]:
predict(70)

In [None]:
predict(73)

Apply the function to the `galton` table.

In [None]:
predicted_heights = ...
predicted_heights

In [None]:
galton = ...

Visualize the predictions.

In [None]:
...

---

## Prediction Accuracy

### Demo: Prediction Accuracy

Define a function to compute the difference `x - y` between two values `x` and `y`.

In [None]:
def difference(x, y):
    ...

Apply the function to the `galton` table to measure the difference between the `predictedHeight` and `childHeight` values. Add the results to the table.

In [None]:
pred_errs = ...
pred_errs

In [None]:
galton = ...
galton

Visualize the errors in prediction.

In [None]:
...

In [None]:
...

---

## Grouping

### Grouping by One Column

The group method aggregates all rows with the same value for a column into a single row in the resulting table.
* First argument: Which column to group by
* Second argument: (Optional) How to combine values
    * `len` — number of grouped values (default)
    * `list` — list of all grouped values
    * `sum`  — total of all grouped values
    * ...


In [None]:
from IPython.display import IFrame
IFrame('https://docs.google.com/presentation/d/e/2PACX-1vT5DQDrDs21XnYnUD1000G97wukT1oj9N\
_ePPTdmGTp2vPh88jW_JCLcoK2yaWmmLjKjXelJDnT4m-J/embed?start=false&loop=false&delayms=3000', 800, 600)

### Demo: Grouping by One Column

Load the `cones.csv` data and explore the `group` method.

In [None]:
cones = Table.read_table('./data/cones.csv')
cones

In [None]:
...

In [None]:
...

In [None]:
...

### Demo: Grouping By One Column: Welcome Survey

Explore the `group` method using the `welcome_survey_sp23.csv` data. Include some visualizations based on the grouped data.

In [None]:
survey = Table.read_table('./data/welcome_survey_sp23.csv')
survey.show(3)

In [None]:
tutoring_reaction_sequence = survey.select('Tutoring Format Preference', 'Reaction Time (ms)', 'Sequence Length')

tutoring_reaction_sequence = (tutoring_reaction_sequence.where('Tutoring Format Preference', are.not_equal_to('nan'))
                              .where('Reaction Time (ms)', are.above_or_equal_to(0))
                              .where('Sequence Length', are.above_or_equal_to(0)))

...

In [None]:
prep_programming_statistics = survey.select('MATH 108 Prep Feeling',
                                            'Programming Language Experience',
                                            'Statistics Experience')

prep_programming_statistics = (prep_programming_statistics
                               .where('Programming Language Experience', are.above_or_equal_to(0))
                               .where('Statistics Experience', are.above_or_equal_to(0)))

...

In [None]:
...

---

## Lists

### Lists are Generic Sequences

* A list is a sequence of values (just like an array), but the values can all have different types:
> `[2+3, 'four', Table()]`
* Lists can be used to create table rows.
* If you create a table column from a list, it will be converted to an array automatically

### Demo: Lists

Demonstrate how to make a list.

In [None]:
...

In [None]:
...

### Grouping by Two Columns

The group method can also aggregate all rows that share the combination of values in multiple columns
* First argument: A list of which columns to group by
* Second argument: (Optional) How to combine values

### Demo: Grouping by Two Columns

Group the  `welcome_survey_sp23.csv` data by more than one column.

In [None]:
section_python_age = ..
section_python_age

In [None]:
...

---

## Pivot Tables

### Pivot

* Cross-classifies according to two categorical variables
* Produces a grid of counts or aggregated values
* Two required arguments:
    * First: variable that forms column labels of grid
    * Second: variable that forms row labels of grid
* Two optional arguments (include both or neither)
    * `values=’column_label_to_aggregate’`
    * `collect=function_to_aggregate_with`

### Demo: Pivot Tables

Create pivot tables using the survey data.

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

### Table Function Visualizer

Check out UC Berkeley's [Table Function Visualizer](https://www.data8.org/interactive_table_functions/) to better understand the `group` and `pivot` table methods.

---

<footer>
    <p>Adopted from UC Berkeley DATA 8 course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>