In [1]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')



### Table of Contents

1.  <a href='#section 1'>Ice Cream Flavors </a>
    
2. <a href='#section 2'> NBA salaries </a>

2. <a href='#section 3'> Extra Practice </a>    


## 1. Ice Cream Flavors <a id='section 1'></a>

Data scientists often need to classify individuals into groups according to shared features, and then identify some characteristics of the groups. Let's start with a table of ice cream flavors.

In [2]:
cones = Table().with_columns(
    'Flavor', make_array('strawberry', 'chocolate', 'chocolate', 'strawberry', 'chocolate'),
    'Price', make_array(3.55, 4.75, 6.55, 5.25, 5.25)
)
cones # RUN THIS CELL

Flavor,Price
strawberry,3.55
chocolate,4.75
chocolate,6.55
strawberry,5.25
chocolate,5.25


What if we wanted to count the total number of each flavor?
The group method with a single argument counts the number of rows for each category in a column. The result contains one row per unique value in the grouped column.

In [3]:
cones.group('Flavor') # RUN THIS CELL

Flavor,count
chocolate,3
strawberry,2


But what if we wanted the total price of the cones of each different flavor? That's where the second argument of group comes in.

The optional second argument of group names the function that will be used to aggregate values in other columns for all of those rows. For instance, sum will sum up the prices in all rows that match each category. This result also contains one row per unique value in the grouped column, but it has the same number of columns as the original table.

In [4]:
cones.group('Flavor', sum)

Flavor,Price sum
chocolate,16.55
strawberry,8.8


To create this new table, group has calculated the sum of the Price entries in all the rows corresponding to each distinct flavor. The prices in the three chocolate rows add up to  16.55. The prices in the two strawberry rows have a total of  8.80.

You can replace sum by any other functions that work on arrays.

<div class="alert alert-warning">
<b> Question 1</b>: Use grouping to find the highest price in each flavor category.

In [None]:
... ## YOUR CODE HERE

In [5]:
# ANSWER KEY
cones.group('Flavor', max)

Flavor,Price max
chocolate,6.55
strawberry,5.25


<div class="alert alert-warning">
<b> Question 2</b>: Use grouping to find the lowest price in each flavor category.

In [None]:
... ## YOUR CODE HERE 

In [6]:
# ANSWER KEY
cones.group('Flavor', min)

Flavor,Price min
chocolate,4.75
strawberry,3.55


## 2. NBA Salaries <a id='section 2'></a>

Let's look at some data about NBA players. Salaries are measured in millions of dollars.

In [7]:
nba1 = Table.read_table('nba_salaries.csv')
nba = nba1.relabeled("2015-2016 SALARY", 'SALARY')
nba

PLAYER,POSITION,TEAM,SALARY
Paul Millsap,PF,Atlanta Hawks,18.6717
Al Horford,C,Atlanta Hawks,12.0
Tiago Splitter,C,Atlanta Hawks,9.75625
Jeff Teague,PG,Atlanta Hawks,8.0
Kyle Korver,SG,Atlanta Hawks,5.74648
Thabo Sefolosha,SF,Atlanta Hawks,4.0
Mike Scott,PF,Atlanta Hawks,3.33333
Kent Bazemore,SF,Atlanta Hawks,2.0
Dennis Schroder,PG,Atlanta Hawks,1.7634
Tim Hardaway Jr.,SG,Atlanta Hawks,1.30452


<div class="alert alert-warning">
<b> Question 1: </b> How much money did each team pay for its players' salaries?
    Hint: Think about which columns you need, then which function you want to group with.

In [None]:
#YOUR CODE HERE
teams_and_money = nba.select(..., ...)
teams_and_money.group(..., ...)

In [8]:
# ANSWER KEY
teams_and_money = nba.select('TEAM', 'SALARY')
teams_and_money.group('TEAM', sum)

TEAM,SALARY sum
Atlanta Hawks,69.5731
Boston Celtics,50.2855
Brooklyn Nets,57.307
Charlotte Hornets,84.1024
Chicago Bulls,78.8209
Cleveland Cavaliers,102.312
Dallas Mavericks,65.7626
Denver Nuggets,62.4294
Detroit Pistons,42.2118
Golden State Warriors,94.0851


<div class="alert alert-warning">
<b> Question 2: </b> How many NBA players were there in each of the five positions?

In [None]:
... ## YOUR CODE HERE

In [9]:
## ANSWER KEY
nba.group('POSITION')

POSITION,count
C,69
PF,85
PG,85
SF,82
SG,96


<div class="alert alert-warning">
<b> Question 3: </b> What was the average salary of the players at each of the five positions?

In [None]:
positions_and_money = nba.select(..., ...)
positions_and_money.group(..., ...)

In [10]:
## ANSWER KEY
positions_and_money = nba.select('POSITION', 'SALARY')
positions_and_money.group('POSITION', np.mean)

POSITION,SALARY mean
C,6.08291
PF,4.95134
PG,5.16549
SF,5.53267
SG,3.9882


If we had not selected the two columns as our first step, group would not attempt to "average" the categorical columns in nba. (It is impossible to average two strings like "Atlanta Hawks" and "Boston Celtics".) It performs arithmetic only on numerical columns and leaves the rest blank. 

Run the cell below to see what happens in this case.

In [11]:
nba.group('POSITION', np.mean) ## RUN THIS CELL

POSITION,PLAYER mean,TEAM mean,SALARY mean
C,,,6.08291
PF,,,4.95134
PG,,,5.16549
SF,,,5.53267
SG,,,3.9882


## 3. Extra Practice <a id='section 3'></a>

<div class="alert alert-warning">
<b> Question 1: </b> Find the highest player salary for each team, then sort the teams in order of highest player salaries.
    
    Hint: think about which columns you need, what you would group by, and finally what you would sort by.

In [None]:
## YOUR CODE HERE
...

In [12]:
## ANSWER KEY
teams_and_money = nba.select('TEAM', 'SALARY')
teams_and_money.group('TEAM', max).sort('SALARY max', descending = True)
# Los Angeles Lakers

TEAM,SALARY max
Los Angeles Lakers,25.0
Brooklyn Nets,24.8949
Cleveland Cavaliers,22.9705
New York Knicks,22.875
Houston Rockets,22.3594
Miami Heat,22.1927
Los Angeles Clippers,21.4687
Oklahoma City Thunder,20.1586
Chicago Bulls,20.0931
San Antonio Spurs,19.689


---

## Bibliography

- Data 8 Textbook - Classifying by One Variable: https://www.inferentialthinking.com/chapters/08/2/Classifying_by_One_Variable.html

---
Notebook developed by: 

Data Science Modules: http://data.berkeley.edu/education/modules
