# Lab 4: Functions and Visualizations

Welcome to lab 4! This week, we'll learn about functions and the table method `apply` from [Section 7.1](http://www.cs.cornell.edu/courses/cs1380/2018sp/textbook/chapters/07/1/applying-a-function-to-a-column.html).  We'll also learn about visualization using histograms from [Chapter 6](http://www.cs.cornell.edu/courses/cs1380/2018sp/textbook/chapters/06/visualization.html).

First, set up the tests and imports by running the cell below.

In [None]:
import numpy as np
from datascience import *
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

from test import *

## 1. Functions and CEO Incomes

Let's start with a real data analysis task.  We'll look at the 2015 compensation of CEOs at the 100 largest companies in California.  The data were compiled for a Los Angeles Times analysis [here](http://spreadsheets.latimes.com/california-ceo-compensation/), and ultimately came from [filings](https://www.sec.gov/answers/proxyhtf.htm) mandated by the SEC from all publicly-traded companies.  Two companies have two CEOs, so there are 102 CEOs in the dataset.

We've copied the data in raw form from the LA Times page into a file called `raw_compensation.csv`.  The page notes that all dollar amounts are in millions of dollars. Run the following cell to view the table.

In [None]:
raw_compensation = Table.read_table('raw_compensation.csv')
raw_compensation

**Question 1.** We want to compute the average of the CEOs' pay. Try uncommenting (i.e., remove the `#`) and running the cell below.

In [None]:
# np.average(raw_compensation.column("Total Pay"))

After you uncomment and run the cell, you should see an error. Let's examine why this error occured by looking at the values in the "Total Pay" column. Use the `type` function and set `total_pay_type` to the type of the first value in the "Total Pay" column. Mark the answer choice corresponding to the type of the entries in the "Total Pay" column.

1. str
2. float
3. list

In [None]:
total_pay_type = ...
total_pay_type

In [None]:
check1_1(total_pay_type)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


**Question 2.** You should have found that the values in "Total Pay" column are strings&mdash;that is, text&mdash;and have type `str`. Averaging is an operation that doesn't make sense for text, so we need to convert the values to numbers. Extract the first value in the "Total Pay" column.  It's Mark Hurd's pay in 2015, in millions of dollars.  Call it `mark_hurd_pay_string`.

In [None]:
mark_hurd_pay_string = ...
mark_hurd_pay_string

In [None]:
check1_2(mark_hurd_pay_string)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


**Question 3.** Convert `mark_hurd_pay_string` to a number of *dollars*.  The string method `strip` will be useful for removing the dollar sign; it removes a specified character from the start or end of a string.  For example, the value of `"100%".strip("%")` is the string `"100"`.  You'll also need the function `float`, which converts a string that looks like a number to an actual number.  Last, remember that the answer should be in dollars, not millions of dollars.

In [None]:
mark_hurd_pay = ...
mark_hurd_pay

In [None]:
check1_3(mark_hurd_pay)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


To compute the average pay, we need to do this for every CEO.  But that looks like it would involve copying this code 102 times.

This is where functions come in.  First, we'll define a new function, giving a name to the expression that converts "total pay" strings to numeric values.  Later in this lab we'll see the payoff: we can call that function on every pay string in the dataset at once.

**Question 4.** Copy the expression you used to compute `mark_hurd_pay` as the `return` expression of the function below, but replace the specific `mark_hurd_pay_string` with the generic `pay_string` name specified in the first line of the `def` statement.

In [None]:
def convert_pay_string_to_number(pay_string):
    """Converts a pay string like '$100' (in millions) to a number of dollars."""
    return ...



In [None]:
check1_4(convert_pay_string_to_number)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


Running that cell doesn't convert any particular pay string. Instead, it creates a function called `convert_pay_string_to_number` that can convert any string with the right format to a number representing millions of dollars.

We can call our function just like we call the built-in functions we've seen. It takes one argument, a string, and it returns a number.

In [None]:
convert_pay_string_to_number('$42')

In [None]:
convert_pay_string_to_number(mark_hurd_pay_string)

In [None]:
# We can also compute Safra Catz's pay in the same way:
convert_pay_string_to_number(raw_compensation.where("Name", are.containing("Safra")).column("Total Pay").item(0))

So, what have we gained by defining the `convert_pay_string_to_number` function? 
Well, without it, we'd have to copy code (that strips off the $ sign, converts to a float, and multiplies by 1 million) each time we wanted to convert a pay string.  Now we just call a function whose name says exactly what it's doing.

Soon, we'll see how to apply this function to every pay string in a single expression. First, let's take a brief detour and introduce `interact`.

### Using `interact`

We've imported a nifty function called `interact` that allows you to
call a function with different arguments.

To use it, call `interact` with the function you want to interact with as the
first argument, then specify a default value for each argument of the original
function like so:

In [None]:
_ = interact(convert_pay_string_to_number, pay_string='$42')

You can now change the value in the textbox to automatically call
`convert_pay_string_to_number` with the argument you enter in the `pay_string`
textbox. For example, entering in `'$49'` in the textbox will display the result of
running `convert_pay_string_to_number('$49')`. Neat!

Note that we'll never ask you to write the `interact` function calls yourself as
part of a question. However, we'll include it here and there where it's helpful
and you'll probably find it useful to use yourself.

Now, let's continue on and write more functions.

## 2. Defining functions

Let's write a very simple function that converts a proportion to a percentage by multiplying it by 100.  For example, the value of `to_percentage(.5)` should be the number 50.  (No percent sign.)

A function definition has a few parts.

##### `def`
It always starts with `def` (short for **def**ine):

    def

##### Name
Next comes the name of the function.  Let's call our function `to_percentage`.
    
    def to_percentage

##### Signature
Next comes something called the *signature* of the function.  This tells Python how many arguments your function should have, and what names you'll use to refer to those arguments in the function's code.  `to_percentage` should take one argument, and we'll call that argument `proportion` since it should be a proportion.

    def to_percentage(proportion)

We put a colon after the signature to tell Python it's over.

    def to_percentage(proportion):

##### Documentation
Functions can do complicated things, so you should write an explanation of what your function does.  For small functions, this is less important, but it's a good habit to learn from the start.  Conventionally, Python functions are documented by writing a triple-quoted string:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
    
    
##### Body
Now we start writing code that runs when the function is called.  This is called the *body* of the function.  Each line of the body of the function must be indented; we usually use the TAB key to indent the right number of spaces.  In the body, we're allowed to write any code we could write outside of a function.  First let's give a name to the number we multiply a proportion by to get a percentage.

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100

##### `return`
The special instruction `return` in a function's body tells Python to make the value of the function call equal to whatever comes right after `return`.  We want the value of `to_percentage(.5)` to be the proportion .5 times the factor 100, so we write:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100
        return proportion * factor

**Question 1.** Define `to_percentage` in the cell below.  Call your function to convert the proportion .2 to a percentage.  Name that percentage `twenty_percent`.

In [None]:
# Finish the definition of to_percentage:
def to_percentage(proportion):
    """..."""
    ...
    return ...

# Call to_percentage below.
twenty_percent = ...
twenty_percent



In [None]:
check2_1(to_percentage)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


You can use variables as arguments to functions you define yourself, just like you can use them as arguments to built-in functions.

**Question 2.** Use `to_percentage` again to convert the value named `a_proportion` (defined below) to a percentage called `a_percentage`.

*Note:* You don't need to define `to_percentage` again in this cell; just make sure you've run the cell above in which you previously defined it.

In [None]:
a_proportion = 2**(.5) / 2
a_percentage = ...
a_percentage



In [None]:
check2_2(a_percentage)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


Here's something important about functions: the names assigned within a function body are only accessible within the function body. Once the function has returned, those names are gone.  So even though you defined `factor = 100` inside `to_percentage` above and then called `to_percentage`, you cannot refer to `factor` anywhere except inside the body of `to_percentage`:

In [None]:
# Uncomment the line below and run it; you should receive a NameError
# factor

As we've seen with the built-in functions, functions can also take strings (or arrays, or tables) as arguments, and they can return those things, too.

**Question 3.** Define a function called `disemvowel`.  It should take a single string as its argument.  (You can call that argument whatever you want.)  It should return a copy of that string, but with all the characters that are vowels removed.  (In English, the vowels are the characters "a", "e", "i", "o", and "u". For this question, let's assume "y" is never a vowel.)

*Hint:* To remove all the "a"s from a string named `that_string`, you can use `that_string.replace('a', '')`, which will return a new string and leave `that_string` unchanged.  You can call `replace` multiple times.

In [None]:
def disemvowel(a_string):
    """Remove the vowels from a_string."""
    ...
    

# An example call to your function.  (It's often helpful to run
# an example call from time to time while you're writing a function,
# to see how it currently works.)
disemvowel("Can you read this without vowels?")

In [None]:
# Alternatively, you can use interact to call your function
_ = interact(disemvowel, a_string='Hello world')

In [None]:
check2_3(disemvowel)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


##### Calls on calls on calls
Just as you write a series of lines to build up a complex computation, it's useful to define a series of small functions that build on each other.  Since you can write any code inside a function's body, you can call other functions you've written.

If a function is a like a recipe, then defining a function in terms of other functions is like having a recipe for cake telling you to follow (i) another recipe to make the frosting, and (ii) another recipe to make a filling.  Referring to other recipes makes the cake recipe itself shorter and clearer.

For example, suppose you want to count the number of characters that *are not vowels* in a piece of text.  One way to do that is this to remove all the vowels and count the size of the remaining string.

**Question 4.** Write a function called `num_non_vowels`.  It should take a string as its argument and return a number.  The number should be the number of characters in the argument string that are not vowels.

*Hint:* The function `len` takes a string as its argument and returns the number of characters in it.

In [None]:
def num_non_vowels(a_string):
    """The number of characters in a_string that are not vowels."""
    ...

    
# Try calling your function yourself to make sure the output is what
# you expect. You can also use the interact function if you'd like.

In [None]:
check2_4(num_non_vowels)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


##### Abstraction

The `movies_by_year` dataset in the textbook has information about movie sales in recent years.  Suppose you'd like to compute the year with the highest-ranking total gross movie sales.  You might do this:

In [None]:
movies_by_year = Table.read_table("movies_by_year.csv")
rank = 1
year = movies_by_year.sort("Total Gross", descending=True).column("Year").item(rank-1)
year

Suppose that, after writing the code above, you realize you also want to know the 5th and 10th-highest years.  Instead of copying your code, you decide to put it in a function.  Since the rank varies, you make that an argument to your function.  (Designing functions like this is known as *functional abstraction* in programming, because the function extracts or removes some details, in this case rank, from the computation and makes them function arguments.)

**Question 5.** Write a function called `highest_ranking_year`.  It should take a single argument, the rank of the year (e.g., an integer such as 1, 5, or 10).

In [None]:
def highest_ranking_year(k):
    ...


# Example call to your function:
highest_ranking_year(10)

In [None]:
# interact also allows you to pass in an array for a function argument. It will
# then present a dropdown menu of options.
_ = interact(highest_ranking_year, k=np.arange(1, 11))

In [None]:
check2_5(highest_ranking_year)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


## 3. Applying functions

Defining a function is a lot like giving a name to a value with `=`.  In fact, a function is a value just like the integer `1` or the string `"the"`!

For example, we can make a new name for the built-in function `max` if we want:

In [None]:
our_name_for_max = max
our_name_for_max(2, 6)

The old name for `max` is still around:

In [None]:
max(2, 6)

Try just writing `max` or `our_name_for_max` (or the name of any other function) in a cell, and run that cell.  Python will print out a (very brief) description of the function.

In [None]:
max

Why is this useful?  Since functions are just values, it's possible to pass them as arguments to other functions. In fact, we've been doing that all along in the lab when we pass functions we define to `check` functions. 

Here's another simple (but not-so-useful example) of how functions are just values: we can make an array of functions.

In [None]:
make_array(max, np.average, are.equal_to)

**Question 1.** Make an array containing any 4 other functions you've seen.  Call it `some_functions`.

In [None]:
some_functions = ...
some_functions

In [None]:
check3_1(some_functions)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


Working with functions as values can lead to some funny-looking code.  For example, see if you can figure out why this works:

In [None]:
make_array(max, np.average, are.equal_to, np.sum).item(0)(4, -2, 7)

##### The `apply` method

The table method `apply` calls a function many times, once on *each* element in a column of a table.  It produces an array of the results.  Here we use `apply` to convert every CEO's pay to a number, using the function you defined.  (Note that Python might choose to display the resulting array values using scientific notation, and that's okay.)

In [None]:
raw_compensation.apply(convert_pay_string_to_number, "Total Pay")

Here's an illustration of what that did:

<img src="apply.png"/>

Note that we didn't write something like `convert_pay_string_to_number()` or `convert_pay_string_to_number("Total Pay")`.  The job of `apply` is to call the function we give it, so instead of calling `convert_pay_string_to_number` ourselves, we just write its name as an argument to `apply`.

**Question 2.** Using `apply`, make a table that's a copy of `raw_compensation` with one more column called "Total Pay (\$)".  It should be the result of applying `convert_pay_string_to_number` to the "Total Pay" column, as we did above.  Call the new table `compensation`.

In [None]:
compensation = raw_compensation.with_column(
    "Total Pay ($)",
    ...
)

compensation

In [None]:
check3_2(compensation)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


Now that we have the pay in numbers, instead of as strings, we can compute arithemtic operations with them.

**Question 3.** Compute the mean total pay of the CEOs in the dataset. *Hint: there is a function `np.mean(a)` that will compute the average of array `a`, as well as an array method `a.mean()` that will do the same thing.*

In [None]:
average_total_pay = ...
average_total_pay

In [None]:
check3_3(average_total_pay)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


**Question 4.** Companies pay executives in a variety of ways: directly in cash, by granting stock or other "equity" in the company, or with ancillary benefits (e.g., private jets).  Compute the proportion of each CEO's pay that was cash.  Your answer should be an array of numbers, one for each CEO in the dataset.  

*Hint 1:* There is a column of the `compensation` regarding cash in the table that will help.

*Hint 2:* You will get a warning about an "invalid value encountered in `true_divide`".  That's okay. It's because Lawrence Page had a compensation of $0, and division by zero is undefined.  So his entry in your array will be `nan`, which stands for "not a number".

In [None]:
cash_proportion = ...
cash_proportion

In [None]:
check3_4(cash_proportion)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


Check out the "% Change" column in `compensation`.  It shows the percentage increase in the CEO's pay from the previous year.  For CEOs with no previous year on record, it instead says "(No previous year)".  The values in this column are *strings*, not numbers, so like the "Total Pay" column, it's not usable without a bit of extra work.

Given your current pay and the percentage increase from the previous year, you can calculate your previous year's pay.  For example, if your pay is \$100 this year, and that's an increase of 50% from the previous year, then your previous year's pay was $\frac{\$100}{1 + \frac{50}{100}}$, or around \$66.66.

**Question 5.** Create a new table called `with_previous_compensation`.  It should be a copy of `compensation`, but with the "(No previous year)" CEOs filtered out, and with an extra column called "2014 Total Pay ($)".  That column should have each CEO's pay in 2014.

*Hint 1:* This question takes several steps, but each one is still something you've seen before.  Take it one step at a time, using as many lines as you need.  You can examine your results after each step to make sure you're on the right track.

*Hint 2:* You'll find it helpful to define a function or two.  You can do that just above your other code.  

*Hint 3:* Recall your first introduction to the strip function in Question 1.3.

In [None]:
# For reference, our solution involved around 10 lines of code, but shorter
# and longer solutions are possible.
with_previous_compensation = ...

with_previous_compensation

In [None]:
check3_5(with_previous_compensation)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


**Question 6.** What was the average pay of these CEOs in 2014?

In [None]:
average_pay_2014 = ...
average_pay_2014

In [None]:
check3_6(average_pay_2014)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


## 4. Histograms
Earlier, we computed the average pay among the CEOs in our dataset.  The average doesn't tell us everything about the amounts CEOs are paid, though.  For example, maybe just a few CEOs make the bulk of the money.

We can use a *histogram* to visualize the dataset.  The table method `hist` takes a single argument, the name of a column of numbers.  It produces a histogram of the numbers in that column.  Run the following cell to make a histogram:

In [None]:
def to_number(pay_string):
    return float(pay_string.strip('$'))
    
pay = Table().with_columns(
    'Name', raw_compensation.column('Name'),
    'Total Pay', raw_compensation.apply(to_number, 'Total Pay')
)
pay.show(3)

pay.hist('Total Pay')

By default, `hist` produces a histogram with 10 equally spaced bins.  It's hard to tell quite how wide the bins are above, so let's try specifying that the bins should be $10 million wide. Run the following cell:

In [None]:
pay.hist('Total Pay', bins=np.arange(0, 70, 10))

Now the width of the bins is easy to read, but the labels on the axes could be improved. Run the following cell:

In [None]:
pay.hist('Total Pay', bins=np.arange(0, 70, 10), unit="$1 Million")

Now both the x and y axes have labels that are easy to read.

**Question 1.** Based on the histogram, how many CEOs made more than $30 million?  Answer the question by looking at the chart, visually estimating the height of some bins, and doing some arithmetic.  Recall that there are 102 CEOs in the dataset.  Fill your answer in below.  *Do not write code that examines the table to answer this question; we'll do that in the next question. You are welcome, though, to write code that uses Python like a calculator to do some arithmetic.*

In [None]:
estimate_ceos30 = ...
estimate_ceos30

In [None]:
check4_1(estimate_ceos30)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


**Question 2.** Answer the same question, but use code that analyzes the `pay` table.  *Hint:* Use the table method `where` and the property `num_rows`.

In [None]:
num_ceos30 = ...
num_ceos30

In [None]:
check4_2(num_ceos30)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


**Question 3.** Which of the following best describes the shape of the distribution?

1. Symmetric - about the mean number total pay.
2. Right skewed - the majority of the distribution is concentrated on the left with few values on the right.
3. Left skewed - the majority of the distribution is concentrated on the right with few values on the left.
4. Bimodal - there are two peaks for total pay.

In [None]:
dist_shape = ...


In [None]:
check4_3(dist_shape)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


## 5. Submit

Great job; you're done with this lab! Please make sure to submit your assignment!

Before submitting, we recommend that you use the menu item Kernel -> Restart & Run All. That will re-run all your cells from scratch, just to make sure they all work as you are expecting.  Take a close look to make sure all your cells are still passing the checks.  Then, if they are, click the blue Submit button. 
